## Homework 5: NLTK and Machine Learning Pipelines

In this final homework assignment, you'll be bringing together ideas from Natural Language Processing as well as
Machine Learning Pipelines.  This assignment uses materials adapted from [Benjamin Bengfort](https://bbengfort.github.io/tutorials/2016/05/19/text-classification-nltk-sckit-learn.html).

We are going back to work that you did in previous courses with object-oriented Python to create a proprocessor
to do NLP on some text in the context of machine learning pipelines.

First, we are going to review code that you should find reusable and helpful moving forward.  There's a lot of setup 
for this homework assignment.

In [1]:
import string

from nltk.corpus import stopwords as sw
from nltk.corpus import wordnet as wn
from nltk import wordpunct_tokenize
from nltk import WordNetLemmatizer
from nltk import sent_tokenize
from nltk import pos_tag

from sklearn.base import BaseEstimator, TransformerMixin

# objects are from sklearn
class NLTKPreprocessor(BaseEstimator, TransformerMixin):

    def __init__(self, stopwords=None, punct=None,
                 lower=True, strip=True):
        self.lower      = lower
        self.strip      = strip
        self.stopwords  = stopwords or set(sw.words('english'))
        self.punct      = punct or set(string.punctuation)
        # better lemmatizer?
        self.lemmatizer = WordNetLemmatizer()

    def fit(self, X, y=None):
        return self

    def inverse_transform(self, X):
        return [" ".join(doc) for doc in X]

    def transform(self, X): 
        # calls tokenize
        return [
            list(self.tokenize(doc)) for doc in X
        ]

    def tokenize(self, document):
        # Break the document into sentences
        for sent in sent_tokenize(document):
            # Break the sentence into part of speech tagged tokens
            for token, tag in pos_tag(wordpunct_tokenize(sent)):
                # Apply preprocessing to the token
                # WILL NEED TO ADD TO THIS LIST
                # BeautifulSoup will make this painful
                # regex?
                token = token.lower() if self.lower else token
                token = token.strip() if self.strip else token
                token = token.strip('_') if self.strip else token
                token = token.strip('*') if self.strip else token

                # If stopword, ignore token and continue
                if token in self.stopwords:
                    continue

                # If punctuation, ignore token and continue
                if all(char in self.punct for char in token):
                    continue

                # Lemmatize the token and yield
                # yield is like returning but the next time its called, we will pick up right where we left off
                lemma = self.lemmatize(token, tag)
                yield lemma

    def lemmatize(self, token, tag):
        tag = {
            'N': wn.NOUN,
            'V': wn.VERB,
            'R': wn.ADV,
            'J': wn.ADJ
        }.get(tag[0], wn.NOUN)

        return self.lemmatizer.lemmatize(token, tag)

This just takes text and returns it unmodified.  We need it in the next section

In [2]:
# takes some text and returns the text (null function)
# needed in the next section
def identity_tokenizer(text):
    return text

In [4]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report as clsr
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split as tts
import time
import pickle


def build_and_evaluate(X, y,
    classifier=SGDClassifier, outpath=None, verbose=True):

    def build(classifier, X, y=None):
        """
        Inner build function that builds a single model.
        """
        if isinstance(classifier, type):
            classifier = classifier()

        model = Pipeline([
            ('preprocessor', NLTKPreprocessor()), # class we just created
            ('vectorizer', TfidfVectorizer( # de-emphasize frequent words, emphasize unusual words
                tokenizer=identity_tokenizer, # note that this will fail unless you use the identity_tokenizer
                preprocessor=None, lowercase=False
            )),
            ('classifier', classifier),
        ])

        model.fit(X, y)
        return model

    # Label encode the targets
    #if "labels are not numeric", call LabelEncoder
    labels = LabelEncoder()
    y = labels.fit_transform(y)

    # Begin evaluation
    if verbose: print("Building for evaluation")
    X_train, X_test, y_train, y_test = tts(X, y, test_size=0.2)
    import time
    start_time = time.time()

    model = build(classifier, X_train, y_train)

    if verbose:
        print("Evaluation model fit in {:0.3f} seconds".format(time.time() - start_time))
        print("Classification Report:\n")

    y_pred = model.predict(X_test)
    print(clsr(y_test, y_pred, target_names=labels.classes_))

    if verbose:
        print("Building complete model and saving ...")
    start_time = time.time()
    model = build(classifier, X, y)
    model.labels_ = labels

    if verbose:
        print("Complete model fit in {:0.3f} seconds".format(time.time() - start_time))

    if outpath:
        with open(outpath, 'wb') as f:
            pickle.dump(model, f) # pickle saves any object to your file system

        print("Model written out to {}".format(outpath))

    return model

Now that we've got everything set up for our pipelines, we can load some data.  Here we're going to use the Movie Reviews
corpus from the NLTK package.

In [5]:
from nltk.corpus import movie_reviews as reviews

X = [reviews.raw(fileid) for fileid in reviews.fileids()]
y = [reviews.categories(fileid)[0] for fileid in reviews.fileids()]
print("There are {} reviews".format(len(y)))


There are 2000 reviews


In [6]:
# we can take a closer look at the structure of 'reviews'
reviews.fileids()[0]

'neg/cv000_29416.txt'

In [7]:
reviews.raw('neg/cv000_29416.txt')

'plot : two teen couples go to a church party , drink and then drive . \nthey get into an accident . \none of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . \nwhat\'s the deal ? \nwatch the movie and " sorta " find out . . . \ncritique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . \nwhich is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn\'t snag this one correctly . \nthey seem to have taken this pretty neat concept , but executed it terribly . \nso what are the problems with the movie ? \nwell , its main problem is that it\'s simply too jumbled . \nit starts off " normal " but then downshifts into this " fantasy " world in which you , as an audience membe

In [8]:
PATH = "movie_reviews_model.pickle"
model = build_and_evaluate(X,y, classifier=SGDClassifier, outpath=PATH)

Building for evaluation
Evaluation model fit in 99.701 seconds
Classification Report:

              precision    recall  f1-score   support

         neg       0.88      0.84      0.86       200
         pos       0.85      0.89      0.87       200

    accuracy                           0.86       400
   macro avg       0.86      0.86      0.86       400
weighted avg       0.86      0.86      0.86       400

Building complete model and saving ...
Complete model fit in 118.805 seconds
Model written out to movie_reviews_model.pickle


In [9]:
# precision: did we get all the positives?
# recall: how many positives were negatives?

As you can see, building a model takes a considerable amount of time (and resources), so we're going to use the
"pickled" version of the model so we don't have to recreate it.

In [10]:
with open(PATH, 'rb') as f:
    model = pickle.load(f)

yhat = model.predict([
    "This is the worst movie I have ever seen!",
    "The movie was great action packed and full of adventure!",
    "Wow!",
    "This was the best and the worst at the same time!"
])


print(yhat)
print(model.labels_.inverse_transform(yhat)) # returns categorical

[0 1 0 0]
['neg' 'pos' 'neg' 'neg']


Finally, we can take a look to see which words are most highly associated with each sentiment:

In [11]:
from operator import itemgetter
def show_most_informative_features(model, text=None, n=20):
    # Extract the vectorizer and the classifier from the pipeline
    vectorizer = model.named_steps['vectorizer']
    classifier = model.named_steps['classifier']

    # Check to make sure that we can perform this computation
    if not hasattr(classifier, 'coef_'):
        raise TypeError(
            "Cannot compute most informative features on {}.".format(
                classifier.__class__.__name__
            )
        )

    if text is not None:
        # Compute the coefficients for the text
        tvec = model.transform([text]).toarray()
    else:
        # Otherwise simply use the coefficients
        tvec = classifier.coef_

    # Zip the feature names with the coefs and sort
    coefs = sorted(
        zip(tvec[0], vectorizer.get_feature_names()),
        key=itemgetter(0), reverse=True
    )

    # Get the top n and bottom n coef, name pairs
    topn  = zip(coefs[:n], coefs[:-(n+1):-1])

    # Create the output string to return
    output = []

    # If text, add the predicted value to the output.
    if text is not None:
        output.append("\"{}\"".format(text))
        output.append(
            "Classified as: {}".format(model.predict([text]))
        )
        output.append("")

    # Create two columns with most negative and most positive features.
    for (cp, fnp), (cn, fnn) in topn:
        output.append(
            "{:0.4f}{: >15}    {:0.4f}{: >15}".format(
                cp, fnp, cn, fnn
            )
        )

    return "\n".join(output)

In [12]:
print(show_most_informative_features(model))

2.6164            fun    -4.8341            bad
2.5542          great    -2.6382  unfortunately
2.1440    performance    -2.5793          waste
2.1328            see    -2.4778           plot
1.9539         matrix    -2.4465        suppose
1.8630          quite    -2.3901        nothing
1.8409           trek    -2.3505        attempt
1.6554      memorable    -1.9573           poor
1.6510       bulworth    -1.9456          awful
1.5805       terrific    -1.9437         stupid
1.5649      different    -1.8706         boring
1.5486            job    -1.8488           look
1.5284      enjoyable    -1.8077     ridiculous
1.5223      hilarious    -1.7973          guess
1.5187        portray    -1.7840          could
1.5070     especially    -1.7226           even
1.5070              7    -1.7110      carpenter
1.4938           also    -1.6744         script
1.4907        overall    -1.6698          harry
1.4898           true    -1.6659           lame


## Your challenge:
Build a sentiment classifier for the IMDB Dataset, which is available in the data/ directory.  Please note that
the IMDB Dataset consists of 50000 rows, so it's probably best to do most of your work on a sample of the
original dataset.  In the code below we use a sample size of 1000.  That's probably fine to start with but your final submission should be based on a sample of at least 5000.

You should attempt to improve the default classifier shown above by trying to get a higher accuracy score.  For example, you might want to try one of the other classifiers from the list shown in class 22.  Another way to improve your pipeline is to spend more time
building a better text preprocessor (e.g. you can see some reviews contain HTML, which you might decide to strip out).  Another thing you might want to do is to look more closely at the stopword list.

Please note that if you resample the dataset you will get slightly different accuracy values.  The values should not fluctuate wildly, so don't get too concerned about their absolute value.  What we're looking for is an improvement from the baseline and evidence that you tried a variety of approaches to improving the classifier.  We're also looking for evidence that you can manipulate text data into a machine learning pipeline and correctly interpret the results.

You should include code and interpretation of your results in this notebook.   If you tried many different approaches and ultimately chose one over the others, please include that in your write-up.  You do not need to include code for analyses that you discarded.

You should be able to plug the new data into the old pipeline code to get started (another handy thing about pipelines) and then start experimenting with improving the code!

In [1]:
import pandas as pd

In [2]:
import string
import re

from nltk.corpus import stopwords as sw
from nltk.corpus import wordnet as wn
from nltk import wordpunct_tokenize
from nltk import WordNetLemmatizer
from nltk import sent_tokenize
from nltk import pos_tag

from sklearn.base import BaseEstimator, TransformerMixin

# objects are from sklearn
class NLTKPreprocessor(BaseEstimator, TransformerMixin):

    def __init__(self, stopwords=None, punct=None,
                 lower=True, strip=True):
        self.lower      = lower
        self.strip      = strip
        self.stopwords  = stopwords or set(sw.words('english'))
        self.punct      = punct or set(string.punctuation)
        # better lemmatizer?
        self.lemmatizer = WordNetLemmatizer()

    def fit(self, X, y=None):
        return self

    def inverse_transform(self, X):
        return [" ".join(doc) for doc in X]

    def transform(self, X): 
        # calls tokenize
        return [
            list(self.tokenize(doc)) for doc in X
        ]

    def tokenize(self, document):
        # Break the document into sentences
        for sent in sent_tokenize(document):
            # Break the sentence into part of speech tagged tokens
            for token, tag in pos_tag(wordpunct_tokenize(sent)):
                # Apply preprocessing to the token
                # WILL NEED TO ADD TO THIS LIST
                # BeautifulSoup will make this painful
                # regex?
                
                # If punctuation, ignore token and continue
                if all(char in self.punct for char in token):
                    continue
                
                token = re.sub('\s+', ' ', token) # for removing new line characters
                token = re.sub("\'", "", token) # for removing distracting single quotes
                token = re.sub("<br\s*/?>", "", token) # for removing line breaks
                token = re.sub("br\s", "", token)
                token = re.sub("\d*", "", token) # for removing digits
                token = token.lower() if self.lower else token
                token = token.strip() if self.strip else token
                token = token.strip('_') if self.strip else token
                token = token.strip('*') if self.strip else token
                

                # If stopword, ignore token and continue
                if token in self.stopwords:
                    continue

                # Lemmatize the token and yield
                # yield is like returning but the next time its called, we will pick up right where we left off
                lemma = self.lemmatize(token, tag)
                yield lemma

    def lemmatize(self, token, tag):
        tag = {
            'N': wn.NOUN,
            'V': wn.VERB,
            'R': wn.ADV,
            'J': wn.ADJ
        }.get(tag[0], wn.NOUN)

        return self.lemmatizer.lemmatize(token, tag)

In [3]:
# takes some text and returns the text (null function)
# needed in the next section
def identity_tokenizer(text):
    return text

In [4]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report as clsr
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split as tts
import time
import pickle


def build_and_evaluate(X, y,
    classifier=SGDClassifier, outpath=None, verbose=True):

    def build(classifier, X, y=None):
        """
        Inner build function that builds a single model.
        """
        if isinstance(classifier, type):
            classifier = classifier()

        model = Pipeline([
            ('preprocessor', NLTKPreprocessor()), # class we just created
            ('vectorizer', TfidfVectorizer( # de-emphasize frequent words, emphasize unusual words
                tokenizer=identity_tokenizer, # note that this will fail unless you use the identity_tokenizer
                preprocessor=None, lowercase=False
            )),
            ('classifier', classifier),
        ])

        model.fit(X, y)
        return model

    # Label encode the targets
    #if "labels are not numeric", call LabelEncoder
    labels = LabelEncoder()
    y = labels.fit_transform(y)

    # Begin evaluation
    if verbose: print("Building for evaluation")
    X_train, X_test, y_train, y_test = tts(X, y, test_size=0.2)
    import time
    start_time = time.time()

    model = build(classifier, X_train, y_train)

    if verbose:
        print("Evaluation model fit in {:0.3f} seconds".format(time.time() - start_time))
        print("Classification Report:\n")

    y_pred = model.predict(X_test)
    print(clsr(y_test, y_pred, target_names=labels.classes_))

    if verbose:
        print("Building complete model and saving ...")
    start_time = time.time()
    model = build(classifier, X, y)
    model.labels_ = labels

    if verbose:
        print("Complete model fit in {:0.3f} seconds".format(time.time() - start_time))

    if outpath:
        with open(outpath, 'wb') as f:
            pickle.dump(model, f) # pickle saves any object to your file system

        print("Model written out to {}".format(outpath))

    return model

In [6]:
m = pd.read_csv("imdb-dataset-of-50k-movie-reviews.zip")
# Let's do most of our work on a smaller sample of the 50000 rows
m = m.sample(200)

In [7]:
m.head()

Unnamed: 0,review,sentiment
35248,This has been one of the best vampire movies t...,positive
19570,For many years I thought I was the only person...,positive
6040,I came across this movie back in the mid eight...,positive
26625,I was bored one night and Red Eye was on and t...,positive
10593,"Broken Silence or ""Race Against Fear""1998): St...",positive


In [8]:
X = m.review
y = m.sentiment

In [9]:
for x in range(len(m.review)):
    print(m.review.iloc[x])
    print()

This has been one of the best vampire movies that I have seen in a long time. It was very seductive and alluring, I liked that it did not have the usual gore and carnage that comes along with most vampire movies. The music was excellent. It would be great if there was a sequel.

For many years I thought I was the only person on the planet who had seen TEMPEST, and I am so glad to learn that I am not the only person who discovered this sleeper somewhere in their movie-going travails. Loosely based on the Shakesperean play, TEMPEST follows an architect (the late John Cassavettes, in one of his best performances), bored with his work and his crumbling marriage (to real life spouse Gene Rowlads), who decides to chuck it all, say the hell with the rat race and go live on an island with his daughter (Molly Ringwald, in her film debut), and new girlfriend Aretha (a luminous Susan Sarandon). Even though Paul Mazursky is credited as director, Cassavettes hand is all over this film...the long sc

In [10]:
PATH = "imdb_model.pickle"

In [11]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="rbf", C=0.025, probability=True),
    NuSVC(nu=0.1,probability=True),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    AdaBoostClassifier(),
    GradientBoostingClassifier(),
    SGDClassifier
    ]

In [12]:
model = build_and_evaluate(X,y, classifier=SGDClassifier, outpath=PATH)

Building for evaluation


LookupError: 
**********************************************************************
  Resource [93mstopwords[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('stopwords')
  [0m
  Searched in:
    - '/Users/andrewdicks/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/Users/andrewdicks/anaconda3/nltk_data'
    - '/Users/andrewdicks/anaconda3/share/nltk_data'
    - '/Users/andrewdicks/anaconda3/lib/nltk_data'
**********************************************************************


In [60]:
for classifier in classifiers:
    print(classifier)
    model = build_and_evaluate(X,y, classifier=classifier, outpath=PATH)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=3, p=2,
                     weights='uniform')
Building for evaluation
Evaluation model fit in 25.979 seconds
Classification Report:

              precision    recall  f1-score   support

    negative       0.71      0.76      0.73       107
    positive       0.70      0.65      0.67        93

    accuracy                           0.70       200
   macro avg       0.70      0.70      0.70       200
weighted avg       0.70      0.70      0.70       200

Building complete model and saving ...
Complete model fit in 32.093 seconds
Model written out to imdb_model.pickle
SVC(C=0.025, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='rbf', max_iter=-1, probability=True, random_state=None,
    shrinking=True, tol=0.001, verbose=False)
Building for evaluation




Evaluation model fit in 31.359 seconds
Classification Report:



  'precision', 'predicted', average, warn_for)


              precision    recall  f1-score   support

    negative       0.51      1.00      0.67       101
    positive       0.00      0.00      0.00        99

    accuracy                           0.51       200
   macro avg       0.25      0.50      0.34       200
weighted avg       0.26      0.51      0.34       200

Building complete model and saving ...




Complete model fit in 38.969 seconds
Model written out to imdb_model.pickle
NuSVC(cache_size=200, class_weight=None, coef0=0.0,
      decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
      kernel='rbf', max_iter=-1, nu=0.1, probability=True, random_state=None,
      shrinking=True, tol=0.001, verbose=False)
Building for evaluation




Evaluation model fit in 26.257 seconds
Classification Report:

              precision    recall  f1-score   support

    negative       0.82      0.66      0.74       113
    positive       0.65      0.82      0.72        87

    accuracy                           0.73       200
   macro avg       0.74      0.74      0.73       200
weighted avg       0.75      0.73      0.73       200

Building complete model and saving ...




Complete model fit in 32.533 seconds
Model written out to imdb_model.pickle
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')
Building for evaluation
Evaluation model fit in 26.206 seconds
Classification Report:

              precision    recall  f1-score   support

    negative       0.57      0.65      0.61        94
    positive       0.65      0.57      0.60       106

    accuracy                           0.60       200
   macro avg       0.61      0.61      0.60       200
weighted avg       0.61      0.60      0.60       200

Building complete model and saving ...
Complete model fit in 31.919 seconds
Model written out to imdb_mod

In [13]:
# You should include the output from the following code in your notebook:
with open(PATH, 'rb') as f:
    model = pickle.load(f)

yhat = model.predict([
    "This is the worst movie I have ever seen!",
    "The movie was great action packed and full of adventure!",
    "Wow!",
    "This was the best and the worst at the same time!"
])


print(yhat)
print(model.labels_.inverse_transform(yhat)) # returns categorical

[0 1 0 0]
['negative' 'positive' 'negative' 'negative']


In [None]:
# build_and_evaluate(m.review,m.sentiment)
# do better at preprocessing
# do better at choosing the classifier

In [21]:
# PATH = m
# X = m.review
# y = m.sentiment
# model = build_and_evaluate(X,y, classifier=SGDClassifier, outpath=PATH)