## Text Classification Tutorial 
Credit to: https://stackabuse.com/text-classification-with-python-and-scikit-learn/

In [1]:
import numpy as np
import re
import nltk
from sklearn.datasets import load_files
import pickle
from nltk.corpus import stopwords

Dataset: Cornell Natural Language Processing Group. The dataset consists of a total of 2000 documents

Check out what the data looks like in the review_polarity/txt_sentoken directory. Particularly cv001 in NEG reviews; it contains words like GOOD.

In [2]:
movie_data = load_files(r"../review_polarity/txt_sentoken")
X, y = movie_data.data, movie_data.target

We print the length of the array "X" below and notice that we start with 2000 documents read. On the other hand, "y" contains all the labels (either negative or positive based on the folders they were in 'neg' or 'pos'). 

Remember that someone had to go through all these reviews and label them into these two folders manually. They read through all these reviews and placed them nicely in folders for us based on their sentiment accordingly.

Please note that these are case sensitive when programming with them later on. 

**Notes to self**: Remind the audience here that jupyter notebook is sequential (which is why there are numbers by it's side), so we can't skip ahead for example (going to the next block printing X before the previous of reading X). 

Also, if they ever get in trouble and aren't sure where they are, just refresh the notebook and run all. there's also a completed notebook in the /completed/ folder that you can follow if you're newer to all this (show them where to get it)

Lastly, slow down. You'll be fine :) 

**Disclaimer**: Binder is a free tool that's publically available and we highly suggest not using it with any proprietary data - unsure about the security plus you may lose data when logging out. There are plenty of tools and people around the company who can assist you with any of these needs if you're interested in using it for actual projects (just reach out!). 

In [3]:
len(X)

2000

Printing out the first review in X just to see the data we've pulled:

In [4]:
print(X[0])

b"arnold schwarzenegger has been an icon for action enthusiasts , since the late 80's , but lately his films have been very sloppy and the one-liners are getting worse . \nit's hard seeing arnold as mr . freeze in batman and robin , especially when he says tons of ice jokes , but hey he got 15 million , what's it matter to him ? \nonce again arnold has signed to do another expensive blockbuster , that can't compare with the likes of the terminator series , true lies and even eraser . \nin this so called dark thriller , the devil ( gabriel byrne ) has come upon earth , to impregnate a woman ( robin tunney ) which happens every 1000 years , and basically destroy the world , but apparently god has chosen one man , and that one man is jericho cane ( arnold himself ) . \nwith the help of a trusty sidekick ( kevin pollack ) , they will stop at nothing to let the devil take over the world ! \nparts of this are actually so absurd , that they would fit right in with dogma . \nyes , the film is 

y labels: 0 is NEGATIVE, 1 is POSITIVE

In [5]:
y[0]

0

## Preprocessing The Data

In [6]:
documents = []

from nltk.stem import WordNetLemmatizer


stemmer = WordNetLemmatizer()

for sen in range(0, len(X)):
    # Remove all the special characters
    document = re.sub(r'\W', ' ', str(X[sen]))
    
    # remove all single characters
    document = re.sub(r'\s+[a-zA-Z]\s+', ' ', document)
    
    # Remove single characters from the start
    document = re.sub(r'\^[a-zA-Z]\s+', ' ', document) 
    
    # Substituting multiple spaces with single space
    document = re.sub(r'\s+', ' ', document, flags=re.I)
    
    # Removing prefixed 'b'
    document = re.sub(r'^b\s+', '', document)
    
    # Converting to Lowercase
    document = document.lower()
    
    # Lemmatization
    document = document.split()

    document = [stemmer.lemmatize(word) for word in document]
    document = ' '.join(document)
    
    documents.append(document)


In [7]:
X[0]

b"arnold schwarzenegger has been an icon for action enthusiasts , since the late 80's , but lately his films have been very sloppy and the one-liners are getting worse . \nit's hard seeing arnold as mr . freeze in batman and robin , especially when he says tons of ice jokes , but hey he got 15 million , what's it matter to him ? \nonce again arnold has signed to do another expensive blockbuster , that can't compare with the likes of the terminator series , true lies and even eraser . \nin this so called dark thriller , the devil ( gabriel byrne ) has come upon earth , to impregnate a woman ( robin tunney ) which happens every 1000 years , and basically destroy the world , but apparently god has chosen one man , and that one man is jericho cane ( arnold himself ) . \nwith the help of a trusty sidekick ( kevin pollack ) , they will stop at nothing to let the devil take over the world ! \nparts of this are actually so absurd , that they would fit right in with dogma . \nyes , the film is 

In [8]:
documents[0]

'arnold schwarzenegger ha been an icon for action enthusiast since the late 80 but lately his film have been very sloppy and the one liner are getting worse nit hard seeing arnold a mr freeze in batman and robin especially when he say ton of ice joke but hey he got 15 million what it matter to him nonce again arnold ha signed to do another expensive blockbuster that can compare with the like of the terminator series true lie and even eraser nin this so called dark thriller the devil gabriel byrne ha come upon earth to impregnate woman robin tunney which happens every 1000 year and basically destroy the world but apparently god ha chosen one man and that one man is jericho cane arnold himself nwith the help of trusty sidekick kevin pollack they will stop at nothing to let the devil take over the world nparts of this are actually so absurd that they would fit right in with dogma nyes the film is that weak but it better than the other blockbuster right now sleepy hollow but it make the wo

^ cuts down all the noise

Other measures of NLP preprocessing: n-grams, stopwords, lemmatize conversion vs stemming

## Feature Engineering

In [9]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(max_features=1500, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))
X1 = vectorizer.fit_transform(documents).toarray()

CountVectorizer makes very unique word a feature for the ml model. Parameters:

- We take a max of 1500 features (the most common, highest occuring 1500 unique words)
- min_df is number of documents they appear in at minimum
- max_df is the maximum percentage of documents containing this word
- Finally, stop_words removes any very commons words in the english language

In [10]:
print(vectorizer.get_feature_names())

['000', '10', '13', '1997', '1998', '1999', '20', '80', '90', 'ability', 'able', 'absolutely', 'academy', 'accent', 'accident', 'across', 'act', 'acting', 'action', 'actor', 'actress', 'actual', 'actually', 'adam', 'adaptation', 'add', 'addition', 'admit', 'adult', 'adventure', 'affair', 'affleck', 'age', 'agent', 'ago', 'air', 'alan', 'alien', 'alive', 'allen', 'allows', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'amazing', 'america', 'american', 'among', 'amount', 'amusing', 'anderson', 'angel', 'angle', 'angry', 'animal', 'animated', 'animation', 'annoying', 'another', 'answer', 'anti', 'anyone', 'anything', 'anyway', 'apart', 'apartment', 'ape', 'apparent', 'apparently', 'appeal', 'appealing', 'appear', 'appearance', 'appears', 'appreciate', 'approach', 'appropriate', 'arm', 'army', 'around', 'art', 'artist', 'aside', 'ask', 'asked', 'asks', 'aspect', 'atmosphere', 'attack', 'attempt', 'attention', 'attitude', 'audience', 'average', 'award', 'away', 'awful

In [11]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidfconverter = TfidfTransformer()
X2 = tfidfconverter.fit_transform(X1).toarray()

TfidfTransformer turns the count array earlier into an array of weights for each unique word based on the document:
- TF-IDF weight is a weight often used in information retrieval and text mining
- This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus
- The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus
- Variations of the tf-idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query

In [12]:
np.set_printoptions(threshold=np.inf)
print(X2[0])

[0.         0.         0.         0.         0.         0.
 0.         0.07599077 0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.20052791 0.         0.         0.         0.04253558 0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.03153765 0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.07461927 0.         0.         0.
 0.         0.06965278 0.         0.         0.         0.
 0.06772453 0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.03756588 0.         0.         0.         0.         

We find that the word that is highly weighted in this first document happens to be at the 18th index from the above print. Looking into it further, the word at the 18th index is 'action' which occurs alot in that review.

In [13]:
print(vectorizer.get_feature_names()[18])

action


## Splitting The Data

In [14]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test, indices_train, indices_test = train_test_split(X2, y, [i for i in range(2000)], test_size=0.2, random_state=0)

Notice that the indices are randomly shuffled, the X_test that we have is filled with randomly selected items from the documents array but we can tell which ones are where since we saved the shuffled indices. For example, the **first item in X_test** is the **405th item in documents**.

In [15]:
print(indices_test)

[405, 1190, 1132, 731, 1754, 1178, 1533, 1303, 1857, 18, 1266, 1543, 249, 191, 721, 1896, 452, 1947, 1544, 1205, 1905, 1235, 799, 1594, 1091, 1357, 479, 1148, 511, 1931, 1257, 1105, 1938, 1750, 1058, 427, 1025, 156, 491, 995, 27, 1326, 1239, 17, 1622, 519, 361, 289, 1790, 1135, 1549, 1327, 76, 579, 279, 935, 475, 1786, 1023, 1556, 842, 1522, 1108, 1369, 47, 794, 918, 1220, 506, 353, 276, 881, 1835, 1703, 333, 1616, 135, 182, 1761, 1501, 930, 107, 390, 1165, 1698, 262, 1732, 260, 487, 303, 53, 1015, 148, 1442, 1580, 1999, 1927, 80, 1854, 264, 1979, 1273, 538, 1529, 37, 896, 1129, 1617, 1276, 878, 386, 1820, 977, 1074, 1542, 648, 1922, 77, 689, 9, 861, 6, 1083, 161, 1600, 1110, 962, 1385, 775, 1174, 823, 145, 240, 215, 1889, 465, 987, 1238, 760, 252, 1633, 781, 1451, 572, 85, 634, 317, 118, 1784, 1876, 443, 1900, 516, 1646, 30, 713, 254, 900, 587, 1317, 1887, 1811, 1121, 1334, 1918, 906, 1635, 1858, 796, 1076, 522, 1411, 124, 187, 517, 1945, 686, 342, 1787, 200, 1421, 1511, 1954, 233, 16

## Training The Model

In [16]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=1000, random_state=0)
classifier.fit(X_train, y_train) 

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=1000,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

**Note to self**: Draw out a single movie review tree here (if the review contains >3 'bad' words, it's negative)

Random Forest example:

<img align="left" src=https://miro.medium.com/max/900/1*EFBVZvHEIoMdYHjvAZg8Zg.gif width="450" /> 
<img align="left" src=https://static.javatpoint.com/tutorial/machine-learning/images/random-forest-algorithm2.png width="450" />


Pros of random forest:
- Great predictive performance for binary classification
- They provide a reliable feature importance estimate
- They offer efficient estimates of the test error without incurring the cost of repeated model training associated with cross-validation
- Handles thousands of input variables without variable deletion

Cons of random forest:
- An ensemble model is inherently less interpretable than an individual decision tree
- Training a large number of deep trees can have high computational costs (but can be parallelized) and use a lot of memory
- Predictions are slower, which may create challenges for applications

More considerations: 
- https://github.com/TayariAmine/ML_cheat_sheet/wiki/Random-forest-Pros-and-Cons
- https://www.oreilly.com/library/view/hands-on-machine-learning/9781789346411/e17de38e-421e-4577-afc3-efdd4e02a468.xhtml

In [17]:
y_pred = classifier.predict(X_test)

In [18]:
len(X_train)

1600

In [19]:
X[405]

b'this is crap , but , honestly , what older american audience is going to be able to resist seeing jack lemmon and james garner as bicker- ing ex-presidents ? \nespecially when their supporting players in- clude dan aykroyd as the current commander in chief , lauren bacall as a former first lady , and john heard as the dan quayle-ish vice president . \nyup , you\'re talkin\' pre-sold property here and , for warner brothers , the perfect fit into their now-ritual grumpy old men holiday slot . \nfor the non-discriminating viewer , my fellow americans is fine . \nthe raw star power alone will have audiences applauding this atrocious political- thriller road-comedy . \n ( they did in mine , heaven help us . ) \nfor the rest of us , the movie is immediately tiresome . \nthe tone is terrible and the banter is worse . \nforget wit-- lemmon and garner merely exchange profanities through most of the movie . \n ( has anyone counted the number of first penis references ? ) \nsure , some of the b

In [20]:
y_pred[0]

0

## Evaluating The Model

In [21]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(confusion_matrix(y_test,y_pred))
print("\nACCURACY: {}".format(accuracy_score(y_test, y_pred)))

[[180  28]
 [ 30 162]]

ACCURACY: 0.855


<img src=https://glassboxmedicine.files.wordpress.com/2019/02/confusion-matrix.png width="500">

In [22]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.86      0.87      0.86       208
           1       0.85      0.84      0.85       192

    accuracy                           0.85       400
   macro avg       0.85      0.85      0.85       400
weighted avg       0.85      0.85      0.85       400



In [23]:
# Finding False Positives (actual = 0, pred = 1)
for i in range(len(y_pred)): 
    if (y_pred[i]==1 and y[indices_test[i]]!=y_pred[i]):
        print(y[indices_test[i]], y_pred[i], indices_test[i], i)
        break

0 1 1266 10


Did some digging to find the first false positive (index 1266 in 'documents' and 10th predictions in 'pred')

In [24]:
print(documents[1266])



In [25]:
print(y[1266])
print(y_pred[10])

0
1


## Saving The Model

In [26]:
with open('text_classifier', 'wb') as picklefile:
    pickle.dump(classifier,picklefile)

In [27]:
with open('text_classifier', 'rb') as training_model:
    model = pickle.load(training_model)

In [28]:
y_pred2 = model.predict(X_test)

print(confusion_matrix(y_test, y_pred2))
print(classification_report(y_test, y_pred2))
print(accuracy_score(y_test, y_pred2)) 

[[180  28]
 [ 30 162]]
              precision    recall  f1-score   support

           0       0.86      0.87      0.86       208
           1       0.85      0.84      0.85       192

    accuracy                           0.85       400
   macro avg       0.85      0.85      0.85       400
weighted avg       0.85      0.85      0.85       400

0.855


## Try It Yourself!

In [29]:
import numpy as np
import re
import nltk
from sklearn.datasets import load_files
import pickle
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfTransformer

movie_data = load_files(r"../review_polarity/txt_sentoken")
X, y = movie_data.data, movie_data.target
documents = []

new_X_test = ['I loved this movie so much!', 'This movie was bad and had terrible actors']
count = len(new_X_test)
new_X_test.extend(X)

# Preprocessing functions
def preprocessing(X):

    stemmer = WordNetLemmatizer()

    for sen in range(0, len(X)):
        # Remove all the special characters
        document = re.sub(r'\W', ' ', str(X[sen]))

        # remove all single characters
        document = re.sub(r'\s+[a-zA-Z]\s+', ' ', document)

        # Remove single characters from the start
        document = re.sub(r'\^[a-zA-Z]\s+', ' ', document) 

        # Substituting multiple spaces with single space
        document = re.sub(r'\s+', ' ', document, flags=re.I)

        # Removing prefixed 'b'
        document = re.sub(r'^b\s+', '', document)

        # Converting to Lowercase
        document = document.lower()

        # Lemmatization
        document = document.split()

        document = [stemmer.lemmatize(word) for word in document]
        document = ' '.join(document)

        documents.append(document)

    vectorizer = CountVectorizer(max_features=1500, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))
    X1 = vectorizer.fit_transform(documents).toarray()

    tfidfconverter = TfidfTransformer()
    X2 = tfidfconverter.fit_transform(X1).toarray()
    
    return X2

new_X2 = preprocessing(new_X_test)

In [30]:
with open('text_classifier', 'rb') as training_model:
    model = pickle.load(training_model)

for i in range(count):
    review = documents[i]
    prediction = model.predict([new_X2[i]])
    print("review: {}, pred: {}".format(review, prediction))
    

review: i loved this movie so much, pred: [1]
review: this movie wa bad and had terrible actor, pred: [0]
