## Text Classification Tutorial 
Credit to: https://stackabuse.com/text-classification-with-python-and-scikit-learn/

In [36]:
import numpy as np
import re
import nltk
from sklearn.datasets import load_files
import pickle
from nltk.corpus import stopwords

Show them what the data looks like! Particularly cv001 in negative reviews, has words like GOOD

In [37]:
movie_data = load_files(r"review_polarity/txt_sentoken")
X, y = movie_data.data, movie_data.target

In [69]:
print(X[0])

b"arnold schwarzenegger has been an icon for action enthusiasts , since the late 80's , but lately his films have been very sloppy and the one-liners are getting worse . \nit's hard seeing arnold as mr . freeze in batman and robin , especially when he says tons of ice jokes , but hey he got 15 million , what's it matter to him ? \nonce again arnold has signed to do another expensive blockbuster , that can't compare with the likes of the terminator series , true lies and even eraser . \nin this so called dark thriller , the devil ( gabriel byrne ) has come upon earth , to impregnate a woman ( robin tunney ) which happens every 1000 years , and basically destroy the world , but apparently god has chosen one man , and that one man is jericho cane ( arnold himself ) . \nwith the help of a trusty sidekick ( kevin pollack ) , they will stop at nothing to let the devil take over the world ! \nparts of this are actually so absurd , that they would fit right in with dogma . \nyes , the film is 

In [68]:
print(y[0])

0


0 is NEGATIVE, 1 is POSITIVE

## Preprocessing The Data

In [40]:
documents = []

from nltk.stem import WordNetLemmatizer


stemmer = WordNetLemmatizer()

for sen in range(0, len(X)):
    # Remove all the special characters
    document = re.sub(r'\W', ' ', str(X[sen]))
    
    # remove all single characters
    document = re.sub(r'\s+[a-zA-Z]\s+', ' ', document)
    
    # Remove single characters from the start
    document = re.sub(r'\^[a-zA-Z]\s+', ' ', document) 
    
    # Substituting multiple spaces with single space
    document = re.sub(r'\s+', ' ', document, flags=re.I)
    
    # Removing prefixed 'b'
    document = re.sub(r'^b\s+', '', document)
    
    # Converting to Lowercase
    document = document.lower()
    
    # Lemmatization
    document = document.split()

    document = [stemmer.lemmatize(word) for word in document]
    document = ' '.join(document)
    
    documents.append(document)

^ cuts down all the noise

In [41]:
documents[6]

'capsule trippy hyperspeed action machine from hong kong accomplished tsui hark nvan damme and rodman have nice chemistry the stunt are eyepopping and stuff get blowed up real good what more do you want ni admit wa all set to loathe double team it reeked of cheapjack timing oriented marketing stick dennis rodman in movie quick while he hot and do something about jean claude van damme flagging career while we re at it nsurprise double team transcends it dumb root and turn out to be mess of fun nbring some friend get some pretzel and have blast nvan damme is jack quinn an ex agent who is brought back in for one last mission you think any spy worth his shoe phone would run like hell when he hears those word nbut van damme character ha pregnant wife who also sculptor and some unpleasant pressure get used to get him to come through on this mission nhe been assigned to take down an old enemy terrorist named stavros mickey rourke looking oddly subdued who may be back up to his old trick nin t

## Feature Engineering

In [42]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(max_features=1500, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))
X1 = vectorizer.fit_transform(documents).toarray()

In [46]:
print(X1[10])

[0 0 0 ... 0 0 0]


In [44]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidfconverter = TfidfTransformer()
X2 = tfidfconverter.fit_transform(X1).toarray()

In [61]:
X2[0]

[0. 0. 0. ... 0. 0. 0.]


TODO: research how this works and how to say it concisely, also find out how to show it

## Splitting The Data

In [59]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test, indices_train, indices_test = train_test_split(X2, y, [i for i in range(2000)], test_size=0.2, random_state=0)

In [62]:
print(indices_test)

[405, 1190, 1132, 731, 1754, 1178, 1533, 1303, 1857, 18, 1266, 1543, 249, 191, 721, 1896, 452, 1947, 1544, 1205, 1905, 1235, 799, 1594, 1091, 1357, 479, 1148, 511, 1931, 1257, 1105, 1938, 1750, 1058, 427, 1025, 156, 491, 995, 27, 1326, 1239, 17, 1622, 519, 361, 289, 1790, 1135, 1549, 1327, 76, 579, 279, 935, 475, 1786, 1023, 1556, 842, 1522, 1108, 1369, 47, 794, 918, 1220, 506, 353, 276, 881, 1835, 1703, 333, 1616, 135, 182, 1761, 1501, 930, 107, 390, 1165, 1698, 262, 1732, 260, 487, 303, 53, 1015, 148, 1442, 1580, 1999, 1927, 80, 1854, 264, 1979, 1273, 538, 1529, 37, 896, 1129, 1617, 1276, 878, 386, 1820, 977, 1074, 1542, 648, 1922, 77, 689, 9, 861, 6, 1083, 161, 1600, 1110, 962, 1385, 775, 1174, 823, 145, 240, 215, 1889, 465, 987, 1238, 760, 252, 1633, 781, 1451, 572, 85, 634, 317, 118, 1784, 1876, 443, 1900, 516, 1646, 30, 713, 254, 900, 587, 1317, 1887, 1811, 1121, 1334, 1918, 906, 1635, 1858, 796, 1076, 522, 1411, 124, 187, 517, 1945, 686, 342, 1787, 200, 1421, 1511, 1954, 233, 16

## Training The Model

In [25]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=1000, random_state=0)
classifier.fit(X_train, y_train) 

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=1000,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

In [26]:
y_pred = classifier.predict(X_test)

In [48]:
len(X_train)

1600

In [71]:
X[405]

b'this is crap , but , honestly , what older american audience is going to be able to resist seeing jack lemmon and james garner as bicker- ing ex-presidents ? \nespecially when their supporting players in- clude dan aykroyd as the current commander in chief , lauren bacall as a former first lady , and john heard as the dan quayle-ish vice president . \nyup , you\'re talkin\' pre-sold property here and , for warner brothers , the perfect fit into their now-ritual grumpy old men holiday slot . \nfor the non-discriminating viewer , my fellow americans is fine . \nthe raw star power alone will have audiences applauding this atrocious political- thriller road-comedy . \n ( they did in mine , heaven help us . ) \nfor the rest of us , the movie is immediately tiresome . \nthe tone is terrible and the banter is worse . \nforget wit-- lemmon and garner merely exchange profanities through most of the movie . \n ( has anyone counted the number of first penis references ? ) \nsure , some of the b

In [70]:
y_pred[0]

0

## Evaluating The Model

In [29]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))
print(accuracy_score(y_test, y_pred))

[[180  28]
 [ 30 162]]
              precision    recall  f1-score   support

           0       0.86      0.87      0.86       208
           1       0.85      0.84      0.85       192

    accuracy                           0.85       400
   macro avg       0.85      0.85      0.85       400
weighted avg       0.85      0.85      0.85       400

0.855


## Saving The Model

In [31]:
with open('text_classifier', 'wb') as picklefile:
    pickle.dump(classifier,picklefile)

In [32]:
with open('text_classifier', 'rb') as training_model:
    model = pickle.load(training_model)

In [33]:
y_pred2 = model.predict(X_test)

print(confusion_matrix(y_test, y_pred2))
print(classification_report(y_test, y_pred2))
print(accuracy_score(y_test, y_pred2)) 

[[180  28]
 [ 30 162]]
              precision    recall  f1-score   support

           0       0.86      0.87      0.86       208
           1       0.85      0.84      0.85       192

    accuracy                           0.85       400
   macro avg       0.85      0.85      0.85       400
weighted avg       0.85      0.85      0.85       400

0.855
