# NLP Classifier questions

The aim of this is a binary classifier. Given an unseen set of statements and correctly decipher if they are movie questions or stackoverflow posts

## Libraries

In [1]:
import pandas as pd
from sklearn.utils import shuffle
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import csr_matrix

# Classifiers
from sklearn.naive_bayes import GaussianNB
from sklearn import linear_model
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# Metrics
from sklearn.metrics import f1_score

# Dimensionality Reduction
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import normalize

np.random.seed(42)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Colaboratory Specific Code

This mounts your Google Drive folder so you can access it from Colaboratory

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


We also want to change the working directory so that the file references work. Here I've places the files for this class inside a folder called 'class_1' that itself is inside a folder called 'NLP_class'.

In [3]:
%cd drive/My\ Drive/NLP_class/0719_question
!ls

/content/drive/My Drive/NLP_class/0719_question
 dialogues.tsv			   Solution_hw_stackoverflow.ipynb
 July_2019_NLP_Questions.ipynb	   tagged_posts.tsv
'NLP Additional Exercise.docx'	   tfidf_vectorizer.pkl
'NLP class.ipynb'		   utils.py
 Question_hw_stackoverflow.ipynb


## Pre-process the data

This code cleans the text

In [0]:
# There is a problem with this codes as it is possible for all words to be removed leaving a null value
# This is solved in line 25 with the if statement

REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))

def text_prepare(text):
    """
        text: a string
        
        return: modified initial string
    """
    text = text.lower() # lowercase text
    text = re.sub(REPLACE_BY_SPACE_RE, " ", text) # replace REPLACE_BY_SPACE_RE symbols by space in text
    text = re.sub(BAD_SYMBOLS_RE, "", text) # delete symbols which are in BAD_SYMBOLS_RE from text
    text = " " + text + " "
    for sw in STOPWORDS:
        text = text.replace(" "+sw+" ", " ") # delete stopwords from text
    text = re.sub('[ ][ ]+', " ", text)

    #     print("text:", text)
#     print(len(text))
    
    if len(text) > 1: # only run if there are words present
      if text[0] == ' ':
          text = text[1:]
      if text[-1] == ' ':
          text = text[:-1]
        
    return text

### Inspect Data
Let's see what the data looks like

In [0]:
dialogues = pd.read_csv('dialogues.tsv', sep='\t')
tagged = pd.read_csv('tagged_posts.tsv', sep='\t')

In [6]:
dialogues.head(2)

Unnamed: 0,text,tag
0,Okay -- you're gonna need to learn how to lie.,dialogue
1,I'm kidding. You know how sometimes you just ...,dialogue


In [7]:
tagged.head(2)

Unnamed: 0,post_id,title,tag
0,9,Calculate age in C#,c#
1,16,Filling a DataSet or DataTable from a LINQ que...,c#


# Movie or StackOverflow?

### Create base dataset

Now we can drop the 'tag' columns of both and assign new labels. 

We set a movie question to label 0 and a stack overflow post to label 1

We also need to drop the 'post_id' column of the stack overflow data and rename 'title' to 'text'

In [8]:
movie = dialogues.drop(columns=['tag'])
movie['label'] = int(0)
movie.head(2)

Unnamed: 0,text,label
0,Okay -- you're gonna need to learn how to lie.,0
1,I'm kidding. You know how sometimes you just ...,0


In [9]:
stack = tagged.drop(columns=['post_id','tag'])
stack['label'] = int(1)
stack = stack.rename(columns={"title": "text"})
stack.head(2)

Unnamed: 0,text,label
0,Calculate age in C#,1
1,Filling a DataSet or DataTable from a LINQ que...,1


We now combine the two different data sets into one single data set.

In [10]:
data = movie.append(stack, ignore_index=True)
data.head()

Unnamed: 0,text,label
0,Okay -- you're gonna need to learn how to lie.,0
1,I'm kidding. You know how sometimes you just ...,0
2,Like my fear of wearing pastels?,0
3,I figured you'd get to the good stuff eventually.,0
4,Thank God! If I had to hear one more story ab...,0


### Clean the text

In [11]:
data.shape

(615477, 2)

Here we can see that certain lines are entirely made up of stopwords and symbols. We need to account for this.

In [12]:
print("This line: ", data.loc[82457, 'text'])
print("\nBecomes: ", text_prepare(data.loc[82457, 'text']))

This line:  Do you have to do that?

Becomes:   


Now we run the cleaning function on the whole database.

In [13]:
%time cleaned_text = [text_prepare(x) for x in data['text']]

CPU times: user 39.5 s, sys: 80.1 ms, total: 39.6 s
Wall time: 39.7 s


In [14]:
clean_text = pd.DataFrame({'clean_data': cleaned_text})
clean_text.head()

Unnamed: 0,clean_data
0,okay youre gonna need learn lie
1,im kidding know sometimes become persona dont ...
2,like fear wearing pastels
3,figured youd get good stuff eventually
4,thank god hear one story coiffure


In [15]:
clean = pd.concat([clean_text, data], axis = 1, ignore_index=True)
clean = shuffle(clean)
clean.head()

Unnamed: 0,0,1,2
378098,syntax error insert statement c# oledb,Syntax error in INSERT INTO statement in c# ol...,1
274212,applying methods object private variables java...,applying methods to object and private variabl...,1
527013,xampp openssl errors calling openssl_pkey_new,xampp openssl errors when calling openssl_pkey...,1
25072,let buyer beware,Let the buyer beware.,0
491370,read write locking confusion,Read/Write locking confusion,1


In [16]:
clean = clean.drop(columns=[1])
clean.head(20)

Unnamed: 0,0,2
378098,syntax error insert statement c# oledb,1
274212,applying methods object private variables java...,1
527013,xampp openssl errors calling openssl_pkey_new,1
25072,let buyer beware,0
491370,read write locking confusion,1
176289,im alright,0
323152,get last segment regular expression,1
588685,wcf service completely locked,1
102626,ask theres nothing new coke,0
380088,complex example project java desktopstyle gui,1


Check that there are only 2 labels (binary) and see the amount of each.

In [17]:
unique_labels = clean.groupby(2).nunique()
unique_labels.head()

Unnamed: 0_level_0,0,2
2,Unnamed: 1_level_1,Unnamed: 2_level_1
0,206948,1
1,394522,1


### Train Test Validation Split

In [18]:
clean_data = clean[0]
clean_data.head()

378098               syntax error insert statement c# oledb
274212    applying methods object private variables java...
527013        xampp openssl errors calling openssl_pkey_new
25072                                      let buyer beware
491370                         read write locking confusion
Name: 0, dtype: object

In [19]:
clean_labels = clean[2]
clean_labels.head()

378098    1
274212    1
527013    1
25072     0
491370    1
Name: 2, dtype: int64

First we split off the validation set as 20% of the overall data set.

In [0]:
df_data, X_val, df_labels, y_val = train_test_split(
    clean_data, clean_labels, test_size=0.2, random_state=42, shuffle=False)

Then we split the remaining into training and test data.

In [0]:
X_train, X_test, y_train, y_test = train_test_split(df_data, df_labels, test_size=0.25,
                                                        random_state=42,
                                                        shuffle=False)

The end result is 60% training data, 20% test data and 20% validation data

In [22]:
print("Shape of X_train", X_train.shape)
print("Shape of y_train", X_train.shape)
print("Shape of X_test", X_test.shape)
print("Shape of y_test", y_test.shape)
print("Shape of X_val", X_val.shape)
print("Shape of y_val", y_val.shape)

Shape of X_train (369285,)
Shape of y_train (369285,)
Shape of X_test (123096,)
Shape of y_test (123096,)
Shape of X_val (123096,)
Shape of y_val (123096,)


### Apply TF-IDF Weighting

We first need to learn the words for the dictionary from our training data.

In [23]:
vectorizer = TfidfVectorizer(norm=u'l1', token_pattern='(\S+)', min_df=5, max_df=0.9, ngram_range=(1,2))

%time vectorizer.fit_transform(X_train)
# print(vectorizer.get_feature_names(10))

CPU times: user 9.94 s, sys: 259 ms, total: 10.2 s
Wall time: 10.2 s


<369285x65395 sparse matrix of type '<class 'numpy.float64'>'
	with 2693828 stored elements in Compressed Sparse Row format>

Apply this learned TF-IDF transform to our different dataframes

In [24]:
%time train_tf = vectorizer.transform(X_train)
# train_tf = csr_matrix.mean(train_tf, axis=0)

%time test_tf = vectorizer.transform(X_test)
# test_tf = csr_matrix.mean(test_tf, axis=0)

%time val_tf = vectorizer.transform(X_val)
# val_tf = csr_matrix.mean(val_tf, axis=0)


CPU times: user 4.93 s, sys: 7.95 ms, total: 4.94 s
Wall time: 4.95 s
CPU times: user 1.6 s, sys: 1.99 ms, total: 1.61 s
Wall time: 1.61 s
CPU times: user 1.63 s, sys: 1.97 ms, total: 1.63 s
Wall time: 1.64 s


We can see that the different dataframes have been transformed intt Compressed Sparse Row format data. This is very important as sparsity has a big impact on accuracy.

In [25]:
print("Shape of train_tf", train_tf.shape)

print("Shape of test_tf", test_tf.shape)

print("Shape of val_tf", val_tf.shape)

Shape of train_tf (369285, 65395)
Shape of test_tf (123096, 65395)
Shape of val_tf (123096, 65395)


## Classify CSR

First we will attempt to classify the full CSR dataset

### Naive Bayes
Interestingly this crashes and causes the runtime to reset as the dimensionality is too high.

In [0]:
# nb = GaussianNB()
# nb.fit(train_tf.todense(), y_train.astype(int))

In [0]:
# nb_score_test = nb.score(test_tf.todense(), y_test.astype(int))
# nb_score_test

### Linear Regression

In [29]:
linr = linear_model.LinearRegression(n_jobs = -1)
%time linr.fit(train_tf, y_train.astype(int))

CPU times: user 1min 14s, sys: 45.7 s, total: 2min
Wall time: 1min 1s


LinearRegression(copy_X=True, fit_intercept=True, n_jobs=-1, normalize=False)

In [30]:
%time linr_score_test = linr.score(test_tf, y_test.astype(int))
linr_score_test*100

CPU times: user 9.41 ms, sys: 11.2 ms, total: 20.6 ms
Wall time: 13.3 ms


92.57128540397106

### Logistic Regression

In [31]:
lr = linear_model.LogisticRegression(C=10.0, penalty='l2', solver = 'newton-cg', n_jobs=-1)
%time lr.fit(train_tf, y_train)

CPU times: user 58.6 ms, sys: 101 ms, total: 159 ms
Wall time: 9.58 s


LogisticRegression(C=10.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=-1, penalty='l2',
                   random_state=None, solver='newton-cg', tol=0.0001, verbose=0,
                   warm_start=False)

In [32]:
%time lr_score_test = lr.score(test_tf, y_test.astype(int))
lr_score_test*100

CPU times: user 16.5 ms, sys: 1.84 ms, total: 18.4 ms
Wall time: 19.7 ms


98.8480535516995

### Support Vector Machine
The high dimensionality means that this takes a ridiculous amount of time to run.

In [0]:
# clf = svm.SVC(kernel='linear', C=1, random_state=42)
# %time clf.fit(train_tf, y_train.astype(int))

In [0]:
# %time clf_score = clf.score(test_tf, y_test.astype(int))
# print("SVM accuracy on test:\t %f" % clf_score)

### Random Forest
The high dimensionality means that this takes a ridiculous amount of time to run.

In [26]:
rf = RandomForestClassifier(criterion='gini', max_depth=5, 
                               min_samples_leaf=5, min_samples_split=2, 
                               n_estimators = 220, oob_score=True, 
                               max_features=0.5, n_jobs = -1, random_state=42)

%time rf.fit(train_tf, y_train.astype(int))

CPU times: user 5min 19s, sys: 835 ms, total: 5min 20s
Wall time: 2min 46s


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=5, max_features=0.5, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=5, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=220,
                       n_jobs=-1, oob_score=True, random_state=42, verbose=0,
                       warm_start=False)

In [27]:
%time rf_score = rf.score(test_tf, y_test.astype(int))
print("RF accuracy on test:\t %f" % rf_score)

CPU times: user 2.44 s, sys: 40.1 ms, total: 2.48 s
Wall time: 1.33 s
RF accuracy on test:	 0.752665


We could use a gridsearch to find optimal paramters for a classifier

In [0]:
# rf_grid = RandomForestClassifier(random_state=42)

# param_grid = { 
#     'n_estimators': [100, 120, 140, 160, 180, 200, 220, 240, 260],
#     'min_samples_leaf': [3, 5, 7],
#     'min_samples_split': [2, 3, 4, 5, 6],
#     'max_depth' : [5, 10, 15, 20, 25, 30, 35, 40, 45],
#     'criterion' :['gini', 'entropy']
# }


# CV_rf = GridSearchCV(estimator=rf_grid, param_grid=param_grid, cv= 5, n_jobs = -1, 
#                      verbose = 2)
# CV_rf.fit(train_tf, y_train.astype(int))
# CV_rf.best_params_

## Scores on CSR Validation Set

In [0]:
# nb_score_val = nb.score(val_tf.todense(), y_val.astype(int))
# nb_score_val

In [39]:
linr_score_val = linr.score(val_tf, y_val.astype(int))
linr_score_val

0.92762086921615

In [40]:
lr_score_val = lr.score(val_tf, y_val.astype(int))
lr_score_val

0.9890654448560473

In [0]:
# clf_score_val = clf.score(X_val, y_val.astype(int))
# clf_score_val

In [28]:
rf_score_val = rf.score(val_tf, y_val.astype(int))
rf_score_val

0.7531845064015078

## Classify Reduced
We will now reduce the dimensionality of the dataset and rety classification.

### Reduce Dimensions using SVD

In [0]:
NUM_DIMEN = 50
svd = TruncatedSVD(n_components=NUM_DIMEN)

In [70]:
%time train_sv = svd.fit_transform(train_tf)
var = svd.explained_variance_.sum() * 100
print("Training data variance with %d SVD components is %f" % (NUM_DIMEN, var))

CPU times: user 8.18 s, sys: 1.83 s, total: 10 s
Wall time: 6.61 s
Training data variance with 50 SVD components is 1.728273


In [71]:
%time test_sv = svd.fit_transform(test_tf)
%time val_sv = svd.fit_transform(val_tf)

CPU times: user 3.37 s, sys: 1.27 s, total: 4.64 s
Wall time: 2.59 s
CPU times: user 3.38 s, sys: 1.34 s, total: 4.72 s
Wall time: 2.61 s


In [72]:
print("Shape of train_sv", train_sv.shape)

print("Shape of test_sv", test_sv.shape)

print("Shape of val_sv", val_sv.shape)

Shape of train_sv (238120, 50)
Shape of test_sv (79374, 50)
Shape of val_sv (79374, 50)


### Naive Bayes
With the reduced dimensionality we can use a Naive Bayes classifier

In [73]:
nb_sv = GaussianNB()
%time nb_sv.fit(train_sv, y_train.astype(int))

CPU times: user 231 ms, sys: 4.02 ms, total: 236 ms
Wall time: 238 ms


GaussianNB(priors=None, var_smoothing=1e-09)

In [74]:
%time nb_score_test_sv = nb_sv.score(test_sv, y_test.astype(int))
nb_score_test_sv*100

CPU times: user 286 ms, sys: 20 ms, total: 306 ms
Wall time: 307 ms


18.271726257968606

### Linear Regression

In [75]:
linr_sv = linear_model.LinearRegression(n_jobs = -1)
%time linr_sv.fit(train_sv, y_train.astype(int))

CPU times: user 1.11 s, sys: 76.2 ms, total: 1.18 s
Wall time: 754 ms


LinearRegression(copy_X=True, fit_intercept=True, n_jobs=-1, normalize=False)

In [76]:
%time linr_score_test_sv = linr_sv.score(test_sv, y_test.astype(int))
linr_score_test_sv*100

CPU times: user 18.1 ms, sys: 3.04 ms, total: 21.2 ms
Wall time: 13.2 ms


-44.693470645871656

### Logistic Regression

In [77]:
lr_sv = linear_model.LogisticRegression(C=1.0, penalty='l2', solver = 'newton-cg', n_jobs=-1)
%time lr_sv.fit(train_sv, y_train)



CPU times: user 101 ms, sys: 165 ms, total: 266 ms
Wall time: 1min 35s


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=-1, penalty='l2',
                   random_state=None, solver='newton-cg', tol=0.0001, verbose=0,
                   warm_start=False)

In [78]:
%time lr_score_test_sv = lr_sv.score(test_sv, y_test.astype(int))
lr_score_test_sv*100

CPU times: user 42.7 ms, sys: 15 ms, total: 57.7 ms
Wall time: 48.6 ms


21.60531156298032

### Support Vector Machine
Even at low dimension this takes a lot of time to run.

In [0]:
# clf_sv = svm.SVC(kernel='linear', C=1, random_state=42)
# %time clf_sv.fit(train_sv, y_train.astype(int))

In [0]:
# %time clf_score_sv = clf_sv.score(test_sv, y_test.astype(int))
# print("SVM accuracy on test:\t %f" % clf_score_sv)

### Random Forest

In [79]:
rf_sv = RandomForestClassifier(criterion='gini', max_depth=5, 
                               min_samples_leaf=5, min_samples_split=2, 
                               n_estimators = 220, oob_score=True, 
                               max_features=0.5, n_jobs = -1, random_state=42)

%time rf_sv.fit(train_sv, y_train.astype(int))

CPU times: user 15min 24s, sys: 959 ms, total: 15min 25s
Wall time: 7min 54s


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=5, max_features=0.5, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=5, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=220,
                       n_jobs=-1, oob_score=True, random_state=42, verbose=0,
                       warm_start=False)

In [80]:
%time rf_score_sv = rf_sv.score(test_sv, y_test.astype(int))
print("RF accuracy on test:\t %f" % rf_score_sv)

CPU times: user 2.08 s, sys: 59.2 ms, total: 2.13 s
Wall time: 1.23 s
RF accuracy on test:	 0.210850


# Which Programming Language?

### Clean the Text and Assign Labels

First we load in the StackOverflow dataset and drop the unused post_id column

In [29]:
tagged.head()

Unnamed: 0,post_id,title,tag
0,9,Calculate age in C#,c#
1,16,Filling a DataSet or DataTable from a LINQ que...,c#
2,39,Reliable timer in a console application,c#
3,42,Best way to allow plugins for a PHP application,php
4,59,"How do I get a distinct, ordered list of names...",c#


In [30]:
stackover = tagged.drop(columns=['post_id'])
stackover.head()

Unnamed: 0,title,tag
0,Calculate age in C#,c#
1,Filling a DataSet or DataTable from a LINQ que...,c#
2,Reliable timer in a console application,c#
3,Best way to allow plugins for a PHP application,php
4,"How do I get a distinct, ordered list of names...",c#


In [31]:
relevant_stack = [text_prepare(x) for x in stackover['title']]
clean_titles = pd.DataFrame({'clean_data': relevant_stack})
clean_titles.head()

Unnamed: 0,clean_data
0,calculate age c#
1,filling dataset datatable linq query result set
2,reliable timer console application
3,best way allow plugins php application
4,get distinct ordered list names datatable usin...


In [32]:
clean_stack = pd.concat([clean_titles, stackover], axis = 1, ignore_index=True)
clean_stack = shuffle(clean_stack)
clean_stack.head()

Unnamed: 0,0,1,2
360370,systemruntimecache expiring,System.Runtime.Cache not expiring?,c#
169070,php convert server timestamp users timezone,PHP: How do I convert a server timestamp to th...,php
347961,stripping slashes mysql_real_escape_string out...,Stripping slashes from mysql_real_escape_strin...,php
329705,appending contents two richtextbox single rich...,Appending Contents of two RichTextbox as a sin...,c#
100740,converting linq memberexpression lambda work c...,Converting Linq MemberExpression lambda to wor...,c#


In [33]:
clean_stack = clean_stack.drop(columns=[1])
clean_stack.head(15)

Unnamed: 0,0,2
360370,systemruntimecache expiring,c#
169070,php convert server timestamp users timezone,php
347961,stripping slashes mysql_real_escape_string out...,php
329705,appending contents two richtextbox single rich...,c#
100740,converting linq memberexpression lambda work c...,c#
343256,android sdk running functions background,java
98188,problem function vbnet,vb
40870,windows nonblocking stdin redirected pipe,c_cpp
22890,listing odbc data sources c#,c#
308914,iphone app windows android mobile,c_cpp


Check how many unique labels we have

In [34]:
unique_languages = clean_stack.groupby(2).nunique()
unique_languages.head(15)

Unnamed: 0_level_0,0,2
2,Unnamed: 1_level_1,Unnamed: 2_level_1
c#,93828,1
c_cpp,63917,1
java,63150,1
javascript,49697,1
php,56747,1
python,33263,1
r,2390,1
ruby,23572,1
swift,3,1
vb,8402,1


Now we convert these categorical labels into numericals ones

In [35]:
clean_stack[2] = clean_stack[2].astype('category')
clean_stack["label"] = clean_stack[2].cat.codes
clean_stack.head(10)

Unnamed: 0,0,2,label
360370,systemruntimecache expiring,c#,0
169070,php convert server timestamp users timezone,php,4
347961,stripping slashes mysql_real_escape_string out...,php,4
329705,appending contents two richtextbox single rich...,c#,0
100740,converting linq memberexpression lambda work c...,c#,0
343256,android sdk running functions background,java,2
98188,problem function vbnet,vb,9
40870,windows nonblocking stdin redirected pipe,c_cpp,1
22890,listing odbc data sources c#,c#,0
308914,iphone app windows android mobile,c_cpp,1


In [36]:
clean_stack = clean_stack.drop(columns=[2])
clean_stack.head()

Unnamed: 0,0,label
360370,systemruntimecache expiring,0
169070,php convert server timestamp users timezone,4
347961,stripping slashes mysql_real_escape_string out...,4
329705,appending contents two richtextbox single rich...,0
100740,converting linq memberexpression lambda work c...,0


### Train Test Validation Split

In [37]:
stack_data = clean_stack[0]
stack_data.head()

360370                          systemruntimecache expiring
169070          php convert server timestamp users timezone
347961    stripping slashes mysql_real_escape_string out...
329705    appending contents two richtextbox single rich...
100740    converting linq memberexpression lambda work c...
Name: 0, dtype: object

In [38]:
stack_labels = clean_stack['label']
stack_labels.head()

360370    0
169070    4
347961    4
329705    0
100740    0
Name: label, dtype: int8

First we split off the validation set as 20% of the overall data set.

In [0]:
df_data, X_val, df_labels, y_val = train_test_split(
    stack_data, stack_labels, test_size=0.2, random_state=42, shuffle=False)

Then we split the remaining into training and test data.

In [0]:
X_train, X_test, y_train, y_test = train_test_split(df_data, df_labels, test_size=0.25,
                                                        random_state=42,
                                                        shuffle=False)

The end result is 60% training data, 20% test data and 20% validation data

In [41]:
print("Shape of X_train", X_train.shape)
print("Shape of y_train", X_train.shape)
print("Shape of X_test", X_test.shape)
print("Shape of y_test", y_test.shape)
print("Shape of X_val", X_val.shape)
print("Shape of y_val", y_val.shape)

Shape of X_train (238120,)
Shape of y_train (238120,)
Shape of X_test (79374,)
Shape of y_test (79374,)
Shape of X_val (79374,)
Shape of y_val (79374,)


### Apply TF-IDF Weighting

We first need to learn the words for the dictionary from our training data.

In [42]:
vectorizer = TfidfVectorizer(norm=u'l1', token_pattern='(\S+)', min_df=5, max_df=0.9, ngram_range=(1,2))

%time vectorizer.fit_transform(X_train)
# print(vectorizer.get_feature_names(10))

CPU times: user 5.19 s, sys: 40 ms, total: 5.23 s
Wall time: 5.24 s


<238120x39389 sparse matrix of type '<class 'numpy.float64'>'
	with 1642493 stored elements in Compressed Sparse Row format>

Apply this learned TF-IDF transform to our different dataframes

In [43]:
%time train_tf = vectorizer.transform(X_train)
# train_tf = csr_matrix.mean(train_tf, axis=0)

%time test_tf = vectorizer.transform(X_test)
# test_tf = csr_matrix.mean(test_tf, axis=0)

%time val_tf = vectorizer.transform(X_val)
# val_tf = csr_matrix.mean(val_tf, axis=0)


CPU times: user 2.82 s, sys: 9.99 ms, total: 2.83 s
Wall time: 2.83 s
CPU times: user 934 ms, sys: 2.99 ms, total: 937 ms
Wall time: 937 ms
CPU times: user 942 ms, sys: 4 ms, total: 946 ms
Wall time: 946 ms


We can see that the different dataframes have been transformed intt Compressed Sparse Row format data. This is very important as sparsity has a big impact on accuracy.

In [44]:
print("Shape of train_tf", train_tf.shape)

print("Shape of test_tf", test_tf.shape)

print("Shape of val_tf", val_tf.shape)

Shape of train_tf (238120, 39389)
Shape of test_tf (79374, 39389)
Shape of val_tf (79374, 39389)


## Classify CSR

First we will attempt to classify the full CSR dataset

### Naive Bayes
Interestingly this crashes and causes the runtime to reset as the dimensionality is too high.

In [0]:
# nb = GaussianNB()
# nb.fit(train_tf.todense(), y_train.astype(int))

In [0]:
# nb_score_test = nb.score(test_tf.todense(), y_test.astype(int))
# nb_score_test

### Linear Regression

In [47]:
linr = linear_model.LinearRegression(n_jobs = -1)
%time linr.fit(train_tf, y_train.astype(int))

CPU times: user 19.6 s, sys: 12.4 s, total: 32 s
Wall time: 16.3 s


LinearRegression(copy_X=True, fit_intercept=True, n_jobs=-1, normalize=False)

In [48]:
%time linr_score_test = linr.score(test_tf, y_test.astype(int))
linr_score_test*100

CPU times: user 7.55 ms, sys: 2.99 ms, total: 10.5 ms
Wall time: 10.1 ms


45.049126542735095

### Logistic Regression
This is automatically a mult-class classifier. Effectively a One vs Rest LR

In [49]:
# lr = linear_model.LogisticRegression(C=1.0, penalty='l2', solver = 'saga', n_jobs=-1)
lr = linear_model.LogisticRegression(solver='newton-cg',C=5, penalty='l2',n_jobs=-1)

%time lr.fit(train_tf, y_train)



CPU times: user 102 ms, sys: 82.5 ms, total: 185 ms
Wall time: 39.2 s


LogisticRegression(C=5, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=-1, penalty='l2',
                   random_state=None, solver='newton-cg', tol=0.0001, verbose=0,
                   warm_start=False)

In [50]:
%time lr_score_test = lr.score(test_tf, y_test.astype(int))
lr_score_test*100

CPU times: user 26.8 ms, sys: 1.06 ms, total: 27.9 ms
Wall time: 28.7 ms


80.61959835714467

### Support Vector Machine
The high dimensionality means that this takes a ridiculous amount of time to run.

In [0]:
# clf = svm.SVC(kernel='linear', C=1, random_state=42)
# %time clf.fit(train_tf, y_train.astype(int))

In [0]:
# %time clf_score = clf.score(test_tf, y_test.astype(int))
# print("SVM accuracy on test:\t %f" % clf_score)

### Random Forest
The high dimensionality means that this takes a ridiculous amount of time to run.

In [53]:
rf = RandomForestClassifier(criterion='gini', max_depth=5, 
                               min_samples_leaf=5, min_samples_split=2, 
                               n_estimators = 220, oob_score=True, 
                               max_features=0.5, n_jobs = -1, random_state=42)

%time rf.fit(train_tf, y_train.astype(int))

CPU times: user 3min 9s, sys: 425 ms, total: 3min 9s
Wall time: 1min 39s


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=5, max_features=0.5, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=5, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=220,
                       n_jobs=-1, oob_score=True, random_state=42, verbose=0,
                       warm_start=False)

In [54]:
%time rf_score = rf.score(test_tf, y_test.astype(int))
print("RF accuracy on test:\t %f" % rf_score)

CPU times: user 2.17 s, sys: 53.2 ms, total: 2.22 s
Wall time: 1.23 s
RF accuracy on test:	 0.515005


## Scores on CSR Validation Set

In [0]:
# nb_score_val = nb.score(val_tf.todense(), y_val.astype(int))
# nb_score_val

In [81]:
linr_score_val = linr.score(val_tf, y_val.astype(int))
linr_score_val

0.4470098302511871

In [82]:
lr_score_val = lr.score(val_tf, y_val.astype(int))
lr_score_val

0.8082873485020283

In [0]:
# clf_score_val = clf.score(X_val, y_val.astype(int))
# clf_score_val

In [83]:
rf_score_val = rf.score(val_tf, y_val.astype(int))
rf_score_val

0.5152190893743543

## Classify Reduced
We will now reduce the dimensionality of the dataset and rety classification.

### Reduce Dimensions using SVD

In [0]:
NUM_DIMEN = 50
svd = TruncatedSVD(n_components=NUM_DIMEN)

In [56]:
%time train_sv = svd.fit_transform(train_tf)
var = svd.explained_variance_.sum() * 100
print("Training data variance with %d SVD components is %f" % (NUM_DIMEN, var))

CPU times: user 8.19 s, sys: 1.97 s, total: 10.2 s
Wall time: 6.77 s
Training data variance with 50 SVD components is 1.728230


In [57]:
%time test_sv = svd.fit_transform(test_tf)
%time val_sv = svd.fit_transform(val_tf)

CPU times: user 3.33 s, sys: 1.32 s, total: 4.66 s
Wall time: 2.57 s
CPU times: user 3.38 s, sys: 1.3 s, total: 4.69 s
Wall time: 2.56 s


In [58]:
print("Shape of train_sv", train_sv.shape)

print("Shape of test_sv", test_sv.shape)

print("Shape of val_sv", val_sv.shape)

Shape of train_sv (238120, 50)
Shape of test_sv (79374, 50)
Shape of val_sv (79374, 50)


### Naive Bayes
With the reduced dimensionality we can use a Naive Bayes classifier

In [59]:
nb_sv = GaussianNB()
%time nb_sv.fit(train_sv, y_train.astype(int))

CPU times: user 233 ms, sys: 8.01 ms, total: 241 ms
Wall time: 252 ms


GaussianNB(priors=None, var_smoothing=1e-09)

In [60]:
%time nb_score_test_sv = nb_sv.score(test_sv, y_test.astype(int))
nb_score_test_sv*100

CPU times: user 270 ms, sys: 22.1 ms, total: 292 ms
Wall time: 293 ms


18.446846574445033

### Linear Regression

In [61]:
linr_sv = linear_model.LinearRegression(n_jobs = -1)
%time linr_sv.fit(train_sv, y_train.astype(int))

CPU times: user 1.07 s, sys: 57.1 ms, total: 1.13 s
Wall time: 714 ms


LinearRegression(copy_X=True, fit_intercept=True, n_jobs=-1, normalize=False)

In [62]:
%time linr_score_test_sv = linr_sv.score(test_sv, y_test.astype(int))
linr_score_test_sv*100

CPU times: user 13.6 ms, sys: 11 ms, total: 24.6 ms
Wall time: 20.6 ms


-43.7126168316895

### Logistic Regression

In [63]:
lr_sv = linear_model.LogisticRegression(C=5.0, penalty='l2', solver = 'newton-cg', n_jobs=-1)
%time lr_sv.fit(train_sv, y_train)



CPU times: user 42.9 s, sys: 68.3 ms, total: 42.9 s
Wall time: 21.9 s


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=-1, penalty='l2',
                   random_state=None, solver='saga', tol=0.0001, verbose=0,
                   warm_start=False)

In [64]:
%time lr_score_test_sv = lr_sv.score(test_sv, y_test.astype(int))
lr_score_test_sv*100

CPU times: user 38.2 ms, sys: 10 ms, total: 48.2 ms
Wall time: 30.1 ms


21.650666465089323

### Support Vector Machine
Even at low dimension this takes a lot of time to run.

In [0]:
# clf_sv = svm.SVC(kernel='linear', C=1, random_state=42)
# %time clf_sv.fit(train_sv, y_train.astype(int))

In [0]:
# %time clf_score_sv = clf_sv.score(test_sv, y_test.astype(int))
# print("SVM accuracy on test:\t %f" % clf_score_sv)

### Random Forest

In [67]:
rf_sv = RandomForestClassifier(criterion='gini', max_depth=5, 
                               min_samples_leaf=5, min_samples_split=2, 
                               n_estimators = 220, oob_score=True, 
                               max_features=0.5, n_jobs = -1, random_state=42)

%time rf_sv.fit(train_sv, y_train.astype(int))

CPU times: user 15min 25s, sys: 886 ms, total: 15min 25s
Wall time: 7min 54s


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=5, max_features=0.5, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=5, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=220,
                       n_jobs=-1, oob_score=True, random_state=42, verbose=0,
                       warm_start=False)

In [68]:
%time rf_score_sv = rf_sv.score(test_sv, y_test.astype(int))
print("RF accuracy on test:\t %f" % rf_score_sv)

CPU times: user 2.02 s, sys: 103 ms, total: 2.12 s
Wall time: 1.13 s
RF accuracy on test:	 0.210081
