# Sketched SVM

We want to download data and trial the method for iterative sketching on an svm.  First we need to set up in sklearn to ensure the QP approach is correct.

Tutorial: https://stackabuse.com/text-classification-with-python-and-scikit-learn/
Data: http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz

## Step 1: Setup

Library imports

In [1]:
import numpy as np
import re
import nltk
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from sklearn.datasets import load_files
nltk.download('stopwords')
import pickle
from nltk.corpus import stopwords
nltk.download('wordnet')
from nltk.stem.wordnet import WordNetLemmatizer

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/charliedickens/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/charliedickens/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## 2 - Importing the dataset

In [2]:
movie_data = load_files(r"review_polarity/txt_sentoken")
X,y = movie_data.data, movie_data.target

The above lines of code load all files recursively in the directory `txt_sentoken`, storing the data in $X$ and the class variables in $y$.  $X$ is a list whose length is the length of the review.

In [8]:
print(len(X))
print(X[1])

2000
b"good films are hard to find these days . \ngreat films are beyond rare . \nproof of life , russell crowe's one-two punch of a deft kidnap and rescue thriller , is one of those rare gems . \na taut drama laced with strong and subtle acting , an intelligent script , and masterful directing , together it delivers something virtually unheard of in the film industry these days , genuine motivation in a story that rings true . \nconsider the strange coincidence of russell crowe's character in proof of life making the moves on a distraught wife played by meg ryan's character in the film -- all while the real russell crowe was hitching up with married woman meg ryan in the outside world . \ni haven't seen this much chemistry between actors since mcqueen and mcgraw teamed up in peckinpah's masterpiece , the getaway . \nbut enough with the gossip , let's get to the review . \nthe film revolves around the kidnapping of peter bowman ( david morse ) , an american engineer working in south am

In [9]:
y

array([0, 1, 1, ..., 1, 0, 0])

$y$ is an array with categoric labels for positive and negative.

## 3 - Text preprocessing
Now we have the data we need to convert the text into a usable format.  For this, we will remove special characters, numbers, and unwanted spaces.

In [10]:
documents = []
stemmer = WordNetLemmatizer()
for sen in range(len(X)):
    
    # remove special characters
    document = re.sub(r'W', ' ', str(X[sen]))
    
    # remove single characters 
    document = re.sub(r'\s+[a-zA-Z]\s+', ' ', document)
    
    # remove multiple spaces and replace with single space
    document = re.sub(r'\s+', ' ', document, flags=re.I)
    
    # Removing prefixed 'b'
    document = re.sub(r'^b\s+', '', document)

    # Converting to Lowercase
    document = document.lower()

    # Lemmatization
    document = document.split()
    stemmer = WordNetLemmatizer()
    document = [stemmer.lemmatize(word) for word in document]
    document = ' '.join(document)

    documents.append(document)

In [11]:
documents[1]

'b"good film are hard to find these day . \\ngreat film are beyond rare . \\nproof of life , russell crowe\'s one-two punch of deft kidnap and rescue thriller , is one of those rare gem . \\na taut drama laced with strong and subtle acting , an intelligent script , and masterful directing , together it delivers something virtually unheard of in the film industry these day , genuine motivation in story that ring true . \\nconsider the strange coincidence of russell crowe\'s character in proof of life making the move on distraught wife played by meg ryan\'s character in the film -- all while the real russell crowe wa hitching up with married woman meg ryan in the outside world . \\ni haven\'t seen this much chemistry between actor since mcqueen and mcgraw teamed up in peckinpah\'s masterpiece , the getaway . \\nbut enough with the gossip , let\'s get to the review . \\nthe film revolves around the kidnapping of peter bowman ( david morse ) , an american engineer working in south america 

## 4 - Converting the text data to numeric data
There are different approaches to this step but our approach will be agnostic to this either way.  This tutorial uses the bag of words model.

In [12]:
from sklearn.feature_extraction.text import CountVectorizer

In [14]:
vectorizer = CountVectorizer(max_features=2500, min_df=5,\
                             max_df=0.7,
                            stop_words=stopwords.words('english'))
X = vectorizer.fit_transform(documents).toarray()

These lines of code use the `CountVectorizer` to transfer the text data to numeric data. The `max_features` takes the 1500 most occuring words. The `min_df` and `max_df` control how many words we keep: minimum includes words occuring in at least 5 documents and 0.7 discards words occuring in more than 70% of the documents. Think about words that occur in every document `max_df=1.0` - these would not be useful for classification. We also remove stopwords.

Next we compute term-frequency i(nverse) d(oc) frequency.

$$
TF = \frac{\text{num occurences of a word}}{\text{total number of words in doc}}
$$

$$
IDF = \log \frac{\text{total num of docs}}{\text{number of docs containing word}}
$$
TFIDF value is higher if a word occurs a lot in one document but occurs relatively infrequently over the entire set of documents.  We now convert values from the bag of words model to the TFIDF values.

In [15]:
from sklearn.feature_extraction.text import TfidfTransformer

In [16]:
tfidf_converter = TfidfTransformer()
X = tfidf_converter.fit_transform(X).toarray()


In [17]:
print(X.shape)

(2000, 2500)


nb can also convert text to tfidf features directly.

```
from sklearn.feature_extraction.text import TfidfVectorizer  
tfidfconverter = TfidfVectorizer(max_features=1500, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))  
X = tfidfconverter.fit_transform(documents).toarray() 
```

## 5 - Training and Testing

We need to divide the set into a train and test set for the supervised ML problem.  To begin with we will train using a random forest classifier.

In [18]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0) 

In [20]:
clf = RandomForestClassifier(n_estimators=1000,random_state=0)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

## 6 - Evaluating the model
Now we need to import some metrics in order to determine how good the method is.

In [21]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [22]:
print(confusion_matrix(y_test,y_pred))

[[178  30]
 [ 32 160]]


In [23]:
print(classification_report(y_test,y_pred))

             precision    recall  f1-score   support

          0       0.85      0.86      0.85       208
          1       0.84      0.83      0.84       192

avg / total       0.84      0.84      0.84       400



In [24]:
print(accuracy_score(y_test,y_pred))

0.845


Next tutorial at : https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a

https://www.kaggle.com/alvations/basic-nlp-with-nltk

# Section 2 
We will try the approach following https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a which works on the 20 newsgroups dataset

In [25]:
from sklearn.datasets import fetch_20newsgroups

In [26]:
twenty_train = fetch_20newsgroups(subset='train', shuffle=True)

In [32]:
y_train = twenty_train.target
print(twenty_train.target_names)

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


In [28]:
print("\n".join(twenty_train.data[0].split("\n")[:3])) #prints first line of the first data file

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu


Now we willd o the feature extraction using the bag of words model again.  The following code generates a `document-term` matrix of shape `n_samples x n_features`.  Note that the number of features exceeds the number of samples so this is good for our application!

In [29]:
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

(11314, 130107)

In [30]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(11314, 130107)

## 2.1 Fitting the model
Now the data is ready to fit the model on so we can use the svm.

In [31]:
from sklearn.naive_bayes import MultinomialNB

In [33]:
clf = MultinomialNB().fit(X_train_tfidf, y_train)

A more robust method of using the algorithms is building a pipeline so that parts of the analysis are automated across all methods.

In [34]:
from sklearn.pipeline import Pipeline

In [36]:
text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])

text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

In [37]:
X_test = fetch_20newsgroups(subset='test',shuffle=True)
predicted = text_clf.predict(X_test.data)
np.mean(predicted == X_test.target)

0.7738980350504514

This was a naive bayes classifier for comparison but our work is on the SVM so we will use that next.

In [39]:
from sklearn.svm import LinearSVC

In [57]:
svm_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf-svm', LinearSVC()),
])

svm_clf = svm_clf.fit(twenty_train.data, twenty_train.target)

In [58]:
svm_predicted = svm_clf.predict(X_test.data)
np.mean(svm_predicted == X_test.target)

0.8531598513011153

# 2.2 Hyperparameters and grid search
The classifiers have different parameters which we can set using `GridSearchCV`.

In [46]:
from sklearn.model_selection import GridSearchCV

In [47]:
parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
               'tfidf__use_idf': (True, False),
              'clf__alpha': (1e-2, 1e-3),
 }

In [48]:
gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)
gs_clf = gs_clf.fit(twenty_train.data, twenty_train.target)

In [49]:
print(gs_clf.best_score_)
print(gs_clf.best_params_)

0.9067526957751458
{'clf__alpha': 0.01, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}


In [59]:
# for the svm
svm_parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
               'tfidf__use_idf': (True, False),
              'clf-svm__C': (1e-3, 1e-5, 1e-1,1.0,1e1,1e3),
 }

In [60]:
gs_clf_svm = GridSearchCV(svm_clf, svm_parameters, n_jobs=-1)
gs_clf_svm = gs_clf_svm.fit(twenty_train.data, twenty_train.target)

In [61]:
print(gs_clf_svm.best_score_)
print(gs_clf_svm.best_params_)

0.923811207353721
{'clf-svm__C': 10.0, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}


In [62]:
# saving the model
from sklearn.externals import joblib

In [63]:
joblib.dump(gs_clf_svm.best_estimator_, "newsgroup_svm.pkl", compress=1)

['newsgroup_svm.pkl']