# Assignment 4

In this assignment, you will build a multi-class SVM classifier for text classification. You will first need to import the libraries. Then you will need to pre-process your data by removing stop words and stemming. After
cleaning the data, you will give each word a weight using tf-idf. Those weights will be used as features for your classifier. You will split your data into training (80%) and testing (20%). Then, you will train different SVM classifiers and find the best model using a 10-fold cross-validation. Also, you should save your best model and test your model on the test data.

## Import Libraries

In [261]:
%matplotlib inline
import pandas
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
from __future__ import print_function
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import pickle
# from sklearn.svm import SVC
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV

stemmer = PorterStemmer()



## Load Data

In [262]:
x = pandas.read_csv("Tweets.csv", usecols=[1], encoding="Latin1").values.flatten()
y = pandas.read_csv("Tweets.csv", usecols=[0], encoding="Latin1").values.flatten()


## Clean Data

 Pre-process your data by removing stop words and perform stemming

In [263]:
table = str.maketrans({key: None for key in string.punctuation})
# new_s = s.translate(table)                          # Output: string without punctuation
result = []
for sentence in x:
    sentence = sentence.translate(table)
    sentence = sentence.split()
    sentence = [w for w in sentence if w.lower() not in stopwords.words('english') ]
    result.append(sentence)
    
ps = PorterStemmer()
result2 = []
for sentence in result:
    sentence2 = []
    for word in sentence:
        sentence2.append(ps.stem(word))
    result2.append(sentence2)
x = []
for sentence in result2:
    x.append(" ".join(str(y) for y in sentence))

## Split Data

Split your data into training (80%) and testing (20%)

In [264]:
from string import punctuation
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)


## TF-IDF

Transform your cleaned textual training data to give for each word a weight using tf-idf

In [265]:
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,3), min_df = 0, stop_words = 'english')
X_train = tf.fit_transform(X_train)
# X_test = tf.transform(X_test)

## Save TF-IDF 

Save the tf-idf learnt from the textual data

In [266]:
tf = pickle.dump(tf, open("tfidf.p", "wb"))

## Exercise 1

Is your data linearly separable? Verify using a hard margin SVM and explain why/why not ?

### Training and Validation

Train a multi-class hard margin SVM

In [242]:
reg = LinearSVC(C=1000)
reg.fit(X_train, y_train)
accuracy = reg.score(X_train, y_train)
print(accuracy)

0.995987021858


### Explanation

In [243]:
# The accuracy is 99%, therefore, it's linearly separable. 

## Exercise 2

### Training and Validation

You need to try the following different models: a hard margin SVM classifier with polynomial Kernel and a hard margin SVM classifier with RBF kernel. For each, set the values of the different hyperparameters using an Exhaustive Grid Search Cross-Validation. Report the number of support vectors in each case.

In [246]:
gamma = [0.1, 1, 10, 100]
degree = [1, 2, 3]
svc_poly = SVC()
params = {'kernel':['poly'], 'C':[1000], 'gamma': gamma, 'degree': degree}
clf_poly = GridSearchCV(svc_poly, params)
clf_poly.fit(X_train, y_train)
best = clf_poly.best_score_
print("for poly:")
print("best_score:", best)
print("# of support vectors",clf_poly.best_estimator_.n_support_)
print("Degree", clf_poly.best_estimator_.degree)
print("Gamma", clf_poly.best_estimator_.gamma)
svc_rbf = SVC()
params = {'kernel':['rbf'], 'C':[1000], 'gamma': gamma, 'degree': degree}
clf_rbf = GridSearchCV(svc_rbf, params)
clf_rbf.fit(X_train, y_train)
best = clf_rbf.best_score_
print("for rbf:")
print("best_score:", best)
print("# of support vectors",clf_rbf.best_estimator_.n_support_)
print("Degree", clf_rbf.best_estimator_.degree)
print("Gamma", clf_rbf.best_estimator_.gamma)

for poly:
best_score: 0.764173497268
# of support vectors [5774 2441 1646]
Degree 1
Gamma 0.1
for rbf:
best_score: 0.764856557377
# of support vectors [5871 2443 1648]
Degree 1
Gamma 0.1


## Exercise 3

### Training and Validation

Repeat the above using a soft margin SVM classifier for all the cases, including the linear one, and again set the hyperparameters using an Exhaustive Grid Search Cross-Validation. Report the number of support vectors in each case.

In [255]:
C = [0.1, 1, 10, 100]
params_lin = params = {'kernel':['linear'], 'C': C}
params_poly = params = {'kernel':['poly'], 'C': C, 'gamma': gamma, 'degree': degree}
params_rbf = params = {'kernel':['rbf'], 'C': C, 'gamma': gamma, 'degree': degree}
svc_lin = SVC()
clf_lin = GridSearchCV(svc_lin, params_lin)
clf_lin.fit(X_train, y_train)
best = clf_lin.best_score_
print("Linear:")
print("best_score:", best)
print("# of support vectors",clf_lin.best_estimator_.n_support_)
# print("Degree", clf_lin.best_estimator_.degree)
# print("Gamma", clf_lin.best_estimator_.gamma)

svc_poly = SVC()
clf_poly = GridSearchCV(svc_poly, params_poly)
clf_poly.fit(X_train, y_train)
best = clf_poly.best_score_
print("Poly:")
print("best_score:", best)
print("# of support vectors",clf_poly.best_estimator_.n_support_)
print("Degree", clf_poly.best_estimator_.degree)
print("Gamma", clf_poly.best_estimator_.gamma)

svc_rbf = SVC()
clf_rbf = GridSearchCV(svc_rbf, params_rbf)
clf_rbf.fit(X_train, y_train)
best = clf_rbf.best_score_
print("Rbf:")
print("best_score:", best)
print("# of support vectors",clf_rbf.best_estimator_.n_support_)
print("Degree", clf_rbf.best_estimator_.degree)
print("Gamma", clf_rbf.best_estimator_.gamma)

Linear:
best_score: 0.765454234973
# of support vectors [5747 2444 1646]
Poly:
best_score: 0.765454234973
# of support vectors [5747 2444 1646]
Degree 1
Gamma 100
Poly:
best_score: 0.768015710383
# of support vectors [5780 2432 1630]
Degree 1
Gamma 0.1


## Exercise 4

### Save Model

Save your best SVM model

In [270]:
soft_rbf = pickle.dump(clf_rbf, open("best_model.p", "wb"))

### Test Model

Load your tf-idf to transform your test data to give for each word a weight. Then load your best SVM classifier and plot the confusion matrix of the test data

 Rename the jupyter notebook to Assignment4_*netid*.ipynb (Assignment4_xyz01.ipynb) and upload it on Moodle no later than Wednesday, Nov 8 11:55 pm.