<a href="https://colab.research.google.com/github/zhaxylykbayeva/msis/blob/main/Text%20Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# HW2: Text Classification
In this homework, we will apply the Text Classification techniques to identify the author of the fictions. This dataset contains text from works of fiction written by spooky authors of the public domain: Edgar Allan Poe, HP Lovecraft and Mary Shelley. Your objective is to accurately identify the author of the sentences in the test set.

The data was prepared by chunking larger texts into sentences using CoreNLP's MaxEnt sentence tokenizer, so you may notice the odd non-sentence here and there. The problem requires us to predict the author, i.e. EAP, HPL and MWS given the text. In simpler words, text classification with 3 different classes. 





In [None]:
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.svm import SVC
from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from nltk import word_tokenize
from nltk.corpus import stopwords

## Load data 

We first load the data and split the data into train and test. Note that do not modify the code in this cell. 


In [None]:
data = pd.read_csv('https://raw.githubusercontent.com/zariable/UW-MSIS541/master/assignments/hw1/data/data.csv')
data.head()

Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of...",EAP
1,id17569,It never once occurred to me that the fumbling...,HPL
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP
3,id27763,How lovely is spring As we looked from Windsor...,MWS
4,id12958,"Finding nothing else, not even gold, the Super...",HPL


We first preprocess the label by converting the label from string into integer. In particular, we use the LabelEncoder from scikit-learn to convert text labels to integers, 0, 1 2. 

In [None]:
# label encode the author names into 0, 1 and 2 for easy evaluation.
y = preprocessing.LabelEncoder().fit_transform(data.author.values)

Then we randomly split the original data into train and test where training data is used to train the text classifier and test data is used to evaluate the model performance. We can do it using `train_test_split` from the `model_selection` module of scikit-learn.

In [None]:
# split the data into train and test 
x_train, x_test, y_train, y_test = train_test_split(
    data.text.values, y, 
    stratify=y, 
    random_state=42, 
    test_size=0.1, 
    shuffle=True
)

print(x_train.shape)
print(x_test.shape)

# print the first training example
print(x_train[0])

(17621,)
(1958,)
Her hair was the brightest living gold, and despite the poverty of her clothing, seemed to set a crown of distinction on her head.


## Task 1: Build a Naive Bayes Model 

Your very first model is a simple TF-IDF (Term Frequency - Inverse Document Frequency) followed by a Naive Bayes classifier.

Try to use Grid Search to find the best hyper parameter from the following settings (feel free to explore other options as well):

* Differnet ngram range
* Weather or not to remove the stop words
* Weather or not to apply IDF

I am intentionally make the requirement vague to encourage you to further explore different options and find the best solution. After identifying the best model, we use that model to make predictions on the test data and report its accuracy.  

Hint: you can two options to extract TFIDF from the raw text. Option 1 is to follow the code we covered in the lab where we apply `CountVectorizer` followed by `TfidfTransformer`. Option 2 is to use the API `TfidfVectorizer`, which is equivalent to Option 1.

In [None]:
#TF-IDF
count_vect = CountVectorizer(
    stop_words = 'english',
    max_features = None,
    ngram_range = (1, 1)
)

X_train_counts = count_vect.fit_transform(x_train)
X_train_counts.shape

from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer(
    norm = 'l2',
    use_idf = True,
    smooth_idf = True
)

In [None]:
from sklearn.pipeline import Pipeline

text_clf = Pipeline(
    [
        ('vect', CountVectorizer()), 
        ('tfidf', TfidfTransformer()), 
        ('clf', MultinomialNB())
    ]
)

#Grid Search

from sklearn.model_selection import GridSearchCV
parameters = {'vect__ngram_range': [(1, 1), (1, 2), (1, 3), (2, 3), (1, 4)], 'tfidf__use_idf': (True, False), 'vect__stop_words': ["english", None]}
gs_clf = GridSearchCV(text_clf, parameters, n_jobs=1, cv=2)
gs_clf = gs_clf.fit(x_train, y_train)

print(gs_clf.best_params_)
print(gs_clf.best_score_)

{'tfidf__use_idf': True, 'vect__ngram_range': (1, 1), 'vect__stop_words': 'english'}
0.7861072431517151


## Task 2: Build a Support Vector Machines (SVM) Model 

Similar to the first task, now you will build a SVM model for the same task. Use Grid Search to find the best hyper parameters and report the accuracy on the test from your best model.  


In [None]:
from sklearn.svm import LinearSVC

text_clf_svm = Pipeline(
    [
        ('vect', CountVectorizer()), 
        ('tfidf', TfidfTransformer()),
        ('clf-svm', LinearSVC(random_state=0))
    ]
)

text_clf_svm = text_clf_svm.fit(x_train, y_train)
predicted_svm = text_clf_svm.predict(x_test)
np.mean(predicted_svm == y_test)

parameters = {'vect__ngram_range': [(1, 1), (1, 2), (1, 3), (2, 3), (1, 4)], 'tfidf__use_idf': (True, False), 'vect__stop_words': ["english", None]}

gs_svm = GridSearchCV(text_clf_svm, parameters, n_jobs=1, cv=2)
gs_svm = gs_svm.fit(x_train, y_train)

print(gs_svm.best_params_)
print(gs_svm.best_score_)

{'tfidf__use_idf': True, 'vect__ngram_range': (1, 2), 'vect__stop_words': None}
0.8102831487984978


## Task 3: Further improve the prediction accuracy

Can you think of other ways to improve the prediction accuracy?

* Can you further improve the prediction accuracy of Naive Bayes or SVM models?
* Can you use Deep Neural Network to further improve model accuracy?
* Can you apply different feature extraction?
* Anything else you might think of to improve the prediction accuracy on the test data?

Please write code to explore those options and report your findings.

###**Answer:**
To further improve prediction accuracy, I decided to optimize more parameters of the feature extraction for both, Naive Bayes and SVM models in the Grid Search. 

In addition to the initial three parameters (N-gram range, usage of stop words, and usage of IDF), I added the following:


> **For SVM:** an option to change the penalty term, to set fit_intercept to False, to solve the primal instead of the dual problem, and to flexibly set the amount of regularization.

> **For Naive Bayes:** gave extra smoothing options

For extra feature extraction, for both models, I applied sublinear tf scaling. This gives the model an extra option for interpreting word counts.



In [None]:
text_clf_svm2 = Pipeline(
    [
        ('vect', CountVectorizer()), 
        ('tfidf', TfidfTransformer()),
        ('clf-svm', LinearSVC(random_state=0, max_iter=2000))
    ]
)


parameters = {'vect__ngram_range': [(1, 1), (1, 2)], 'tfidf__use_idf': (True, False), 'vect__stop_words': ["english", None],
              'clf-svm__penalty': ['l1', 'l2'], 'clf-svm__fit_intercept':[True, False], 'clf-svm__dual':[False, True],
              'tfidf__sublinear_tf':[True, False], 'clf-svm__C': [0.5, 1, 2]}

gs_svm2 = GridSearchCV(text_clf_svm2, parameters, n_jobs=-1, cv=2)
gs_svm2 = gs_svm2.fit(x_train, y_train)

print(gs_svm2.best_params_)
print(gs_svm2.best_score_)

{'clf-svm__C': 2, 'clf-svm__dual': True, 'clf-svm__fit_intercept': True, 'clf-svm__penalty': 'l2', 'tfidf__sublinear_tf': True, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2), 'vect__stop_words': None}
0.8129504691213169


In [None]:
text_clf2 = Pipeline(
    [
        ('vect', CountVectorizer()), 
        ('tfidf', TfidfTransformer()), 
        ('clf', MultinomialNB())
    ]
)

from sklearn.model_selection import GridSearchCV
parameters = {'vect__ngram_range': [(1, 1), (1, 2)], 'tfidf__use_idf': (True, False), 'vect__stop_words': ["english", None],
              'clf__alpha': [0, 0.5, 1, 2], 'tfidf__sublinear_tf':[True, False]}

gs_clf2 = GridSearchCV(text_clf2, parameters, n_jobs=-1, cv=2)
gs_clf2 = gs_clf2.fit(x_train, y_train)

print(gs_clf2.best_params_)

print(gs_clf2.best_score_)

{'clf__alpha': 0.5, 'tfidf__sublinear_tf': True, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 1), 'vect__stop_words': None}
0.8077858125697022


###**Observations:**

As a result of my modifications to the initial Grid Search, SVM model prediction accuracy has improved by **0.27%** (81.03% vs. 81.30%), and Naive Bayes model prediction accuracy improved by **2.17%** (78.61% vs. 80.78%).