# <Center> Homework 5: Classification </center>

This assignment needs the following two data files:
- train.csv: dataset for training
- test.csv: dataset for testing. 
    
Both of them have samples in the following format. The `text` column contains documents and the `label` column gives the sentiment of each document.

|label | text |
|------|------|
|1|  I must admit that I'm addicted to "Version 2.0...|
|0|  I think it's such a shame that an enormous tal...|
|1|  The Sunsout No Room at The Inn Puzzle has oddl...|
|...|...|


## Q1 Classification

- Define a function `create_model(train_docs, train_y, test_docs, test_y, model_type='svm', stop_words='english', min_df = 1, print_result = True, algorithm_para=1.0`), where

    - `train_docs`: is a list of documents for training
    - `train_y`: is the ground-truth labels of training documents
    - `test_docs`: is a list of documents for test
    - `test_y`: is the ground-truth labels of test documents
    - `model_type`: two options: `nb` (Multinomial Naive Bayes) or `svm` (Linear SVM)
    - `stop_words`: indicate whether stop words should be removed. The default value is 'english', i.e. remove English stopwords.
    - `min_df`: only word with document frequency above this threshold can be included. The default is 1. 
    - `print_result`: controls whether to show classification report or plots. The default is True.
    - `algorithm_para`: the hyperparameter of algorithms, here refers to C in SVM and alpha in NB. Both C and alpha have default values of 1.0. Please check sklearn documentation here: https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html
    https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html


- This function does the following:
    - Fit a `TfidfVectorizer` using `train_docs` with options `stop_words, min_df` as specified in the function inputs. Extract features from `train_docs` using the fitted `TfidfVectorizer`.
    - Build `linear SVM` or `Multinomial Naive Bayes` model as specified by `model_type` with `algorithm_para` setting. Train the model using the extracted features and `train_y`. 
    - Transform `test_docs` by the fitted `TfidfVectorizer` (hint: use function `transform` not `fit_transform`).
    - Predict the labels for `test_docs`. If `print_result` is True, print the classification report.
    - Calculate the AUC score and PRC score (or Average Precision) for class 1 on the test dataset. If `print_result` is True, plot the ROC and PRC curves. **Hint**: 
        - `sklearn.svm.LinearSVM` does not provide `predict_proba` function. 
        - Instead, you can use its `decision_function` (see <a href = "https://stackoverflow.com/questions/59227176/how-to-plot-roc-and-calculate-auc-for-binary-classifier-with-no-probabilities-s">some referenc code</a>) 
        - Another option is to use `sklearn.svm.SVC` with `kernel='linear' and probability=False` (see <a href = "https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html"> reference</a>)
    - Return the AUC and PRC scores.

    
- Test your function with following cases:
    - model_type='svm', stop_words = 'english', min_df = 1
    - model_type='nb', stop_words = 'english', min_df = 1

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import svm
import string
from sklearn.metrics import roc_auc_score
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import roc_curve, auc,precision_recall_curve
from sklearn.metrics import classification_report
from matplotlib import pyplot as plt
from sklearn.pipeline import Pipeline
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from sklearn.model_selection import GridSearchCV

train = pd.read_csv("train5.csv")
test = pd.read_csv("test5.csv")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\avitr\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [17]:
from textblob import TextBlob

In [2]:
# Define the function for Task 1

def create_model(train_docs, train_y, test_docs, test_y, \
              model_type='svm', stop_words=None, min_df = 1, print_result = True, algorithm_para=1.0):
    
    
    f = lambda x: ''.join([i for i in x if i not in string.punctuation]).lower()

    train_docs = train_docs.apply(f)
    test_docs = test_docs.apply(f)

    vectorizer = TfidfVectorizer(stop_words=stop_words,min_df=min_df)

    vectorizer.fit(train_docs)
    train_docs = vectorizer.transform(train_docs)
    train_docs = train_docs.toarray()

    # model building

    if model_type=='svm':
        clf = svm.SVC(kernel='linear', probability=False,C=algorithm_para)
    else:
        clf = MultinomialNB(alpha=algorithm_para)

    clf.fit(train_docs, train_y)
    test_docs = vectorizer.transform(test_docs)
    test_docs = test_docs.toarray()
    y_pred = clf.predict(test_docs)
    
    if print_result:
        print(classification_report(test_y,y_pred))
        
    fpr = dict()
    tpr = dict()
    roc_auc = dict()
    
    if model_type=='svm':
        y_score = clf.decision_function(test_docs)
    else:
        y_score = clf.predict_proba(test_docs)
        
    if model_type=='svm':
        for i in range(2):
            fpr[i], tpr[i], _ = roc_curve(test_y, y_score)
            roc_auc[i] = auc(fpr[i], tpr[i])
        lr_precision, lr_recall, _ = precision_recall_curve(test_y, y_score)
    else:
        for i in range(2):
            fpr[i], tpr[i], _ = roc_curve(test_y, y_score[:,1])
            roc_auc[i] = auc(fpr[i], tpr[i])
        lr_precision, lr_recall, _ = precision_recall_curve(test_y, y_score[:,1])

    

    auc_score = roc_auc[1]
    prc_score = lr_precision.mean()
            
    print('AUC: {:.2f}% PRC: {:.2f}%'.format(auc_score*100,prc_score*100))
    
    plt.figure()
    lw = 2
    plt.plot(
        fpr[1],
        tpr[1],
        color="darkorange",
        lw=lw,
        label="ROC curve (area = %0.2f)" % roc_auc[1],
    )
    plt.plot([0, 1], [0, 1], color="navy", lw=lw, linestyle="--")
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title(f'{model_type} AUC')
    plt.show()
    
    plt.plot(lr_recall, lr_precision, marker='.', label='Logistic')
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.title(f'{model_type} PRC')
    plt.show()
    
    return auc_score, prc_score


In [None]:
# Test the function with 'svm'

auc_score, prc_socre = create_model(train["text"], train["label"], test["text"], test["label"], \
          model_type='svm', stop_words = 'english', min_df = 1, print_result=True, algorithm_para=1.0)

In [None]:
# Test the function with 'nb' 

auc_score, prc_score = create_model(train["text"], train["label"], test["text"], test["label"], \
                    model_type='nb', stop_words = 'english', min_df = 1, print_result=True, algorithm_para=1.0)

### Task 2: Search for best parameters 

From Task 1, you may find there are many possible ways to configure parameters. Next, let's use grid search to find the optimal parameters

- Define a function `search_para(docs, y)` where `docs` are training documents and `y` is the ground-truth labels.
- This function does the following:
    - Create a pipleline which integrates `TfidfVectorizer` and `SVM` classifier 
    - Define the parameter ranges as follow:
        - `stop_words': [None, 'english']`
        - `min_df: [1,2,5]`
        - `C: [0.2,0.5,0.8]`
    - Set the scoring metric to "f1_macro"
    - Use `GridSearchCV` with `5-fold cross validation` to find the best parameter values based on the training dataset. 
    - Print the best parameter values
    
- Call the function `svm_model` defined in Task 1 `with the best parameter values`. You will have to add the best C value to the original function, because you used default settings of the SVM classifier previously. Then please briefly answer the following:
    - Compare with the model in Task 1, how is the performance improved on the test dataset?
    - Why do you think the new parameter values help sentiment classification?

In [21]:
# Define grade search function
from sklearn.model_selection import GridSearchCV

def search_para(docs, y):
    
    objs = [   
        ("tfidf", TfidfVectorizer()),
        ("clf", svm.SVC())
    ]
    
    pipe = Pipeline(objs)

    param_grid = { 
    'tfidf__stop_words': (None,'english'),# None
    'tfidf__min_df': (1,2,5),
    'clf__C': (0.2,0.5,0.8)
    }
    
    grid_search = GridSearchCV(pipe,param_grid,cv= 5, n_jobs=-1, verbose=3)
    grid_search.fit(docs, y)
    best_parameters = grid_search.best_estimator_
    print('best_parameters:',best_parameters)
    # Add your code here


In [23]:
search_para(train["text"][:100], train["label"][:100])

Fitting 5 folds for each of 18 candidates, totalling 90 fits
best_parameters: Pipeline(steps=[('tfidf', TfidfVectorizer(min_df=5, stop_words='english')),
                ('clf', SVC(C=0.8))])


In [None]:
# call the function with the best parameters

auc_score, prc_score = create_model(train["text"], train["label"], test["text"], test["label"], \
          model_type='svm', stop_words = None, min_df = 2, print_result=True, algorithm_para=0.2)

# Please answer the questions


## Task 3. How many samples are enough? 

This task is to help you understand the impact of sample size on classifier performance. 

- Define a function `sample_size_impact(train_docs, train_y, test_docs, test_y)` with the input as defined above. 
    
    
- This function does the following:
    - Starting with 500 samples, in each round you build a classifier with 500 more samples. i.e. in round 1, you use samples from 0:500, and in round 2, you use samples from 0:1000, …, until you use all samples. 
    - In each round, call the `create_model` function with `model_type='svm'` and then `model_type='nb'`. Keep other parameters as default. Record the AUC score of each model type and the sample size.
    - For each model type, plot a line chart show the relationship between sample size and the AUC score. 
    - This function has no return.
    
    
- Write your analysis on the following:
    - How sample size affects each classifier’s performance? 
    - If it is expensive to collect and label samples, can you decide an optimal sample size with model performance and the cost of samples both considered? 
    - How is performance of SVM classifier compared with Naïve Bayes classifier?

In [None]:
# Show the impact of sample size

def sample_size_impact(train_docs, train_y, test_docs, test_y):  
    x = []
    y1 = []
    y2 = []
    for i in (500,len(train_docs),500):
        auc_score, prc_score = create_model(train_docs[:i], train_y[:i], test_docs[:i], test_y[:i],
                                      model_type='svm', stop_words = None, 
                                      min_df = 2, print_result=True,algorithm_para=0.2)
        
        y1.append(auc_score)
        auc_score, prc_score = create_model(train_docs[:i], train_y[:i], test_docs[:i], test_y[:i],
                                      model_type='nb', stop_words = None, 
                                      min_df = 2, print_result=True,algorithm_para=0.2)
        
        y2.append(auc_score)
        x.append(i)
        
    plt.plot(x,y1,label='svm')
    plt.plot(x,y1,label='nb')
    plt.xlabel('sample size')
    plt.ylabel('AUC')

In [None]:
# Test the function

sample_size_impact(train["text"], train["label"], test["text"], test["label"])


# Briefly answer the questions

## Task 4 (Bonus): How to improve sentiment classification?

- Can you tune other parameters of TfidfVectorizer or SVM model to further improve the classification performance? Try to beat the model you get in Task 2!

In [None]:
 # add your code here