# Lab 6 - Text classification by machine learning

In this lab, you will learn:
* How to use machine learning model to classify text
* How to evaluate the performance of different models

This lab is written by Jisun AN (jisunan@smu.edu.sg) and Michelle KAN (michellekan@smu.edu.sg).


# 0. Import Packages

In [None]:
# Packages for data
import pandas as pd
import numpy as np

# Packages for machine learning models
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.model_selection import cross_validate

# Packages for visualization
import matplotlib.pyplot as plt
%matplotlib inline 

# 1. Getting the data

In this lab, we will use restaurant review data. 

This data is manually annotated by humans according to their aspect and sentiment. 

One review may have two or more aspects and thus two or more sentiment. 

We note that we excluded those conflicting reviews.

"restaurant_reviews.tsv" is tab-separated file which fields are: 

- `sid` is review id
- `text` is a review
- `aspect` refers to the review area of interest. It consists of any of these five labels: <i>food, service, ambience, price</i> 
- `sentiment` consists of one of these labels: <i>positive, negative, neutral</i>


From this dataset, we will create **a 'balanced' dataset** to build classification models. 

The balanced dataset includes the equal number of samples of each label. 

**We will sample 500 positive texts and 500 negative texts.**

In [None]:
ori_df = pd.read_table("https://raw.githubusercontent.com/anjisun221/css_codes/main/restaurant_reviews.tsv", sep="\t")
print(ori_df.shape)
ori_df.head()

In [None]:
ori_df['text'][10]

In [None]:
ori_df['sentiment'].value_counts()


In [None]:
# Sample 500 rows from dataframe --> sample 500 positive texts.
df_pos = ori_df.query('sentiment == "positive"').sample(500, random_state=999)
df_pos.head()

In [None]:
# Sample 500 rows from dataframe --> sample 500 negative texts.
df_neg = ori_df.query('sentiment == "negative"').sample(500, random_state=999)
df_neg.head()

In [None]:
# Combine two dataframes 
df = pd.concat([df_pos, df_neg])
df.shape

In [None]:
df['sentiment'].value_counts()

In [None]:
# We extract text and label to build the model

sentences = df['text'].values
y = df['sentiment'].values


[train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) is a function in Sklearn model selection for splitting data arrays into two subsets: for training data and for testing data. With this function, you don't need to divide the dataset manually. It has the following syntax:

    train_test_split(X, y, train_size=0.*,test_size=0.*, random_state=*)

The function takes the following parameters:
- `X, y`: the dataset you're selecting to use. Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.
- `train_size`: This parameter sets the size of the training dataset. There are three options: None, which is the default, Int, which requires the exact number of samples, and float, which ranges from 0.1 to 1.0.
- `test_size`: This parameter specifies the size of the testing dataset. The default state suits the training size. It will be set to 0.25 if the training size is set to default.
- `random_state`: The default mode performs a random split using `np.random`. Alternatively, you can add an integer using an exact number.

In [None]:
# Randomly split the data into training (80%) and test (20%) datasets
sentences_train, sentences_test, y_train_str, y_test_str = train_test_split(sentences, y, test_size=0.20, random_state=999)


# 2. Extract features (words are features)

### Document-Term Matrix

For many of the techniques we'll be using in future notebooks, the text must be tokenized, meaning broken down into smaller pieces. The most common tokenization technique is to break down text into words. We can do this using scikit-learn's [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html), where every row will represent a different document and every column will represent a different word.



In [None]:
# We are going to create a document-term matrix using CountVectorizer
vectorizer = CountVectorizer()
vectorizer.fit(sentences_train)
X_train = vectorizer.transform(sentences_train)


In [None]:
# label encode the target variable - this will change our string labels to integer labels. 
encoder = preprocessing.LabelEncoder()
y_train = encoder.fit_transform(y_train_str)
y_test = encoder.fit_transform(y_test_str)


In [None]:
tmp = pd.DataFrame({'y':y_train})
tmp['y'].value_counts()

In [None]:
tmp = pd.DataFrame({'y':y_test})
tmp['y'].value_counts()

### Exercise 1. Improve Document-Term Matrix (DTM)

You can improve the performance of the classification models by having better or other features. 
In text classification, this can be done by, for example, excluding common English stop words or adding bigrams.
You can do it by adding some parameters of scikit-learn's [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). 

Challenge: Remove stop words and add bigram in the DTM and see whether the performance of the model improves. 


In [None]:
vectorizer = #[WRITE YOUR CODE]
vectorizer.fit(sentences_train)
X_train = vectorizer.transform(sentences_train)


# 3. Build the model and evaluate via cross-validation


We will use two classification algorithms. 
* [Naïve Bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html): a family of probabilistic algorithms that uses Bayes’s Theorem to predict the category of a text.
* [Support Vector Machines](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html): a non-probabilistic model which uses a representation of text examples as points in a multidimensional space. Examples of different categories (sentiments) are mapped to distinct regions within that space. Then, new texts are assigned a category based on similarities with existing texts and the regions they’re mapped to.

Cross-validation is a common method to evaluate the performance of a text classifier. It works by splitting the training dataset into random, equal-length example sets (e.g., 4 sets with 25% of the data). For each set, a text classifier is trained with the remaining samples (e.g., 75% of the samples). Next, the classifiers make predictions on their respective sets, and the results are compared against the human-annotated tags. This will determine when a prediction was right (true positives and true negatives) and when it made a mistake (false positives, false negatives).

We will use [sklearn's Cross-validation](https://scikit-learn.org/stable/modules/cross_validation.html) function to implement it.

With these results, you can build performance metrics that are useful for a quick assessment on how well a classifier works:

* Accuracy: the percentage of texts that were categorized with the correct tag.
* Precision: the percentage of examples the classifier got right out of the total number of examples that it predicted for a given tag.
* Recall: the percentage of examples the classifier predicted for a given tag out of the total number of examples it should have predicted for that given tag.
* F1 Score: the harmonic mean of precision and recall.


In [None]:
# Define evaluation metrics we want to get from cross validation
scoring = ['precision_macro', 'recall_macro', 'f1_macro', 'accuracy', 'balanced_accuracy']

In [None]:
# Print the mean values of evaluation metrics across 5 experiments 
def print_cross_validation_result(cross_val_result):
    print("Cross Accuracy : ",round(cross_val_result['test_accuracy'].mean() * 100 , 2),"%")
    print("Cross Validation Precision : ",round(cross_val_result['test_precision_macro'].mean() * 100 , 2),"%")
    print("Cross Validation Recall : ",round(cross_val_result['test_recall_macro'].mean() * 100 , 2),"%")
    print("Cross Validation F1 : ",round(cross_val_result['test_f1_macro'].mean() * 100 , 2),"%")

In [None]:
print("Naive Bayes --- ")
cross_val_naive = cross_validate(estimator = MultinomialNB(), X = X_train, y = y_train, scoring=scoring, cv = 5, n_jobs = -1)
print_cross_validation_result(cross_val_naive)


In [None]:
print("Linear SVM results --- ")
cross_val_svc_linear = cross_validate(estimator = SVC(kernel='linear'), X = X_train, y = y_train, scoring=scoring, cv = 5, n_jobs = -1)
print_cross_validation_result(cross_val_svc_linear)


### Exercise 2. Let's change the parameters of SVM to improve the performance. 

[sklearn's SVM document](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)

**SVM's C parameter:** The C parameter allows you to decide how much you want to penalize misclassified points.

**SVM's Kernel:** You can use various kernels of SVM. You can specify the kernel type to be used in the algorithm by using 'kernel' parameber. It must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, or ‘precomputed’

Try with various C and Kernel and find the parameters with the best performance. 

In [None]:
print("SVM results --- ")
cross_val_svc_linear_2 = # WRTIE YOUR CODE

print_cross_validation_result(cross_val_svc_linear_2)


### Exercise 3. Let's build the model using Random forest. 

You can find the example here:
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

You will need to import the necessary library, and then change the 'estimator' to the one for random forest. 


In [None]:
#WRITE-YOUR-CODE # import the library 

print("Ramdom Forest --- ")
cross_val_rfc = #WRITE-YOUR-CODE
print_cross_validation_result(cross_val_rfc)


# 4. Find the most important features

We can visualize the most important features in classifying texts into either positive or negative review.


In [None]:
def plot_coefficients(classifier, feature_names, modelname, top_features=20):
    coef = classifier.coef_.ravel() 
    top_positive_coefficients = np.argsort(coef)[-top_features:]
    top_negative_coefficients = np.argsort(coef)[:top_features]
    top_coefficients = np.hstack([top_negative_coefficients, top_positive_coefficients])
    # create plot
    plt.figure(figsize=(16, 6))

    plt.title('Important features by %s model' % (modelname), fontsize=20)
    plt.ylabel('Coefficient', fontsize=18)
    plt.xlabel('Negative Reviews <<------------------ Important features ------------------>> Positive Reviews', fontsize=18)
    
    colors = ['red' if c < 0 else 'blue' for c in coef[top_coefficients]]
    plt.bar(np.arange(2 * top_features), coef[top_coefficients], color=colors)
    feature_names = np.array(feature_names)
    plt.xticks(np.arange(0, 0 + 2 * top_features), feature_names[top_coefficients], rotation=60, ha='right', fontsize=14)    

    plt.show() 

In [None]:
# We build the model using all our training data
svm = LinearSVC() # this is another way to define SVM model with linear kernel. we need to use this to see the important features. 
svm.fit(X_train, y_train)
plot_coefficients(svm, vectorizer.get_feature_names(), "Linear SVM")


In [None]:
# We build the model using all our training data
naive_bayes = MultinomialNB()
naive_bayes.fit(X_train,y_train)
plot_coefficients(naive_bayes, vectorizer.get_feature_names(), "Naive Bayes")

# 5. Classify new texts into positive or negative class & find the best model

Using our test dataset, we will find the best model. 

In [None]:
# Let's build the best performed models
m_naive = MultinomialNB().fit(X_train, y_train)
m_svm = SVC(kernel='linear').fit(X_train, y_train)


In [None]:
# We need to extract features for our test set, and built DTM for test texts
X_test = vectorizer.transform(sentences_test)


In [None]:
def classify_and_evaluate(mymodel, X_test):
    predicted = mymodel.predict(X_test)
    y_true = y_test
    y_pred = predicted
    print(classification_report(y_true, y_pred))


In [None]:
print("naive bayes ---")
classify_and_evaluate(m_naive, X_test)


In [None]:
print("SBM Linear --- ")
classify_and_evaluate(m_svm, X_test)


### Exercise 4. Classify the below new texts!

You have two new texts. Please use the best model to classify those texts into positive or negative. 
Print out the result. 

In [None]:
new_texts = ["Authentic, cheap, huge portion Korean food in orchard. ", 
             "The food is server quite fast but compare to the quantity of tender beef given last time & now is like a reduction in size."]


In [None]:
# Write your code
[Write your code]

### Advanced exercise (optional) 

Instead of CountVectorizer, you can consider to use TFIDF vectorizer to improve the performance of the model.
See details about TFIDF vectorization [here](https://medium.com/@cmukesh8688/tf-idf-vectorizer-scikit-learn-dbc0244a911a) and see sklearn's [TFIDF vectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

Define TFIDF vectorizer and use it as a replacement of CountVectorizer and see whether it improves the performance of the classification model. You can rerun from the Section 3. 

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = # WRITE YOUR CODE
vectorizer.fit(sentences_train)
X_train = vectorizer.transform(sentences_train)
