# Text Classification 

## What is text Classification?

Text classification (or document classification) is a basic problem in information science and computer science. It aims to assign a given text to one or more categories. We can find a wide range of real-world applications of text classification, such as spam filtering and sentiment analysis. As the following example shows, emails are categorized into two classes (i.e., spam and non-spam) by a classifier. The goal of Natural Language Processing (NLP) is to train a machine learning model on unlabelled or labelled data to create a classifier. The learning of text classification is formulated as:

$$y = f(X)$$, where $X$ is the input samples, $y$ is the correponding labels of input samples, and $f()$ is a classifier. 
 
![](https://developers.google.com/machine-learning/guides/text-classification/images/TextClassificationExample.png)

Picture Courtesy: https://developers.google.com/machine-learning/guides/text-classification

## Supervised Text Classification with Classical Machine Learning




In this tutorial, we use the corpus from the [CL-Aff shared task](https://sites.google.com/view/affcon2019/cl-aff-shared-task?authuser=0). HappyDB is a dataset of about 100,000 `happy moments` crowd-sourced via Amazon’s Mechanical Turk where each worker was asked to describe in a complete sentence `what made them happy in the past 24 hours`. Each user was asked to describe three such moments. 

In this tutorial, we focus on `sociality classification`. Sociality refers to `whether or not other people than the author are involved in the emotion situation`. For example, an emotion experience with a sociality value "yes" (i.e., other people are involved) could teach us about social groups (e.g., families) and the range of emotions expressed during specific types of situations (e.g., wedding, death). 

We only use labelled dataset which include 10,560 labelled samples. 

We placed the dataset under ``./happy_db`` folder in three files as ``train.tsv``, ``dev.tsv`` and ``test.tsv``. 

In [None]:
#load package
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import nltk
from nltk.corpus import stopwords
import string
import json
import os

#### a. Load data sets: 

Files are tsv files whose delimiter is tab (i.e., "\t").

In [None]:
train_set = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/happy_db/train.tsv", sep="\t")
dev_set = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/happy_db/dev.tsv", sep="\t")
test_set = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/happy_db/test.tsv", sep="\t")

In [None]:
train_set.head()

Unnamed: 0,moment,social
0,"It was my birthday, and my wife and daughter s...",yes
1,The weather has been warm and gorgeous for the...,no
2,"Yesterday, my boyfriend and I went to a beauti...",yes
3,I bought a new cell phone for my dad.,yes
4,I got news from my supervisor that my request ...,yes


To train a machine learning system, we need to convert the text labels to numerical representations. 

In [None]:
# convert textual label to numerical label

In [None]:
label2ind = {'no': 0, 'yes': 1}

train_set["social"] = train_set["social"].apply(lambda x: label2ind[x])
dev_set["social"] = dev_set["social"].apply(lambda x: label2ind[x])
test_set["social"] = test_set["social"].apply(lambda x: label2ind[x])

In [None]:
train_set.head()

Unnamed: 0,moment,social
0,"It was my birthday, and my wife and daughter s...",1
1,The weather has been warm and gorgeous for the...,0
2,"Yesterday, my boyfriend and I went to a beauti...",1
3,I bought a new cell phone for my dad.,1
4,I got news from my supervisor that my request ...,1


#### b. Data Preparation

In [None]:
X_train = train_set['moment']
y_train = train_set['social']

X_dev = dev_set['moment']
y_dev = dev_set['social']

X_test = test_set['moment']
y_test = test_set['social']

##### b.1 Normalize input texts

In [None]:
import nltk
import re

def cleanHtml(sentence):
  # function to remove url links
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, ' ', str(sentence))
    return cleantext
def cleanPunc(sentence): 
  #function to remove punctuations or special characters
    cleaned = re.sub(r'[?|!|\'|"|#]',r'',sentence)
    cleaned = re.sub(r'[.|,|)|(|\|/]',r' ',cleaned)
    cleaned = cleaned.strip()
    cleaned = cleaned.replace("\n"," ")
    return cleaned

def preprocessing(X):
    X = X.str.lower() # lower case 
    X = X.apply(cleanHtml)
    X = X.apply(cleanPunc)
    return X

In [None]:
# apply preprocessing funtion
X_train = preprocessing(X_train)
X_dev = preprocessing(X_dev)
X_test = preprocessing(X_test)

In [None]:
print(X_train[0])
print(y_train[0])

it was my birthday  and my wife and daughter surprised me with some surprise guests and a small party
1


##### b.2 Vectorization. 

We convert input text strings to a numerical vector. We use a `CountVectorizer` from sklearn to process input text. First, the vectorizer learns $n$-gram vocabulary from `X_train`. In this turorial, we use $uni$-gram only. Then, the vectorizer is applied to input texts.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer( analyzer='word', ngram_range=(1,1))
vectorizer.fit(X_train)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [None]:
print( "Vocabulary size: ", len(vectorizer.get_feature_names()))

Vocabulary size:  6780


The vectorizer includes 6,780 words. 


In [None]:
# Processing input texts
# Transform and save to a numpy array
X_vec_train = vectorizer.transform(X_train).toarray() 
X_vec_dev = vectorizer.transform(X_dev).toarray() 
X_vec_test = vectorizer.transform(X_test).toarray() 

The `CountVectorizer` converts each text sample to a vocabulary-sized (i.e., 6,780) vecotor, $v=[v_1, ..., v_i], i \in 1, ..., 6,780$. The value of $v_i$ is the counts of word $i$ in the given input text.

In [None]:
print(X_vec_train[0].shape)

(6780,)


### Modeling 

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import classification_report,multilabel_confusion_matrix
from sklearn.metrics import f1_score, accuracy_score
 

#### a. Gaussian Naive Bayes

We utilize the inplementation from sklearn, [`sklearn.naive_bayes.GaussianNB`](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html). The likelihood of the features is assumed to be Gaussian.
<!-- The likelihood of the features is assumed to be Gaussian:
$$P(x_i \mid y) = \frac{1}{\sqrt{2\pi\sigma^2_y}} \exp\left(-\frac{(x_i - \mu_y)^2}{2\sigma^2_y}\right)$$ -->

We use train a GaussianNB on training set and then evaluate on Dev and Test set. 

In [None]:
# use training set to fit model 
classfier_gnb = GaussianNB()
classfier_gnb.fit(X_vec_train, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [None]:
# Use trained model to predict labels of Dev and Test sets and evaluate the predictions. 

pred_dev = classfier_gnb.predict(X_vec_dev)
pred_test = classfier_gnb.predict(X_vec_test)

In [None]:
# Performance on Dev set
acc_dev = accuracy_score(y_dev, pred_dev)
f1_macro_dev = f1_score(y_dev, pred_dev, average="macro")

print("Performance on Dev set: f1_macro",f1_macro_dev, "Accuracy", acc_dev)


In [None]:
# Performance on Test set
acc_test = accuracy_score(y_test, pred_test)
f1_macro_test = f1_score(y_test, pred_test, average="macro")

print("Performance on Test set: f1_macro", f1_macro_test, "Accuracy", acc_test)

Performance on Test set: f1_macro 0.7301075857022306 Accuracy 0.7301136363636364


b. Logistic Regression

We utilize the inplementation from sklearn, [`sklearn.linear_model.LogisticRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). Logistic regression is a linear model for classification. 

In [None]:
# use training set to fit model 
classfier_lr = LogisticRegression()
classfier_lr.fit(X_vec_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [None]:
# Use trained model to predict labels of Dev and Test sets and evaluate the predictions. 

pred_dev = classfier_lr.predict(X_vec_dev)
pred_test = classfier_lr.predict(X_vec_test)

In [None]:
# Performance on Dev set
acc_dev = accuracy_score(y_dev, pred_dev)
f1_macro_dev = f1_score(y_dev, pred_dev, average="macro")

print("Performance on Dev set: f1_macro", f1_macro_dev, "Accuracy", acc_dev)

Performance on Dev set: f1_macro 0.8880262249827466 Accuracy 0.8882575757575758


In [None]:
# Performance on Test set
acc_test = accuracy_score(y_test, pred_test)
f1_macro_test = f1_score(y_test, pred_test, average="macro")

print("Performance on Test set: f1_macro", f1_macro_test, "Accuracy", acc_test)

Performance on Test set: f1_macro 0.887735685867497 Accuracy 0.8882575757575758


### References: 
* scikit-learn: https://scikit-learn.org/stable/ind
* https://developers.google.com/machine-learning/guides/text-classification