# Word Classification

This notebook provides an introduction on using NLTK and Scikit-Learn for performing Word Classification

## Initialize NTLK

Download some of the resources that NLTK needs

In [1]:
import nltk
nltk.download('book', quiet=True)
nltk.download('opinion_lexicon', quiet=True)

True

## Import the additional modules

The `random` module is loaded to do some randomization on the raw data. The seed is set so that repeated runs of the notebook is replicable. 

While NLTK also contains implementation of machine learning algorithms, Scikit-Learn provides general purpose machine learning implementations. Learning to use its modules meaning learning to use them on data outside NLP

In [2]:
import random
random.seed(0)

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import (
    classification_report,
    roc_auc_score,
    confusion_matrix,
)

## Word Classification

In this part a proposition is to be verified. To word structures affect the meaning of the words. That is so say, is there a common pattern on negative words and positive words? 

### Loading the data

NLTK provides the `opinion_lexicon` that list words that are negative and positive in nature. The words can be loaded as shown below.

In [3]:
negatives = nltk.corpus.opinion_lexicon.negative()
negatives

['2-faced', '2-faces', 'abnormal', 'abolish', ...]

In [4]:
len(negatives)

4783

In [5]:
positives = nltk.corpus.opinion_lexicon.positive()
positives

['a+', 'abound', 'abounds', 'abundance', 'abundant', ...]

In [6]:
len(positives)

2006

### Balancing the Dataset

From the number of data, there are more negative words than positive words. This will bias most machine learning algorithms. As a simple solution, the number of negative words will be reduced to balance the number.

The negatives are shuffled before reducing the number to make sure there is a sample for every starting letter of the alphabet. It is converted to a list for an inplace shuffle.

In [7]:
negatives = list(negatives)
random.shuffle(negatives)

In [8]:
negatives = negatives[:len(positives)]

In [9]:
words = negatives + list(positives)
labels = [0] * len(negatives) + [1] * len(positives)

len(words), len(labels)

(4012, 4012)

### Feature Engineering

Given that some English affixes can change the meaning of words, 3-character suffixes and prefixes are used as an initial feature in determining the word sentiment.

In [10]:
def word_features(word):
    return {
        '3-suffix': word[-3:],
        '3-prefix': word[:3]
    }

In [11]:
word_features = [
    word_features(w)
    for w in words
]

In [12]:
word_features[:10]

[{'3-suffix': 'ons', '3-prefix': 'sus'},
 {'3-suffix': 'age', '3-prefix': 'dis'},
 {'3-suffix': 'sty', '3-prefix': 'rus'},
 {'3-suffix': 'ble', '3-prefix': 'gul'},
 {'3-suffix': 'ems', '3-prefix': 'pro'},
 {'3-suffix': 'ess', '3-prefix': 'dev'},
 {'3-suffix': 'rsy', '3-prefix': 'con'},
 {'3-suffix': 'sly', '3-prefix': 'gri'},
 {'3-suffix': 'rim', '3-prefix': 'gri'},
 {'3-suffix': 'nky', '3-prefix': 'cra'}]

### Splitting the Dataset

The dataset will be split into 3 different sets, the training set, the validation set, and the testing set.

*   Train (70%): Used for training the machine learning model
*   Validation (10%): Used for evaluation of pipeline changes such as feature engineering and model hyperparameters
*   Test (20%): Used to evaluate the whole pipeline.

The scikit-learn `train_test_split` only splits in two sets. Thus, it will be used twice. The `strafity` options makes sures that the split will consider the balance of the labels while `random_state` sets the seed to allow replicability

In [13]:
train_features, test_features, train_labels, test_labels = \
    train_test_split(word_features, labels, test_size=0.2, stratify=labels, random_state=0)

train_features, val_features, train_labels, val_labels = \
    train_test_split(train_features, train_labels, test_size=1/8, stratify=train_labels, random_state=0)

len(train_features), len(test_features), len(val_features)

(2807, 803, 402)

### Dictionary to Vectors

Machine learning algorithms work with vectors rather than dictionary of features. Scikit-Learn's `DictVectorizer` converts the dictionary of features into a vector.

Take note that on the training the `fit_transform` method is used. This means that only the features in the training will be considerd in vectorization. The validation and test only used the `transform` method.

The `DictVectorizer` will convert any non numeric data into one-hot encoding. This means that each value will be created as a flag.

In [14]:
dv = DictVectorizer(sparse=False)
train_vectors = dv.fit_transform(train_features)
val_vectors = dv.transform(val_features)
test_vectors = dv.transform(test_features)

In [15]:
dv.feature_names_[:10], dv.feature_names_[-10:]

(['3-prefix=a+',
  '3-prefix=abo',
  '3-prefix=abr',
  '3-prefix=abs',
  '3-prefix=abu',
  '3-prefix=acc',
  '3-prefix=ace',
  '3-prefix=ach',
  '3-prefix=acr',
  '3-prefix=acu'],
 ['3-suffix=yed',
  '3-suffix=yer',
  '3-suffix=yon',
  '3-suffix=ype',
  '3-suffix=zed',
  '3-suffix=zer',
  '3-suffix=zes',
  '3-suffix=zle',
  '3-suffix=zzy',
  '3-suffix=Żve'])

### Naive Bayes

The `BernoulliNB` is an implementation of Naive Bayes that is designed for boolean/flags. Note that the `fit` method is only called on the train dataset

In [16]:
nb = BernoulliNB()
nb.fit(train_vectors, train_labels)

BernoulliNB()

In [17]:
train_predict = nb.predict(train_vectors)
val_predict = nb.predict(val_vectors)
test_predict =  nb.predict(test_vectors)

### Evaluation

The different sets will be evaluated to show how the performance degrades from the training data and the out of sample data. The following metrics are computed for each set.

*   ROC AUC
*   Precision
*   Recall
*   F1-Score
*   Accuracy
*   Confusion Matrix

#### Training Performance

In [18]:
roc_auc_score(train_labels, train_predict)

0.8642596349296279

In [19]:
print(classification_report(train_labels, train_predict))

              precision    recall  f1-score   support

           0       0.88      0.84      0.86      1403
           1       0.85      0.89      0.87      1404

    accuracy                           0.86      2807
   macro avg       0.87      0.86      0.86      2807
weighted avg       0.87      0.86      0.86      2807



In [20]:
confusion_matrix(train_labels, train_predict)

array([[1180,  223],
       [ 158, 1246]])

#### Validation Performance

In [21]:
roc_auc_score(val_labels, val_predict)

0.7213930348258707

In [22]:
print(classification_report(val_labels, val_predict))

              precision    recall  f1-score   support

           0       0.73      0.71      0.72       201
           1       0.71      0.74      0.73       201

    accuracy                           0.72       402
   macro avg       0.72      0.72      0.72       402
weighted avg       0.72      0.72      0.72       402



In [23]:
confusion_matrix(val_labels, val_predict)

array([[142,  59],
       [ 53, 148]])

#### Test Performance

In [24]:
roc_auc_score(test_labels, test_predict)

0.6912259153112243

In [25]:
print(classification_report(test_labels, test_predict))

              precision    recall  f1-score   support

           0       0.72      0.64      0.67       402
           1       0.67      0.75      0.71       401

    accuracy                           0.69       803
   macro avg       0.69      0.69      0.69       803
weighted avg       0.69      0.69      0.69       803



In [26]:
confusion_matrix(test_labels, test_predict)

array([[256, 146],
       [102, 299]])

### Interpretability

One of the advantages of Naive Bayes is its simplicity. The simplicity allows the deicisions taken by the machine learning model to be analyzed.

On the first part, the most common features for both labels are verified. Then the features that are biased on the labels are checked on the second part

#### Highest probabilities

By checking the features with the highest probability at each side, the common features at both labels can be examined.

For example, `ing`, `ble`, `ess`, and `ion` are in the top of both labels. This makes sense since these are usually suffixes for descriptive words,

In [27]:
best_neg = nb.feature_log_prob_[0].argsort()
best_pos = nb.feature_log_prob_[1].argsort()

In [28]:
[dv.feature_names_[idx] for idx in best_neg[-10:][::-1]]

['3-prefix=dis',
 '3-suffix=ing',
 '3-suffix=ion',
 '3-suffix=ble',
 '3-suffix=ous',
 '3-suffix=ess',
 '3-suffix=ent',
 '3-suffix=sly',
 '3-suffix=ate',
 '3-suffix=ted']

In [29]:
[dv.feature_names_[idx] for idx in best_pos[-10:][::-1]]

['3-suffix=ing',
 '3-suffix=ble',
 '3-suffix=ess',
 '3-suffix=ous',
 '3-suffix=ent',
 '3-prefix=pro',
 '3-suffix=sly',
 '3-suffix=ion',
 '3-suffix=lly',
 '3-suffix=tly']

#### Biased Features

By getting the difference between probabilities, the features that are biased to the labels are be determined.

For example, `dis`, `mis`, `ina`, `mal`, and `ant` are pretty common prefixes on the negative labels. This makes sense given that most of these prefixes are negating prefixes, creating a negative opinion.

In [30]:
diff_pos_neg = (nb.feature_log_prob_[1] - nb.feature_log_prob_[0]).argsort()

In [31]:
[dv.feature_names_[idx] for idx in diff_pos_neg[:10]]

['3-prefix=dis',
 '3-prefix=mis',
 '3-prefix=cra',
 '3-prefix=ina',
 '3-prefix=obs',
 '3-prefix=mal',
 '3-prefix=sca',
 '3-prefix=gri',
 '3-prefix=scr',
 '3-prefix=ant']

In [32]:
[dv.feature_names_[idx] for idx in diff_pos_neg[-10:][::-1]]

['3-prefix=wel',
 '3-prefix=rea',
 '3-prefix=eff',
 '3-prefix=ast',
 '3-prefix=tru',
 '3-prefix=aff',
 '3-prefix=eas',
 '3-prefix=cle',
 '3-prefix=end',
 '3-prefix=gen']