This is simplified example of building a basic supervised text classification model.

- **supervised**: we know the correct output class for each text in sample data
- **text**: input data is in a text format
- **classification model**: a model that uses input data to predict output class

we will build a supervised sentiment classifier as we will be using a sentiment polarity data on movie reviews with a binary target.

In [1]:
import nltk
nltk.download('stopwords') 
nltk.download('wordnet')
nltk.download('movie_reviews')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/adarshsalapaka/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/adarshsalapaka/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/adarshsalapaka/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


True

## Import sample data and packages

Firstly, let’s prepare the environment by importing the required packages:

In [2]:
import pandas as pd
from nltk.corpus import movie_reviews, stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, accuracy_score

We will transform movie_reviews tagged corpus from nltk to a pandas dataframe.

In [3]:
reviews = []
for fileid in movie_reviews.fileids():
    tag, filename = fileid.split('/')
    reviews.append((tag, movie_reviews.raw(fileid)))

In [4]:
sample = pd.DataFrame(reviews, columns=['target', 'document'])
print(f'Dimensions: {sample.shape}')
sample.head()

Dimensions: (2000, 2)


Unnamed: 0,target,document
0,neg,"plot : two teen couples go to a church party ,..."
1,neg,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...


You will see that the dataframe has 2 columns: a column for the targets, the polarity sentiment, and a column for the reviews (i.e. documents) for 2000 reviews. Each review is either tagged as positive or negative review. Let’s check the counts of the target classes.

In [5]:
sample['target'].value_counts()

pos    1000
neg    1000
Name: target, dtype: int64

## Partition data

When it comes to partitioning data, we have 2 options:

 - Split the sample data into 3 groups: train, validation and test, where train is used to fit the model, validation is used to evaluate fitness of interim models, and test is used to assess final model fitness.
 - Split the sample data into 2 groups: train and test, where train is further split into train and validation set k times using k-fold cross validation, and test is used to assess final model fitness. With k-fold cross validation:
 
 
 - First: Train is split into k pieces.
 - Second: Take one piece for validation set to evaluate fitness of interim models after fitting the model to the remaining k-1 pieces.
 - Third: Repeat the second step k-1 times using a different piece for the validation set each time and the remaining for the train set such that each piece of train is used as validation set only once.


Interim models here refer to the models created during the iterative process of comparing different machine learning classifiers as well as trying different hyperparameters for a given classifier to find the best model.

We will be using the second option to partition the sample data. Let’s put aside some test data so that we could check how well the final model generalises on unseen data later:

In [6]:
X_train, X_test, y_train, y_test = train_test_split(sample['document'], sample['target'], test_size=0.3, random_state=123)
print(f'Train dimensions: {X_train.shape, y_train.shape}')
print(f'Test dimensions: {X_test.shape, y_test.shape}')

# Check out target distribution
print(y_train.value_counts())
print(y_test.value_counts())

Train dimensions: ((1400,), (1400,))
Test dimensions: ((600,), (600,))
pos    700
neg    700
Name: target, dtype: int64
pos    300
neg    300
Name: target, dtype: int64


We have 1400 documents in train and 600 documents in test dataset. The target is evenly distributed in both train and test dataset.

## Preprocess documents

It’s time to preprocess training documents, that is to transform unstructured data to a matrix of numbers. Let’s preprocess the text using an approach called bag-of-word where each text is represented by its words regardless of the order in which they are presented or the embedded grammar with the following steps:

1. Tokenise
2. Normalise
3. Remove stop words
4. Count vectorise
5. Transform to tf-idf representation

In [7]:
def preprocess_text(text):
    # Tokenise words while ignoring punctuation
    tokeniser = RegexpTokenizer(r'\w+')
    tokens = tokeniser.tokenize(text)
    
    # Lowercase and lemmatise 
    lemmatiser = WordNetLemmatizer()
    lemmas = [lemmatiser.lemmatize(token.lower(), pos='v') for token in tokens]
    
    # Remove stop words
    keywords= [lemma for lemma in lemmas if lemma not in stopwords.words('english')]
    return keywords

In [8]:
# Create an instance of TfidfVectorizer
vectoriser = TfidfVectorizer(analyzer=preprocess_text)

# Fit to the data and transform to feature matrix
X_train_tfidf = vectoriser.fit_transform(X_train)
X_train_tfidf.shape

(1400, 27676)

Once we preprocess the text, our training data is now a 1400 x 27676 feature matrix stored in a sparse matrix format. This format provides an efficient storage of the data and speeds up subsequent processes. We have 27676 features that represent the unique words from the training dataset. Now, the training data is ready for modelling.

## Data Modelling

Let’s build a baseline model using Stochastic Gradient Descent Classifier. I have chosen this classifier because it is fast and works well with sparse matrix. Using 5-fold cross validation, let’s fit the model to the data and evaluate it.

In [9]:
sgd_clf = SGDClassifier(random_state=123)
sgf_clf_scores = cross_val_score(sgd_clf, X_train_tfidf, y_train, cv=5)

print(sgf_clf_scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (sgf_clf_scores.mean(), sgf_clf_scores.std() * 2))

[0.82857143 0.85       0.84285714 0.81785714 0.81428571]
Accuracy: 0.83 (+/- 0.03)


Given the data is perfectly balanced and we want both labels to be predicted as correctly as possible, we will use accuracy as a metric to evaluate the model fitness. 

However, accuracy is not always the best measure depending on the distribution of the target and relative misclassification costs of the classes. In which case, other evaluation metrics such as precision, recall or f1 may be more appropriate.

The initial performance does not look bad. The baseline model can predict accurately ~83% +/- 3% of the time.
Of note, the default metric used is accuracy in cross_val_score hence we don’t need to specify it unless you want to explicitly say so like below:

In [10]:
cross_val_score(sgd_clf, X_train_tfidf, y_train, cv=5, scoring='accuracy')

array([0.82857143, 0.85      , 0.84285714, 0.81785714, 0.81428571])

Let’s understand the predictions a bit further by looking at confusion matrix:

In [11]:
sgf_clf_pred = cross_val_predict(sgd_clf, X_train_tfidf, y_train, cv=5)

print(confusion_matrix(y_train, sgf_clf_pred))

[[580 120]
 [117 583]]


The accuracy of predictions is similar for both classes.

## Attempt to improve performance

The purpose of this section is to find the best machine learning algorithm as well as its hyperparameters. 

Let’s see if we are able to improve the model by tweaking some hyperparameters. We will leave most of the hyperparameters to its sensible default value. With the help of grid search, we will run a model with every combination o the hyperparameters specified below and cross validate the results to get a feel of its accuracy:

In [12]:
grid = {'fit_intercept': [True,False],
        'early_stopping': [True, False],
        'loss' : ['hinge', 'log', 'squared_hinge'],
        'penalty' : ['l2', 'l1', 'none']}

search = GridSearchCV(estimator=sgd_clf, param_grid=grid, cv=5)
search.fit(X_train_tfidf, y_train)

search.best_params_

{'early_stopping': False,
 'fit_intercept': False,
 'loss': 'log',
 'penalty': 'l1'}

These are the best values for the hyperparameters specified above. Let’s train and validate the model using these values for the selected hyperparameters:

In [13]:
grid_sgd_clf_scores = cross_val_score(search.best_estimator_, X_train_tfidf, y_train, cv=5)
print(grid_sgd_clf_scores)

print("Accuracy: %0.2f (+/- %0.2f)" % (grid_sgd_clf_scores.mean(), grid_sgd_clf_scores.std() * 2))

[0.85       0.85714286 0.83571429 0.84285714 0.82857143]
Accuracy: 0.84 (+/- 0.02)


The model fitness is slightly better compared to initial model.

We will choose these hyperparameter combiantion for our final model and stop this section here in the interest of time. However, this section could be extended further by trying different modelling techniques and finding optimal values for the hyperparameters of the model using a grid search.

## Final model

Now that we have finalised the model, let’s put the data transformation step as well as the model in a pipeline:

In [14]:
pipe = Pipeline([('vectoriser', vectoriser),
                 ('classifier', search.best_estimator_)])

pipe.fit(X_train, y_train)

Pipeline(steps=[('vectoriser',
                 TfidfVectorizer(analyzer=<function preprocess_text at 0x7f7f9c640050>)),
                ('classifier',
                 SGDClassifier(fit_intercept=False, loss='log', penalty='l1',
                               random_state=123))])

In the code shown above, the pipeline first transforms the unstructured data to a feature matrix, then fits the preprocessed data to the model. This is an elegant way of putting together the essential steps in a single pipeline.

Let’s assess the predictive power of the model on the test set. Here, we will pass the test data to the pipeline, which will first preprocess the data then make predictions using the previously fitted model:

In [15]:
y_test_pred = pipe.predict(X_test)
print("Accuracy: %0.2f" % (accuracy_score(y_test, y_test_pred)))

print(confusion_matrix(y_test, y_test_pred))

Accuracy: 0.85
[[249  51]
 [ 37 263]]


The accuracy of the final model on unseen data is ~85%. If this test data is representative of future data, the predictive power of the model is decent.