# Exercise 5. Text Classification

## Text, Web and Social Media Analytics Lab

In this exercise, we will build multiple classification models for the newsgroup dataset. We will apply the following steps:

- Document representation with TF-IDF, Word2Vec and BERT
- Naive Bayes Classification Model
- Random Forests
- Grid Search
- BERT for Sequence Classification

We first import all the required libraries that we are going to use. 

In [None]:
import pandas as pd
import pickle
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score
from sklearn.ensemble import RandomForestClassifier

In [None]:
base_directory = '/content/drive/MyDrive/Colab Notebooks/TWSM Analytics Lab/storage/'

## Part A: Document Representation with TF-IDF, Word2Vec and BERT

We load the three datasets that we had previously preprocessed. 

In [None]:
data_stem = pickle.load(open(base_directory + 'stemmed_data.p', 'rb'))
data_w2v = pickle.load(open(base_directory + 'WordtoVecModel.pkl', 'rb'))
data_bert = pickle.load(open(base_directory + 'BertModel.pkl', 'rb'))

We show the head of the stemmed dataset to check its contents.

In [None]:
data_stem.head()

Unnamed: 0,content,target,target_names
0,car wonder enlighten car saw dai door sport ca...,7,rec.autos
1,clock poll final clock report acceler clock up...,4,comp.sys.mac.hardware
2,question folk mac plu final gave ghost weekend...,4,comp.sys.mac.hardware
3,weitek robert kyanko rob rjck uucp wrote abrax...,1,comp.graphics
4,shuttl launch question articl cowcb world std ...,14,sci.space


Since the Word2Vector model only includes the document embeddings, we add the 'target' and 'target_names' columns from the stemmed dataset, because we will be needing these two columns for the classification task.

In [None]:
data_w2v = data_w2v.join(data_stem[['target','target_names']], on=data_w2v.index, how='left')

print(data_w2v.head())

          0         1         2  ...        99  target           target_names
0  0.145009  0.074307  0.062820  ... -0.051636       7              rec.autos
1 -0.328259 -0.023729  0.118994  ... -0.025866       4  comp.sys.mac.hardware
2 -0.236099  0.099917  0.135177  ... -0.233269       4  comp.sys.mac.hardware
3 -0.143364 -0.172010  0.220363  ... -0.076263       1          comp.graphics
4 -0.043743  0.041052 -0.155883  ... -0.262626      14              sci.space

[5 rows x 102 columns]


The same as we did before, we add the columns 'target' and 'target_names' to the BERT data, since it only included the embeddings as well. 

In [None]:
data_bert = data_bert.join(data_stem[['target', 'target_names']], on=data_bert.index, how='left')

print(data_bert.head())

          0         1         2  ...       767  target           target_names
0 -0.127447 -0.267551 -0.952460  ... -0.155924       7              rec.autos
1 -0.128594 -0.297495 -0.960266  ... -0.225011       4  comp.sys.mac.hardware
2 -0.364291 -0.418737 -0.972758  ... -0.155387       4  comp.sys.mac.hardware
3 -0.208075 -0.404881 -0.984882  ... -0.209980       1          comp.graphics
4 -0.134173 -0.371447 -0.981081  ... -0.286874      14              sci.space

[5 rows x 770 columns]


We decide to use only a specific subset of topics, so we remove all the rows from all the datasets that do not belong to these topics.

In [None]:
topics = ['soc.religion.christian', 'rec.sport.hockey', 'talk.politics.mideast', 'rec.motorcycles']

data_stem = data_stem[data_stem['target_names'].isin(topics)]
data_w2v = data_w2v[data_w2v['target_names'].isin(topics)]
data_bert = data_bert[data_bert['target_names'].isin(topics)]

Since we saw before that the stemmed dataset still included the documents, we decide to calculate the TF-IDF frequency and create a new dataframe with the 'target' and 'target_names' columns additionally. 

In [None]:
tfidf_frequency = TfidfVectorizer(max_df=0.7, min_df=0.1).fit_transform(data_stem['content'])
data_stem2 = pd.DataFrame(tfidf_frequency.toarray()).join(data_stem[['target', 'target_names']], on=data_stem.index, how='left')

print(data_stem2.head())

          0         1    2  ...        86  target            target_names
0  0.000000  0.000000  0.0  ...  0.000000       8         rec.motorcycles
1  0.169876  0.132560  0.0  ...  0.000000      10        rec.sport.hockey
2  0.097571  0.076138  0.0  ...  0.000000      15  soc.religion.christian
3  0.181629  0.212597  0.0  ...  0.000000      17   talk.politics.mideast
4  0.000000  0.136480  0.0  ...  0.195978      10        rec.sport.hockey

[5 rows x 89 columns]


Before going into training the models, we define a function which splits the data into the independent variables, which are all columns except for the last two, and the target variable, which is the 'target' column. We also define a test size of 20% the size of the whole dataset and a random state so the results can be reproduced in different circumstances.

In [None]:
def text_train(df):
  return train_test_split(df.iloc[:, :-2], df.target, test_size=0.20, random_state=12)

## Part B: Naive Bayes Classification Model

First we split our stemmed data into training and testing set for the independent and target variables. We then define a Naive Bayes model and fit the data to it. We finally predict on our test data. 

In [None]:
docs_train_s, docs_test_s, y_train_s, y_test_s = text_train(data_stem2)

clf = MultinomialNB()
clf.fit(docs_train_s, y_train_s)

y_pred_s = clf.predict(docs_test_s)

We print the score, the accuracy and the classification report, where we actually see that the model did not perform bad at all. According to the f1-score, we can see that the model had some more troubles for class 17, which is 'talk.politics.mideast', but overall it performed quite well.  

In [None]:
print('Training Score: {}'.format(round(clf.score(docs_train_s, y_train_s), 4)))
print('Accuracy: {}'.format(round(accuracy_score(y_pred_s, y_test_s), 4)))
print('Classification Report:\n{}'.format(classification_report(y_pred_s, y_test_s)))

Training Score: 0.8528
Accuracy: 0.833
Classification Report:
              precision    recall  f1-score   support

           8       0.87      0.83      0.85       140
          10       0.86      0.87      0.87       117
          15       0.87      0.81      0.84       113
          17       0.73      0.82      0.77       103

    accuracy                           0.83       473
   macro avg       0.83      0.83      0.83       473
weighted avg       0.84      0.83      0.83       473



## Part C: Random Forests

We now define a Random Forest classifier and fit the same data we used in the previous model. We then predict on our test data and print the classification report, where we see that this model actually performed even better than the Naive Bayes. There is a slight increase in the average accuracy, but we can see that the f1-scores are all quite high. One thing to notice is that the training score is very high, which means that the model is overfitting to the training data.

In [None]:
clf2 = RandomForestClassifier(random_state=42)
clf2.fit(docs_train_s, y_train_s)

y_pred_s = clf2.predict(docs_test_s)

print('Training Score: {}'.format(round(clf2.score(docs_train_s, y_train_s), 4)))
print(classification_report(y_pred_s, y_test_s))

Training Score: 0.9936
              precision    recall  f1-score   support

           8       0.86      0.92      0.89       125
          10       0.89      0.82      0.85       128
          15       0.85      0.84      0.85       107
          17       0.86      0.88      0.87       113

    accuracy                           0.86       473
   macro avg       0.86      0.86      0.86       473
weighted avg       0.87      0.86      0.86       473



Here we also define a new Random Forest classifier and train it, but this time with another dataset. So first, we split our Word2Vector data into training and testing set, which we use to train and predict. We then print the classification report, where we can see that the model is performing poorly with an average accuracy of 0.25. We can also see that the training score is also very high, which means that the classifier overfitted and its poor performance on the test set might be happening because of the increased number of features and value ranges of the Word2Vec embeddings.

In [None]:
docs_train_w, docs_test_w, y_train_w, y_test_w = text_train(data_w2v)

clf3 = RandomForestClassifier(random_state=42)
clf3.fit(docs_train_w, y_train_w)

y_pred_w = clf3.predict(docs_test_w)

print('Training Score: {}'.format(round(clf3.score(docs_train_w, y_train_w), 4)))
print(classification_report(y_pred_w, y_test_w))

Training Score: 0.9984
              precision    recall  f1-score   support

           8       0.26      0.32      0.29       115
          10       0.27      0.21      0.24       140
          15       0.27      0.21      0.23       135
          17       0.19      0.27      0.23        82

    accuracy                           0.25       472
   macro avg       0.25      0.25      0.25       472
weighted avg       0.25      0.25      0.25       472



Like we did before, here we also define a new Random Forst classifier and train it using another dataset. This time we split the BERT dataset into the respective sets, train our model and predict on our test data. We print the classification report, where we see that the model performed better than the previous one, but still worse than the first random forest classifier. The training score is of 100%, which means this model also overfitted; however, the test score is better than the previous one, which might show how powerful the BERT embeddings are. 

In [None]:
docs_train_b, docs_test_b, y_train_b, y_test_b = text_train(data_bert)

clf4 = RandomForestClassifier(random_state=42)
clf4.fit(docs_train_b, y_train_b)

y_pred_b = clf4.predict(docs_test_b)

print('Training Score: {}'.format(round(clf4.score(docs_train_b, y_train_b), 4)))
print(classification_report(y_pred_b, y_test_b))

Training Score: 1.0
              precision    recall  f1-score   support

           8       0.70      0.69      0.70       136
          10       0.69      0.64      0.66       126
          15       0.75      0.65      0.70       124
          17       0.55      0.72      0.62        87

    accuracy                           0.67       473
   macro avg       0.67      0.68      0.67       473
weighted avg       0.68      0.67      0.67       473



## Part D: Grid Search

All machine learning models have parameters that can be tuned in order to improve the model's performance. One way to do this is by defining a set of parameters with which several models will be trained with and a comparison of the model's performance is made to select the best-performing parameters.

To do this, we first define a parameter grid with a couple of values for the 'min_samples_leaf' and 'n_estimators' parameters. We then initialize a new Random Forest Classifier, as well as a GridSearchCV object, to which we pass the estimator and the parameter grid in order to train the models.

In [None]:
param_grid = {'min_samples_leaf': [5, 10], 'n_estimators': [3, 5]}

rf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=10)
grid_search.fit(docs_train_s, y_train_s)

GridSearchCV(cv=10, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=100, n_jobs=None,
                                              oob_score=False, random_state=42,
                                 

We can check what value combination of the parameters performed the best.

In [None]:
best_params = grid_search.best_params_

print('Best Parameters:\n{}'.format(best_params))

Best Parameters:
{'min_samples_leaf': 5, 'n_estimators': 5}


We get the best model/estimator, which is the model that was trained with the previous parameters, and then predict on our test data. We then print the classification report and see that the model performed well, but not as well as previous models. However, in this model we can see that the training and testing score is closer together, which might mean that overfitting is reduced with these parameters.

In [None]:
best_model = grid_search.best_estimator_
y_pred_bm = best_model.predict(docs_test_s)

print('Training Score: {}'.format(round(best_model.score(docs_train_s, y_train_s), 4)))
print(classification_report(y_pred_bm, y_test_s))

Training Score: 0.8893
              precision    recall  f1-score   support

           8       0.83      0.85      0.84       131
          10       0.88      0.82      0.85       127
          15       0.78      0.82      0.80       101
          17       0.79      0.80      0.79       114

    accuracy                           0.82       473
   macro avg       0.82      0.82      0.82       473
weighted avg       0.82      0.82      0.82       473



Here we print the classification report of the first Random Forest Classifier that we trained for comparison, which performed slightly better than the previous model, but overfitted on the training set.

In [None]:
print('First Random Forest Results')
print('Training Score: {}'.format(round(clf2.score(docs_train_s, y_train_s), 4)))
print(classification_report(y_pred_s, y_test_s))

First Random Forest Results
Training Score: 0.9936
              precision    recall  f1-score   support

           8       0.86      0.92      0.89       125
          10       0.89      0.82      0.85       128
          15       0.85      0.84      0.85       107
          17       0.86      0.88      0.87       113

    accuracy                           0.86       473
   macro avg       0.86      0.86      0.86       473
weighted avg       0.87      0.86      0.86       473



## Part E: BERT Model

In order to train a BERT model, we have to make sure to have the transformers package, which we install here. 

In [None]:
!pip install transformers



We import some additional libraries that we will be using to train our next model.

In [None]:
import numpy as np
import random
import torch
from torch.utils.data import TensorDataset, DataLoader, SequentialSampler
from torch.nn.functional import softmax
from transformers import BertTokenizer, BertForSequenceClassification, AdamW, BertConfig
from keras.preprocessing.sequence import pad_sequences

We check if we have an available GPU, which we then tell PyTorch to use. 

In [None]:
if torch.cuda.is_available():
    device = torch.device('cuda')
    print('There are %d GPU(s) available.' %torch.cuda.device_count())
    print('We will use the GPU: ', torch.cuda.get_device_name(0))
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device('cpu')

There are 1 GPU(s) available.
We will use the GPU:  Tesla T4


We will first do some preprocessing again, which most was done in the previous exercise, but we need to do some changes to the data for our BERT model. 

We load our lemmatized data from the pickle file that we had saved before. We then filter by the four topics that we had chosen previously and change the target values respectively, so they range from 0 to 3 for a four class classification task. We then print the head of our dataframe to see that the filtering and the changes we done correctly.

In [None]:
lemmatized_data = pickle.load(open(base_directory + 'lemmatized_data.p', 'rb'))

topics = {'soc.religion.christian':0, 'rec.sport.hockey':1, 'talk.politics.mideast':2, 'rec.motorcycles':3}
lemmatized_data = lemmatized_data[lemmatized_data['target_names'].isin(topics.keys())]
lemmatized_data['target'] = lemmatized_data['target_names'].apply(lambda name: topics[name])

lemmatized_data.head()

Unnamed: 0,content,target,target_names
10,recommendation duc worth ducati gts line ducat...,3,rec.motorcycles
21,nhl team captain article apr samba oit unc edu...,1,rec.sport.hockey
28,pantheism environmentalism article apr athos r...,0,soc.religion.christian
33,israeli expansion lust article spam math adela...,2,talk.politics.mideast
35,goalie masks article netnews upenn edu kkeller...,1,rec.sport.hockey


We add the '[CLS]' and '[SEP]' markers at the beginning and at the end of each sentences respectively, since BERT needs them. We then tokenize each of the sentences and print the first one to check how it looks.

In [None]:
sentences = ['[CLS] ' + query + ' [SEP]' for query in lemmatized_data['content']]

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokenized_texts = [tokenizer.tokenize(sentence) for sentence in sentences]

print(tokenized_texts[0])

['[CLS]', 'recommendation', 'duc', 'worth', 'duc', '##ati', 'gt', '##s', 'line', 'duc', '##ati', 'gt', '##s', 'model', 'clock', 'run', 'paint', 'bronze', 'brown', 'orange', 'fade', 'leak', 'bit', 'oil', 'pop', 'hard', 'acc', '##el', 'shop', 'fix', 'trans', 'oil', 'leak', 'sell', 'bike', 'owner', 'want', 'think', 'like', 'opinion', 'email', 'thank', 'nice', 'stable', 'mate', 'bee', '##mer', 'ja', '##p', 'bike', 'axis', 'motors', 'tuba', 'irwin', 'hon', '##k', 'com', '##put', '##rac', 'richardson', 'irwin', 'cm', '##pt', '##rc', 'lone', '##star', 'org', 'dod', '[SEP]']


We now perform some padding and truncating to each of our documents. We define the 'maxlen' parameter to 512 since that is the size BERT works with, the padding value as '[PAD]', and define that the padding and truncating should be performed 'post', which means after the actual tokens. We then print the first document after the transformations to see how it looks like, but we just show the first 77 characters, since the rest will only include the '[PAD]' marker.

In [None]:
sentences_padded = pad_sequences(tokenized_texts, dtype=object, maxlen=512, value='[PAD]', truncating='post', padding='post')

print(sentences_padded[0][:77])

['[CLS]' 'recommendation' 'duc' 'worth' 'duc' '##ati' 'gt' '##s' 'line'
 'duc' '##ati' 'gt' '##s' 'model' 'clock' 'run' 'paint' 'bronze' 'brown'
 'orange' 'fade' 'leak' 'bit' 'oil' 'pop' 'hard' 'acc' '##el' 'shop' 'fix'
 'trans' 'oil' 'leak' 'sell' 'bike' 'owner' 'want' 'think' 'like'
 'opinion' 'email' 'thank' 'nice' 'stable' 'mate' 'bee' '##mer' 'ja' '##p'
 'bike' 'axis' 'motors' 'tuba' 'irwin' 'hon' '##k' 'com' '##put' '##rac'
 'richardson' 'irwin' 'cm' '##pt' '##rc' 'lone' '##star' 'org' 'dod'
 '[SEP]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]']


Now we convert each of the tokens to its id. We print the first 77 characters of the first document, where we can identify the values for certain markers. 

In [None]:
sentences_converted = [tokenizer.convert_tokens_to_ids(token) for token in sentences_padded]

print(sentences_converted[0][:77])

[101, 12832, 26363, 4276, 26363, 10450, 14181, 2015, 2240, 26363, 10450, 14181, 2015, 2944, 5119, 2448, 6773, 4421, 2829, 4589, 12985, 17271, 2978, 3514, 3769, 2524, 16222, 2884, 4497, 8081, 9099, 3514, 17271, 5271, 7997, 3954, 2215, 2228, 2066, 5448, 10373, 4067, 3835, 6540, 6775, 10506, 5017, 14855, 2361, 7997, 8123, 9693, 29242, 17514, 10189, 2243, 4012, 18780, 22648, 9482, 17514, 4642, 13876, 11890, 10459, 14117, 8917, 26489, 102, 0, 0, 0, 0, 0, 0, 0, 0]


We also create a mask for each document, where we write a one if the token id is higher than zero, or zero if it's lower. In this case, all '[PAD]' markers will become zeros and every other token will be a one. We do the same print as before to check if this is really happening.

In [None]:
masks = []

for seq in sentences_converted:
  seq_mask = [int(i>0) for i in seq]
  masks.append(seq_mask)
    
print(masks[0][:77])

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]


Now that our data is preprocessed and ready, we must split it into training and testing sets. We split the ids, the masks and the labels the same way we splitted the datasets for the previous models. This way we will be able to compare accurately. We then print the train and test lengths of the data to make sure all the data was split correctly.

In [None]:
inputs_train, inputs_test, labels_train, labels_test = train_test_split(sentences_converted, lemmatized_data['target'], 
                                                                        test_size=0.2, random_state=12)
masks_train, masks_test = train_test_split(masks, test_size=0.2, random_state=12)

print('Train-Test Lengths\nInputs: {} - {}\nMasks: {} - {}\nLabels: {} - {}'.format(len(inputs_train), 
            len(inputs_test), len(masks_train), len(masks_test), len(labels_train), len(labels_test)))

Train-Test Lengths
Inputs: 1888 - 473
Masks: 1888 - 473
Labels: 1888 - 473


We now convert all of our dataset variables into torch tensors. 

In [None]:
inputs_train = torch.LongTensor(inputs_train)
inputs_test = torch.LongTensor(inputs_test)

labels_train = torch.tensor(labels_train.values)
labels_test = torch.tensor(labels_test.values)

masks_train = torch.LongTensor(masks_train)
masks_test = torch.LongTensor(masks_test)

We define the batch size we want our model to use for the training, as well as create some more variables. We first create a TensorDataset with all of our training set variables, we also create a SequentialSampler with our train data and a DataLoader for our training session.

In [None]:
batch_size = 8

train_data = TensorDataset(inputs_train, masks_train, labels_train)
train_sampler = SequentialSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

Before we start the training, we have to change the BERT model configuration, since by default it's configured for a binary classification. In our case, we have four classes, so that's the number of labels we have to define. We then create a new BertForSequenceClassification model with our configurations and send it to our GPU. We also define a new AdamW optimizer for our model.

In [None]:
config = BertConfig.from_pretrained('bert-base-uncased')
config.num_labels = 4

model = BertForSequenceClassification(config)
model.to(device)

optimizer = AdamW(model.parameters(), lr=1e-5)

We first have to set the state of the model to train, before we actually do it. We then decide to train the model for five epochs and then loop through each batch of data from the data loader. Inside each loop, we get the ids, masks and labels that were inside the data loader and send it to our GPU. We then set the gradients to zero and make the forward pass on the model. We get the outputs, from which we can get the loss, then we perform the backward propagation and update the parameters on the optimizer.  

In [None]:
model.train()

for epoch in range(5):

  for batch in train_dataloader:
      
      b_input_ids = batch[0].to(device)
      b_input_masks = batch[1].to(device)
      b_labels = batch[2].to(device)
      
      model.zero_grad()
      
      outputs = model(b_input_ids, attention_mask=b_input_masks, labels=b_labels)
      loss = outputs.loss
      loss.backward()
      optimizer.step()
    
print('Training Done...')

Training Done...


Similarly than with the train data, we also create a TensorDataset with our test dataset variables, as well as a SequentialSampler and a DataLoader. 

In [None]:
test_data = TensorDataset(inputs_test, masks_test, labels_test)
test_sampler = SequentialSampler(test_data)
test_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size=batch_size)

We now change the state of the model to evaluation. We create an empty list for our batch predictions and then tell torch to not update the gradients, since we will be testing and no longer training. For each batch in the test data loader, we get the ids and the masks only. We then pass these to our model for a forward pass and get the outputs. We pass the logits from the outputs to a softmax function to get the probabilities for each class. We then get the class with the higher probability, which will be our predicted class, detach it from the GPU and append it to our batch predictions.

In [None]:
model.eval()

batch_predictions = []

with torch.no_grad():
  
  for batch in test_dataloader:

    b_input_ids = batch[0].to(device)
    b_input_masks = batch[1].to(device) 

    outputs = model(b_input_ids, attention_mask=b_input_masks)

    probs = softmax(outputs.logits, dim=1)
    predicted = torch.max(probs, 1).indices
    predicted_detached = predicted.detach().cpu().numpy()
    batch_predictions.append(predicted_detached)

We then loop through each batch and append each class value to a flat list. We then print the length to make sure everything was executed correctly and indeed it matches with the test size. 

In [None]:
predictions = []

for batch in batch_predictions:
  for value in batch:
    predictions.append(value)

print(len(predictions))

473


Finally, we now print the classification report, where we can see that the BERT model outperformed any other model that we tried before and by a really surprising difference with an 98% accuracy.

In [None]:
print(classification_report(labels_test.numpy(), np.array(predictions)))

              precision    recall  f1-score   support

           0       0.96      0.99      0.98       106
           1       1.00      0.95      0.97       118
           2       1.00      0.97      0.99       115
           3       0.95      0.99      0.97       134

    accuracy                           0.98       473
   macro avg       0.98      0.98      0.98       473
weighted avg       0.98      0.98      0.98       473

