<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# Lab 8.4: Text Classification

In this lab you will implement different types of feature engineering for text classification:
* Count vectors
* TF-IDF vectors (word level, n-gram level, character level)
* Text/NLP based features
* Topic models
  
The following classification algorithms will be applied to the count and TF-IDF vector features:
* Naïve Bayes
* Logistic Regression
* Support Vector Machine
* Random Forest
* Gradient Boosting

## Import libraries

In [1]:
## Import Libraries
import numpy as np
import pandas as pd

import string
import spacy

from collections import Counter

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

# import warnings
# warnings.filterwarnings('ignore')

## Load data

Sample:

    __label__2 Stuning even for the non-gamer: This sound ...
    __label__2 The best soundtrack ever to anything.: I'm ...
    __label__2 Amazing!: This soundtrack is my favorite m ...
    __label__2 Excellent Soundtrack: I truly like this so ...
    __label__2 Remember, Pull Your Jaw Off The Floor Afte ...
    __label__2 an absolute masterpiece: I am quite sure a ...
    __label__1 Buyer beware: This is a self-published boo ...
    . . .
    
There are only two **labels**:
- `__label__1`
- `__label__2`

In [2]:
## Loading the data

df_corpus = pd.read_fwf(
    filepath_or_buffer = 'corpus.txt',
    colspecs = [(9, 10),   # label: get only the numbers 1 or 2
                (11, 9000) # text: makes the it big enough to get to the end of the line
               ],
    header = 0,
    names = ['label', 'text'],
    lineterminator = '\n'
)

# convert label from [1, 2] to [0, 1]
df_corpus['label'] = df_corpus['label'] - 1

## Inspect the data

In [3]:
# ANSWER
print(df_corpus.info())
print(df_corpus.sample(10))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9999 entries, 0 to 9998
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   label   9999 non-null   int64 
 1   text    9999 non-null   object
dtypes: int64(1), object(1)
memory usage: 156.4+ KB
None
      label                                               text
5080      1  Great Science Fiction: I read books for "escap...
5790      1  All-Star Lineup: This is a very good recording...
6326      0  Not what she wanted: Bought this for my wife o...
628       1  I love It!: it does what it is supposed to do ...
8598      0  In-line Writing Style: This is the worst progr...
3033      0  Fun movie, but not worth buying at this price-...
7637      0  I'm Baffled: I've tried a couple of times to r...
2572      0  Buy a different jujitsu book: I bought it befo...
1431      1  Amazingly Beautiful: I own over 800 CD's and t...
8657      1  Great: A good one... and a few more words to f...


## Split the data into train and test

In [4]:
## ANSWER
## split the dataset
X_train, X_test, y_train, y_test = train_test_split(
    df_corpus['text'],
    df_corpus['label'],
    test_size = 0.2,
    random_state = 42
)

## Feature Engineering

### Count Vectors as features

In [5]:
# create a count vectorizer object
count_vect = CountVectorizer(token_pattern = r'\w{1,}')

# Learn a vocabulary dictionary of all tokens in the raw documents
count_vect.fit(X_train)

# Transform documents to document-term matrix.
X_train_count = count_vect.transform(X_train)
X_test_count = count_vect.transform(X_test)

### TF-IDF Vectors as features
- Word level
- N-Gram level
- Character level

In [6]:
%%time
# word level tf-idf
tfidf_vect = TfidfVectorizer(analyzer = 'word',
                             token_pattern = r'\w{1,}',
                             max_features = 5000)
print(tfidf_vect)

tfidf_vect.fit(X_train)
X_train_tfidf = tfidf_vect.transform(X_train)
X_test_tfidf  = tfidf_vect.transform(X_test)

TfidfVectorizer(max_features=5000, token_pattern='\\w{1,}')
CPU times: user 2.09 s, sys: 26.3 ms, total: 2.11 s
Wall time: 2.93 s


In [7]:
%%time
# ngram level tf-idf
tfidf_vect_ngram = TfidfVectorizer(analyzer = 'word',
                                   token_pattern = r'\w{1,}',
                                   ngram_range = (2, 3),
                                   max_features = 5000)
print(tfidf_vect_ngram)

tfidf_vect_ngram.fit(X_train)
X_train_tfidf_ngram = tfidf_vect_ngram.transform(X_train)
X_test_tfidf_ngram  = tfidf_vect_ngram.transform(X_test)

TfidfVectorizer(max_features=5000, ngram_range=(2, 3), token_pattern='\\w{1,}')
CPU times: user 6.71 s, sys: 223 ms, total: 6.93 s
Wall time: 7.8 s


In [8]:
%%time
# characters level tf-idf
tfidf_vect_ngram_chars = TfidfVectorizer(analyzer = 'char',
                                         ngram_range = (2, 3),
                                         max_features = 5000)
print(tfidf_vect_ngram_chars)

tfidf_vect_ngram_chars.fit(X_train)
X_train_tfidf_ngram_chars = tfidf_vect_ngram_chars.transform(X_train)
X_test_tfidf_ngram_chars  = tfidf_vect_ngram_chars.transform(X_test)

TfidfVectorizer(analyzer='char', max_features=5000, ngram_range=(2, 3))
CPU times: user 7.64 s, sys: 92.8 ms, total: 7.73 s
Wall time: 7.77 s


### Text / NLP based features

Create some other features.

char_count = Number of Characters in Text

word_count = Number of Words in Text

word_density = Average Number of Char in Words

punctuation_count = Number of Punctuation in Text

title_word_count = Number of Words in Title

uppercase_word_count = Number of Upperwords in Text


In [9]:
%%time
# ANSWER
df_corpus['char_count'] = df_corpus['text'].apply(len)
df_corpus['word_count'] = df_corpus['text'].apply(lambda x: len(x.split()))
df_corpus['word_density'] = df_corpus['char_count'] / (df_corpus['word_count'] + 1)
df_corpus['punctuation_count'] = df_corpus['text'].apply(lambda x: len(''.join(_ for _ in x if _ in string.punctuation)))
df_corpus['title_word_count'] = df_corpus['text'].apply(lambda x: len([w for w in x.split() if w.istitle()]))
df_corpus['uppercase_word_count'] = df_corpus['text'].apply(lambda x: len([w for w in x.split() if w.isupper()]))

CPU times: user 425 ms, sys: 2.11 ms, total: 428 ms
Wall time: 428 ms


In [10]:
## load spaCy
nlp = spacy.load('en_core_web_sm')

Part of Speech in **SpaCy**

    POS   DESCRIPTION               EXAMPLES
    ----- ------------------------- ---------------------------------------------
    ADJ   adjective                 big, old, green, incomprehensible, first
    ADP   adposition                in, to, during
    ADV   adverb                    very, tomorrow, down, where, there
    AUX   auxiliary                 is, has (done), will (do), should (do)
    CONJ  conjunction               and, or, but
    CCONJ coordinating conjunction  and, or, but
    DET   determiner                a, an, the
    INTJ  interjection              psst, ouch, bravo, hello
    NOUN  noun                      girl, cat, tree, air, beauty
    NUM   numeral                   1, 2017, one, seventy-seven, IV, MMXIV
    PART  particle                  's, not,
    PRON  pronoun                   I, you, he, she, myself, themselves, somebody
    PROPN proper noun               Mary, John, London, NATO, HBO
    PUNCT punctuation               ., (, ), ?
    SCONJ subordinating conjunction if, while, that
    SYM   symbol                    $, %, §, ©, +, −, ×, ÷, =, :), 😝
    VERB  verb                      run, runs, running, eat, ate, eating
    X     other                     sfpksdpsxmsa
    SPACE space
    
Find out the number of Adjectives, Adverbs, Nouns, Numerals, Pronouns, Proper Nouns, Verbs.

    Hint:
    1. Convert text to spacy document
    2. Use pos_
    3. Use Counter

In [11]:
# Initialise some columns for feature's counts
df_corpus['adj_count'] = 0
df_corpus['adv_count'] = 0
df_corpus['noun_count'] = 0
df_corpus['num_count'] = 0
df_corpus['pron_count'] = 0
df_corpus['propn_count'] = 0
df_corpus['verb_count'] = 0

In [12]:
%%time
# ANSWER
# for each text
for i in range(df_corpus.shape[0]):
  # convert into a spaCy document
  doc = nlp(df_corpus.iloc[i]['text'])
  # initialise feature counter
  c = Counter(t.pos_ for t in doc)

  df_corpus.at[i, 'adj_count'] = c['ADJ']
  df_corpus.at[i, 'adv_count'] = c['ADV']
  df_corpus.at[i, 'noun_count'] = c['NOUN']
  df_corpus.at[i, 'num_count'] = c['NUM']
  df_corpus.at[i, 'pron_count'] = c['PRON']
  df_corpus.at[i, 'propn_count'] = c['PROPN']
  df_corpus.at[i, 'verb_count'] = c['VERB']

CPU times: user 4min 1s, sys: 531 ms, total: 4min 2s
Wall time: 4min 4s


In [13]:
cols = [
    'char_count', 'word_count', 'word_density',
    'punctuation_count', 'title_word_count',
    'uppercase_word_count', 'adj_count',
    'adv_count', 'noun_count', 'num_count',
    'pron_count', 'propn_count', 'verb_count']

df_corpus[cols].sample(5)

Unnamed: 0,char_count,word_count,word_density,punctuation_count,title_word_count,uppercase_word_count,adj_count,adv_count,noun_count,num_count,pron_count,propn_count,verb_count
2295,439,78,5.556962,13,8,0,8,9,12,3,11,3,13
4735,339,59,5.65,9,4,3,8,4,14,1,7,1,7
6116,452,84,5.317647,8,10,7,5,6,13,0,16,1,17
1516,258,48,5.265306,7,8,0,7,4,6,0,6,3,8
7426,159,30,5.129032,9,4,0,5,2,5,3,0,3,2


### Topic Models as features

In [14]:
%%time
# train a LDA Model
lda_model = LatentDirichletAllocation(n_components = 20, learning_method = 'online', max_iter = 20)

X_topics = lda_model.fit_transform(X_train_count)
topic_word = lda_model.components_
vocab = count_vect.get_feature_names_out()

CPU times: user 59.6 s, sys: 160 ms, total: 59.7 s
Wall time: 1min


In [15]:
# view the topic models
n_top_words = 10
topic_summaries = []
print('Group Top Words')
print('-----', '-'*80)
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    top_words = ' '.join(topic_words)
    topic_summaries.append(top_words)
    print('  %3d %s' % (i, top_words))

Group Top Words
----- --------------------------------------------------------------------------------
    0 scooter goodie breathe steer dancers albert serum hollies poets controller
    1 camera works use software with price unit canon player my
    2 apple skin japanese soundtrack g4 eye lola opening powerbook exam
    3 cd music album game are songs they song great on
    4 his he book manson hollywood page diane lane theory bible
    5 sings titan astrology anthology boiling shades marie yea curie eyed
    6 y en que los del un mas lo una si
    7 movie movies watch acting special effects bad seen actors haunting
    8 l violence et philadelphia les il hide dedicated pour est
    9 u show season 1 stargate episode winston episodes 2 higgins
   10 greatest memory economics required romantic scanner error yoga touching bass
   11 buddy paris remix maker achieve ry faces mario compartment bernie
   12 pros sea piper rome latin touches celiac sensor knox perabo
   13 local mad rock vo

## Modelling

Run the following cells to train a number of models on the count vector and TF-IDF vector feature sets generated above.

In [16]:
## helper function

def train_model(classifier, feature_vector_train, label, feature_vector_valid):
    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, label)

    # predict the labels on validation dataset
    predictions = classifier.predict(feature_vector_valid)

    return accuracy_score(predictions, y_test)

In [17]:
# Keep the results in a dataframe
results = pd.DataFrame(columns = ['Count Vectors',
                                  'WordLevel TF-IDF',
                                  'N-Gram Vectors',
                                  'CharLevel Vectors'])

### Naive Bayes Classifier

In [18]:
%%time
# Naive Bayes on Count Vectors
accuracy1 = train_model(MultinomialNB(), X_train_count, y_train, X_test_count)
print('NB, Count Vectors    : %.4f\n' % accuracy1)

NB, Count Vectors    : 0.8520

CPU times: user 10.8 ms, sys: 989 µs, total: 11.8 ms
Wall time: 12.8 ms


In [19]:
%%time
# Naive Bayes on Word Level TF IDF Vectors
accuracy2 = train_model(MultinomialNB(), X_train_tfidf, y_train, X_test_tfidf)
print('NB, WordLevel TF-IDF : %.4f\n' % accuracy2)

NB, WordLevel TF-IDF : 0.8560

CPU times: user 10.2 ms, sys: 992 µs, total: 11.2 ms
Wall time: 10.5 ms


In [20]:
%%time
# Naive Bayes on Ngram Level TF IDF Vectors
accuracy3 = train_model(MultinomialNB(), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('NB, N-Gram Vectors   : %.4f\n' % accuracy3)

NB, N-Gram Vectors   : 0.8375

CPU times: user 7.7 ms, sys: 0 ns, total: 7.7 ms
Wall time: 7.71 ms


In [21]:
%%time
# # Naive Bayes on Character Level TF IDF Vectors
accuracy4 = train_model(MultinomialNB(), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('NB, CharLevel Vectors: %.4f\n' % accuracy4)

NB, CharLevel Vectors: 0.8195

CPU times: user 36.6 ms, sys: 0 ns, total: 36.6 ms
Wall time: 36.3 ms


In [22]:
results.loc['Naïve Bayes'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

### Linear Classifier

In [23]:
%%time
# Linear Classifier on Count Vectors
accuracy1 = train_model(LogisticRegression(solver = 'lbfgs', max_iter = 350), X_train_count, y_train, X_test_count)
print('LR, Count Vectors    : %.4f\n' % accuracy1)

LR, Count Vectors    : 0.8515

CPU times: user 9.99 s, sys: 13.9 ms, total: 10 s
Wall time: 5.21 s


In [24]:
%%time
# Linear Classifier on Word Level TF IDF Vectors
accuracy2 = train_model(LogisticRegression(solver = 'lbfgs', max_iter = 100), X_train_tfidf, y_train, X_test_tfidf)
print('LR, WordLevel TF-IDF : %.4f\n' % accuracy2)

LR, WordLevel TF-IDF : 0.8710

CPU times: user 181 ms, sys: 996 µs, total: 182 ms
Wall time: 109 ms


In [25]:
%%time
# Linear Classifier on Ngram Level TF IDF Vectors
accuracy3 = train_model(LogisticRegression(solver = 'lbfgs', max_iter = 100), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('LR, N-Gram Vectors   : %.4f\n' % accuracy3)

LR, N-Gram Vectors   : 0.8305

CPU times: user 67.4 ms, sys: 2.98 ms, total: 70.4 ms
Wall time: 38.9 ms


In [26]:
%%time
# Linear Classifier on Character Level TF IDF Vectors
accuracy4 = train_model(LogisticRegression(solver = 'lbfgs', max_iter = 100), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('LR, CharLevel Vectors: %.4f\n' % accuracy4)

LR, CharLevel Vectors: 0.8485

CPU times: user 540 ms, sys: 0 ns, total: 540 ms
Wall time: 275 ms


In [27]:
results.loc['Logistic Regression'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

### Support Vector Machine

In [28]:
%%time
# Support Vector Machine on Count Vectors
accuracy1 = train_model(LinearSVC(), X_train_count, y_train, X_test_count)
print('SVM, Count Vectors    : %.4f\n' % accuracy1)

SVM, Count Vectors    : 0.8345

CPU times: user 1.14 s, sys: 6.93 ms, total: 1.15 s
Wall time: 1.15 s


In [29]:
%%time
# Support Vector Machine on Word Level TF IDF Vectors
accuracy2 = train_model(LinearSVC(), X_train_tfidf, y_train, X_test_tfidf)
print('SVM, WordLevel TF-IDF : %.4f\n' % accuracy2)

SVM, WordLevel TF-IDF : 0.8630

CPU times: user 142 ms, sys: 0 ns, total: 142 ms
Wall time: 147 ms


In [30]:
%%time
# Support Vector Machine on Ngram Level TF IDF Vectors
accuracy3 = train_model(LinearSVC(), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('SVM, N-Gram Vectors   : %.4f\n' % accuracy3)

SVM, N-Gram Vectors   : 0.8135

CPU times: user 72.6 ms, sys: 0 ns, total: 72.6 ms
Wall time: 84.7 ms


In [31]:
%%time
# Support Vector Machine on Character Level TF IDF Vectors
accuracy4 = train_model(LinearSVC(), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('SVM, CharLevel Vectors: %.4f\n' % accuracy4)

SVM, CharLevel Vectors: 0.8600

CPU times: user 1.19 s, sys: 43.8 ms, total: 1.24 s
Wall time: 1.3 s


In [32]:
results.loc['Support Vector Machine'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

### Bagging Models

In [33]:
%%time
# Bagging (Random Forest) on Count Vectors
accuracy1 = train_model(RandomForestClassifier(n_estimators = 100), X_train_count, y_train, X_test_count)
print('RF, Count Vectors    : %.4f\n' % accuracy1)

RF, Count Vectors    : 0.8265

CPU times: user 14.1 s, sys: 31.1 ms, total: 14.1 s
Wall time: 14.2 s


In [34]:
%%time
# Bagging (Random Forest) on Word Level TF IDF Vectors
accuracy2 = train_model(RandomForestClassifier(n_estimators = 100), X_train_tfidf, y_train, X_test_tfidf)
print('RF, WordLevel TF-IDF : %.4f\n' % accuracy2)

RF, WordLevel TF-IDF : 0.8220

CPU times: user 9.18 s, sys: 15 ms, total: 9.2 s
Wall time: 9.2 s


In [35]:
%%time
# Bagging (Random Forest) on Ngram Level TF IDF Vectors
accuracy3 = train_model(RandomForestClassifier(n_estimators = 100), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('RF, N-Gram Vectors   : %.4f\n' % accuracy3)

RF, N-Gram Vectors   : 0.7900

CPU times: user 9.64 s, sys: 13 ms, total: 9.65 s
Wall time: 9.68 s


In [36]:
%%time
# Bagging (Random Forest) on Character Level TF IDF Vectors
accuracy4 = train_model(RandomForestClassifier(n_estimators = 100), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('RF, CharLevel Vectors: %.4f\n' % accuracy4)

RF, CharLevel Vectors: 0.7870

CPU times: user 31.4 s, sys: 64.9 ms, total: 31.4 s
Wall time: 31.5 s


In [37]:
results.loc['Random Forest'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

### Boosting Models

In [38]:
%%time
# Gradient Boosting on Count Vectors
accuracy1 = train_model(GradientBoostingClassifier(), X_train_count, y_train, X_test_count)
print('GB, Count Vectors    : %.4f\n' % accuracy1)

GB, Count Vectors    : 0.7990

CPU times: user 8.52 s, sys: 8.98 ms, total: 8.53 s
Wall time: 8.5 s


In [39]:
%%time
# Gradient Boosting on Word Level TF IDF Vectors
accuracy2 = train_model(GradientBoostingClassifier(), X_train_tfidf, y_train, X_test_tfidf)
print('GB, WordLevel TF-IDF : %.4f\n' % accuracy2)

GB, WordLevel TF-IDF : 0.7970

CPU times: user 21.8 s, sys: 25 ms, total: 21.8 s
Wall time: 21.9 s


In [40]:
%%time
# Gradient Boosting on Ngram Level TF IDF Vectors
accuracy3 = train_model(GradientBoostingClassifier(), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('GB, N-Gram Vectors   : %.4f\n' % accuracy3)

GB, N-Gram Vectors   : 0.7380

CPU times: user 13.2 s, sys: 19 ms, total: 13.2 s
Wall time: 13.2 s


In [41]:
%%time
# Gradient Boosting on Character Level TF IDF Vectors
accuracy4 = train_model(GradientBoostingClassifier(), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('GB, CharLevel Vectors: %.4f\n' % accuracy4)

GB, CharLevel Vectors: 0.8025

CPU times: user 3min 14s, sys: 252 ms, total: 3min 14s
Wall time: 3min 15s


In [42]:
results.loc['Gradient Boosting'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

In [43]:
results

Unnamed: 0,Count Vectors,WordLevel TF-IDF,N-Gram Vectors,CharLevel Vectors
Naïve Bayes,0.852,0.856,0.8375,0.8195
Logistic Regression,0.8515,0.871,0.8305,0.8485
Support Vector Machine,0.8345,0.863,0.8135,0.86
Random Forest,0.8265,0.822,0.79,0.787
Gradient Boosting,0.799,0.797,0.738,0.8025


Which combination of features and model performed the best?

Answer: Based on the above results, word-level TF-IDF with logistic regression performed the best.



---



---



> > > > > > > > > © 2025 Institute of Data


---



---



