<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# Lab 8.4: Text Classification

In this lab you will implement different types of feature engineering for text classification:
* Count vectors
* TF-IDF vectors (word level, n-gram level, character level)
* Text/NLP based features
* Topic models
  
The following classification algorithms will be applied to the count and TF-IDF vector features:
* Naïve Bayes
* Logistic Regression
* Support Vector Machine
* Random Forest
* Gradient Boosting

## Import libraries

In [4]:
## Import Libraries
import numpy as np
import pandas as pd

import string
import spacy

from collections import Counter

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

import warnings
warnings.filterwarnings('ignore')

## Load data

Sample:

    __label__2 Stuning even for the non-gamer: This sound ...
    __label__2 The best soundtrack ever to anything.: I'm ...
    __label__2 Amazing!: This soundtrack is my favorite m ...
    __label__2 Excellent Soundtrack: I truly like this so ...
    __label__2 Remember, Pull Your Jaw Off The Floor Afte ...
    __label__2 an absolute masterpiece: I am quite sure a ...
    __label__1 Buyer beware: This is a self-published boo ...
    . . .
    
There are only two **labels**:
- `__label__1`
- `__label__2`

In [8]:
## Loading the data

df_corpus = pd.read_fwf(
    filepath_or_buffer = '/Users/annaxu/Documents/Data Science/DATA/corpus.txt',
    colspecs = [(9, 10),   # label: get only the numbers 1 or 2
                (11, 9000) # text: makes it big enough to get to the end of the line
               ],
    header = 0,
    names = ['label', 'text'],
    lineterminator = '\n'
)

# convert label from [1, 2] to [0, 1]
df_corpus['label'] = df_corpus['label'] - 1

## Inspect the data

In [11]:
df_corpus.head()

Unnamed: 0,label,text
0,1,The best soundtrack ever to anything.: I'm rea...
1,1,Amazing!: This soundtrack is my favorite music...
2,1,Excellent Soundtrack: I truly like this soundt...
3,1,"Remember, Pull Your Jaw Off The Floor After He..."
4,1,an absolute masterpiece: I am quite sure any o...


In [13]:
df_corpus.shape

(9999, 2)

## Split the data into train and test

In [16]:
X = df_corpus['text']
y = df_corpus['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)

## Feature Engineering

### Count Vectors as features

In [20]:
#Create a count vectorizer object
count_vect = CountVectorizer(token_pattern = r'\w{1,}') #split text on any alphanumeric character of length at least 1

#Build a dictionary of all unique tokens in the raw documents
count_vect.fit(X_train)

#Transform data into a document-term matrix where each document is a row in X_train/X_test
X_train_count = count_vect.transform(X_train)
X_test_count = count_vect.transform(X_test)

### TF-IDF Vectors as features
- Word level
- N-Gram level
- Character level

In [23]:
%%time
# word level tf-idf
tfidf_vect = TfidfVectorizer(analyzer = 'word',
                             token_pattern = r'\w{1,}',
                             max_features = 5000)
print(tfidf_vect)

tfidf_vect.fit(X_train)
X_train_tfidf = tfidf_vect.transform(X_train)
X_test_tfidf  = tfidf_vect.transform(X_test)

TfidfVectorizer(max_features=5000, token_pattern='\\w{1,}')
CPU times: user 413 ms, sys: 9.09 ms, total: 422 ms
Wall time: 423 ms


In [24]:
%%time
# ngram level tf-idf; split into n consecutive words
tfidf_vect_ngram = TfidfVectorizer(analyzer = 'word',
                                   token_pattern = r'\w{1,}',
                                   ngram_range = (2, 3), #bigrams or trigrams
                                   max_features = 5000)
print(tfidf_vect_ngram)

tfidf_vect_ngram.fit(X_train)
X_train_tfidf_ngram = tfidf_vect_ngram.transform(X_train)
X_test_tfidf_ngram  = tfidf_vect_ngram.transform(X_test)

TfidfVectorizer(max_features=5000, ngram_range=(2, 3), token_pattern='\\w{1,}')
CPU times: user 1.96 s, sys: 54.8 ms, total: 2.01 s
Wall time: 2.02 s


In [26]:
%%time
# characters level tf-idf; split into n consecutive characters
tfidf_vect_ngram_chars = TfidfVectorizer(analyzer = 'char',
                                         ngram_range = (2, 3),
                                         max_features = 5000)
print(tfidf_vect_ngram_chars)

tfidf_vect_ngram_chars.fit(X_train)
X_train_tfidf_ngram_chars = tfidf_vect_ngram_chars.transform(X_train)
X_test_tfidf_ngram_chars  = tfidf_vect_ngram_chars.transform(X_test)

TfidfVectorizer(analyzer='char', max_features=5000, ngram_range=(2, 3))
CPU times: user 3 s, sys: 55.8 ms, total: 3.06 s
Wall time: 3.1 s


### Text / NLP based features

Create some other features.

char_count = Number of Characters in Text

word_count = Number of Words in Text

word_density = Average Number of Char in Words

punctuation_count = Number of Punctuation in Text

title_word_count = Number of Words in Title

uppercase_word_count = Number of Upperwords in Text #All uppercase


In [28]:
%%time
def text_features(text):
    char_count = len(text)
    word_count = len(text.split())
    word_density = char_count / word_count
    punctuation_count = sum(1 for c in text if c in string.punctuation)
    title_word_count = sum(1 for word in text.split() if word.istitle())
    uppercase_word_count = sum(1 for word in text.split() if word.isupper())

    return {
        'char_count': char_count,
        'word_count': word_count,
        'word_density': word_density,
        'punctuation_count': punctuation_count,
        'title_word_count': title_word_count,
        'uppercase_word_count': uppercase_word_count
    }

CPU times: user 2 μs, sys: 1e+03 ns, total: 3 μs
Wall time: 3.1 μs


In [29]:
#test function
test_df = pd.DataFrame({'text': ["Lorem ipsum dolor sit amet, consectetuer adipiscing elit! Aenean's commodo ligula eget dolor."]})
test_features = test_df['text'].apply(text_features).apply(pd.Series) #apply(pd.Series) converts returned dictionaries into columns
print(test_features)

   char_count  word_count  word_density  punctuation_count  title_word_count  \
0        93.0        13.0      7.153846                4.0               1.0   

   uppercase_word_count  
0                   0.0  


In [30]:
#apply function to dataset
features_df = df_corpus['text'].apply(text_features).apply(pd.Series) 
df_corpus = pd.concat([df_corpus, features_df], axis=1)

In [31]:
## load spaCy
nlp = spacy.load('en_core_web_sm')

Part of Speech in **SpaCy**

    POS   DESCRIPTION               EXAMPLES
    ----- ------------------------- ---------------------------------------------
    ADJ   adjective                 big, old, green, incomprehensible, first
    ADP   adposition                in, to, during
    ADV   adverb                    very, tomorrow, down, where, there
    AUX   auxiliary                 is, has (done), will (do), should (do)
    CONJ  conjunction               and, or, but
    CCONJ coordinating conjunction  and, or, but
    DET   determiner                a, an, the
    INTJ  interjection              psst, ouch, bravo, hello
    NOUN  noun                      girl, cat, tree, air, beauty
    NUM   numeral                   1, 2017, one, seventy-seven, IV, MMXIV
    PART  particle                  's, not,
    PRON  pronoun                   I, you, he, she, myself, themselves, somebody
    PROPN proper noun               Mary, John, London, NATO, HBO
    PUNCT punctuation               ., (, ), ?
    SCONJ subordinating conjunction if, while, that
    SYM   symbol                    $, %, §, ©, +, −, ×, ÷, =, :), 😝
    VERB  verb                      run, runs, running, eat, ate, eating
    X     other                     sfpksdpsxmsa
    SPACE space
    
Find out the number of Adjectives, Adverbs, Nouns, Numerals, Pronouns, Proper Nouns, Verbs.

    Hint:
    1. Convert text to spacy document
    2. Use pos_
    3. Use Counter

In [34]:
%%time
def pos_features (df, text_col='text', batch_size=100):
    adj = []
    adv = []
    noun = []
    num = []
    pron = []
    propn = []
    verb = []

# Process texts in batches with nlp.pipe()
    for doc in nlp.pipe(df['text'], batch_size=100, disable=["ner", "parser"]): #disable Named Entity Recognition (doesn't affect PROPN) and Dependency Parsing to speed up processing 
        c = Counter([token.pos_ for token in doc])
        adj.append(c.get('ADJ', 0)) #return 0 if no ADJ
        adv.append(c.get('ADV', 0))
        noun.append(c.get('NOUN', 0))
        num.append(c.get('NUM', 0))
        pron.append(c.get('PRON', 0))
        propn.append(c.get('PROPN', 0))
        verb.append(c.get('VERB', 0))

    return pd.DataFrame({
        'adj_count': adj,
        'adv_count': adv,
        'noun_count': noun,
        'num_count': num,
        'pron_count': pron,
        'propn_count': propn,
        'verb_count': verb
    })

CPU times: user 1 μs, sys: 1 μs, total: 2 μs
Wall time: 3.81 μs


In [35]:
#test function
test_df = pd.DataFrame({'text': ["Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts."]})
test_features = pos_features(test_df)
print(test_features)

   adj_count  adv_count  noun_count  num_count  pron_count  propn_count  \
0          1          4           4          0           1            2   

   verb_count  
0           1  


In [36]:
#apply function to dataset
features_df = pos_features(df_corpus)
df_corpus = pd.concat([df_corpus, features_df], axis=1)

### Topic Models as features

In [38]:
%%time
# train a LDA Model
#LDA is a topic modelling algorithm which assumes that documents are generated from a mixture of underlying topics, and each topic is, in turn, a distribution over words
# e.g. If there are 2 topics: sport and politics, the topic-word distribution for the "sports" topic might assign high probs to words like "game," "team," "athlete," and "tournament," while the "politics" topic might have high probs for words like "government," "policy," "election," and "law". 
lda_model = LatentDirichletAllocation(n_components = 20, learning_method = 'online', max_iter = 20)

X_topics = lda_model.fit_transform(X_train_count)
topic_word = lda_model.components_
vocab = count_vect.get_feature_names_out()

CPU times: user 22 s, sys: 1.2 s, total: 23.2 s
Wall time: 23.2 s


In [121]:
# view the topic models
n_top_words = 10
topic_summaries = []
print('Group Top Words')
print('-----', '-'*80)
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    #topic_dist = array of word probabilities or weights for one topic
    #np.argsort = return the indices of topic_dist in ascending order
    #[:-(n_top_words+1):-1] = take the last 10 indices and reverse order (so top n words are returned)
    #np.array(vocab) = get actual words from the indices
    top_words = ' '.join(topic_words) #combines 10 top words into a string
    topic_summaries.append(top_words) #append to empty list
    print('  %3d %s' % (i, top_words))

Group Top Words
----- --------------------------------------------------------------------------------
    0 game non games unit plug device graphics food gammell brando
    1 system waist tight yoga magazine cooking surface squeem secret salt
    2 battle pet steve health errors charged index publisher exchange pearl
    3 plastic dumb fake gay edges cartridge ho serving dissapointing applications
    4 her she mother apple woman adapter diane power lane flick
    5 max boys blade law flow train lady breaking cutting pin
    6 windows war future images compare xp finds higgins experienced censorship
    7 diary youth tag lil lemon holocaust streets huppert peice scarlett
    9 le connector cliff est discussed knots caps angles versus symphonic
   10 manson emarker haiku bugliosi beatles authority influence bringing powers helter
   11 of and is s the a album music his in
   12 tough sadly showed comparison stiff damn em crocodile rollerball bernstein
   13 lens chinese fw pile voodoo 

## Modelling

Run the following cells to train a number of models on the count vector and TF-IDF vector feature sets generated above.

In [79]:
def train_model(classifier, feature_vector_train, y_train, feature_vector_test):
    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, y_train) #using feature_vector_train and feature_vector_test because we have multiple X_trains and X_tests

    # predict the labels on validation dataset
    predictions = classifier.predict(feature_vector_test)

    return accuracy_score(predictions, y_test)

In [81]:
# Keep the results in a dataframe
results = pd.DataFrame(columns = ['Count Vectors',
                                  'WordLevel TF-IDF',
                                  'N-Gram Vectors',
                                  'CharLevel Vectors'])

### Naive Bayes Classifier

In [84]:
%%time
# Naive Bayes on Count Vectors
accuracy1 = train_model(MultinomialNB(), X_train_count, y_train, X_test_count)
print('NB, Count Vectors    : %.4f\n' % accuracy1)

NB, Count Vectors    : 0.8368

CPU times: user 12.5 ms, sys: 5.54 ms, total: 18 ms
Wall time: 14.8 ms


In [86]:
%%time
# Naive Bayes on Word Level TF IDF Vectors
accuracy2 = train_model(MultinomialNB(), X_train_tfidf, y_train, X_test_tfidf)
print('NB, WordLevel TF-IDF : %.4f\n' % accuracy2)

NB, WordLevel TF-IDF : 0.8400

CPU times: user 10.9 ms, sys: 5.21 ms, total: 16.1 ms
Wall time: 12.6 ms


In [88]:
%%time
# Naive Bayes on Ngram Level TF IDF Vectors
accuracy3 = train_model(MultinomialNB(), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('NB, N-Gram Vectors   : %.4f\n' % accuracy3)

NB, N-Gram Vectors   : 0.8356

CPU times: user 3.19 ms, sys: 1.73 ms, total: 4.92 ms
Wall time: 3.99 ms


In [90]:
%%time
# # Naive Bayes on Character Level TF IDF Vectors
accuracy4 = train_model(MultinomialNB(), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('NB, CharLevel Vectors: %.4f\n' % accuracy4)

NB, CharLevel Vectors: 0.8144

CPU times: user 18.2 ms, sys: 2.82 ms, total: 21 ms
Wall time: 20 ms


In [92]:
results.loc['Naïve Bayes'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

### Linear Classifier

In [94]:
%%time
# Linear Classifier on Count Vectors
accuracy1 = train_model(LogisticRegression(solver = 'lbfgs', max_iter = 350), X_train_count, y_train, X_test_count)
print('LR, Count Vectors    : %.4f\n' % accuracy1)

LR, Count Vectors    : 0.8548

CPU times: user 2.05 s, sys: 194 ms, total: 2.24 s
Wall time: 503 ms


In [95]:
%%time
# Linear Classifier on Word Level TF IDF Vectors
accuracy2 = train_model(LogisticRegression(solver = 'lbfgs', max_iter = 100), X_train_tfidf, y_train, X_test_tfidf)
print('LR, WordLevel TF-IDF : %.4f\n' % accuracy2)

LR, WordLevel TF-IDF : 0.8688

CPU times: user 15.7 ms, sys: 3.1 ms, total: 18.8 ms
Wall time: 17.7 ms


In [96]:
%%time
# Linear Classifier on Ngram Level TF IDF Vectors
accuracy3 = train_model(LogisticRegression(solver = 'lbfgs', max_iter = 100), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('LR, N-Gram Vectors   : %.4f\n' % accuracy3)

LR, N-Gram Vectors   : 0.8392

CPU times: user 10.5 ms, sys: 1.86 ms, total: 12.4 ms
Wall time: 12.5 ms


In [100]:
%%time
# Linear Classifier on Character Level TF IDF Vectors
accuracy4 = train_model(LogisticRegression(solver = 'lbfgs', max_iter = 100), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('LR, CharLevel Vectors: %.4f\n' % accuracy4)

LR, CharLevel Vectors: 0.8404

CPU times: user 85.7 ms, sys: 3.72 ms, total: 89.5 ms
Wall time: 88.5 ms


In [102]:
results.loc['Logistic Regression'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

### Support Vector Machine

In [105]:
%%time
# Support Vector Machine on Count Vectors
accuracy1 = train_model(LinearSVC(), X_train_count, y_train, X_test_count)
print('SVM, Count Vectors    : %.4f\n' % accuracy1)

SVM, Count Vectors    : 0.8424

CPU times: user 274 ms, sys: 7.49 ms, total: 281 ms
Wall time: 283 ms


In [106]:
%%time
# Support Vector Machine on Word Level TF IDF Vectors
accuracy2 = train_model(LinearSVC(), X_train_tfidf, y_train, X_test_tfidf)
print('SVM, WordLevel TF-IDF : %.4f\n' % accuracy2)

SVM, WordLevel TF-IDF : 0.8508

CPU times: user 42.2 ms, sys: 3.6 ms, total: 45.8 ms
Wall time: 45.6 ms


In [109]:
%%time
# Support Vector Machine on Ngram Level TF IDF Vectors
accuracy3 = train_model(LinearSVC(), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('SVM, N-Gram Vectors   : %.4f\n' % accuracy3)

SVM, N-Gram Vectors   : 0.8280

CPU times: user 27.6 ms, sys: 3.59 ms, total: 31.2 ms
Wall time: 31.3 ms


In [111]:
%%time
# Support Vector Machine on Character Level TF IDF Vectors
accuracy4 = train_model(LinearSVC(), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('SVM, CharLevel Vectors: %.4f\n' % accuracy4)

SVM, CharLevel Vectors: 0.8480

CPU times: user 371 ms, sys: 8.58 ms, total: 380 ms
Wall time: 380 ms


In [112]:
results.loc['Support Vector Machine'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

### Bagging Models

In [116]:
%%time
# Bagging (Random Forest) on Count Vectors
accuracy1 = train_model(RandomForestClassifier(n_estimators = 100), X_train_count, y_train, X_test_count)
print('RF, Count Vectors    : %.4f\n' % accuracy1)

RF, Count Vectors    : 0.8284

CPU times: user 3.08 s, sys: 34 ms, total: 3.12 s
Wall time: 3.12 s


In [117]:
%%time
# Bagging (Random Forest) on Word Level TF IDF Vectors
accuracy2 = train_model(RandomForestClassifier(n_estimators = 100), X_train_tfidf, y_train, X_test_tfidf)
print('RF, WordLevel TF-IDF : %.4f\n' % accuracy2)

RF, WordLevel TF-IDF : 0.8280

CPU times: user 1.97 s, sys: 23.9 ms, total: 2 s
Wall time: 2 s


In [118]:
%%time
# Bagging (Random Forest) on Ngram Level TF IDF Vectors
accuracy3 = train_model(RandomForestClassifier(n_estimators = 100), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('RF, N-Gram Vectors   : %.4f\n' % accuracy3)

RF, N-Gram Vectors   : 0.7868

CPU times: user 2.23 s, sys: 20 ms, total: 2.25 s
Wall time: 2.25 s


In [119]:
%%time
# Bagging (Random Forest) on Character Level TF IDF Vectors
accuracy4 = train_model(RandomForestClassifier(n_estimators = 100), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('RF, CharLevel Vectors: %.4f\n' % accuracy4)

RF, CharLevel Vectors: 0.7956

CPU times: user 6.78 s, sys: 48.9 ms, total: 6.83 s
Wall time: 6.88 s


In [120]:
results.loc['Random Forest'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

### Boosting Models

In [124]:
%%time
# Gradient Boosting on Count Vectors
accuracy1 = train_model(GradientBoostingClassifier(), X_train_count, y_train, X_test_count)
print('GB, Count Vectors    : %.4f\n' % accuracy1)

GB, Count Vectors    : 0.8008

CPU times: user 2.97 s, sys: 33.1 ms, total: 3 s
Wall time: 3.11 s


In [125]:
%%time
# Gradient Boosting on Word Level TF IDF Vectors
accuracy2 = train_model(GradientBoostingClassifier(), X_train_tfidf, y_train, X_test_tfidf)
print('GB, WordLevel TF-IDF : %.4f\n' % accuracy2)

GB, WordLevel TF-IDF : 0.8004

CPU times: user 8.51 s, sys: 56.6 ms, total: 8.57 s
Wall time: 8.66 s


In [126]:
%%time
# Gradient Boosting on Ngram Level TF IDF Vectors
accuracy3 = train_model(GradientBoostingClassifier(), X_train_tfidf_ngram, y_train, X_test_tfidf_ngram)
print('GB, N-Gram Vectors   : %.4f\n' % accuracy3)

GB, N-Gram Vectors   : 0.7352

CPU times: user 5.12 s, sys: 39.5 ms, total: 5.16 s
Wall time: 5.23 s


In [128]:
%%time
# Gradient Boosting on Character Level TF IDF Vectors
accuracy4 = train_model(GradientBoostingClassifier(), X_train_tfidf_ngram_chars, y_train, X_test_tfidf_ngram_chars)
print('GB, CharLevel Vectors: %.4f\n' % accuracy4)

GB, CharLevel Vectors: 0.8080

CPU times: user 1min 22s, sys: 285 ms, total: 1min 23s
Wall time: 1min 23s


In [129]:
results.loc['Gradient Boosting'] = {
    'Count Vectors': accuracy1,
    'WordLevel TF-IDF': accuracy2,
    'N-Gram Vectors': accuracy3,
    'CharLevel Vectors': accuracy4}

In [130]:
results

Unnamed: 0,Count Vectors,WordLevel TF-IDF,N-Gram Vectors,CharLevel Vectors
Naïve Bayes,0.8368,0.84,0.8356,0.8144
Logistic Regression,0.8548,0.8688,0.8392,0.8404
Support Vector Machine,0.8424,0.8508,0.828,0.848
Random Forest,0.8284,0.828,0.7868,0.7956
Gradient Boosting,0.8008,0.8004,0.7352,0.808


Which combination of features and model performed the best?

WordLevel TF-IDF and Logistic Regression performed the best.



---



---



> > > > > > > > > © 2025 Institute of Data


---



---



