# Simple text classifier

Build a simple text classifier using BoW (Bag of Words) to convert text to numerical values and classify a text based on a specific topic.

## Bag of Words (BoW)

In [1]:
# Get dummy data
X = ['I love the book', 'This is a great book', 'This is a nice shirt', 'I love your shoes']
y = ['book', 'book', 'clothes', 'clothes']

from sklearn.feature_extraction.text import CountVectorizer

vectorizer_uni = CountVectorizer(binary=True, # binary = wether to apply a binary encoding or keep the count of ocurences
                                 ngram_range=(1, 1),) # (1, 1) = Unigram only, (1, 2) = Uni + Bigram, (2, 2) = Bigram only 
# Fit to the data
X_uni = vectorizer_uni.fit_transform(X)

vectorizer_bi = CountVectorizer(binary=True,
                                ngram_range=(2, 2))
X_bi = vectorizer_bi.fit_transform(X)

vectorizer_uni_bi = CountVectorizer(binary=True,
                                    ngram_range=(1, 2))
X_uni_bi = vectorizer_uni_bi.fit_transform(X)

print('First sentence without encoding:')
print(X[0])
print('\n')

print('Unigram vectorizer learned n-grams:')
print(vectorizer_uni.get_feature_names())
print('\n')

print('Bigram vectorizer learned n-grams:')
print(vectorizer_bi.get_feature_names())
print('\n')

print('Unigram + bigram vectorizer learned n-grams:')
print(vectorizer_uni_bi.get_feature_names())
print('\n')

print(f'First sentence unigrams: {X_uni.toarray()[0]}')
print(f'First sentence bigrams: {X_bi.toarray()[0]}')
print(f'First sentence uni + bigrams: {X_uni_bi.toarray()[0]}')

First sentence without encoding:
I love the book


Unigram vectorizer learned n-grams:
['book', 'great', 'is', 'love', 'nice', 'shirt', 'shoes', 'the', 'this', 'your']


Bigram vectorizer learned n-grams:
['great book', 'is great', 'is nice', 'love the', 'love your', 'nice shirt', 'the book', 'this is', 'your shoes']


Unigram + bigram vectorizer learned n-grams:
['book', 'great', 'great book', 'is', 'is great', 'is nice', 'love', 'love the', 'love your', 'nice', 'nice shirt', 'shirt', 'shoes', 'the', 'the book', 'this', 'this is', 'your', 'your shoes']


First sentence unigrams: [1 0 0 1 0 0 0 1 0 0]
First sentence bigrams: [0 0 0 1 0 0 1 0 0]
First sentence uni + bigrams: [1 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0]




## Simple classifier using LinearSVC

In [2]:
from sklearn import svm

clf = svm.LinearSVC()
clf.fit(X_uni, y)

X_test = ['I like this book', 'I love your shirt', 'Nice shoes']

clf.predict(vectorizer_uni.transform(X_test))

array(['book', 'clothes', 'clothes'], dtype='<U7')

## BoW issues

1) The more n-grams you have the more biased the classifier may be. For instance, if you fit a 10-gram BoW then the classifier would have trouble generalizing 10 worded sentences.

2) If a word is not present to the BoW dictionary then the encoding would fail.

3) The sparsity of the fitted data may increase the memory usage.

In [3]:
clf.predict(vectorizer_uni.transform(['I love the boots'])) # Boots is not present to the vocabulary

array(['book'], dtype='<U7')

# Word vectors

Using spacy's trained pipeline that has pre-trained transformations that output a vector for each word which is a spacial representation of that word into a (300,) sized vector. For sentences, it computes the mean for each word (token)

## Downloading the pipeline

In [4]:
# Download vectors
!python -m spacy download en_core_web_md

Collecting en-core-web-md==3.3.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.3.0/en_core_web_md-3.3.0-py3-none-any.whl (33.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.5/33.5 MB[0m [31m26.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.3.0
[0m[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


## Loading the pipeline and trying to find some similarities

In [5]:
import numpy as np
import spacy

# Spacy has an english trained pipeline
nlp = spacy.load('en_core_web_md')

def similarity(a, b):
    """
    Computes the cosine similarity between strings.
    """
    # Get vector representation
    a_vec = nlp(a).vector
    b_vec = nlp(b).vector
    
    return np.dot(a_vec, b_vec.T)/(np.linalg.norm(a_vec)*np.linalg.norm(b_vec))

print(similarity('cat', 'book')) # Not similar at all
print(similarity('cat', 'dog')) # Really similar
print(similarity('book', 'library')) # Pretty similar

0.06928015
1.0
0.71957725


## Retraining the simple classifier

In [6]:
# Get vector representation for each sentence
X_vectors = np.array([nlp(text).vector for text in X])

# Retrain the classifier
clf = svm.LinearSVC()
clf.fit(X_vectors, y)

# The simple classifier predicted "book" for the sentence
# Now eventhough the word "boots" is not in our vocab
# the classifier is able to correctly predict the proper label
clf.predict(np.array([nlp('I love the boots').vector]))

array(['clothes'], dtype='<U7')

In [7]:
# Another slightly tricky example
clf.predict(np.array([nlp('I love going to the library').vector]))

array(['book'], dtype='<U7')

## Embeddings issues

1) Sometimes squashing a whole sentence/text/document into a single vector representation may lose some individual word meaning.

2) Word embeddings don't take to account the word positioning, which may add a total different meaning to the sentence.

In [8]:
# Same words rearanged into different sentences
sentence_1 = 'Her dog has gone for a walk with Mary'
sentence_2 = 'Mary has gone for a walk with her dog'

similarity(sentence_1, sentence_2)

1.0

# Stemming/Lemmatization

Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.

In [9]:
import nltk

nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /usr/share/nltk_data...


True

## Stemming

Stemming algorithms work by cutting off the end or the beginning of the word, taking into account a list of common prefixes and suffixes that can be found in an inflected word.

In [10]:
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

phrase = 'reading the books'
words = word_tokenize(phrase) # Break the sentence into individual words
stemmed_words = [stemmer.stem(word) for word in words] # Apply 
new_phrase = ' '.join(stemmed_words)
print(new_phrase) # Expected "read the book"

read the book


## Lemmatization

In [11]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

phrase = 'reading the books'
words = word_tokenize(phrase) # Break the sentence into individual words
lemmatized_words = [lemmatizer.lemmatize(word) for word in words] # Apply 
new_phrase = ' '.join(lemmatized_words)
print(new_phrase) # Expected "read the book"

reading the book
