# Text classification/categorization

    What is text classification?

Text classification is the process of assigning text documents into one or more classes or categories, assuming that we have a predefined set of classes.

Documents here are textual documents, and each document can contain a sentence or even a paragraph of words. 

## Two types of text classification

    What types of text classifications are available?

- content-based classification
- request-based classification

__Content-based classification__ is the type of text classification where priorities or weights are given to a specific subjects or topics in the text content that would help determine the class of the document.

E.g., a book with more than 30 percent of its content about food preparations can be classified under cooking/recipes. 

__Request-based classification__ is influenced by user requests and targeted towards specific user groups and audiences. This type of classification is governed by specific policies and ideals.

## Text classification blueprint

1. prepare test, train and validation (optional) datasets
2. text normalization
3. feature extraction
4. model training
5. model prediction and evaluation
6. model deployment

## Text normalization

- expanding contractions
- text standardization through lemmatization
- removing special characters and aymbols
- removing stopwords

Others:
- correcting spelling

In [37]:
# In order to use modules, create a directory module and a __init__.py file there.
# Note that a .py file cannot be in the same folder as the .ipynb, else it will throw an exception.
from module.contractions import expand_contractions 
from module.tokenize import tokenize_text
from module.lemmatize import lemmatize_text, pos_tag_text
from module.normalization import normalize_corpus
from module.feature_extractors import bow_extractor, tfidf_transformer, tfidf_extractor, averaged_word_vectorizer, tfidf_weighted_averaged_word_vectorizer

In [38]:
expand_contractions("this isn't good")

'this is not good'

In [39]:
# Define function to tokenize text into tokens that will be used by our other normalization functions.
tokenize_text('hello world')

['hello', 'world']

In [40]:
import re

# Match any hello.
pattern = re.compile('hello')

# Define a substitution function that allows us access to the matched word.
def subfn(m):
    match = m.group(0)
    return f'[{match}]'
    
pattern.sub(subfn, 'hello world')

'[hello] world'

In [41]:
# lemmatize_text('where are you playing football')

In [42]:
CORPUS = [
    'the sky is blue',
    'sky is blue and sky is beautiful',
    'the beautiful sky is blue',
    'i love blue cheese'
]
new_doc = ['loving this blue sky today']

In [43]:
normalize_corpus(CORPUS, True)

['sky blue',
 ['sky', 'blue'],
 'sky blue sky beautiful',
 ['sky', 'blue', 'sky', 'beautiful'],
 'beautiful sky blue',
 ['beautiful', 'sky', 'blue'],
 'love blue cheese',
 ['love', 'blue', 'cheese']]

## Feature Extraction


### What is feature extraction/engineering?
    
- The process of extracting and selecting features

### What is feature?

- features are unique, measurable attributes or properties for each observation or data point in a dataset.
- features are usuallu numeric in nature and can be absolute numeric values or categorical features that can be encoded as binary features for each category in the list using a process called __one-hot encoding__.

### What are examples of feature extraction techniques?

- bag of words model
- tf-idf model
- advanced word vectorization model

# Model 1: Bag of Words

Disadvantage:
- vectors are completely based on the absolute frequencies of word occurences
- this may have potential problems where words that may tend to occur a lot across all documents in the corpus will have higher frequencies and will tend to overshadow other words that may not occur as frequently but may be more interesting and effective as features to identify specific categories for the documents.

In [44]:
# Build bow vectorizer and get features.
bow_vectorizer, bow_features = bow_extractor(CORPUS)
features = bow_features.todense()
features

matrix([[0, 0, 1, 0, 1, 0, 1, 1],
        [1, 1, 1, 0, 2, 0, 2, 0],
        [0, 1, 1, 0, 1, 0, 1, 1],
        [0, 0, 1, 1, 0, 1, 0, 0]])

In [45]:
# Extract features from new document using built vectorizer.
new_doc_features = bow_vectorizer.transform(new_doc)
new_doc_features = new_doc_features.todense()
new_doc_features

matrix([[0, 0, 1, 0, 0, 0, 1, 0]])

In [46]:
# Print the feature names.
feature_names = bow_vectorizer.get_feature_names()
feature_names

['and', 'beautiful', 'blue', 'cheese', 'is', 'love', 'sky', 'the']

In [47]:
import pandas as pd

def display_features(features, feature_names):
    df = pd.DataFrame(data=features,
                      columns=feature_names)
    print(df)

In [48]:
display_features(features, feature_names)

   and  beautiful  blue  cheese  is  love  sky  the
0    0          0     1       0   1     0    1    1
1    1          1     1       0   2     0    2    0
2    0          1     1       0   1     0    1    1
3    0          0     1       1   0     1    0    0


In [49]:
display_features(new_doc_features, feature_names)

   and  beautiful  blue  cheese  is  love  sky  the
0    0          0     1       0   0     0    1    0


# Model 2: TF-IDF Model

- product of two metrics, term frequency (tf) and inverse document frequency (idf)
- term frequency is the raw frequency value of that term in a particular document
- $tf(w, D) = f_\text(wD')$, $f_\text(wD')$ denotes frequency for word in document D
- inverse document frequency is the inverse of the document frequency for each term.
- idf is computed by dividing the total number of documents in our corpus by the document frequency for each term and then applying logarithmic scaling on the result

We add 1 to the document frequency for each term to indicate that we have one more document in our corpus that essentially has every term in the vocabulary. This is to prevent potential division-by-zero errors and smoothen the inverse document frequencies. We also add 1 to our result of our idf to avoid ignoring terms completely that might have zero idf:

$idf(t) = 1 + log\frac{C}{1 + df(t)}$

Where:
- $C$ is the count of the total number of documents in our corpus
- $idf(t)$ is the idf for term t
- $df(t)$ is the frequency of the number of documents in which term t is present

In [50]:
import numpy as np

# Build tfidf transformer and show train corpus tfidf features.
tfidf_trans, tfidf_features = tfidf_transformer(bow_features)
features = np.round(tfidf_features.todense(), 2)
display_features(features, feature_names)

    and  beautiful  blue  cheese    is  love   sky   the
0  0.00       0.00  0.40    0.00  0.49  0.00  0.49  0.60
1  0.44       0.35  0.23    0.00  0.56  0.00  0.56  0.00
2  0.00       0.52  0.34    0.00  0.42  0.00  0.42  0.52
3  0.00       0.00  0.35    0.66  0.00  0.66  0.00  0.00


In [51]:
# Show tfidf features for new_doc using built tfidf transformer.
nd_tfidf = tfidf_trans.transform(new_doc_features)
nd_features = np.round(nd_tfidf.todense(), 2)
display_features(nd_features, feature_names)

   and  beautiful  blue  cheese   is  love   sky  the
0  0.0        0.0  0.63     0.0  0.0   0.0  0.77  0.0


## Implementing TF-IDF from scratch

In [52]:
import scipy.sparse as sp
from numpy.linalg import norm

feature_names = bow_vectorizer.get_feature_names()

# Compute term frequency.
tf = bow_features.todense()
tf = np.array(tf, dtype='float64')

In [53]:
# Show term frequency.
display_features(tf, feature_names)

   and  beautiful  blue  cheese   is  love  sky  the
0  0.0        0.0   1.0     0.0  1.0   0.0  1.0  1.0
1  1.0        1.0   1.0     0.0  2.0   0.0  2.0  0.0
2  0.0        1.0   1.0     0.0  1.0   0.0  1.0  1.0
3  0.0        0.0   1.0     1.0  0.0   1.0  0.0  0.0


In [54]:
# Build the document frequency matrix.
df = np.diff(sp.csc_matrix(bow_features, copy=True).indptr)
df = 1 + df # To smoothen the idf later.

In [55]:
# How many times the term appear in each document + 1.
display_features([df], feature_names)

   and  beautiful  blue  cheese  is  love  sky  the
0    2          3     5       2   4     2    4    3


In [56]:
# Compute inverse document frequencies.
total_docs = 1 + len(CORPUS)
idf = 1.0 + np.log(float(total_docs) / df)

In [57]:
# Show inverse document frequencies.
display_features([np.round(idf, 2)], feature_names)

    and  beautiful  blue  cheese    is  love   sky   the
0  1.92       1.51   1.0    1.92  1.22  1.92  1.22  1.51


In [58]:
# Compute idf diagonal matrix.
total_features = bow_features.shape[1]
idf_diag = sp.spdiags(idf, diags=0, m=total_features, n=total_features)
idf = idf_diag.todense()

In [59]:
np.round(idf, 2)

array([[1.92, 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ],
       [0.  , 1.51, 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ],
       [0.  , 0.  , 1.  , 0.  , 0.  , 0.  , 0.  , 0.  ],
       [0.  , 0.  , 0.  , 1.92, 0.  , 0.  , 0.  , 0.  ],
       [0.  , 0.  , 0.  , 0.  , 1.22, 0.  , 0.  , 0.  ],
       [0.  , 0.  , 0.  , 0.  , 0.  , 1.92, 0.  , 0.  ],
       [0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 1.22, 0.  ],
       [0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 1.51]])

In [60]:
tfidf = tf * idf
display_features(np.round(tfidf, 2), feature_names)

    and  beautiful  blue  cheese    is  love   sky   the
0  0.00       0.00   1.0    0.00  1.22  0.00  1.22  1.51
1  1.92       1.51   1.0    0.00  2.45  0.00  2.45  0.00
2  0.00       1.51   1.0    0.00  1.22  0.00  1.22  1.51
3  0.00       0.00   1.0    1.92  0.00  1.92  0.00  0.00


In [61]:
# Compute L2 norms.
norms = norm(tfidf, axis=1)

In [62]:
# Print norms for each document.
np.round(norms, 2)

array([2.5 , 4.35, 2.93, 2.89])

In [63]:
# Compute normalized tfidf.
norm_tfidf = tfidf / norms[:, None]
norm_tfidf

matrix([[0.        , 0.        , 0.39921021, 0.        , 0.48829139,
         0.        , 0.48829139, 0.60313701],
        [0.44051607, 0.34730793, 0.22987956, 0.        , 0.5623514 ,
         0.        , 0.5623514 , 0.        ],
        [0.        , 0.51646957, 0.34184591, 0.        , 0.41812662,
         0.        , 0.41812662, 0.51646957],
        [0.        , 0.        , 0.34618161, 0.66338461, 0.        ,
         0.66338461, 0.        , 0.        ]])

In [64]:
# Show final tfidf feature matrix.
display_features(np.round(norm_tfidf, 2), feature_names)

    and  beautiful  blue  cheese    is  love   sky   the
0  0.00       0.00  0.40    0.00  0.49  0.00  0.49  0.60
1  0.44       0.35  0.23    0.00  0.56  0.00  0.56  0.00
2  0.00       0.52  0.34    0.00  0.42  0.00  0.42  0.52
3  0.00       0.00  0.35    0.66  0.00  0.66  0.00  0.00


In [65]:
# Compute new doc terms freqs from bow freqs.
nd_tf = new_doc_features
nd_tf = np.array(nd_tf, dtype='float64')

# Compute tfidf using idf matrix from train corpus.
nd_tfidf = nd_tf * idf
nd_norms = norm(nd_tfidf, axis=1)
norm_nd_tfidf = nd_tfidf / nd_norms[:, None]

In [66]:
# Show new_doc tfidf feature vector.
display_features(np.round(norm_nd_tfidf, 2), feature_names)

   and  beautiful  blue  cheese   is  love   sky  the
0  0.0        0.0  0.63     0.0  0.0   0.0  0.77  0.0


# Implementing Tfidf Vectorizer

In [67]:
# Build tfidf vectorizer and get training corpus feature vectors.
tfidf_vectorizer, tdidf_features = tfidf_extractor(CORPUS)
display_features(np.round(tdidf_features.todense(), 2), feature_names)

    and  beautiful  blue  cheese    is  love   sky   the
0  0.00       0.00  0.40    0.00  0.49  0.00  0.49  0.60
1  0.44       0.35  0.23    0.00  0.56  0.00  0.56  0.00
2  0.00       0.52  0.34    0.00  0.42  0.00  0.42  0.52
3  0.00       0.00  0.35    0.66  0.00  0.66  0.00  0.00


In [68]:
# Get tfidf feature vector for the new document.
nd_tfidf = tfidf_vectorizer.transform(new_doc)
display_features(np.round(nd_tfidf.todense(), 2), feature_names)

   and  beautiful  blue  cheese   is  love   sky  the
0  0.0        0.0  0.63     0.0  0.0   0.0  0.77  0.0


## Advanced Word Vectorization Models

In [69]:
import gensim
import nltk

In [70]:
CORPUS = [
    'the sky is blue',
    'sky is blue and sky is beautiful',
    'the beautiful sky is blue',
    'i love blue cheese'
]
new_doc = ['loving this blue sky today']

In [71]:
# Tokenize corpora.
TOKENIZED_CORPUS = [nltk.word_tokenize(sentence)
                    for sentence in CORPUS]
tokenized_new_doc = [nltk.word_tokenize(sentence)
                     for sentence in new_doc]

In [72]:
# Build the word2vec model on our training corpus.

# size: set the size or dimension for the word vectors.
# window: set the context or window size, which specifies the length of the window of words that should be considered for the algorithm to take into account as context when training.
# min_count: the minimum word count needed across the corpus for the words to be considered in the vocabulary.
# sample: used to downsample effects of occurence of frequent words.
model = gensim.models.Word2Vec(TOKENIZED_CORPUS, size=10, window=10,
                               min_count=2, sample=1e-3)

In [73]:
model.wv['sky']

array([ 0.04561428, -0.02996336, -0.04556778, -0.00351243,  0.04590314,
        0.01923104, -0.0222545 ,  0.01540163, -0.04887827, -0.03940738],
      dtype=float32)

In [74]:
model.wv['blue']

array([-0.00696618, -0.03747655,  0.00500463, -0.02279691,  0.0324855 ,
       -0.04436615,  0.03656287,  0.03615776,  0.00809231,  0.01606599],
      dtype=float32)

## Averaged Word Vectors

Problem:
- each word vector is of length 10 based on the size parameter specified earlier.
- but sentences are of unequal length
- some operations (combining and aggregations) are required to make sure the number of dimensions of the final feature vectors are the same, regardless of the length of the text document, number of words and so on. 


Solution:
- use average weighted word vectorization scheme, where for each text document we will extract all the tokens of the text document, and for each token in the document we will capture the subsequent word vector if present in the vocabulary. 
- we will sum up all the word vectors and divide the result by the total number of words matched in the vocabulary to get a final resulting averaged word vector representation of the text document.

Pseudo-code:
```
model := the word2vec model we built
vocabulary := unique_words(model)
document := [words]
matched_word_count := 0
vector := []

for word in words:
    if word in vocabulary:
        vector := vector + model[word]
        matched_word_count := matched_word_count + 1

averaged_word_vector := vector / matched_word_count
```

In [75]:
# Ge averaged word vectors for our training CORPUS.
avg_word_vec_features = averaged_word_vectorizer(corpus=TOKENIZED_CORPUS,
                                                 model=model,
                                                 num_features=10)
np.round(avg_word_vec_features, 3)

array([[ 0.006, -0.014, -0.001, -0.008,  0.017, -0.008, -0.013,  0.027,
        -0.015, -0.015],
       [ 0.01 , -0.025, -0.013, -0.005,  0.014, -0.003, -0.007,  0.019,
        -0.014, -0.021],
       [ 0.005, -0.011,  0.005, -0.008,  0.006,  0.   , -0.003,  0.024,
        -0.017, -0.01 ],
       [-0.007, -0.037,  0.005, -0.023,  0.032, -0.044,  0.037,  0.036,
         0.008,  0.016]])

In [76]:
# Get averaged word vectors for our test new_doc.
nd_avg_word_vec_features = averaged_word_vectorizer(corpus=tokenized_new_doc, 
                                                    model=model,
                                                    num_features=10)
np.round(nd_avg_word_vec_features, 3)

array([[ 0.019, -0.034, -0.02 , -0.013,  0.039, -0.013,  0.007,  0.026,
        -0.02 , -0.012]])

## TF-IDF Weighted Average Word Vectors

```
model := the word2vec model we built
vocabulary := unique_words(model)
document := [words]
tfidfs := [tfidf(word) for each word in words]
matched_word_wts := 0
vector := []

for word in words:
    if word in vocabulary:
        word_vector := model[word]
        weighted_word_vector := tfidfs[word] x word_vector
        vector := vector + weighted_word_vector
        matched_word_wts := matched_word_wts + tfidfs[word]

tfidf_wtd_avgd_word_vector := vector / matched_word_wts
```

In [77]:
# Get tfidf weights and vocabulary from earlier results and compute result.
corpus_tfidf = tdidf_features
vocab = tfidf_vectorizer.vocabulary_
wt_tfidf_word_vec_features = tfidf_weighted_averaged_word_vectorizer(corpus=TOKENIZED_CORPUS, 
                                                                     tfidf_vectors=corpus_tfidf,
                                                                     tfidf_vocabulary=vocab,
                                                                     model=model,
                                                                     num_features=10)
np.round(wt_tfidf_word_vec_features, 3)

array([[ 0.007, -0.01 ,  0.002, -0.008,  0.015, -0.005, -0.016,  0.027,
        -0.017, -0.016],
       [ 0.012, -0.025, -0.018, -0.002,  0.016, -0.   , -0.016,  0.017,
        -0.015, -0.028],
       [ 0.005, -0.008,  0.007, -0.008,  0.003,  0.004, -0.003,  0.024,
        -0.019, -0.01 ],
       [-0.007, -0.037,  0.005, -0.023,  0.032, -0.044,  0.037,  0.036,
         0.008,  0.016]])

In [78]:
# Compute avgd word vector for test new_doc.
nd_wt_tfidf_word_vec_features = tfidf_weighted_averaged_word_vectorizer(corpus=tokenized_new_doc,
                                                                        tfidf_vectors=nd_tfidf,
                                                                        tfidf_vocabulary=vocab,
                                                                        model=model,
                                                                        num_features=10)

In [79]:
np.round(nd_wt_tfidf_word_vec_features, 3)

array([[ 0.022, -0.033, -0.023, -0.012,  0.04 , -0.009,  0.004,  0.025,
        -0.023, -0.014]])