# Text classification/categorization

    What is text classification?

Text classification is the process of assigning text documents into one or more classes or categories, assuming that we have a predefined set of classes.

Documents here are textual documents, and each document can contain a sentence or even a paragraph of words. 

## Two types of text classification

    What types of text classifications are available?

- content-based classification
- request-based classification

__Content-based classification__ is the type of text classification where priorities or weights are given to a specific subjects or topics in the text content that would help determine the class of the document.

E.g., a book with more than 30 percent of its content about food preparations can be classified under cooking/recipes. 

__Request-based classification__ is influenced by user requests and targeted towards specific user groups and audiences. This type of classification is governed by specific policies and ideals.

## Text classification blueprint

1. prepare test, train and validation (optional) datasets
2. text normalization
3. feature extraction
4. model training
5. model prediction and evaluation
6. model deployment

## Text normalization

- expanding contractions
- text standardization through lemmatization
- removing special characters and aymbols
- removing stopwords

Others:
- correcting spelling

In [84]:
# In order to use modules, create a directory module and a __init__.py file there.
# Note that a .py file cannot be in the same folder as the .ipynb, else it will throw an exception.
from module.contractions import expand_contractions 
from module.tokenize import tokenize_text
from module.lemmatize import lemmatize_text, pos_tag_text
# from module.feature_extractor import bow_extractor

In [85]:
expand_contractions("this isn't good")

'this is not good'

In [86]:
# Define function to tokenize text into tokens that will be used by our other normalization functions.
tokenize_text('hello world')

['hello', 'world']

In [87]:
import re

# Match any hello.
pattern = re.compile('hello')

# Define a substitution function that allows us access to the matched word.
def subfn(m):
    match = m.group(0)
    return f'[{match}]'
    
pattern.sub(subfn, 'hello world')

'[hello] world'

In [88]:
lemmatize_text('where are you playing football')

'where be you play football'

In [89]:
import string
import re
from nltk.corpus import stopwords

stopword_list = stopwords.words('english')

def remove_special_characters(text):
    tokens = tokenize_text(text)
    pattern = re.compile('[{}]'.format(re.escape(string.punctuation)))
    filtered_tokens = filter(None, [pattern.sub('', token) 
                                    for token in tokens])
    filtered_text = ' '.join(filtered_tokens)
    return filtered_text

def remove_stopwords(text):
    tokens = tokenize_text(text)
    filtered_tokens = [token for token in tokens if token not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)
    return filtered_text

def normalize_corpus(corpus, tokenize=False):
    normalized_corpus = []
    for text in corpus:
        text = expand_contractions(text)
        text = lemmatize_text(text)
        text = remove_special_characters(text)
        text = remove_stopwords(text)
        normalized_corpus.append(text)
        if tokenize:
            text = tokenize_text(text)
            normalized_corpus.append(text)
    return normalized_corpus

In [90]:
CORPUS = [
    'the sky is blue',
    'sky is blue and sky is beautiful',
    'the beautiful sky is blue',
    'i love blue cheese'
]
new_doc = ['loving this blue sky today']

In [91]:
normalize_corpus(CORPUS, True)

['sky blue',
 ['sky', 'blue'],
 'sky blue sky beautiful',
 ['sky', 'blue', 'sky', 'beautiful'],
 'beautiful sky blue',
 ['beautiful', 'sky', 'blue'],
 'love blue cheese',
 ['love', 'blue', 'cheese']]

## Feature Extraction


### What is feature extraction/engineering?
    
- The process of extracting and selecting features

### What is feature?

- features are unique, measurable attributes or properties for each observation or data point in a dataset.
- features are usuallu numeric in nature and can be absolute numeric values or categorical features that can be encoded as binary features for each category in the list using a process called __one-hot encoding__.

### What are examples of feature extraction techniques?

- bag of words model
- tf-idf model
- advanced word vectorization model

# Model 1: Bag of Words

Disadvantage:
- vectors are completely based on the absolute frequencies of word occurences
- this may have potential problems where words that may tend to occur a lot across all documents in the corpus will have higher frequencies and will tend to overshadow other words that may not occur as frequently but may be more interesting and effective as features to identify specific categories for the documents.

In [92]:
from sklearn.feature_extraction.text import CountVectorizer

def bow_extractor(corpus, ngram_range=(1, 1)):
    vectorizer = CountVectorizer(min_df=1, ngram_range=ngram_range)
    features = vectorizer.fit_transform(corpus)
    return vectorizer, features

In [93]:
# Build bow vectorizer and get features.
bow_vectorizer, bow_features = bow_extractor(CORPUS)
features = bow_features.todense()
features

matrix([[0, 0, 1, 0, 1, 0, 1, 1],
        [1, 1, 1, 0, 2, 0, 2, 0],
        [0, 1, 1, 0, 1, 0, 1, 1],
        [0, 0, 1, 1, 0, 1, 0, 0]])

In [94]:
# Extract features from new document using built vectorizer.
new_doc_features = bow_vectorizer.transform(new_doc)
new_doc_features = new_doc_features.todense()
new_doc_features

matrix([[0, 0, 1, 0, 0, 0, 1, 0]])

In [95]:
# Print the feature names.
feature_names = bow_vectorizer.get_feature_names()
feature_names

['and', 'beautiful', 'blue', 'cheese', 'is', 'love', 'sky', 'the']

In [96]:
import pandas as pd

def display_features(features, feature_names):
    df = pd.DataFrame(data=features,
                      columns=feature_names)
    print(df)

In [97]:
display_features(features, feature_names)

   and  beautiful  blue  cheese  is  love  sky  the
0    0          0     1       0   1     0    1    1
1    1          1     1       0   2     0    2    0
2    0          1     1       0   1     0    1    1
3    0          0     1       1   0     1    0    0


In [98]:
display_features(new_doc_features, feature_names)

   and  beautiful  blue  cheese  is  love  sky  the
0    0          0     1       0   0     0    1    0


# Model 2: TF-IDF Model

- product of two metrics, term frequency (tf) and inverse document frequency (idf)
- term frequency is the raw frequency value of that term in a particular document
- $tf(w, D) = f_\text(wD')$, $f_\text(wD')$ denotes frequency for word in document D
- inverse document frequency is the inverse of the document frequency for each term.
- idf is computed by dividing the total number of documents in our corpus by the document frequency for each term and then applying logarithmic scaling on the result

We add 1 to the document frequency for each term to indicate that we have one more document in our corpus that essentially has every term in the vocabulary. This is to prevent potential division-by-zero errors and smoothen the inverse document frequencies. We also add 1 to our result of our idf to avoid ignoring terms completely that might have zero idf:

$idf(t) = 1 + log\frac{C}{1 + df(t)}$

Where:
- $C$ is the count of the total number of documents in our corpus
- $idf(t)$ is the idf for term t
- $df(t)$ is the frequency of the number of documents in which term t is present

In [99]:
from sklearn.feature_extraction.text import TfidfTransformer

def tfidf_transformer(bow_matrix):
    transformer = TfidfTransformer(norm='l2',
                                   smooth_idf=True,
                                   use_idf=True)
    tfidf_matrix = transformer.fit_transform(bow_matrix)
    return transformer, tfidf_matrix

In [100]:
import numpy as np

# Build tfidf transformer and show train corpus tfidf features.
tfidf_trans, tfidf_features = tfidf_transformer(bow_features)
features = np.round(tfidf_features.todense(), 2)
display_features(features, feature_names)

    and  beautiful  blue  cheese    is  love   sky   the
0  0.00       0.00  0.40    0.00  0.49  0.00  0.49  0.60
1  0.44       0.35  0.23    0.00  0.56  0.00  0.56  0.00
2  0.00       0.52  0.34    0.00  0.42  0.00  0.42  0.52
3  0.00       0.00  0.35    0.66  0.00  0.66  0.00  0.00


In [101]:
# Show tfidf features for new_doc using built tfidf transformer.
nd_tfidf = tfidf_trans.transform(new_doc_features)
nd_features = np.round(nd_tfidf.todense(), 2)
display_features(nd_features, feature_names)

   and  beautiful  blue  cheese   is  love   sky  the
0  0.0        0.0  0.63     0.0  0.0   0.0  0.77  0.0


## Implementing TF-IDF from scratch

In [106]:
import scipy.sparse as sp
from numpy.linalg import norm

feature_names = bow_vectorizer.get_feature_names()

# Compute term frequency.
tf = bow_features.todense()
tf = np.array(tf, dtype='float64')

In [111]:
# Show term frequency.
display_features(tf, feature_names)

   and  beautiful  blue  cheese   is  love  sky  the
0  0.0        0.0   1.0     0.0  1.0   0.0  1.0  1.0
1  1.0        1.0   1.0     0.0  2.0   0.0  2.0  0.0
2  0.0        1.0   1.0     0.0  1.0   0.0  1.0  1.0
3  0.0        0.0   1.0     1.0  0.0   1.0  0.0  0.0


In [114]:
# Build the document frequency matrix.
df = np.diff(sp.csc_matrix(bow_features, copy=True).indptr)
df = 1 + df # To smoothen the idf later.

In [116]:
# How many times the term appear in each document + 1.
display_features([df], feature_names)

   and  beautiful  blue  cheese  is  love  sky  the
0    2          3     5       2   4     2    4    3


In [117]:
# Compute inverse document frequencies.
total_docs = 1 + len(CORPUS)
idf = 1.0 + np.log(float(total_docs) / df)

In [118]:
# Show inverse document frequencies.
display_features([np.round(idf, 2)], feature_names)

    and  beautiful  blue  cheese    is  love   sky   the
0  1.92       1.51   1.0    1.92  1.22  1.92  1.22  1.51


In [119]:
# Compute idf diagonal matrix.
total_features = bow_features.shape[1]
idf_diag = sp.spdiags(idf, diags=0, m=total_features, n=total_features)
idf = idf_diag.todense()

In [120]:
np.round(idf, 2)

array([[1.92, 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ],
       [0.  , 1.51, 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ],
       [0.  , 0.  , 1.  , 0.  , 0.  , 0.  , 0.  , 0.  ],
       [0.  , 0.  , 0.  , 1.92, 0.  , 0.  , 0.  , 0.  ],
       [0.  , 0.  , 0.  , 0.  , 1.22, 0.  , 0.  , 0.  ],
       [0.  , 0.  , 0.  , 0.  , 0.  , 1.92, 0.  , 0.  ],
       [0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 1.22, 0.  ],
       [0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 1.51]])

In [124]:
tfidf = tf * idf
display_features(np.round(tfidf, 2), feature_names)

    and  beautiful  blue  cheese    is  love   sky   the
0  0.00       0.00   1.0    0.00  1.22  0.00  1.22  1.51
1  1.92       1.51   1.0    0.00  2.45  0.00  2.45  0.00
2  0.00       1.51   1.0    0.00  1.22  0.00  1.22  1.51
3  0.00       0.00   1.0    1.92  0.00  1.92  0.00  0.00


In [125]:
# Compute L2 norms.
norms = norm(tfidf, axis=1)

In [128]:
# Print norms for each document.
np.round(norms, 2)

array([2.5 , 4.35, 2.93, 2.89])

In [127]:
# Compute normalized tfidf.
norm_tfidf = tfidf / norms[:, None]
norm_tfidf

matrix([[0.        , 0.        , 0.39921021, 0.        , 0.48829139,
         0.        , 0.48829139, 0.60313701],
        [0.44051607, 0.34730793, 0.22987956, 0.        , 0.5623514 ,
         0.        , 0.5623514 , 0.        ],
        [0.        , 0.51646957, 0.34184591, 0.        , 0.41812662,
         0.        , 0.41812662, 0.51646957],
        [0.        , 0.        , 0.34618161, 0.66338461, 0.        ,
         0.66338461, 0.        , 0.        ]])

In [131]:
# Show final tfidf feature matrix.
display_features(np.round(norm_tfidf, 2), feature_names)

    and  beautiful  blue  cheese    is  love   sky   the
0  0.00       0.00  0.40    0.00  0.49  0.00  0.49  0.60
1  0.44       0.35  0.23    0.00  0.56  0.00  0.56  0.00
2  0.00       0.52  0.34    0.00  0.42  0.00  0.42  0.52
3  0.00       0.00  0.35    0.66  0.00  0.66  0.00  0.00


In [132]:
# Compute new doc terms freqs from bow freqs.
nd_tf = new_doc_features
nd_tf = np.array(nd_tf, dtype='float64')

# Compute tfidf using idf matrix from train corpus.
nd_tfidf = nd_tf * idf
nd_norms = norm(nd_tfidf, axis=1)
norm_nd_tfidf = nd_tfidf / nd_norms[:, None]

In [133]:
# Show new_doc tfidf feature vector.
display_features(np.round(norm_nd_tfidf, 2), feature_names)

   and  beautiful  blue  cheese   is  love   sky  the
0  0.0        0.0  0.63     0.0  0.0   0.0  0.77  0.0


# Implementing Tfidf Vectorizer

In [134]:
from sklearn.feature_extraction.text import TfidfVectorizer

def tfidf_extractor(corpus, ngram_range=(1, 1)):
    vectorizer = TfidfVectorizer(min_df=1,
                                 norm='l2',
                                 smooth_idf=True,
                                 use_idf=True,
                                 ngram_range=ngram_range)
    features = vectorizer.fit_transform(corpus)
    return vectorizer, features

In [135]:
# Build tfidf vectorizer and get training corpus feature vectors.
tfidf_vectorizer, tfidf_features = tfidf_extractor(CORPUS)
display_features(np.round(tfidf_features.todense(), 2), feature_names)

    and  beautiful  blue  cheese    is  love   sky   the
0  0.00       0.00  0.40    0.00  0.49  0.00  0.49  0.60
1  0.44       0.35  0.23    0.00  0.56  0.00  0.56  0.00
2  0.00       0.52  0.34    0.00  0.42  0.00  0.42  0.52
3  0.00       0.00  0.35    0.66  0.00  0.66  0.00  0.00


In [136]:
# Get tfidf feature vector for the new document.
nd_tfidf = tfidf_vectorizer.transform(new_doc)
display_features(np.round(nd_tfidf.todense(), 2), feature_names)

   and  beautiful  blue  cheese   is  love   sky  the
0  0.0        0.0  0.63     0.0  0.0   0.0  0.77  0.0
