### Text Classification
There are various types of text classification.
- Content-based classification
    - Content based classification is the type of text classification where priorities or weights are given to specific subjects or topics in the text content that would help determine the class of thedocument. A conceptual example would be that a book with more than 30 percent of itscontent about food preparations can be classified under cooking/recipes.
- Request-based classification
    - Request-based classification is influenced by user requests and is targeted towards specific user groups and audiences. This type of classification is governed by specific policies and ideals.

### Text Classification Blueprint
1. Prepare train and test datasets
2. Text normalization
3. Feature extraction
4. Model training
5. Model prediction and evaluation
6. Model deployment

#### Text Normalization
- Expanding contractions
- Text standardization through lemmatization
- Removing special characters and symbols
- Removing stopwords

In [2]:
from contractions import CONTRACTION_MAP
import re
import nltk
import string
from nltk.stem import WordNetLemmatizer

stopword_list = nltk.corpus.stopwords.words('english')
wnl = WordNetLemmatizer()

def tokenize_text(text):
    tokens = nltk.word_tokenize(text) 
    tokens = [token.strip() for token in tokens]
    return tokens

def expand_contractions(text, contraction_mapping):
    
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), 
                                      flags=re.IGNORECASE|re.DOTALL)
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match)\
                                if contraction_mapping.get(match)\
                                else contraction_mapping.get(match.lower())                       
        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction
        
    expanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    return expanded_text

In [3]:
# Annotate text tokens with POS tags
def pos_tag_text(text):
    
    def penn_to_wn_tags(pos_tag):
        if pos_tag.startswith('J'):
            return wn.ADJ
        elif pos_tag.startswith('V'):
            return wn.VERB
        elif pos_tag.startswith('N'):
            return wn.NOUN
        elif pos_tag.startswith('R'):
            return wn.ADV
        else:
            return None
    
    tagged_text = tag(text)
    tagged_lower_text = [(word.lower(), penn_to_wn_tags(pos_tag))
                         for word, pos_tag in
                         tagged_text]
    return tagged_lower_text

In [4]:
# lemmatize text based on POS tags    
def lemmatize_text(text):
    
    pos_tagged_text = pos_tag_text(text)
    lemmatized_tokens = [wnl.lemmatize(word, pos_tag) if pos_tag
                         else word                     
                         for word, pos_tag in pos_tagged_text]
    lemmatized_text = ' '.join(lemmatized_tokens)
    return lemmatized_text
    

def remove_special_characters(text):
    tokens = tokenize_text(text)
    pattern = re.compile('[{}]'.format(re.escape(string.punctuation)))
    filtered_tokens = filter(None, [pattern.sub('', token) for token in tokens])
    filtered_text = ' '.join(filtered_tokens)
    return filtered_text
    
    
def remove_stopwords(text):
    tokens = tokenize_text(text)
    filtered_tokens = [token for token in tokens if token not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text


In [5]:
def normalize_corpus(corpus, tokenize=False):
    
    normalized_corpus = []    
    for text in corpus:
        text = expand_contractions(text, CONTRACTION_MAP)
        text = lemmatize_text(text)
        text = remove_special_characters(text)
        text = remove_stopwords(text)
        normalized_corpus.append(text)
        if tokenize:
            text = tokenize_text(text)
            normalized_corpus.append(text)
            
    return normalized_corpus

#### Feature Extraction
- Bag of Words model
- TF-IDF model
- Advanced word vectorization models

In [72]:
from sklearn.feature_extraction.text import CountVectorizer

def bow_extractor1(corpus, ngram_range=(1,1)):
    
    vectorizer = CountVectorizer(min_df=1, ngram_range=ngram_range)
    features = vectorizer.fit_transform(corpus)
    return vectorizer, features

In [73]:
CORPUS = [
'the sky is blue',
'sky is blue and sky is beautiful',
'the beautiful sky is so blue',
'i love blue cheese'
]
new_doc = ['loving this blue sky today']

In [74]:
# build bow vectorizer and get features
bow_vectorizer, bow_features = bow_extractor1(CORPUS)
features = bow_features.todense()
print (features)

[[0 0 1 0 1 0 1 0 1]
 [1 1 1 0 2 0 2 0 0]
 [0 1 1 0 1 0 1 1 1]
 [0 0 1 1 0 1 0 0 0]]


In [75]:
# extract features from new document using built vectorizer
new_doc_features = bow_vectorizer.transform(new_doc)
new_doc_features = new_doc_features.todense()
print (new_doc_features)

[[0 0 1 0 0 0 1 0 0]]


In [76]:
# print the feature names
feature_names = bow_vectorizer.get_feature_names()
print (feature_names)

['and', 'beautiful', 'blue', 'cheese', 'is', 'love', 'sky', 'so', 'the']


In [77]:
import pandas as pd
def display_features(features, feature_names):
    df = pd.DataFrame(data=features,columns=feature_names)
    return df

In [78]:
display_features(features, feature_names)

Unnamed: 0,and,beautiful,blue,cheese,is,love,sky,so,the
0,0,0,1,0,1,0,1,0,1
1,1,1,1,0,2,0,2,0,0
2,0,1,1,0,1,0,1,1,1
3,0,0,1,1,0,1,0,0,0


In [80]:
display_features(new_doc_features, feature_names)

Unnamed: 0,and,beautiful,blue,cheese,is,love,sky,so,the
0,0,0,1,0,0,0,1,0,0


In [60]:
def bow_extractor3(corpus, ngram_range=(1,3)):
    
    vectorizer = CountVectorizer(min_df=1, ngram_range=ngram_range)
    features = vectorizer.fit_transform(corpus)
    return vectorizer, features

In [61]:
# build bow vectorizer and get features
bow_vectorizer, bow_features = bow_extractor3(CORPUS)
features = bow_features.todense()
print (features)

[[0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 0 1 1]
 [1 1 1 1 0 0 1 1 1 0 0 2 1 1 1 0 0 0 0 0 2 2 1 1 0 0 0 0 0 0 0 0]
 [0 0 0 1 1 1 1 0 0 0 0 1 0 0 0 1 1 0 0 0 1 1 0 0 1 1 1 1 1 1 0 0]
 [0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0]]


In [62]:
# extract features from new document using built vectorizer
new_doc_features = bow_vectorizer.transform(new_doc)
new_doc_features = new_doc_features.todense()
print (new_doc_features)

[[0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]]


In [63]:
# print the feature names
feature_names = bow_vectorizer.get_feature_names()
print (feature_names)

['and', 'and sky', 'and sky is', 'beautiful', 'beautiful sky', 'beautiful sky is', 'blue', 'blue and', 'blue and sky', 'blue cheese', 'cheese', 'is', 'is beautiful', 'is blue', 'is blue and', 'is so', 'is so blue', 'love', 'love blue', 'love blue cheese', 'sky', 'sky is', 'sky is beautiful', 'sky is blue', 'sky is so', 'so', 'so blue', 'the', 'the beautiful', 'the beautiful sky', 'the sky', 'the sky is']


In [64]:
import pandas as pd
def display_features(features, feature_names):
    df = pd.DataFrame(data=features,columns=feature_names)
    return df

In [65]:
display_features(features, feature_names)

Unnamed: 0,and,and sky,and sky is,beautiful,beautiful sky,beautiful sky is,blue,blue and,blue and sky,blue cheese,...,sky is beautiful,sky is blue,sky is so,so,so blue,the,the beautiful,the beautiful sky,the sky,the sky is
0,0,0,0,0,0,0,1,0,0,0,...,0,1,0,0,0,1,0,0,1,1
1,1,1,1,1,0,0,1,1,1,0,...,1,1,0,0,0,0,0,0,0,0
2,0,0,0,1,1,1,1,0,0,0,...,0,0,1,1,1,1,1,1,0,0
3,0,0,0,0,0,0,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [66]:
display_features(new_doc_features, feature_names)

Unnamed: 0,and,and sky,and sky is,beautiful,beautiful sky,beautiful sky is,blue,blue and,blue and sky,blue cheese,...,sky is beautiful,sky is blue,sky is so,so,so blue,the,the beautiful,the beautiful sky,the sky,the sky is
0,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### TF-IDF Model

In [68]:
from sklearn.feature_extraction.text import TfidfVectorizer

def tfidf_extractor(corpus, ngram_range=(1,1)):
    
    vectorizer = TfidfVectorizer(min_df=1, 
                                 norm='l2',
                                 smooth_idf=True,
                                 use_idf=True,
                                 ngram_range=ngram_range)
    features = vectorizer.fit_transform(corpus)
    return vectorizer, features

In [81]:
import numpy as np
from feature_extractors import tfidf_transformer
feature_names = bow_vectorizer.get_feature_names()
# build tfidf transformer and show train corpus tfidf features
tfidf_trans, tdidf_features = tfidf_transformer(bow_features)
features = np.round(tdidf_features.todense(), 2)
display_features(features, feature_names)

Unnamed: 0,and,beautiful,blue,cheese,is,love,sky,so,the
0,0.0,0.0,0.4,0.0,0.49,0.0,0.49,0.0,0.6
1,0.44,0.35,0.23,0.0,0.56,0.0,0.56,0.0,0.0
2,0.0,0.43,0.29,0.0,0.35,0.0,0.35,0.55,0.43
3,0.0,0.0,0.35,0.66,0.0,0.66,0.0,0.0,0.0


In [82]:
# show tfidf features for new_doc using built tfidf transformer
nd_tfidf = tfidf_trans.transform(new_doc_features)
nd_features = np.round(nd_tfidf.todense(), 2)
display_features(nd_features, feature_names)

Unnamed: 0,and,beautiful,blue,cheese,is,love,sky,so,the
0,0.0,0.0,0.63,0.0,0.0,0.0,0.77,0.0,0.0


In [85]:
import scipy.sparse as sp
from numpy.linalg import norm
feature_names = bow_vectorizer.get_feature_names()

# compute term frequency
tf = bow_features.todense()
tf = np.array(tf, dtype='float64')

# show term frequencies
display_features(tf, feature_names)



Unnamed: 0,and,beautiful,blue,cheese,is,love,sky,so,the
0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0
1,1.0,1.0,1.0,0.0,2.0,0.0,2.0,0.0,0.0
2,0.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0
3,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0


In [86]:
# build the document frequency matrix
df = np.diff(sp.csc_matrix(bow_features, copy=True).indptr)
df = 1 + df # to smoothen idf later

# show document frequencies
display_features([df], feature_names)


Unnamed: 0,and,beautiful,blue,cheese,is,love,sky,so,the
0,2,3,5,2,4,2,4,2,3


In [87]:
 # compute inverse document frequencies
total_docs = 1 + len(CORPUS)
idf = 1.0 + np.log(float(total_docs) / df)

# show inverse document frequencies
display_features([np.round(idf, 2)], feature_names)



Unnamed: 0,and,beautiful,blue,cheese,is,love,sky,so,the
0,1.92,1.51,1.0,1.92,1.22,1.92,1.22,1.92,1.51


In [88]:
# compute idf diagonal matrix  
total_features = bow_features.shape[1]
idf_diag = sp.spdiags(idf, diags=0, m=total_features, n=total_features)
idf = idf_diag.todense()

# print the idf diagonal matrix
print (np.round(idf, 2))

[[ 1.92  0.    0.    0.    0.    0.    0.    0.    0.  ]
 [ 0.    1.51  0.    0.    0.    0.    0.    0.    0.  ]
 [ 0.    0.    1.    0.    0.    0.    0.    0.    0.  ]
 [ 0.    0.    0.    1.92  0.    0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.    1.22  0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.    0.    1.92  0.    0.    0.  ]
 [ 0.    0.    0.    0.    0.    0.    1.22  0.    0.  ]
 [ 0.    0.    0.    0.    0.    0.    0.    1.92  0.  ]
 [ 0.    0.    0.    0.    0.    0.    0.    0.    1.51]]


In [90]:
# compute tfidf feature matrix
tfidf = tf * idf
# show tfidf feature matrix
display_features(np.round(tfidf, 2), feature_names)

Unnamed: 0,and,beautiful,blue,cheese,is,love,sky,so,the
0,0.0,0.0,1.0,0.0,1.22,0.0,1.22,0.0,1.51
1,1.92,1.51,1.0,0.0,2.45,0.0,2.45,0.0,0.0
2,0.0,1.51,1.0,0.0,1.22,0.0,1.22,1.92,1.51
3,0.0,0.0,1.0,1.92,0.0,1.92,0.0,0.0,0.0


In [91]:
# compute normalized tfidf
norm_tfidf = tfidf / norms[:, None]

In [92]:
# show final tfidf feature matrix
display_features(np.round(norm_tfidf, 2), feature_names)

Unnamed: 0,and,beautiful,blue,cheese,is,love,sky,so,the
0,0.0,0.0,0.4,0.0,0.49,0.0,0.49,0.0,0.6
1,0.44,0.35,0.23,0.0,0.56,0.0,0.56,0.0,0.0
2,0.0,0.43,0.29,0.0,0.35,0.0,0.35,0.55,0.43
3,0.0,0.0,0.35,0.66,0.0,0.66,0.0,0.0,0.0


In [93]:
# compute new doc term freqs from bow freqs
nd_tf = new_doc_features
nd_tf = np.array(nd_tf, dtype='float64')

In [94]:
# compute tfidf using idf matrix from train corpus
nd_tfidf = nd_tf*idf
nd_norms = norm(nd_tfidf, axis=1)
norm_nd_tfidf = nd_tfidf / nd_norms[:, None]

In [95]:
# show new_doc tfidf feature vector
display_features(np.round(norm_nd_tfidf, 2), feature_names)

Unnamed: 0,and,beautiful,blue,cheese,is,love,sky,so,the
0,0.0,0.0,0.63,0.0,0.0,0.0,0.77,0.0,0.0


In [96]:
from feature_extractors import tfidf_extractor
    
tfidf_vectorizer, tdidf_features = tfidf_extractor(CORPUS)
display_features(np.round(tdidf_features.todense(), 2), feature_names)

nd_tfidf = tfidf_vectorizer.transform(new_doc)
display_features(np.round(nd_tfidf.todense(), 2), feature_names) 

Unnamed: 0,and,beautiful,blue,cheese,is,love,sky,so,the
0,0.0,0.0,0.63,0.0,0.0,0.0,0.77,0.0,0.0


### Advanced Word Vectorization Models

The word2vec
framework is much faster than other neural network–based implementations and does
not require manual labels to create meaningful representations among words.

In [98]:
import gensim
import nltk

TOKENIZED_CORPUS = [nltk.word_tokenize(sentence) 
                    for sentence in CORPUS]
tokenized_new_doc = [nltk.word_tokenize(sentence) 
                    for sentence in new_doc]                        

model = gensim.models.Word2Vec(TOKENIZED_CORPUS, 
                               size=10,
                               window=10,
                               min_count=2,
sample=1e-3)



### Averaged Word Vectors
The preceding model creates a vector representation for each word in the vocabulary. We
can access them by just typing in the following code:

In [99]:
print (model['sky'])

[ 0.01346539 -0.00181841  0.00765929 -0.01275312 -0.01014564  0.02556595
  0.04384702  0.0131581   0.01751876  0.03605422]


  """Entry point for launching an IPython kernel.


In [100]:
print (model['blue'])

[ 0.0260599   0.01641689 -0.00504607  0.04584883 -0.02595955  0.0434286
  0.04731379 -0.02152569 -0.03001199 -0.0466424 ]


  """Entry point for launching an IPython kernel.


In [None]:
model := the word2vec model we built
vocabulary := unique_words(model)
document := [words]
matched_word_count := 0
vector := []
for word in words:
if word in vocabulary:
vector := vector + model[word]
matched_word_count := matched_word_count + 1
averaged_word_vector := vector / matched_word_count