# Text Classification

* Data Retrieval
* Data Preprocessing and Normalization
* Building Train and Test Datasets
* Feature Engineering Techniques
    1. Traditional
    2. Advanced
* Classification Models
    1. Multinomial Naive Bayes
    2. Logistic Regression
    3. Support Vector Machines
    4. Ensemble Models
    5. Random Forest
    6. Gradient Boosting Machines
* Evaluating Classification Models
    1. Confusion Matrix
* Building and Evaluating Our Text Classifier
    1. Bag of Words Features with Classification Models
    2. TF-IDF Features with Classification Models
    3. Comparative Model Performance Evaluation
    4. Word2Vec Embeddings with Classification Models
    5. GloVe Embeddings with Classification Models
    6. FastText Embeddings with Classification Models
    7. Model Tuning
    8. Model Performance Evaluation

## Data Retrieval

In [17]:
from sklearn.datasets import fetch_20newsgroups
import numpy as np
import text_normalizer as tn
import matplotlib.pyplot as plt
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

data = fetch_20newsgroups(subset='all', shuffle=True, remove=('headers', 'footers', 'quotes'))
data_labels_map = dict(enumerate(data.target_names))

In [18]:
# building the dataframe
corpus, target_labels, target_names = (data.data, data.target, [data_labels_map[label] for label in data.target])
data_df = pd.DataFrame({'Article': corpus, 'Target Label': target_labels, 'Target Name': target_names})
print(data_df.shape)
data_df.head(10)

(18846, 3)


Unnamed: 0,Article,Target Label,Target Name
0,\n\nI am sure some bashers of Pens fans are pr...,10,rec.sport.hockey
1,My brother is in the market for a high-perform...,3,comp.sys.ibm.pc.hardware
2,\n\n\n\n\tFinally you said what you dream abou...,17,talk.politics.mideast
3,\nThink!\n\nIt's the SCSI card doing the DMA t...,3,comp.sys.ibm.pc.hardware
4,1) I have an old Jasmine drive which I cann...,4,comp.sys.mac.hardware
5,\n\nBack in high school I worked as a lab assi...,12,sci.electronics
6,\n\nAE is in Dallas...try 214/241-6060 or 214/...,4,comp.sys.mac.hardware
7,"\n[stuff deleted]\n\nOk, here's the solution t...",10,rec.sport.hockey
8,"\n\n\nYeah, it's the second one. And I believ...",10,rec.sport.hockey
9,\nIf a Christian means someone who believes in...,19,talk.religion.misc


### Data Preprocessing and Normalization

In [19]:
total_nulls = data_df[data_df.Article.str.strip() == ''].shape[0]
print("Empty documents:", total_nulls)

Empty documents: 515


In [20]:
data_df = data_df[~(data_df.Article.str.strip() == '')]
data_df.shape

(18331, 3)

In [21]:
import nltk
stopword_list = nltk.corpus.stopwords.words('english')

# just to keep negation if any in bi-grams
stopword_list.remove('no')
stopword_list.remove('not')

# normalize our corpus
norm_corpus = tn.normalize_corpus(corpus=data_df['Article'], html_stripping=True, contraction_expansion=True, 
                                  accented_char_removal=True, text_lower_case=True, text_lemmatization=True, 
                                  text_stemming=False, special_char_removal=True, remove_digits=True, 
                                  stopword_removal=True, stopwords=stopword_list)

data_df['Clean Article'] = norm_corpus

# view sample data
data_df = data_df[['Article', 'Clean Article', 'Target Label', 'Target Name']]
data_df.head(10)

Unnamed: 0,Article,Clean Article,Target Label,Target Name
0,\n\nI am sure some bashers of Pens fans are pr...,sure basher pens fan pretty confused lack kind...,10,rec.sport.hockey
1,My brother is in the market for a high-perform...,brother market high performance video card sup...,3,comp.sys.ibm.pc.hardware
2,\n\n\n\n\tFinally you said what you dream abou...,finally say dream mediterranean new area great...,17,talk.politics.mideast
3,\nThink!\n\nIt's the SCSI card doing the DMA t...,think scsi card dma transfer not disk scsi car...,3,comp.sys.ibm.pc.hardware
4,1) I have an old Jasmine drive which I cann...,old jasmine drive not use new system understan...,4,comp.sys.mac.hardware
5,\n\nBack in high school I worked as a lab assi...,back high school work lab assistant bunch expe...,12,sci.electronics
6,\n\nAE is in Dallas...try 214/241-6060 or 214/...,ae dallas try tech support may line one get start,4,comp.sys.mac.hardware
7,"\n[stuff deleted]\n\nOk, here's the solution t...",stuff delete ok solution problem move canada y...,10,rec.sport.hockey
8,"\n\n\nYeah, it's the second one. And I believ...",yeah second one believe price try get good loo...,10,rec.sport.hockey
9,\nIf a Christian means someone who believes in...,christian mean someone believe divinity jesus ...,19,talk.religion.misc


In [22]:
data_df['Clean Article'] = norm_corpus
data_df = data_df.replace(r'^(\s?)+$', np.nan, regex=True)
data_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18331 entries, 0 to 18845
Data columns (total 4 columns):
Article          18331 non-null object
Clean Article    18300 non-null object
Target Label     18331 non-null int64
Target Name      18331 non-null object
dtypes: int64(1), object(3)
memory usage: 716.1+ KB


In [23]:
data_df = data_df.dropna().reset_index(drop=True)
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18300 entries, 0 to 18299
Data columns (total 4 columns):
Article          18300 non-null object
Clean Article    18300 non-null object
Target Label     18300 non-null int64
Target Name      18300 non-null object
dtypes: int64(1), object(3)
memory usage: 572.0+ KB


In [24]:
data_df.to_csv('clean_newsgroups.csv', index=False)

In [25]:
data_df = pd.read_csv('clean_newsgroups.csv')

In [26]:
from sklearn.model_selection import train_test_split

train_corpus, test_corpus, train_label_nums, test_label_nums, train_label_names, test_label_names =\
                                 train_test_split(np.array(data_df['Clean Article']), np.array(data_df['Target Label']),
                                                  np.array(data_df['Target Name']), test_size=0.33, random_state=42)

train_corpus.shape, test_corpus.shape

((12261,), (6039,))

In [27]:
from collections import Counter

trd = dict(Counter(train_label_names))
tsd = dict(Counter(test_label_names))

(pd.DataFrame([[key, trd[key], tsd[key]] for key in trd], 
             columns=['Target Label', 'Train Count', 'Test Count'])
.sort_values(by=['Train Count', 'Test Count'],
             ascending=False))

Unnamed: 0,Target Label,Train Count,Test Count
15,sci.crypt,667,295
0,soc.religion.christian,662,312
5,rec.motorcycles,660,309
10,comp.sys.ibm.pc.hardware,654,309
8,comp.windows.x,653,327
11,rec.sport.hockey,651,322
19,sci.space,649,304
7,sci.med,648,312
17,rec.sport.baseball,648,303
4,sci.electronics,647,309
