# Tutorial - Text Mining - Classification - SCIKIT-LEARN

We will predict the category of discussion posts in a newsgroup.

**The unit of analysis is a discussion post**

In [1]:
import pandas as pd
import numpy as np

In [2]:
news = pd.read_csv('news.csv')

In [4]:
news.head(5)

Unnamed: 0,TEXT,graphics,hockey,medical,newsgroup
0,I have a few reprints left of chapters from my...,1,0,0,graphics
1,"gnuplot, etc. make it easy to plot real valued...",1,0,0,graphics
2,Article-I.D.: snoopy.1pqlhnINN8k1 References: ...,1,0,0,graphics
3,"Hello, I am looking to add voice input capabil...",1,0,0,graphics
4,I recently got a file describing a library of ...,1,0,0,graphics


## Change the target variable to ordinal

This is a multi-class classification problem. There are three categories we will predict:<br>
Whether a post is "graphics," "hockey," or "medical" related

In [5]:
#Convert the target to ordinal
from sklearn.preprocessing import OrdinalEncoder

enc = OrdinalEncoder()

news['target'] = enc.fit_transform(news[['newsgroup']])



In [6]:
news.head()

Unnamed: 0,TEXT,graphics,hockey,medical,newsgroup,target
0,I have a few reprints left of chapters from my...,1,0,0,graphics,0.0
1,"gnuplot, etc. make it easy to plot real valued...",1,0,0,graphics,0.0
2,Article-I.D.: snoopy.1pqlhnINN8k1 References: ...,1,0,0,graphics,0.0
3,"Hello, I am looking to add voice input capabil...",1,0,0,graphics,0.0
4,I recently got a file describing a library of ...,1,0,0,graphics,0.0


In [9]:
news['target'].value_counts()

2.0    200
1.0    200
0.0    197
Name: target, dtype: int64

In [10]:
len(news)

597

In [7]:
target = news['target']

## Select the "text" (input) variable

In [11]:
# Check for missing values

news[['TEXT']].isna().sum()

TEXT    0
dtype: int64

In [12]:
input_data = news['TEXT']

## Split the data

In [13]:
from sklearn.model_selection import train_test_split

train_set, test_set, train_y, test_y = train_test_split(input_data, target, test_size=0.3, random_state=42)

In [14]:
train_set.shape, train_y.shape

((417,), (417,))

In [15]:
test_set.shape, test_y.shape

((180,), (180,))

## Sklearn: Text preparation

We need to prepare the text data. We'll use sklearn's CountVectorizer, which counts the frequency of words that appear in your entire data set.<br>
CountVectorizer: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

If you don't use the CountVectorizer, you have to do all the text prep on your own:<br>
1- Convert to lowercase<br>
2- Remove numbers (if needed)<br>
3- Remove punctuation<br>
4- Remove whitespace<br>
5- Tokenize<br>
6- Stemming<br>
etc.

Note that, CountVectorizer doesn't do stemming, or lemmatizing. You may want to use NLTK for that (import NLTK)

In [16]:
#TfidfVectorizer includes pre-processing, tokenization, filtering stop words
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(stop_words='english')

train_x_tr = tfidf_vect.fit_transform(train_set)

**Notice in the previous step that we use `fit_transform` on TRAIN. When we transform the TEST data, we need to use `transform` only. This enables us to keep the number of columns (features) the same across the data sets. Otherwise, they WILL be different, and no model will work!**

In [17]:
# Perform the TfidfVectorizer transformation
# Be careful: We are using the train fit to transform the test data set. Otherwise, the test data 
# features will be very different and match the train set!!!

test_x_tr = tfidf_vect.transform(test_set)


In [20]:
train_x_tr.shape

(417, 11716)

In [21]:
test_x_tr.shape

(180, 11716)

In [19]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
train_x_tr

<417x11716 sparse matrix of type '<class 'numpy.float64'>'
	with 32087 stored elements in Compressed Sparse Row format>

In [22]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
train_x_tr.toarray()

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.07535523, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [34]:
tfidf_vect.vocabulary_ #numeric is the column number of term not a count

{'article': 2031,
 'alchemy': 1770,
 '1993apr6': 641,
 '142037': 395,
 '9246': 1390,
 'references': 8961,
 'rauser': 8857,
 '734062608': 1207,
 'sfu': 9663,
 'ca': 2760,
 '044323': 40,
 '22829': 768,
 'pasteur': 8105,
 'berkeley': 2358,
 'edu': 4390,
 'organization': 7932,
 'university': 11029,
 'toronto': 10748,
 'chemistry': 3036,
 'department': 3930,
 'lines': 6760,
 '14': 382,
 'daniell': 3775,
 'cory': 3554,
 'daniel': 3774,
 'lyddy': 6909,
 'writes': 11563,
 'don': 4209,
 'americans': 1828,
 'study': 10238,
 'history': 5690,
 'french': 5115,
 'settled': 9644,
 'north': 7725,
 'america': 1826,
 'early': 4352,
 'british': 2602,
 'lemieux': 6673,
 'probably': 8582,
 'trace': 10765,
 'american': 1827,
 'heritage': 5644,
 'lot': 6861,
 'gerald': 5275,
 '42': 986,
 'yr': 11665,
 'old': 7859,
 'male': 6983,
 'friend': 5124,
 'misdiagnosed': 7315,
 'having': 5580,
 'osteopporosis': 7960,
 'years': 11636,
 'recently': 8917,
 'illness': 5904,
 'rare': 8843,
 'gaucher': 5217,
 'disease': 41

## Latent Semantic Analysis (Singular Value Decomposition)

In [23]:
from sklearn.decomposition import TruncatedSVD

In [55]:
#If you are performing Latent Semantic Analysis, recommended number of components is 100

svd = TruncatedSVD(n_components=400, n_iter=10)

In [56]:
train_x_lsa = svd.fit_transform(train_x_tr)

In [57]:
train_x_lsa.shape

(417, 400)

In [58]:
train_x_lsa

array([[ 1.90881906e-01, -7.79143057e-02, -4.31061141e-02, ...,
         2.50558626e-02,  3.39829886e-02,  1.99648908e-03],
       [ 9.58606619e-02, -2.60948982e-02,  7.46506435e-02, ...,
        -4.10024627e-03, -5.84129801e-03, -1.12393224e-02],
       [ 1.17993507e-01, -5.07330215e-02,  5.39099455e-02, ...,
         1.64138299e-02, -1.86070691e-02, -9.31701595e-03],
       ...,
       [ 1.84910352e-01, -2.06761773e-01, -1.41444035e-01, ...,
        -1.30938040e-02, -7.72025915e-02,  1.42406161e-02],
       [ 6.80806341e-02, -3.05288828e-02,  5.89509791e-02, ...,
        -4.00680477e-03, -1.39355591e-04, -1.57210391e-02],
       [ 4.87594356e-02, -3.78255069e-02,  8.34634697e-02, ...,
        -7.95832246e-03,  1.56743114e-03, -1.17517556e-04]])

### Let's transform the test data set

In [59]:
test_x_lsa = svd.transform(test_x_tr)

In [60]:
test_x_lsa.shape

(180, 400)

### Explore the SVDs - OPTIONAL

In [61]:
svd.explained_variance_.sum()

0.9721032833795678

In [62]:
#These are the all the components:
svd.components_

array([[ 4.15023038e-03,  7.55367177e-03,  2.10764546e-04, ...,
         7.16749137e-04,  2.61284117e-03,  1.50466721e-03],
       [-4.99769973e-03, -4.78001354e-03, -1.66731821e-04, ...,
        -1.37127790e-03, -8.96681502e-04, -1.57434256e-03],
       [ 2.29562423e-03,  4.73562968e-03,  4.14949126e-04, ...,
        -1.16802659e-03,  3.23566309e-03,  1.41876821e-03],
       ...,
       [ 8.29095645e-03, -7.12896612e-03, -2.83443213e-04, ...,
         1.97391921e-04, -1.40257182e-03,  8.77373317e-05],
       [ 2.73608103e-04,  1.01189073e-03, -2.34234443e-04, ...,
         2.03570822e-03, -5.95731957e-04,  2.91703726e-03],
       [-7.10477592e-03,  3.22022788e-03,  4.25156733e-04, ...,
         3.61845223e-04,  5.09883208e-03,  8.93917451e-04]])

In [63]:
svd.components_.shape

(400, 11716)

In [64]:
#Let's select the first component:

first_component = svd.components_[0,:]

In [65]:
# Sort the weights in the first component, and get the indeces

indeces = np.argsort(first_component).tolist()

In [66]:
#Be careful, indeces are in descending order (least important first)

print(indeces)

[11612, 5402, 11597, 11566, 7560, 8732, 8469, 9678, 2993, 2936, 4258, 9243, 6702, 11425, 5095, 2303, 5936, 3366, 5426, 9637, 9255, 11376, 4938, 5813, 4836, 2224, 1610, 5484, 3876, 6986, 9176, 7649, 9391, 5953, 11108, 10089, 2806, 8820, 5289, 3657, 7780, 9495, 7113, 10249, 3646, 11068, 2516, 11201, 1729, 2901, 3291, 10490, 7890, 5204, 2574, 11367, 9177, 3447, 9434, 5791, 6566, 8076, 1932, 5471, 9436, 7486, 11316, 1232, 11431, 11432, 11366, 10055, 2112, 7049, 5836, 1971, 11429, 1830, 2372, 4637, 2111, 7018, 9452, 11380, 11384, 5203, 7138, 9453, 2832, 10476, 9180, 4902, 5587, 7726, 8183, 2805, 2707, 7732, 11466, 7753, 6294, 7632, 5683, 2614, 9122, 2609, 6815, 6914, 2591, 1816, 11498, 8787, 2008, 8414, 2950, 4940, 11300, 10173, 11656, 5428, 5498, 3435, 11368, 10906, 9960, 7972, 9008, 9376, 5093, 2287, 9414, 5638, 7540, 9930, 5565, 9270, 3029, 3858, 2335, 6982, 1547, 11482, 10395, 6382, 5680, 3162, 11424, 10125, 5983, 4806, 9827, 10111, 2589, 2842, 8374, 3407, 4805, 10120, 7570, 2407, 10322

In [67]:
#Let's get the feature names from the count vectorizer:
feat_names = tfidf_vect.get_feature_names()

In [68]:
#Print the last 10 terms (i.e., the 10 terms that have the highest weigths)

for index in indeces[-10:]:
    print(feat_names[index], "\t\tweight =", first_component[index])

ca 		weight = 0.11872355068995326
cs 		weight = 0.12542146929810832
com 		weight = 0.14113622347157895
gordon 		weight = 0.14413645467248487
geb 		weight = 0.14501359988895576
banks 		weight = 0.14501359988895576
writes 		weight = 0.14842561656640788
article 		weight = 0.17897972170136625
pitt 		weight = 0.23728946728714637
edu 		weight = 0.3176103795017922


## Random Forest

In [69]:
from sklearn.ensemble import RandomForestClassifier 

from sklearn.metrics import accuracy_score

In [84]:
rnd_clf = RandomForestClassifier(n_estimators=100, max_depth=3, max_leaf_nodes=16, n_jobs=-1) 

rnd_clf.fit(train_x_lsa, train_y)



RandomForestClassifier(max_depth=3, max_leaf_nodes=16, n_jobs=-1)

## Accuracy

In [85]:
from sklearn.metrics import accuracy_score

In [86]:
#Train accuracy

train_y_pred = rnd_clf.predict(train_x_lsa)

train_acc = accuracy_score(train_y, train_y_pred)

print('Train acc: {}' .format(train_acc))

Train acc: 0.9448441247002398


In [87]:
#Test accuracy

test_y_pred = rnd_clf.predict(test_x_lsa)

test_acc = accuracy_score(test_y, test_y_pred)

print('Test acc: {}' .format(test_acc))

Test acc: 0.8555555555555555


# Confusion Matrix

In [88]:
from sklearn.metrics import confusion_matrix

#Usually created on test set
confusion_matrix(test_y, test_y_pred)

array([[58,  0, 11],
       [ 2, 51,  2],
       [10,  1, 45]], dtype=int64)

## Stochastic Gradient Descent Classifier

In [89]:
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(max_iter=100, tol=1e-3)


In [90]:
sgd_clf.fit(train_x_lsa, train_y)

SGDClassifier(max_iter=100)

## Accuracy

In [91]:
#Train accuracy

train_y_pred = sgd_clf.predict(train_x_lsa)

train_acc = accuracy_score(train_y, train_y_pred)

print('Train acc: {}' .format(train_acc))

Train acc: 0.9952038369304557


In [92]:
#Test accuracy

test_y_pred = sgd_clf.predict(test_x_lsa)

test_acc = accuracy_score(test_y, test_y_pred)

print('Test acc: {}' .format(test_acc))

Test acc: 0.9055555555555556


# Confusion Matrix

In [93]:
from sklearn.metrics import confusion_matrix

#Usually created on test set
confusion_matrix(test_y, test_y_pred)

array([[65,  2,  2],
       [ 1, 54,  0],
       [ 7,  5, 44]], dtype=int64)