# Tutorial - Text Mining - Classification 

We will predict the category of discussion posts in a newsgroup.

**The unit of analysis is a discussion post**

### Import common packages

In [1]:
import pandas as pd
import numpy as np
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
np.random_seed = 1

### Load data

In [2]:
news = pd.read_csv('C:/Users/Nithin Yadav/Desktop/DSP/news.csv')

news.shape


(597, 5)

In [3]:
news.head(5)

Unnamed: 0,TEXT,graphics,hockey,medical,newsgroup
0,I have a few reprints left of chapters from my...,1,0,0,graphics
1,"gnuplot, etc. make it easy to plot real valued...",1,0,0,graphics
2,Article-I.D.: snoopy.1pqlhnINN8k1 References: ...,1,0,0,graphics
3,"Hello, I am looking to add voice input capabil...",1,0,0,graphics
4,I recently got a file describing a library of ...,1,0,0,graphics


### Check for missing values

In [4]:
news[['TEXT']].isna().sum()

TEXT    0
dtype: int64

## Assign the input variable to X and the target variable to y

In [5]:
X = news['TEXT']

This is a multi-class classification problem. There are three categories we will predict:<br>
Whether a post is "graphics," "hockey," or "medical" related

In [6]:
y = news['newsgroup']
y.unique()

array(['graphics', 'hockey', 'medical'], dtype=object)

In [7]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit(y)
print(le.classes_)
y = le.transform(y)

y


['graphics' 'hockey' 'medical']


array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

## Split the data

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [9]:
X_train.shape, y_train.shape

((417,), (417,))

In [10]:
X_test.shape, y_test.shape

((180,), (180,))

In [11]:
X_train.head(5)

433    Article-I.D.: aio.1993Apr6.133244.14717 Refere...
575    In article < szikopou.734725851@cunews: szikop...
340    2nd uptade: Here are the standings for the pol...
293    Article-I.D.: abyss.1psqioINN3mg References: <...
403    How long does it take a smoker's lungs to clea...
Name: TEXT, dtype: object

In [12]:
y_train[:5]

array([2, 2, 1, 1, 2])

## Sklearn: Text preparation

For simplicity (and focus), we will not do any text cleaning or preprocessing. We will just use the raw text as input to the model. See the text mining fundamentals tutorial for more details on text cleaning and preprocessing.

In [13]:
#TfidfVectorizer includes pre-processing, tokenization, filtering stop words
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(stop_words='english', lowercase=True, token_pattern="[^\W\d_]+")

X_train = tfidf_vect.fit_transform(X_train)

**Notice in the previous step that we use `fit_transform` on TRAIN. When we transform the TEST data, we need to use `transform` only. This enables us to keep the number of columns (features) the same across the data sets. Otherwise, they WILL be different, and no model will work!**

In [14]:
# Perform the TfidfVectorizer transformation
# Be careful: We are using the train fit to transform the test data set. Otherwise, the test data 
# features will be very different and match the train set!!!

X_test = tfidf_vect.transform(X_test)


In [15]:
X_train.shape, X_test.shape

((417, 9866), (180, 9866))

In [16]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
X_train

<417x9866 sparse matrix of type '<class 'numpy.float64'>'
	with 29835 stored elements in Compressed Sparse Row format>

In [17]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
X_train.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

## Latent Semantic Analysis (Singular Value Decomposition)

In [18]:
from sklearn.decomposition import TruncatedSVD
#n_components is the number of topics, which should be less than the number of features

## n components = 100
svd_1 = TruncatedSVD(n_components=100, n_iter=10)
X_train_svd_1 = svd_1.fit_transform(X_train)
X_test_svd_1 = svd_1.transform(X_test)

## n components = 300
svd_3 = TruncatedSVD(n_components=300, n_iter=10) 
X_train_svd_3= svd_3.fit_transform(X_train)
X_test_scd_3 = svd_3.transform(X_test)

## n components = 500
svd_5 = TruncatedSVD(n_components=500, n_iter=10) 
X_train_svd_5= svd_5.fit_transform(X_train)
X_test_scd_5 = svd_5.transform(X_test)




In [19]:
X_train.shape, X_test.shape

((417, 9866), (180, 9866))

## Random Forest and evaluating model performance 

In [34]:


# Define n_components to try
n_components_list = [100, 300, 500]

for n in n_components_list:
    print(f"n_components = {n}")
    
    # TruncatedSVD Applying
    svd = TruncatedSVD(n_components=n, n_iter=10)
    X_train_svd = svd.fit_transform(X_train)
    X_test_svd = svd.transform(X_test)
    
    
    rnd_clf = RandomForestClassifier()
    _ = rnd_clf.fit(X_train_svd, y_train)
    
    # train
    y_pred_train = rnd_clf.predict(X_train_svd)
    train_acc = accuracy_score(y_train, y_pred_train)
    print(f"Train acc: {train_acc:.4f}")
    
    # test
    y_pred_test = rnd_clf.predict(X_test_svd)
    test_acc = accuracy_score(y_test, y_pred_test)
    print(f"Test acc: {test_acc:.4f}")
    
   
    print(f"Confusion matrix:\n{confusion_matrix(y_test, y_pred_test)}\n")

n_components = 100
Train acc: 0.9976
Test acc: 0.8389
Confusion matrix:
[[46  1 13]
 [ 3 49  5]
 [ 6  1 56]]

n_components = 300
Train acc: 0.9976
Test acc: 0.8056
Confusion matrix:
[[42  3 15]
 [ 3 50  4]
 [ 6  4 53]]

n_components = 500
Train acc: 0.9976
Test acc: 0.8333
Confusion matrix:
[[44  2 14]
 [ 2 52  3]
 [ 6  3 54]]



## Stochastic Gradient Descent Classifier and evaluating model performance 

In [36]:


for n in [100, 300, 500]:
    print(f"n_components = {n}")
    svd = TruncatedSVD(n_components=n, n_iter=10)
    X_train_svd = svd.fit_transform(X_train)
    X_test_svd = svd.transform(X_test)

    sgd_clf = SGDClassifier(max_iter=100)
    _ = sgd_clf.fit(X_train_svd, y_train)

    # Train
    y_pred_train = sgd_clf.predict(X_train_svd)
    print(f"Train acc: {accuracy_score(y_train, y_pred_train):.4f}")

    # Test 
    y_pred_test = sgd_clf.predict(X_test_svd)
    print(f"Test acc: {accuracy_score(y_test, y_pred_test):.4f}")

    print(f"Confusion Matrix : \n{confusion_matrix(y_test, y_pred_test)}\n")

n_components = 100
Train acc: 0.9904
Test acc: 0.8944
Confusion Matrix : 
[[54  4  2]
 [ 4 53  0]
 [ 5  4 54]]

n_components = 300
Train acc: 0.9976
Test acc: 0.8333
Confusion Matrix : 
[[44  0 16]
 [ 1 46 10]
 [ 3  0 60]]

n_components = 500
Train acc: 0.9976
Test acc: 0.9222
Confusion Matrix : 
[[55  2  3]
 [ 3 52  2]
 [ 4  0 59]]



## Conclusion

After performing and evaluating both the rabdom forest and SGD models with n_components as [100,300,500].
The results are follows :

Rando forest :

n_components = 100
Train acc: 0.9976
Test acc: 0.8389

n_components = 300
Train acc: 0.9976
Test acc: 0.8056

n_components = 500
Train acc: 0.9976
Test acc: 0.8333

After changing the number of components from 100 to 300 the test accuracy is slightly decreased but it has again increased when the n_component value is increased to 500. But in our data we have only 417 observations where we doesn't require the n_component value to be 500. The test accuracy is slightly more in case of n_components  = 100.

SGD:

n_components = 100
Train acc: 0.9904
Test acc: 0.8944

n_components = 300
Train acc: 0.9976
Test acc: 0.8333

n_components = 500
Train acc: 0.9976
Test acc: 0.9222

After peforming the SGD model and evaluating the results, the test accuracy score is 0.89 when n_component value is 100 where are the accuracy of test is 0.83 when n_compnent value is 300 , the accuracy is decreased but again the accuracy got increased more than the first case which is almost equal to 0.922. 

As SVD can be used to reduce the dimentionality of the data set by reducing the large matrix to smaller matrix and  removing noise. But SVD is not mandatory because not many datasets are very complex.


