# Exercise - Text Mining - Classification - SCIKIT-LEARN

We will predict the category of a product based on the description of the product.

**The unit of analysis is a product**

In [1]:
import pandas as pd
import numpy as np

In [2]:
products = pd.read_csv('products.csv')

In [3]:
products.head(5)

Unnamed: 0,Category,Description
0,Electronics,HP 680 Original Ink Advantage Cartridge (Black...
1,Clothing & Accessories,Bold N Elegant Navy Blue Thin Summer Pregnancy...
2,Books,The Travel Book: A Journey Through Every Count...
3,Electronics,Tiny Deal Compact 10x25 Mini Binoculars Telesc...
4,Clothing & Accessories,Nimble House 16Pcs/Set Unisex Women Men No Tie...


In [4]:
products.shape

(5000, 2)

## Assign the "target" variable



In [5]:
target = products['Category']

## Assign the "text" (input) variable

In [6]:
input_data = products[['Description']]

## Split the data

In [7]:
from sklearn.model_selection import train_test_split

train_set, test_set, train_y, test_y = train_test_split(input_data, target, test_size=0.3, random_state=42)

In [8]:
train_set.shape, train_y.shape

((3500, 1), (3500,))

In [9]:
test_set.shape, test_y.shape

((1500, 1), (1500,))

# Data Prep

In [10]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

from sklearn.preprocessing import FunctionTransformer

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

## Sklearn: Text preparation

Create a pipeline to transform the text column and create a term by document matrix.

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

In [12]:
def new_col(df):
    #Create a copy so that we don't overwrite the existing dataframe
    df1 = df.copy()
    
    # First, conver the dataframe column to a numpy array. Then, call the ravel function to make it one-dimensional
    return np.array(df).ravel()

In [13]:
new_col(train_set)

array(['Dehman 100-Percent Silky Satin Hair Beauty Pillowcase, Black, King Size DEHMANSATIN Silk originated in INDIA and it is an honor to present this ancient oriental gift to you. The collective wisdom of the sericulture people of this land has been condensed DEHMAN has always adhered to its original heart from generation to generation. From raw satin silk material collection to processing and to finished products,each step is carefully selected. DEHMAN hopes that every customer will have a wonderful satin silk experience and enjoy the silky noble life. Why Choose DEHMAN Pillowcase ? Material：DEHMAN Satin Silk pillowcase is crafted by 19 momme pure mulberry silk. Both sides of the pillowcase is organic and natural satin silk.Light weight and easy to carry. Design: It is designed with hidden zipper closure.With Queen(20x30inches) and King(20x36inches) and Standard size(20x26inches) in various kinds colours. Quality：Superior durable plain color ,not easy to run after washing.Exquisite 

In [14]:
text_column = ['Description']

In [15]:
number_svd_components = 310

In [16]:
text_transformer = Pipeline(steps=[
                ('my_new_column', FunctionTransformer(new_col)),
                ('text', TfidfVectorizer(stop_words='english')),
                ('svd', TruncatedSVD(n_components=number_svd_components, n_iter=10))
            ])

In [17]:
preprocessor = ColumnTransformer([
                     ('text', text_transformer, text_column),
                    ],
        remainder='drop')

#passtrough is an optional step. You don't have to use it.

In [18]:
#Fit and transform the train data
train_x  = preprocessor.fit_transform(train_set)

train_x

array([[ 1.32127476e-01, -3.83815352e-02, -5.52127985e-02, ...,
         1.29439783e-02,  1.20356603e-02, -3.79514919e-02],
       [ 1.19672838e-03,  1.09071177e-03, -9.41521707e-06, ...,
         7.08928859e-02, -2.90129415e-02,  2.54643488e-04],
       [ 5.74701078e-02,  5.16487292e-02,  8.81057288e-03, ...,
        -1.14756614e-02, -5.73256211e-02, -4.36066224e-03],
       ...,
       [ 1.59120292e-01, -7.66943792e-02, -1.60781830e-01, ...,
         3.38256022e-04, -1.66197783e-02,  3.27053948e-02],
       [ 8.31440381e-02, -4.33959857e-02, -3.44149656e-02, ...,
         1.22070043e-02,  8.39511606e-03,  3.43223149e-02],
       [ 9.66538621e-02, -3.95970843e-02, -1.19857668e-01, ...,
         7.57567559e-04, -2.93496081e-02, -3.59466901e-02]])

In [19]:
train_x.shape

(3500, 310)

In [20]:
# Transform the test data
test_x = preprocessor.transform(test_set)

test_x

array([[ 0.0197115 ,  0.0121762 ,  0.00564652, ..., -0.0065298 ,
         0.00498033, -0.00739366],
       [ 0.08027585, -0.04060881,  0.02606995, ...,  0.02766857,
        -0.02375448,  0.01969134],
       [ 0.17327444, -0.07025064, -0.0460695 , ...,  0.00465202,
         0.01972045,  0.03558481],
       ...,
       [ 0.08834409,  0.03663293,  0.00627167, ...,  0.01454669,
        -0.01236264, -0.01846717],
       [ 0.08826245, -0.03472008,  0.03563941, ..., -0.0191585 ,
         0.02662236, -0.00927352],
       [ 0.01982621, -0.00944311,  0.02909595, ...,  0.01335532,
        -0.00799947,  0.00405315]])

In [21]:
test_x.shape

(1500, 310)

# Calculate the baseline

In [22]:
from sklearn.dummy import DummyClassifier

dummy_clf = DummyClassifier(strategy="most_frequent")

dummy_clf.fit(train_x, train_y)

In [23]:
from sklearn.metrics import accuracy_score

In [24]:
#Baseline Train Accuracy
dummy_train_pred = dummy_clf.predict(train_x)

baseline_train_acc = accuracy_score(train_y, dummy_train_pred)

print('Baseline Train Accuracy: {}' .format(baseline_train_acc))

Baseline Train Accuracy: 0.38142857142857145


In [25]:
#Baseline Test Accuracy
dummy_test_pred = dummy_clf.predict(test_x)

baseline_test_acc = accuracy_score(test_y, dummy_test_pred)

print('Baseline Test Accuracy: {}' .format(baseline_test_acc))

Baseline Test Accuracy: 0.386


# Try one of the classifiers we have covered so far

In [26]:
from sklearn.ensemble import RandomForestClassifier 

from sklearn.metrics import accuracy_score

In [27]:
rnd_clf = RandomForestClassifier(n_estimators=100, max_leaf_nodes=16, n_jobs=-1) 

rnd_clf.fit(train_x, train_y)


## Accuracy

In [28]:
from sklearn.metrics import accuracy_score

In [29]:
#Train accuracy

train_y_pred = rnd_clf.predict(train_x)

train_acc = accuracy_score(train_y, train_y_pred)

print('Train acc: {}' .format(train_acc))

Train acc: 0.9011428571428571


In [30]:
#Test accuracy

test_y_pred = rnd_clf.predict(test_x)

test_acc = accuracy_score(test_y, test_y_pred)

print('Test acc: {}' .format(test_acc))

Test acc: 0.8933333333333333


## Generate the confusion matrix

In [31]:
from sklearn.metrics import confusion_matrix

#Usually created on test set
confusion_matrix(test_y, test_y_pred)

array([[325,   5,   4,  31],
       [  1, 231,   3,  23],
       [ 10,   0, 230,  58],
       [ 15,   4,   6, 554]])

# Try another classifier we have covered so far

In [32]:
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(max_iter=100)


In [33]:
sgd_clf.fit(train_x, train_y)

## Accuracy

In [34]:
#Train accuracy

train_y_pred = sgd_clf.predict(train_x)

train_acc = accuracy_score(train_y, train_y_pred)

print('Train acc: {}' .format(train_acc))

Train acc: 0.9451428571428572


In [35]:
#Test accuracy

test_y_pred = sgd_clf.predict(test_x)

test_acc = accuracy_score(test_y, test_y_pred)

print('Test acc: {}' .format(test_acc))

Test acc: 0.9373333333333334


## Generate the confusion matrix

In [36]:
from sklearn.metrics import confusion_matrix

#Usually created on test set
confusion_matrix(test_y, test_y_pred)

array([[332,   6,   5,  22],
       [  0, 252,   1,   5],
       [  7,   0, 260,  31],
       [  5,   6,   6, 562]])

# Check for the number of components

**Determine whether increasing/decreasing the number of components increases/decreases the two models' accuracies** Discuss your findings below.

In [37]:
# Let's retrieve the Truncated SVD from the column transformer
# We must do chain indexing
# "preprocessor" has "transformers_" attribute
# We must retrieve all transformers with an index value of 0
# Then, we must retrieve the "text" transformer with an index value of 1
# Then, we must retrieve the "svd" transformer with an index value of 2.

svd = preprocessor.transformers_[0][1][2]

svd

In [38]:
# Now, retrieve the varience explained and sum them

svd.explained_variance_.sum()

0.3562853867301484

In [39]:
#These are the all the components:
svd.components_

array([[ 1.63402171e-03,  7.19153466e-03,  6.99364414e-04, ...,
         7.97173883e-06,  2.59443705e-04,  2.59443705e-04],
       [-1.01786207e-03, -1.34315380e-03, -4.69872705e-04, ...,
         5.99853873e-06,  2.84166628e-04,  2.84166628e-04],
       [ 1.26893159e-03,  5.84297822e-03,  1.92277169e-03, ...,
         4.00954618e-06,  8.12129562e-05,  8.12129562e-05],
       ...,
       [-2.94735634e-03, -2.84571237e-02,  1.85155192e-03, ...,
        -2.13739008e-05, -6.87490014e-04, -6.87490014e-04],
       [-7.09570023e-03, -1.81672169e-02,  2.67121038e-04, ...,
        -2.00727056e-04,  1.60404383e-03,  1.60404383e-03],
       [-1.16624210e-03,  5.40192746e-03,  2.00584936e-03, ...,
         1.53382954e-04, -1.53251008e-03, -1.53251008e-03]])

In [40]:
#Let's select the first component:

first_component = svd.components_[0,:]

first_component

array([1.63402171e-03, 7.19153466e-03, 6.99364414e-04, ...,
       7.97173883e-06, 2.59443705e-04, 2.59443705e-04])

In [41]:
# Sort the weights in the first component, and get the indeces

indeces = np.argsort(first_component).tolist()

In [42]:
#Be careful, indeces are in descending order (least important first)

print(indeces)

[1040, 16094, 25514, 10268, 11522, 26065, 26020, 15104, 26053, 26054, 26055, 26059, 26057, 26049, 26060, 26061, 26062, 26064, 26056, 26021, 26047, 26023, 26029, 26033, 26028, 26034, 26026, 26036, 26037, 26038, 26040, 26041, 26042, 26043, 26044, 26045, 26024, 26031, 26066, 26030, 26079, 26068, 26083, 26086, 26087, 26077, 26075, 26072, 26090, 26074, 26092, 26096, 26069, 26094, 26095, 26082, 12124, 5620, 167, 26058, 26091, 26070, 26039, 26051, 26032, 26080, 7646, 8681, 8804, 8844, 10062, 26078, 1492, 4813, 23475, 12657, 13537, 11502, 5717, 9418, 20425, 297, 10854, 8312, 14141, 7668, 22683, 14550, 2767, 16199, 200, 1755, 944, 9627, 16671, 12386, 21764, 17253, 14287, 6697, 10738, 20577, 14256, 6439, 6028, 15036, 19760, 7295, 11320, 9906, 14729, 14347, 15119, 16724, 14349, 11135, 8271, 10045, 10849, 19895, 23023, 16979, 9391, 18753, 14977, 9641, 13120, 9864, 2637, 10001, 24832, 25155, 23666, 21749, 18378, 2636, 3396, 17730, 4716, 11505, 3703, 11850, 19508, 255, 3753, 23024, 22567, 2334, 6634

In [43]:
#Let's get the feature names from the count vectorizer:
# First, we need to retrieve the TfIDFVectorizer from the column transformer

tfidf = preprocessor.transformers_[0][1][1]

tfidf

In [44]:
# Now, get the feature names

feat_names = tfidf.get_feature_names_out()

In [45]:
#Print the last 10 terms (i.e., the 10 terms that have the highest weigths)

for index in indeces[-10:]:
    print(f'term: {feat_names[index]}\t weight = {first_component[index]}')

term: product	 weight = 0.09851489999028218
term: design	 weight = 0.10037239064841945
term: black	 weight = 0.10842596161008759
term: size	 weight = 0.1094158814881905
term: book	 weight = 0.11039126314394121
term: quality	 weight = 0.11714008268983249
term: students	 weight = 0.11965214406660453
term: set	 weight = 0.12164642329193157
term: cotton	 weight = 0.13830682137653286
term: content	 weight = 0.1494769087127782


## Findings
From the provided code and analysis, it seems like TruncatedSVD is used for dimensionality reduction after TF-IDF vectorization. Here's a breakdown of the process:

Dimensionality Reduction Technique: TruncatedSVD is applied after TF-IDF vectorization. This technique reduces the dimensionality 
of the TF-IDF matrix while preserving the most important information.

Baseline: The baseline model is set using DummyClassifier with 'most_frequent' strategy. This baseline predicts the most frequent class in the training set.

Model Evaluation: Two models are evaluated after dimensionality reduction - RandomForestClassifier and SGDClassifier. Both models are evaluated based on train and test accuracies. 
 Additionally, confusion matrices are generated to analyze model performance further.

Analysis of Components: The code then proceeds to analyze the components extracted by TruncatedSVD. 
It calculates the explained variance ratio and inspects the components' weights, specifically focusing on the first component's highest weighted terms.

Number of Components Evaluation: Finally, there's an attempt to evaluate the impact of the number of components on model performance. 
However, the code only analyzes the components without actually varying the number of components and re-evaluating the models.

To properly evaluate the impact of the number of components on model performance, you would need to conduct an experiment where you systematically vary the number of components and retrain the models, 
then compare the resulting performance metrics. This would provide insights into whether increasing or decreasing the number of components improves or degrades model performance.