# Exercise - Text Mining - Classification - SCIKIT-LEARN

We will predict the category of a product based on the description of the product.

**The unit of analysis is a product**

In [68]:
import pandas as pd
import numpy as np

In [69]:
products = pd.read_csv('products.csv')

In [70]:
products.head(5)

Unnamed: 0,Category,Description
0,Electronics,HP 680 Original Ink Advantage Cartridge (Black...
1,Clothing & Accessories,Bold N Elegant Navy Blue Thin Summer Pregnancy...
2,Books,The Travel Book: A Journey Through Every Count...
3,Electronics,Tiny Deal Compact 10x25 Mini Binoculars Telesc...
4,Clothing & Accessories,Nimble House 16Pcs/Set Unisex Women Men No Tie...


In [71]:
products.shape

(5000, 2)

## Assign the "target" variable



In [72]:
target = products['Category']

## Assign the "text" (input) variable

In [73]:
input_data = products[['Description']]

## Split the data

In [74]:
from sklearn.model_selection import train_test_split

train_set, test_set, train_y, test_y = train_test_split(input_data, target, test_size=0.3, random_state=42)

In [75]:
train_set.shape, train_y.shape

((3500, 1), (3500,))

In [76]:
test_set.shape, test_y.shape

((1500, 1), (1500,))

# Data Prep

In [77]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

from sklearn.preprocessing import FunctionTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

## Sklearn: Text preparation

Create a pipeline to transform the text column and create a term by document matrix.

In [78]:
def new_col(df):
    # Create a copy so that we don't overwrite the existing dataframe
    df1 = df.copy()
    
    # First, conver the dataframe column to a numpy array. Then, call the ravel function to make it one-dimensional
    return np.array(df).ravel()

In [79]:
new_col(train_set)

array(['Dehman 100-Percent Silky Satin Hair Beauty Pillowcase, Black, King Size DEHMANSATIN Silk originated in INDIA and it is an honor to present this ancient oriental gift to you. The collective wisdom of the sericulture people of this land has been condensed DEHMAN has always adhered to its original heart from generation to generation. From raw satin silk material collection to processing and to finished products,each step is carefully selected. DEHMAN hopes that every customer will have a wonderful satin silk experience and enjoy the silky noble life. Why Choose DEHMAN Pillowcase ? Material：DEHMAN Satin Silk pillowcase is crafted by 19 momme pure mulberry silk. Both sides of the pillowcase is organic and natural satin silk.Light weight and easy to carry. Design: It is designed with hidden zipper closure.With Queen(20x30inches) and King(20x36inches) and Standard size(20x26inches) in various kinds colours. Quality：Superior durable plain color ,not easy to run after washing.Exquisite 

## Identify the text column

In [80]:
text_column = ['Description']

# Pipeline

In [81]:
number_svd_components = 100

In [82]:
text_transformer = Pipeline(steps=[
                ('my_new_column', FunctionTransformer(new_col)),
                ('text', TfidfVectorizer(stop_words='english')),
                ('svd', TruncatedSVD(n_components=number_svd_components, n_iter=10))
            ])

In [83]:
preprocessor = ColumnTransformer([
                     ('text', text_transformer, text_column),
                    ],
        remainder='drop')

#passtrough is an optional step. You don't have to use it.

# Transform: fit_transform() for TRAIN

In [84]:
train_set.shape

(3500, 1)

In [85]:
#Fit and transform the train data
train_x = preprocessor.fit_transform(train_set)

train_x

array([[ 1.32127476e-01, -3.83815352e-02,  5.52127989e-02, ...,
         1.36914071e-02,  3.44935824e-02,  2.77563738e-02],
       [ 1.19672838e-03,  1.09071177e-03,  9.41546052e-06, ...,
        -3.81370454e-04,  1.08456808e-02,  9.46528943e-05],
       [ 5.74701078e-02,  5.16487292e-02, -8.81057322e-03, ...,
         1.09394010e-02, -2.67024857e-02,  1.85782378e-02],
       ...,
       [ 1.59120292e-01, -7.66943792e-02,  1.60781829e-01, ...,
        -5.39372830e-02,  4.47880154e-04,  3.18229835e-02],
       [ 8.31440381e-02, -4.33959857e-02,  3.44149656e-02, ...,
        -9.85543343e-03,  6.44846407e-02, -3.28265621e-02],
       [ 9.66538621e-02, -3.95970844e-02,  1.19857668e-01, ...,
         6.71208708e-02,  8.89555282e-03, -1.11650919e-02]])

In [86]:
train_x.shape

(3500, 100)

# Tranform: transform() for TEST

In [87]:
# Transform the test data
test_x = preprocessor.transform(test_set)

test_x

array([[ 0.0197115 ,  0.0121762 , -0.00564652, ...,  0.00070306,
        -0.00932534, -0.00530162],
       [ 0.08027585, -0.04060881, -0.02606995, ..., -0.0728195 ,
        -0.02162933, -0.02158341],
       [ 0.17327444, -0.07025064,  0.0460695 , ..., -0.02509734,
        -0.01922008,  0.02219291],
       ...,
       [ 0.08834409,  0.03663293, -0.00627167, ..., -0.00841541,
         0.08481412, -0.00332921],
       [ 0.08826245, -0.03472008, -0.03563941, ..., -0.01237829,
        -0.00141656,  0.00284105],
       [ 0.01982621, -0.00944311, -0.02909595, ...,  0.02922745,
        -0.0168786 , -0.00369394]])

In [88]:
test_x.shape

(1500, 100)

# Calculate the baseline

In [89]:
# Dummy doesn't learn, it's a coin flip
from sklearn.dummy import DummyClassifier

dummy_clf = DummyClassifier(strategy="most_frequent")

dummy_clf.fit(train_x, train_y)

In [90]:
from sklearn.metrics import accuracy_score

In [91]:
#Baseline Train Accuracy
dummy_train_pred = dummy_clf.predict(train_x)

baseline_train_acc = accuracy_score(train_y, dummy_train_pred)

print('Baseline Train Accuracy: {}' .format(baseline_train_acc))

Baseline Train Accuracy: 0.38142857142857145


In [92]:
#Baseline Test Accuracy
dummy_test_pred = dummy_clf.predict(test_x)

baseline_test_acc = accuracy_score(test_y, dummy_test_pred)

print('Baseline Test Accuracy: {}' .format(baseline_test_acc))

Baseline Test Accuracy: 0.386


# Try one of the classifiers we have covered so far

In [93]:
# Logistic regression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

# Initialize the logistic regression model
logistic_model = LogisticRegression(max_iter=1000, random_state=42)

# Fit the model on the training data
logistic_model.fit(train_x, train_y)

## Accuracy

In [94]:
# Make predictions on the test data
predictions = logistic_model.predict(preprocessor.transform(test_set))
# Calculate and print accuracy

accuracy = accuracy_score(test_y, predictions)
print("Logistic Regression Test Accuracy: {:.3f}".format(accuracy))

Logistic Regression Test Accuracy: 0.927


In [95]:
# Make predictions on the train data
predictions_train = logistic_model.predict(preprocessor.transform(train_set))
# Calculate and print accuracy

accuracy = accuracy_score(train_y, predictions_train)
print("Logistic Regression Train Accuracy: {:.3f}".format(accuracy))

Logistic Regression Train Accuracy: 0.923


## Generate the confusion matrix

In [96]:
# Generate the confusion matrix
conf_matrix = confusion_matrix(test_y, predictions)
print("Confusion Matrix:\n", conf_matrix)

Confusion Matrix:
 [[326   5   5  29]
 [  1 247   2   8]
 [  6   0 256  36]
 [  5   6   6 562]]


# Try another classifier we have covered so far

In [97]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

# Initialize the Random Forest model
rf_model = RandomForestClassifier(n_estimators=50, random_state=42)

# Fit the model on the training data
rf_model.fit(train_x, train_y)

## Accuracy

In [98]:
# Make predictions on the test data
rf_predictions = rf_model.predict(preprocessor.transform(test_set))

# Calculate and print accuracy
rf_accuracy = accuracy_score(test_y, rf_predictions)
print("Random Forest Test Accuracy: {:.3f}".format(rf_accuracy))

Random Forest Test Accuracy: 0.935


In [99]:
# Make predictions on the train data
rf_predictions_train = rf_model.predict(preprocessor.transform(train_set))

# Calculate and print accuracy
rf_accuracy_train = accuracy_score(train_y, rf_predictions_train)
print("Random Forest Train Accuracy: {:.3f}".format(rf_accuracy_train))

Random Forest Train Accuracy: 1.000


## Generate the confusion matrix

In [100]:
# Generate the confusion matrix
rf_conf_matrix = confusion_matrix(test_y, rf_predictions)
print("Confusion Matrix:\n", rf_conf_matrix)

Confusion Matrix:
 [[330   7   6  22]
 [  0 253   2   3]
 [  6   0 267  25]
 [ 10   8   9 552]]


# Check for the number of components

**Determine whether increasing/decreasing the number of components increases/decreases the two models' accuracies** Discuss your findings below.

In [101]:
print("Logistic Regression Test Accuracy: {:.3f}".format(accuracy))
print("Random Forest Test Accuracy: {:.3f}".format(rf_accuracy))

Logistic Regression Test Accuracy: 0.923
Random Forest Test Accuracy: 0.935


### 100 Components

- Logistic Regression Test Accuracy: 0.923
- Random Forest Test Accuracy: 0.933

### 300 components

- Logistic Regression Test Accuracy: 0.937
- Random Forest Test Accuracy: 0.929

### 600 Components

- Logistic Regression Test Accuracy: 0.945
- Random Forest Test Accuracy: 0.934

### 1000 Components

- Logistic Regression Test Accuracy: 0.958
- Random Forest Test Accuracy: 0.921

<span style="color: #81e64b;">For the Random Forest model, test accuracy at 100 components was 0.933. From there, test accuracy decreased when the number of components increased to 300, then increased for 600 and, again, decreased at 1000 components. To find the optimal number of components, it may be a good idea to use GridSearchCV from SK-Learn to perform hyper-parameter tuning. This would help us find the optimal number of components for the best accuracy for Random Forest.</span>
<p>
<span style="color: #81e64b;">In the case of the Logistic Regression model, test accuracy at 100 components was 0.923. From there, test accuracy increased when the number of components increased to 300, then increased for 600 and, again, increased at 1000 components. To find the optimal accuracy, my chosen course of action may be to increase the span of componenets until I see a drop off, then dial it back to achieve more targeted understanding of the optimal test accuracy.</span>