<h1> Text Classifier </h1>

<h3> 1: Setting up the Notebook </h3>

In [1]:
import pandas as pd
import numpy as np

import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')

from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

from sklearn.preprocessing import LabelEncoder
from collections import defaultdict
from nltk.corpus import wordnet as wn

from sklearn.feature_extraction.text import TfidfVectorizer

# Import machine learing algorithms
from sklearn import model_selection, naive_bayes, svm, tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier

# Import accuracy checker
from sklearn.metrics import accuracy_score

# Import pickle to save models
import pickle

# Import matplotlib.pyplot for graphs
import matplotlib.pyplot as plt

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
# Set random seed
np.random.seed(500)

In [3]:
# Read data
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

<h3> 2: Data Exploration </h3>
<p> We first performed some descriptive analytics to better understand the train and test data that we were working with. </p>

In [4]:
train_df.head()

Unnamed: 0,itemid,title,Category,image_path
0,307504,nyx sex bomb pallete natural palette,0,beauty_image/6b2e9cbb279ac95703348368aa65da09.jpg
1,461203,etude house precious mineral any cushion pearl...,1,beauty_image/20450222d857c9571ba8fa23bdedc8c9.jpg
2,3592295,milani rose powder blush,2,beauty_image/6a5962bed605a3dd6604ca3a4278a4f9.jpg
3,4460167,etude house baby sweet sugar powder,3,beauty_image/56987ae186e8a8e71fcc5a261ca485da.jpg
4,5853995,bedak revlon color stay aqua mineral make up,3,beauty_image/9c6968066ebab57588c2f757a240d8b9.jpg


In [5]:
test_df.head()

Unnamed: 0,itemid,title,image_path
0,370855998,flormar 7 white cream bb spf 30 40ml,beauty_image/1588591395c5a254bab84042005f2a9f.jpg
1,637234604,maybelline clear smooth all in one bb cream sp...,beauty_image/920985ed9587ea20f58686ea74e20f93.jpg
2,690282890,murah innisfree eco natural green tea bb cream...,beauty_image/90b40e5710f54352b243fcfb0f5d1d7f.jpg
3,930913462,loreal white perfect day cream spf 17 pa white...,beauty_image/289c668ef3d70e1d929d602d52d5d78a.jpg
4,1039280071,hada labo cc cream ultimate anti aging spf 35 ...,beauty_image/d5b3e652c5822d2306f4560488ec30c6.jpg


<p> We observe that the test data has the same columns as the train data except that it is missing the "Category" column that we are meant to create with our predictive model. </p>

<p> Moreover, the title data has some stopwords, e.g. "in" and "all" that are unlikely to have much predictive value for our model. </p>

In [4]:
train_df.describe()

Unnamed: 0,itemid,Category
count,666615.0,666615.0
mean,1155562000.0,18.071577
std,522688800.0,13.090931
min,112574.0,0.0
25%,812002100.0,4.0
50%,1252422000.0,18.0
75%,1612608000.0,28.0
max,1868917000.0,57.0


In [5]:
test_df.describe()

Unnamed: 0,itemid
count,172402.0
mean,1152965000.0
std,547206000.0
min,112655.0
25%,779196400.0
50%,1282126000.0
75%,1631265000.0
max,1868894000.0


<p> The title in the first row of the train dataset says "palette". While this word was in singular form, we expect that there would be some titles with "palettes" instead. We thus slice the train dataset to confirm our suspicion. </p>

In [None]:
train_df[train_df['title'].str.contains('palette')]

In [None]:
train_df[train_df['title'].str.contains('palettes')]

<p> We expect that there would be other words, e.g. "color" or "colors" that will have both singular and plural form as well. </p>

<p> <strong> Conclusions </strong> </p>
<p> After looking at the title data, we identified the following issues with the title data that would reduce the accuracy of our machine learning model:
<ol>
    <li> Some title data contained punctuation, special characters and numbers that had little significance to the title data.</li>
    <li> Some titles reflected products in plural and others in singular form, e.g. "palette" and "palettes" are the same thing but would be registed in a model as different. </li>
    <li> There were a lot of stopwords, e.g. "the", "and", "in", that we do not expect to have much predictive value. </li>
</ol>
</p>

<h3> 3: Data Pre-Processing </h3>
<p> Having identified the above problems, we decided to run the title data through a pre-processing pipeline (mainly using the nltk library) to make the names easier to deal with. The following are steps in our pipeline: </p>

<ol>
    <li> Removing any blank rows in the data. </li>
    <li> Changing all letters to lowercase, since python interprets 'color' and 'COLOR' differently. </li>
    <li> Removing punctuation, special characters (like *, | or .) and numbers in the title data. </li>
    <li> Lemmatizing the words (≈finding word stems) to remove variance from word inflection (i.e. we want our model to know that "palette" and "palettes" are the same thing) </li>
    <li> Removing stopwords (the, and, in, etc.) because we do not expect them to have much predictive value. </li> 
</ol>

<p> We also split the words in the title string in each row into a list of words to make it easier to parse through all words in the string. </P>

In [None]:
# Step 1: Remove blank rows
train_df['title'].dropna(inplace = True)

In [None]:
# Step 2: Change all the text to lower case
train_df['title'] = [entry.lower() for entry in train_df['title']]

In [None]:
# Step 3: Tokenization : In this each entry in the train_df will be broken into set of words
train_df['title']= [word_tokenize(entry) for entry in train_df['title']]

In [None]:
# Step 4: Remove stopwords, punctuationon, special characters and numeric data, and lemmatize the words (find word stems)
# WordNetLemmatizer requires Pos tags to understand if the word is noun ozRr verb or adjective etc. By default it is set to Noun
tag_map = defaultdict(lambda : wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV

for index,entry in enumerate(train_df['title']):
    # Declaring Empty List to store the words that follow the rules for this step
    final_words = []
    # Initializing WordNetLemmatizer()
    word_lemmatized = WordNetLemmatizer()
    # pos_tag function below will provide the 'tag' i.e if the word is Noun(N) or Verb(V) or something else.
    for word, tag in pos_tag(entry):
        # Below condition is to check for Stop words and consider only alphabets
        if word not in stopwords.words('english') and word.isalpha():
            word_final = word_lemmatized.lemmatize(word,tag_map[tag[0]])
            final_words.append(word_final)
    # The final processed set of words for each iteration will be stored in 'title_final'
    train_df.loc[index,'title_final'] = str(final_words)

In [None]:
# Save clean data for future use
train_df.to_pickle("train_df_cleaned.pickle")

In [None]:
# Split the train and test data
train_X, test_X, train_Y, test_Y = model_selection.train_test_split(train_df['title_final'], train_df['Category'],test_size=0.3)

<h4> Extracting Features from Title Data </h4>
<p> In order to run machine learning algorithms, we need to convert the titles into numerical feature vectors. We chose to use the Term Frequency - Inverse Document Frequency (TF-IDF) method that reduces the weightage of more common words that occur in all documents, e.g. color </p> 

In [None]:
Tfidf_vect = TfidfVectorizer(max_features=5000)
Tfidf_vect.fit(train_df['title_final'])
train_X_Tfidf = Tfidf_vect.transform(train_X)
test_X_Tfidf = Tfidf_vect.transform(test_X)

In [None]:
print(Tfidf_vect.vocabulary_)

In [None]:
train_df.to_pickle("train_df.pkl")

<h3> 4: Machine Learning Models </h3>
<p> We then tested out a few different supervised machine learning models and chose the one with the highest accuracy. </p>

<h4> Logistic Regression </h4>

In [None]:
LogRegression = LogisticRegression()
LogRegression.fit(train_X_Tfidf, train_Y)

# predict the labels on validation dataset
predictions_LogRegression = LogRegression.predict(test_X_Tfidf)

# Use accuracy_score function to get the accuracy
print("Logistic Regression Accuracy Score -> ", accuracy_score(predictions_LogRegression, test_Y)*100)

In [None]:
import pickle
filename_LogReg = 'finalized_model_LogReg.sav'
pickle.dump(LogRegression, open(filename_LogReg, 'wb'))

<h4> Naive Bayes Classifier Algorithm </h4>

In [None]:
# fit the training dataset on the NB classifier
Naive = naive_bayes.MultinomialNB()
Naive.fit(train_X_Tfidf,train_Y)

# predict the labels on validation dataset
predictions_NB = Naive.predict(test_X_Tfidf)

# Use accuracy_score function to get the accuracy
print("Naive Bayes Accuracy Score -> ", accuracy_score(predictions_NB, test_Y)*100)

In [None]:
filename_Naive = 'finalized_model_NaiveBayes.sav'
pickle.dump(Naive, open(filename_Naive, 'wb'))

<h4> Support Vector Machine </h4>

In [None]:
# fit the training dataset on the classifier
SVM = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')
SVM.fit(train_X_Tfidf, train_Y)

# predict the labels on validation dataset
predictions_SVM = SVM.predict(test_X_Tfidf)

# Use accuracy_score function to get the accuracy
print("SVM Accuracy Score -> ", accuracy_score(predictions_SVM, test_Y)*100)

In [None]:
filename_SVM = 'finalized_model_SVM.sav'
pickle.dump(SVM, open(filename_SVM, 'wb'))

<h4> Decision Tree </h4>

In [None]:
# fit the training dataset on the classifier
DecisionTree = tree.DecisionTreeClassifier()
DecisionTree.fit(train_X_Tfidf, trian_Y)

# predict the labels on validation dataset
predictions_DecisionTree = DecisionTree.predict(text_X_Tfidf)

# use accuracy_score function to get accuracy
print("Decision Tree Accuracy Score -> ", accuracy_score(predictions_DecisionTree, test_Y)*100)

In [None]:
filename_DecisionTree = 'finalized_model_DecisionTree.sav'
pickle.dump(DecisionTree, open(filename_DecisionTree, 'wb'))

<h4> K-Nearest Neighbors </h4>

In [None]:
# fit the training dataset on the classifier
KNN = KNeighborsClassifier(n_neighbors = 7)
KNN.fit(train_X_Tfidf, train_Y)

# predict the labels on validation dataset
predictions_KNN = KNN.predict(text_X_Tfidf)

# use accuracy_score function to get accuracy
print("KNN Accuracy Score -> ", accuracy_score(predictions_KNN, test_Y)*100)

In [None]:
filename_KNN = 'finalized_model_KNN.sav'
pickle.dump(KNN, open(filename_KNN, 'wb'))

<h4> Stochastic Gradient Descent </h4>

In [None]:
# fit the training dataset on the classifier
SGD = SGDClassifier(loss = "hinge", penalty = "12", max_iter = 5)
SGD.fit(train_X_Tfidf, train_Y)

# predict the labels on validation dataset
predictions_SGD = SGD.predict(text_X_Tfidf)

# use accuracy_score function to get accuracy
print("SGD Accuracy Score -> ", accuracy_score(predictions_SGD, test_Y)*100)

In [None]:
filename_SGD = 'finalized_model_SGD.sav'
pickle.dump(SGD, open(filename_SGD, 'wb'))

<h3> 5: Improving the Model </h3>
<h4> Logistic Regression </h4>

In [None]:
# optimizing the Logistic Regression Model

# account for class imbalances (if any)
LogRegression2 = LogisticRegression(class_weight = 'balanced')
LogRegression2.fit(train_X_Tfidf, train_Y)

# predict the labels on validation dataset
predictions_LogRegression2 = LogRegression2.predict(test_X_Tfidf)

# use accuracy_score function to get the accuracy
print("Logistic Regression V2 Accuracy Score -> ", accuracy_score(y_true=test_Y, y_pred=predictions_LogRegression2)*100)

# final result
# Logistic Regression V2 Accuracy Score ->  68.47013526014452

In [None]:
# optimizing the Logistic Regression Model

# changing the solver from the default of 'liblinear' which does not handle multinomial loss
# note that 'lbfgs' generates a convergence warning
LogRegression3 = LogisticRegression(solver = 'newton-cg')
LogRegression3.fit(train_X_Tfidf, train_Y)

# predict the labels on validation dataset
predictions_LogRegression3 = LogRegression3.predict(test_X_Tfidf)

# use accuracy_score function to get the accuracy
print("Logistic Regression V3 Accuracy Score -> ", accuracy_score(y_true=test_Y, y_pred=predictions_LogRegression3)*100)

# final result
# Logistic Regression V3 Accuracy Score ->  71.07633072480436

In [None]:
# optimizing the Logistic Regression Model

# adding multi_class = 'auto'
LogRegression4 = LogisticRegression(solver = 'newton-cg', multi_class = 'auto')
LogRegression4.fit(train_X_Tfidf, train_Y)

# predict the labels on validation dataset
predictions_LogRegression4 = LogRegression4.predict(test_X_Tfidf)

# use accuracy_score function to get the accuracy
print("Logistic Regression V4 Accuracy Score -> ", accuracy_score(y_true=test_Y, y_pred=predictions_LogRegression4)*100)

# final result
# Logistic Regression V4 Accuracy Score ->  71.17683826286971

In [None]:
# optimizing the Logistic Regression Model

# changing the solver from the default of 'liblinear' 
LogRegression5 = LogisticRegression(solver = 'saga', multi_class = 'auto')
LogRegression5.fit(train_X_Tfidf, train_Y)

# predict the labels on validation dataset
predictions_LogRegression5 = LogRegression5.predict(test_X_Tfidf)

# Use accuracy_score function to get the accuracy
print("Logistic Regression V5 Accuracy Score -> ", accuracy_score(y_true=test_Y, y_pred=predictions_LogRegression5)*100)

# final result
# Logistic Regression V5 Accuracy Score ->  71.1803385253894

In [None]:
# optimizing the Logistic Regression Model

# changing the solver from the default of 'liblinear' 
LogRegression6 = LogisticRegression(solver = 'sag', multi_class = 'auto')
LogRegression6.fit(train_X_Tfidf, train_Y)

# predict the labels on validation dataset
predictions_LogRegression6 = LogRegression6.predict(test_X_Tfidf)

# Use accuracy_score function to get the accuracy
print("Logistic Regression V6 Accuracy Score -> ", accuracy_score(y_true=test_Y, y_pred=predictions_LogRegression6)*100)

# final result
# Logistic Regression V6 Accuracy Score -> 71.17733830037253

<p> The Logistic Regression using the SAGA solver and auto multiclass had the highest accuracy. Thus, we saved it and used it to predict the categories of data in 'test.csv'. </p>

In [None]:
import pickle
filename_LogReg = 'LogReg_SagaSolver.sav'
pickle.dump(LogRegression5, open(filename_LogReg, 'wb'))

<h4> Using Entire 'train.csv' </h4>
<p> After finding the best model, we chose to use all the train data to train the model. </p>

In [4]:
train_X = train_df['title_final']
train_Y = train_df['Category']

In [5]:
Tfidf_vect = TfidfVectorizer(max_features=5000)
Tfidf_vect.fit(train_df['title_final'])
train_X_Tfidf = Tfidf_vect.transform(train_X)
test_Tfidf = Tfidf_vect.transform(test_df['title_final'])

In [None]:
# Log Regression with all data using solver = 'saga', multi_class = 'auto'
LogRegressionAll = LogisticRegression(solver = 'saga', multi_class = 'auto')
LogRegressionAll.fit(train_X_Tfidf, train_Y)

In [22]:
filename_LogReg = 'LogReg_SagaSolver_All.sav'
pickle.dump(LogRegressionAll, open(filename_LogReg, 'wb'))

<h3> 6: Predictions </h3>
<p> In order to make use of our machine learning models, we first need to process the test data. </p>

In [None]:
# Step 1: Remove blank rows
test_df['title'].dropna(inplace = True)

In [None]:
# Step 2: Change all the text to lower case
test_df['title'] = [entry.lower() for entry in test_df['title']]

In [None]:
# Step 3: Tokenization : In this each entry in the train_df will be broken into set of words
test_df['title']= [word_tokenize(entry) for entry in test_df['title']]

In [None]:
# Step 4: Remove stopwords, punctuationon, special characters and numeric data, and lemmatize the words (find word stems)
# WordNetLemmatizer requires Pos tags to understand if the word is noun ozRr verb or adjective etc. By default it is set to Noun
tag_map = defaultdict(lambda : wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV

for index,entry in enumerate(test_df['title']):
    # Declaring Empty List to store the words that follow the rules for this step
    final_words = []
    # Initializing WordNetLemmatizer()
    word_lemmatized = WordNetLemmatizer()
    # pos_tag function below will provide the 'tag' i.e if the word is Noun(N) or Verb(V) or something else.
    for word, tag in pos_tag(entry):
        # Below condition is to check for Stop words and consider only alphabets
        if word not in stopwords.words('english') and word.isalpha():
            word_final = word_lemmatized.lemmatize(word,tag_map[tag[0]])
            final_words.append(word_final)
    # The final processed set of words for each iteration will be stored in 'title_final'
    test_df.loc[index,'title_final'] = str(final_words)

In [None]:
train_df.to_pickle("test_df_cleaned.pkl")

<h3> 2: Loading and Using the Models </h3>
<p> We opened the models previously saved and then used those models to predict our test data. We then added the predictions as a 'Category' column to the existing dataframe. </p>

In [2]:
# Open cleaned test dataframe
with open('test_df_cleaned.pkl', 'rb') as test:
    test_df = pickle.load(test)

# Open cleaned train dataframe
with open('train_df_cleaned.pickle', 'rb') as train:
    train_df = pickle.load(train)

In [3]:
# Open final model
with open('LogReg_SagaSolver_All.sav', 'rb') as f:
    LogReg_model = pickle.load(f)

In [None]:
Tfidf_vect = TfidfVectorizer(max_features=5000)
Tfidf_vect.fit(train_df['title_final'])
test_Tfidf = Tfidf_vect.transform(test_df['title_final'])

In [None]:
# Predict categories
predictions_LogReg = LogReg_model.predict(test_Tfidf)

In [None]:
# Add predicted cateogries to test_df
test_df['Category'] = predictions_LogReg

In [None]:
# View the predictions
fig, axl = plt.subplots()
axl.set_title("Category Predictions")
axl.set_ylabel("Count")
axl.set_xlabel("Category")
test_df.Category.value_counts().sort_index().plot(ax = axl, marker = '.')

<h4> Submission File </h4>

In [18]:
# Drop irrelevant columns
submission_df = test_df.loc[:, ['itemid', 'Category']]

In [20]:
# Save the submission_df file
submission_df.to_csv("submission.csv", index=False)