### Import Libraries
Here I will import all the libraries that are required for this project. I will be using `pandas` library to read the dataset and `pickle` library to store the trained models. 

For this project I will be using `Naive Bias` model, so I will also import it by using sklearn library.

In [1]:
import re
import os
import time
import pickle
import random
import pandas as pd
import numpy as np
from nltk.stem import PorterStemmer
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB
from nltk.stem.snowball import SnowballStemmer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

# Used to hide the warnings from pandas library while replacing a value in a df copy. 
pd.options.mode.chained_assignment = None  # default='warn'

### Load and explore the data
Now here I will load the dataset and we can see we have four columns in the dataframe:
- `Description`: This the feature column, it contains all the descriptions for categories.
- `Level_1`: First level of categories.
- `Level_2`: Second level of categories.
- `Level_3`: Third level of categories.

In [2]:
dataframe = pd.read_csv("product-cat-dataset.csv")
dataframe

Unnamed: 0,Description,Level_1,Level_2,Level_3
0,gerb cap help keep littl on head cov warm day ...,09BF5150,C7E19,D06E
1,newborn inf toddl boy hoody jacket oshkosh b g...,2CEC27F1,ADAD6,98CF
2,tut ballet anym leap foxy fash ruffl tul toddl...,09BF5150,C7E19,D06E
3,newborn inf toddl boy hoody jacket oshkosh b g...,2CEC27F1,ADAD6,98CF
4,easy keep feel warm cozy inf toddl girl hoody ...,2CEC27F1,ADAD6,98CF
...,...,...,...,...
10644,term 10 issu on year subscriptionyo sav 75 cov...,90A8B052,C719A,A0E2
10645,term 12 issu on year subscriptionyo sav 86 cov...,90A8B052,C719A,A0E2
10646,term 9 issu on year subscriptionyo sav 64 cov ...,90A8B052,C719A,A0E2
10647,term 26 issu on year subscriptionyo sav 54 cov...,90A8B052,C719A,A0E2


Now I will use the `describe` method of pandas library. This method is used for calculating some statistical data like percentile, mean and std of the numerical values of the DataFrame, it analyzes both numeric and object data types. In our dataframe we don't have numerical data so this will just give us some small details about the dataset.

In [3]:
dataframe.describe()

Unnamed: 0,Description,Level_1,Level_2,Level_3
count,10637,10649,10649,10649
unique,9677,15,39,43
top,glory gorg col fing complet outfit express moo...,B092BA29,2D5A3,28A7
freq,24,900,797,797


### Deal with Missing Data
Now I will check for missing values in the dataframe. I will use `isnull()` method. This method will return a copy of dataframe with boolean values. If their is any missing value in the dataset then that cell will be filled by `True`. And now after that I will use `sum()` method, this will calculate how many values of each column we have missing values.

In [4]:
# Check if data has missing values in the Description column
dataframe.isnull().sum()

Description    12
Level_1         0
Level_2         0
Level_3         0
dtype: int64

So as we can see, `Description` is the only column which have **12** missing entries, so now I will delete all those rows from the dataframe. In the cell give below I will clean the dataset from missing values.

In [5]:
missing_values_index = dataframe[dataframe['Description'].isnull()].index.tolist()

Now after cleaning we will again check if we have missing values in the dataset or not.

In [6]:
# Deal with missing values
dataframe.drop(dataframe.index[missing_values_index], inplace=True)
dataframe.isnull().any()

Description    False
Level_1        False
Level_2        False
Level_3        False
dtype: bool

### Drop Classes where the number of instances is < 10
In this part I will check the number of instances for each class Level and if any class has less than **10 instances**, I will delete it from the dataframe. 

So for implementing this, I will create a helper function which will take dataframe and class level as a parameter and then in that function I will delete those classes which have less then 10 instances and return a filtered dataframe. 

In [7]:
def filter_dataframe(df, class_level):
    class_freq = df[class_level].value_counts().to_dict()
    delete_flag = False
    classes_to_delete = []
    
    # Iterating over each class frequency
    for key, value in class_freq.items():
        if value < 10:
            delete_flag = True
            print("Deleting Class {} with {} number of instances.".format(key, value))
            classes_to_delete.append(key)

    # Selecting a copy of dataframe in which those classes will not be added which we have to delete
    df = df.loc[~df[class_level].isin(classes_to_delete)]
    
    if delete_flag == False:
        print("No class Found with less than 10 instances")
    else:
        print("Class Frequencies after filtering dataframe")        
        
    updated_class_freq = df[class_level].value_counts().to_dict()

    for key, value in updated_class_freq.items():
        print("Class {} having {} number of instances.".format(key, value))
        
    return df


In [8]:
# Apply to Level_1 
dataframe = filter_dataframe(dataframe, class_level="Level_1")

No class Found with less than 10 instances
Class B092BA29 having 900 number of instances.
Class 35E04739 having 896 number of instances.
Class AAC8EE56 having 890 number of instances.
Class 57164AC1 having 877 number of instances.
Class 2CEC27F1 having 859 number of instances.
Class EFEF723B having 800 number of instances.
Class 09BF5150 having 799 number of instances.
Class 69286F45 having 797 number of instances.
Class 96F95EEC having 587 number of instances.
Class 3E1E0D78 having 579 number of instances.
Class 4C3D8686 having 574 number of instances.
Class 4513C920 having 558 number of instances.
Class 014303D1 having 511 number of instances.
Class 90A8B052 having 506 number of instances.
Class D410C91A having 504 number of instances.


In [9]:
# Apply to Level_2
dataframe = filter_dataframe(dataframe, class_level="Level_2")

Deleting Class 80D5B with 6 number of instances.
Deleting Class A6301 with 1 number of instances.
Deleting Class C66C5 with 1 number of instances.
Class Frequencies after filtering dataframe
Class 2D5A3 having 797 number of instances.
Class ACD06 having 504 number of instances.
Class C719A having 482 number of instances.
Class 9D9EE having 462 number of instances.
Class 5A8AB having 450 number of instances.
Class 375FE having 450 number of instances.
Class BAE8A having 449 number of instances.
Class B2DB4 having 449 number of instances.
Class CB803 having 448 number of instances.
Class 9B69F having 447 number of instances.
Class 74974 having 446 number of instances.
Class 914A1 having 443 number of instances.
Class 390F1 having 441 number of instances.
Class 94728 having 440 number of instances.
Class C7E19 having 429 number of instances.
Class 7B638 having 420 number of instances.
Class A04D3 having 411 number of instances.
Class ADAD6 having 410 number of instances.
Class F4055 havin

In [10]:
# Apply to Level_3
dataframe = filter_dataframe(dataframe, class_level="Level_3")

Deleting Class DE3D with 1 number of instances.
Deleting Class CF52 with 1 number of instances.
Class Frequencies after filtering dataframe
Class 28A7 having 797 number of instances.
Class 33D1 having 504 number of instances.
Class A0E2 having 482 number of instances.
Class 05A0 having 462 number of instances.
Class 1F61 having 450 number of instances.
Class AA6B having 450 number of instances.
Class 21DA having 449 number of instances.
Class 627D having 448 number of instances.
Class 2ABA having 448 number of instances.
Class 80C4 having 447 number of instances.
Class 62E8 having 446 number of instances.
Class D97D having 443 number of instances.
Class 6856 having 441 number of instances.
Class 5912 having 439 number of instances.
Class D06E having 429 number of instances.
Class 0F8B having 420 number of instances.
Class C5B4 having 411 number of instances.
Class 98CF having 410 number of instances.
Class 6539 having 282 number of instances.
Class 078B having 264 number of instances.


### Now let's write a Function to Prepare Text
We will apply it to our DataFrame later on

* This function receives a text string and performs the following:
* Convert text to lower case
* Remove punctuation marks
* Apply stemming using the popular Snowball or Porter Stemmer (optional)
* Apply NGram Tokenisation
* Return the tokenised text as a list of strings

In [11]:
# Function for fetching n-grams
def get_n_gram(words_list, n=1):
    n_grams = []
    
    for i in range(len(words_list)):
        gram = []
        j = i
        for _ in range(n):
            gram.append(words_list[j])
            j += 1
            
            if j >= len(words_list):
                break
                
        if len(gram) == n:
            n_grams.append(gram)    
    
    return n_grams


def process_text(text, n = 1):
    """
    Takes in a string of text, then performs the following:
    1. Convert text to lower case and remove all punctuation
    2. Optionally apply stemming
    3. Apply Ngram Tokenisation
    4. Returns the tokenised text as a list
    """
    # write steps here
    text = text.lower()
    text = re.sub(r'[^a-zA-Z ]+', '',text)
    words = text.split(" ")
    stem_words = []
    
    porter = PorterStemmer()
    for word in words:
        stem_words.append(porter.stem(word))
    
    n_grams = get_n_gram(stem_words, n)
    tokenised = []
    
    for gram in n_grams:
        sentence = " ".join(gram)
        tokenised.append(sentence)
    
    return tokenised

In [12]:
# Here is an example function call
process_text("Here we're testing the process_text function, results are as follows:", n = 3)

['here were test',
 'were test the',
 'test the processtext',
 'the processtext function',
 'processtext function result',
 'function result are',
 'result are as',
 'are as follow']

In [13]:
# Results should look like this:
['here were test',
 'were test the',
 'test the processtext',
 'the processtext function',
 'processtext function result',
 'function result are',
 'result are as',
 'are as follow']

['here were test',
 'were test the',
 'test the processtext',
 'the processtext function',
 'processtext function result',
 'function result are',
 'result are as',
 'are as follow']

### Now let's apply TF-IDF to extract features from plain text

In [14]:
# Might take a while...
# Here you apply the process_text function to the Description column of the data
# Then you pass the results to the bag of words tranformer
# See here: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
dataframe["Description"] = dataframe["Description"].values.astype('unicode')
descriptions = dataframe["Description"].to_numpy()
labels = dataframe.drop(["Description"], axis=1)
proccessed_features = []

i = 0
for description in descriptions:
    proccessed_features.append(process_text(description, n = 2))
    
# Because we have lowercase the words and created n-grams so we will provide those 2 parameters
vectorizer = CountVectorizer(lowercase=False, analyzer=lambda x:x)
count_vectorizer = vectorizer.fit_transform(proccessed_features)
count_vectorizer

<10627x154666 sparse matrix of type '<class 'numpy.int64'>'
	with 336704 stored elements in Compressed Sparse Row format>

Now we can use .transform on our Bag-of-Words (bow) transformed object and transform the entire DataFrame of text file contents. Let's go ahead and check out how the bag-of-words counts for the entire corpus in a large, sparse matrix:

In [15]:
# After that you pass the result of the previous step to sklearn's TfidfTransformer
# which will convert them into a feature matrix
# See here: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html

tfidftransformer = TfidfTransformer()
tfidf_vectorizer = tfidftransformer.fit_transform(count_vectorizer)
tfidf_vectorizer

<10627x154666 sparse matrix of type '<class 'numpy.float64'>'
	with 336704 stored elements in Compressed Sparse Row format>

In [16]:
# The resulting matrix is in sparse format, we can transform it into dense
# Code prepared for you so you can see what results look like
text_tfidf = pd.DataFrame(tfidf_vectorizer[:10].toarray())

In [17]:
# This is an example result, the matrix will contain lots of zero values, that is expected
# Some values will be non-zero
text_tfidf.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,154656,154657,154658,154659,154660,154661,154662,154663,154664,154665
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Now the Data is Ready for Classifier Usage

### Split Data into Train and Test sets
Now here I will split the dataset into training and testing sets. I will use **20%** data for testing and **80%** for training. 

In [18]:
# Train/Test split
X_train, X_test, y_train, y_test = train_test_split(tfidf_vectorizer, labels, test_size=0.1, random_state=42)

In [19]:
X_train = pd.DataFrame(X_train.toarray())
X_test = pd.DataFrame(X_test.toarray())
X_train[:5]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,154656,154657,154658,154659,154660,154661,154662,154663,154664,154665
0,0.063976,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.053572,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [20]:
# You might need to reset index in each dataframe (depends on you how you do things)
# done for you to make it clearer
X_train.reset_index(inplace=True, drop=True)
X_test.reset_index(inplace=True, drop=True)
y_train.reset_index(inplace=True, drop=True)
y_test.reset_index(inplace=True, drop=True)

In [21]:
# You might need to take classes as separate columns (depends on you how you do things)
class1training = y_train['Level_1'].astype(str)
class1testing = y_test['Level_1'].astype(str)

## Model training for the three levels
I will be using `Naive Bias Classifier`.

In [22]:
classifier = GaussianNB()

In [23]:
start = time.time()
classifier.fit(X_train, class1training)
stop = time.time()

In [24]:
accuracy = classifier.score(X_test, class1testing)
mins, sec = divmod(stop - start, 60)

print("Testing Acurracy of Level 1 model: {}%".format(round((accuracy * 100) , 2)))
print("It too {} minutes and {} seconds to train the model.".format(round(mins), round(sec)))

Testing Acurracy of Level 1 model: 85.61%
It too 0 minutes and 12 seconds to train the model.


In [25]:
# Saving Level 1 model
path = "./Models"
file_name = "Level1_model.pickle"
save_path = os.path.join(path, file_name)

with open(save_path, 'wb') as f:
    pickle.dump(classifier, f)

Now I will create some helper functions which will help through the training process of Level 2 and Level 3 models.

In [26]:
def get_unique_classes_from_next_level(data, previous_level, model_class, current_level):
    """
    Takes in data with category level information, then return the unique classes for required level.
    """
    temp = data[data[previous_level] == model_class]
    return temp[current_level].unique()

def filter_data(X_train, y_train, current_level, current_classes):
    """
    Takes in data with category level information, 
    then return filtered dataframe according to a class in a specific level.
    """
    filtered_X_dataframes = []
    filtered_y_dataframes = []

    for cl in current_classes:
        train_data_indexes = y_train[y_train[current_level] == cl].index.tolist()
        #features
        filtered_X_dataframes.append(X_train.iloc[train_data_indexes])        
        #labels
        filtered_y_dataframes.append(y_train.iloc[train_data_indexes])

        
    if len(current_classes) == 1:
        updated_X_dataframe, updated_y_dataframe = update_filtered_data_one_class(X_train, y_train,\
                                                                              len(train_data_indexes), \
                                                                              current_classes[0],\
                                                                              current_level)
        
        filtered_X_dataframes.append(updated_X_dataframe)
        filtered_y_dataframes.append(updated_y_dataframe)
        
    return pd.concat(filtered_X_dataframes), pd.concat(filtered_y_dataframes)


# #############################################
def update_filtered_data_one_class(X_train, y_train, available_data_length, current_class, current_level):
    """
    Takes in data with only one class, then selects a random sample of same size
    and assign another class to make an even dataset to train the model for that class.
    """
    data_indexes = y_train[y_train[current_level] != current_class].index.tolist()
    indexes_to_select = random.sample(data_indexes, available_data_length)
    
    selected_X_train = X_train.iloc[indexes_to_select]
    selected_y_train = y_train.iloc[indexes_to_select]
    replaced_class = selected_y_train[current_level].unique()[0]
    selected_y_train.loc[:, [current_level]] = replaced_class
    
    return selected_X_train, selected_y_train

# #########################################
def train_models(X_train, y_train, model_class, previous_level, current_level):
    """
    Takes in data with category level information, then train and save the models.
    """
    unique_classes = get_unique_classes_from_next_level(y_train, previous_level, model_class, current_level)
    filtered_X_train, filtered_y_train = filter_data(X_train, y_train, current_level, unique_classes)

    # Create and save models for level 2
    classifier = GaussianNB()
    classifier.fit(filtered_X_train, filtered_y_train[current_level])
    accuracy = classifier.score(filtered_X_train, filtered_y_train[current_level])
    
    # Path
    path = "./Models"
    file_name = model_class + ".pickle"
 
    save_path = os.path.join(path, current_level, file_name)
    
    with open(save_path, 'wb') as f:
        pickle.dump(classifier, f)

    print("Trained and saved Model for Class {}".format(model_class))

In [27]:
level1_classes = dataframe["Level_1"].unique()
for cl in level1_classes:
    train_models(X_train, y_train, cl, "Level_1", "Level_2")

print("{} Total Models Trained and Saved for Level 2".format(len(level1_classes)))

Trained and saved Model for Class 09BF5150
Trained and saved Model for Class 2CEC27F1
Trained and saved Model for Class AAC8EE56
Trained and saved Model for Class 4C3D8686
Trained and saved Model for Class 69286F45
Trained and saved Model for Class 57164AC1
Trained and saved Model for Class 4513C920
Trained and saved Model for Class 35E04739
Trained and saved Model for Class EFEF723B
Trained and saved Model for Class 96F95EEC
Trained and saved Model for Class 014303D1
Trained and saved Model for Class 90A8B052
Trained and saved Model for Class B092BA29
Trained and saved Model for Class 3E1E0D78
Trained and saved Model for Class D410C91A
15 Total Models Trained and Saved for Level 2


In [28]:
level2_classes = dataframe["Level_2"].unique()
for cl in level2_classes:
    train_models(X_train, y_train, cl, "Level_2", "Level_3")

print("{} Total Models Trained and Saved for Level 3".format(len(level2_classes)))

Trained and saved Model for Class C7E19
Trained and saved Model for Class ADAD6
Trained and saved Model for Class 914A1
Trained and saved Model for Class 74974
Trained and saved Model for Class 2D5A3
Trained and saved Model for Class 9B69F
Trained and saved Model for Class 7B638
Trained and saved Model for Class F4055
Trained and saved Model for Class 0864A
Trained and saved Model for Class F824F
Trained and saved Model for Class B2DB4
Trained and saved Model for Class 02FA0
Trained and saved Model for Class D5531
Trained and saved Model for Class CB803
Trained and saved Model for Class BAE8A
Trained and saved Model for Class 31FED
Trained and saved Model for Class E69F5
Trained and saved Model for Class 390F1
Trained and saved Model for Class 94728
Trained and saved Model for Class 36080
Trained and saved Model for Class 77F62
Trained and saved Model for Class A04D3
Trained and saved Model for Class 7AED7
Trained and saved Model for Class 915D4
Trained and saved Model for Class 6C6B1


## Predict the test set

In [29]:
# Creating an empty Dataframe with column names only (depends on you how you do things)
results = pd.DataFrame(columns=['Level1_Pred', 'Level2_Pred', 'Level3_Pred'])

## Here we reload the saved models and use them to predict the levels
# load model for level 1 (done for you)
with open('./Models/Level1_model.pickle', 'rb') as nb:
    level1_model = pickle.load(nb)

## loop through the test data, predict level 1, then based on that predict level 2
## and based on level 2 predict level 3 (you need to load saved models accordingly)
dir_path = "./Models"

for index, test_data in X_test.iterrows():
    # Because we are predicting categories for a single sample, so we will reshape it first.    
    data = np.array(test_data).reshape(1, -1)
    # Level 1 prediction
    level1_pred = level1_model.predict(data)    
    level1_pred_class = level1_pred[0] + ".pickle"
    
    model_path = os.path.join(dir_path, "Level_2", level1_pred_class)
    with open(model_path, 'rb') as nb:
        level2_model = pickle.load(nb)
    # Level 2 preidction
    level2_pred = level2_model.predict(data)    
    
    level2_pred_class = level2_pred[0] + ".pickle"
    model_path = os.path.join(dir_path, "Level_3", level2_pred_class)
    with open(model_path, 'rb') as nb:
        level3_model = pickle.load(nb)
    # Level 3 prediction
    level3_pred = level3_model.predict(data)
    
    results.at[index, "Level1_Pred"] = level1_pred[0]
    results.at[index, "Level2_Pred"] = level2_pred[0]
    results.at[index, "Level3_Pred"] = level3_pred[0]  


In [30]:
# Prediction Results
results

Unnamed: 0,Level1_Pred,Level2_Pred,Level3_Pred
0,014303D1,77F62,5AE1
1,09BF5150,915D4,A2FA
2,2CEC27F1,BAE8A,2ABA
3,57164AC1,94728,5912
4,57164AC1,94728,5912
...,...,...,...
1058,AAC8EE56,9B69F,80C4
1059,AAC8EE56,9B69F,80C4
1060,AAC8EE56,914A1,D97D
1061,2CEC27F1,ADAD6,98CF


## Compute Accuracy on each level
Now you have the predictions for each level (in the test data), and you also have the actual levels, you can compute the accurcay

In [31]:
# Level 1 accuracy
level1_accuracy = accuracy_score(y_test["Level_1"], results["Level1_Pred"])
print("Testing Acurracy of Level 1 model: {}%".format(round((level1_accuracy * 100) , 2)))

Testing Acurracy of Level 1 model: 85.61%


In [32]:
# Level 2 accuracy
level2_accuracy = accuracy_score(y_test["Level_2"], results["Level2_Pred"])
print("Testing Acurracy of Level 2 model: {}%".format(round((level2_accuracy * 100) , 2)))

Testing Acurracy of Level 2 model: 78.17%


In [33]:
# Level 3 accuracy
level3_accuracy = accuracy_score(y_test["Level_3"], results["Level3_Pred"])
print("Testing Acurracy of Level 3 model: {}%".format(round((level3_accuracy * 100) , 2)))

Testing Acurracy of Level 3 model: 76.76%
