# Product Category Prediction - Exploratory Analysis

In this notebook, we explore the dataset and build a Machine Learning model
that predicts the product category based on the product title.


In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns


In [2]:
url = "https://raw.githubusercontent.com/andreivizi99/product-category-prediction/main/data/IMLP4_TASK_03-products%20(3).csv"

df = pd.read_csv(url)

df.head()


Unnamed: 0,product ID,Product Title,Merchant ID,Category Label,_Product Code,Number_of_Views,Merchant Rating,Listing Date
0,1,apple iphone 8 plus 64gb silver,1,Mobile Phones,QA-2276-XC,860.0,2.5,5/10/2024
1,2,apple iphone 8 plus 64 gb spacegrau,2,Mobile Phones,KA-2501-QO,3772.0,4.8,12/31/2024
2,3,apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim...,3,Mobile Phones,FP-8086-IE,3092.0,3.9,11/10/2024
3,4,apple iphone 8 plus 64gb space grey,4,Mobile Phones,YI-0086-US,466.0,3.4,5/2/2022
4,5,apple iphone 8 plus gold 5.5 64gb 4g unlocked ...,5,Mobile Phones,NZ-3586-WP,4426.0,1.6,4/12/2023


## Dataset Overview

In this section, we inspect the structure of the dataset:
- number of rows and columns
- column names
- data types
- missing values


In [3]:
df.shape


(35311, 8)

In [4]:
df.columns


Index(['product ID', 'Product Title', 'Merchant ID', ' Category Label',
       '_Product Code', 'Number_of_Views', 'Merchant Rating',
       ' Listing Date  '],
      dtype='object')

In [5]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35311 entries, 0 to 35310
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   product ID       35311 non-null  int64  
 1   Product Title    35139 non-null  object 
 2   Merchant ID      35311 non-null  int64  
 3    Category Label  35267 non-null  object 
 4   _Product Code    35216 non-null  object 
 5   Number_of_Views  35297 non-null  float64
 6   Merchant Rating  35141 non-null  float64
 7    Listing Date    35252 non-null  object 
dtypes: float64(2), int64(2), object(4)
memory usage: 2.2+ MB


In [6]:
df.isnull().sum()


Unnamed: 0,0
product ID,0
Product Title,172
Merchant ID,0
Category Label,44
_Product Code,95
Number_of_Views,14
Merchant Rating,170
Listing Date,59


In [7]:
df = df[['Product Title', 'Category Label']]
df.head()


KeyError: "['Category Label'] not in index"

In [8]:
df.columns


Index(['product ID', 'Product Title', 'Merchant ID', ' Category Label',
       '_Product Code', 'Number_of_Views', 'Merchant Rating',
       ' Listing Date  '],
      dtype='object')

In [9]:
df.columns = df.columns.str.strip()
df.columns


Index(['product ID', 'Product Title', 'Merchant ID', 'Category Label',
       '_Product Code', 'Number_of_Views', 'Merchant Rating', 'Listing Date'],
      dtype='object')

In [10]:
df = df[['Product Title', 'Category Label']]
df.head()


Unnamed: 0,Product Title,Category Label
0,apple iphone 8 plus 64gb silver,Mobile Phones
1,apple iphone 8 plus 64 gb spacegrau,Mobile Phones
2,apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim...,Mobile Phones
3,apple iphone 8 plus 64gb space grey,Mobile Phones
4,apple iphone 8 plus gold 5.5 64gb 4g unlocked ...,Mobile Phones


## Data Cleaning and Feature Engineering

In this section, we clean the text data and create additional features
that may help the model better understand product titles.


In [11]:
# remove any remaining missing values
df = df.dropna()

# ensure all titles are strings and lowercase
df['Product Title'] = df['Product Title'].astype(str).str.lower()

df.head()


Unnamed: 0,Product Title,Category Label
0,apple iphone 8 plus 64gb silver,Mobile Phones
1,apple iphone 8 plus 64 gb spacegrau,Mobile Phones
2,apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim...,Mobile Phones
3,apple iphone 8 plus 64gb space grey,Mobile Phones
4,apple iphone 8 plus gold 5.5 64gb 4g unlocked ...,Mobile Phones


In [12]:
# number of characters in the title
df['title_length'] = df['Product Title'].apply(len)

# number of words in the title
df['word_count'] = df['Product Title'].apply(lambda x: len(x.split()))

# check if title contains numbers
df['has_numbers'] = df['Product Title'].str.contains(r'\d').astype(int)

df.head()


Unnamed: 0,Product Title,Category Label,title_length,word_count,has_numbers
0,apple iphone 8 plus 64gb silver,Mobile Phones,31,6,1
1,apple iphone 8 plus 64 gb spacegrau,Mobile Phones,35,7,1
2,apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim...,Mobile Phones,70,13,1
3,apple iphone 8 plus 64gb space grey,Mobile Phones,35,7,1
4,apple iphone 8 plus gold 5.5 64gb 4g unlocked ...,Mobile Phones,54,11,1


In [13]:
X = df['Product Title']
y = df['Category Label']

print(X.shape, y.shape)


(35096,) (35096,)


In [14]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(X_train.shape, X_test.shape)


(28076,) (7020,)


In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(
    max_features=5000,
    ngram_range=(1, 2),
    stop_words='english'
)

X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

X_train_tfidf.shape


(28076, 5000)

In [16]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

nb_model = MultinomialNB()
nb_model.fit(X_train_tfidf, y_train)

y_pred_nb = nb_model.predict(X_test_tfidf)

print("Naive Bayes accuracy:", accuracy_score(y_test, y_pred_nb))
print(classification_report(y_test, y_pred_nb))


Naive Bayes accuracy: 0.9282051282051282
                  precision    recall  f1-score   support

             CPU       0.00      0.00      0.00        14
            CPUs       0.98      1.00      0.99       726
 Digital Cameras       0.99      0.99      0.99       535
     Dishwashers       0.90      0.95      0.92       684
        Freezers       0.99      0.72      0.83       422
 Fridge Freezers       0.84      0.92      0.88      1087
         Fridges       0.84      0.84      0.84       702
      Microwaves       0.99      0.95      0.97       464
    Mobile Phone       0.00      0.00      0.00        17
   Mobile Phones       0.97      0.98      0.98       795
             TVs       0.96      0.99      0.98       724
Washing Machines       0.94      0.95      0.95       821
          fridge       0.00      0.00      0.00        29

        accuracy                           0.93      7020
       macro avg       0.72      0.71      0.72      7020
    weighted avg       0.92  

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [17]:
from sklearn.linear_model import LogisticRegression

lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train_tfidf, y_train)

y_pred_lr = lr_model.predict(X_test_tfidf)

print("Logistic Regression accuracy:", accuracy_score(y_test, y_pred_lr))
print(classification_report(y_test, y_pred_lr))


Logistic Regression accuracy: 0.9445868945868946
                  precision    recall  f1-score   support

             CPU       0.00      0.00      0.00        14
            CPUs       0.98      1.00      0.99       726
 Digital Cameras       1.00      0.99      1.00       535
     Dishwashers       0.91      0.95      0.93       684
        Freezers       0.98      0.86      0.91       422
 Fridge Freezers       0.93      0.93      0.93      1087
         Fridges       0.86      0.90      0.88       702
      Microwaves       0.98      0.96      0.97       464
    Mobile Phone       0.00      0.00      0.00        17
   Mobile Phones       0.96      0.99      0.97       795
             TVs       0.96      0.99      0.98       724
Washing Machines       0.94      0.95      0.95       821
          fridge       0.00      0.00      0.00        29

        accuracy                           0.94      7020
       macro avg       0.73      0.73      0.73      7020
    weighted avg     

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Model Comparison

Two models were trained and evaluated:
- Multinomial Naive Bayes
- Logistic Regression

Logistic Regression achieved better overall performance in terms of
accuracy and class balance, therefore it was selected as the final model.


In [18]:
import pickle

# save model
with open("product_category_model.pkl", "wb") as f:
    pickle.dump(lr_model, f)

# save vectorizer
with open("tfidf_vectorizer.pkl", "wb") as f:
    pickle.dump(tfidf, f)
