<a href="https://colab.research.google.com/github/damjan18/ml-product-category-prediction/blob/main/notebook/product_category_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Loading and Inspecting the dataset
Steps:
- Load the CSV file from Github
- Check how many rows and columns we have
- Display first five rows


In [5]:
import pandas as pd

url = "https://raw.githubusercontent.com/damjan18/ml-product-category-prediction/main/data/products.csv"

df = pd.read_csv(url)

print("Dataset shape: ", df.shape)

print("First five rows:")
display(df.head())

print("Info:")
df.info()


Dataset shape:  (35311, 8)
First five rows:


Unnamed: 0,product ID,Product Title,Merchant ID,Category Label,_Product Code,Number_of_Views,Merchant Rating,Listing Date
0,1,apple iphone 8 plus 64gb silver,1,Mobile Phones,QA-2276-XC,860.0,2.5,5/10/2024
1,2,apple iphone 8 plus 64 gb spacegrau,2,Mobile Phones,KA-2501-QO,3772.0,4.8,12/31/2024
2,3,apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim...,3,Mobile Phones,FP-8086-IE,3092.0,3.9,11/10/2024
3,4,apple iphone 8 plus 64gb space grey,4,Mobile Phones,YI-0086-US,466.0,3.4,5/2/2022
4,5,apple iphone 8 plus gold 5.5 64gb 4g unlocked ...,5,Mobile Phones,NZ-3586-WP,4426.0,1.6,4/12/2023


Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35311 entries, 0 to 35310
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   product ID       35311 non-null  int64  
 1   Product Title    35139 non-null  object 
 2   Merchant ID      35311 non-null  int64  
 3    Category Label  35267 non-null  object 
 4   _Product Code    35216 non-null  object 
 5   Number_of_Views  35297 non-null  float64
 6   Merchant Rating  35141 non-null  float64
 7    Listing Date    35252 non-null  object 
dtypes: float64(2), int64(2), object(4)
memory usage: 2.2+ MB


# 2. Normalizing column names and removing irrelevant rows

In [15]:
df.columns = df.columns.str.replace(' ', '').str.replace('_', '')
df = df.drop(columns=['productID', 'ProductCode', 'NumberofViews', 'MerchantRating', 'ListingDate'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35311 entries, 0 to 35310
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   ProductTitle   35139 non-null  object
 1   CategoryLabel  35267 non-null  object
dtypes: object(2)
memory usage: 551.9+ KB


# 3. Dealing with missing values missing values
We will:
- Check for missing values
- remove rows with missing values

In [17]:
print("Missing values per column: ")
print(df.isna().sum())

df = df.dropna()

print("Missing values per column: ")
print(df.isna().sum())

Missing values per column: 
ProductTitle     172
CategoryLabel     44
dtype: int64
Missing values per column: 
ProductTitle     0
CategoryLabel    0
dtype: int64


# 4. Standardizing the `CategoryLabel` column

In [28]:
print(df['CategoryLabel'].value_counts())
replace_map = {
    'Fridge Freezers': 'Fridge',
    'Fridges': 'Fridge',
    'Freezers': 'Fridge',
    'fridge': 'Fridge',

    'Washing Machines': 'Washing Machine',

    'Mobile Phones': 'Mobile Phone',
    'Mobile Phone': 'Mobile Phone',

    'CPUs': 'CPU',
    'CPU': 'CPU',

    'TVs': 'TV',

    'Dishwashers': 'Dishwasher',

    'Digital Cameras': 'Digital Camera',

    'Microwaves': 'Microwave',
}
df['CategoryLabel'] = df['CategoryLabel'].replace(replace_map)
print(df['CategoryLabel'].value_counts())

CategoryLabel
Fridge Freezers     5470
Washing Machines    4015
Mobile Phones       4002
CPUs                3747
TVs                 3541
Fridges             3436
Dishwashers         3405
Digital Cameras     2689
Microwaves          2328
Freezers            2201
fridge               123
CPU                   84
Mobile Phone          55
Name: count, dtype: int64
CategoryLabel
Fridge             11230
Mobile Phone        4057
Washing Machine     4015
CPU                 3831
TV                  3541
Dishwasher          3405
Digital Camera      2689
Microwave           2328
Name: count, dtype: int64


# 5. Training and comparing multiple ML models
We will choose the best model for this task.

In [30]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report

x = df['ProductTitle']
y = df['CategoryLabel']

x_train, x_test, y_train,y_test = train_test_split(
    x, y, test_size=0.2, random_state=42, stratify=y
)


vectorizer = TfidfVectorizer()
x_train_vec = vectorizer.fit_transform(x_train)
x_test_vec = vectorizer.transform(x_test)

models = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Naive Bayes": MultinomialNB(),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "Support Vector Machine": LinearSVC()
}



for name, model in models.items():
  model.fit(x_train_vec, y_train)
  y_pred = model.predict(x_test_vec)
  print(f"{name}\n")
  print(classification_report(y_test, y_pred))


Logistic Regression

                 precision    recall  f1-score   support

            CPU       1.00      0.99      1.00       766
 Digital Camera       1.00      0.99      0.99       538
     Dishwasher       0.96      0.91      0.94       681
         Fridge       0.94      0.99      0.97      2246
      Microwave       1.00      0.94      0.97       466
   Mobile Phone       0.99      0.99      0.99       812
             TV       0.98      0.98      0.98       708
Washing Machine       0.99      0.92      0.95       803

       accuracy                           0.97      7020
      macro avg       0.98      0.97      0.97      7020
   weighted avg       0.97      0.97      0.97      7020

Naive Bayes

                 precision    recall  f1-score   support

            CPU       1.00      1.00      1.00       766
 Digital Camera       1.00      0.99      0.99       538
     Dishwasher       0.98      0.89      0.93       681
         Fridge       0.94      1.00      0.97    