This notebook trains a machine learning model that predicts the clothing **subCategory** based on:
- `baseColour`
- `season`
- `gender`
- `usage`

I'll use the **styles.csv** file from the Kaggle fashion product dataset.


In [11]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
import joblib

## Loading Dataset

I'll load the `styles.csv` file using pandas. This file contains metadata about fashion items like category, color, and season.


In [12]:
df = pd.read_csv('../data/fashion_product_images/styles.csv', on_bad_lines='skip')

print("Shape:", df.shape)
df.head()


Shape: (44424, 10)


Unnamed: 0,id,gender,masterCategory,subCategory,articleType,baseColour,season,year,usage,productDisplayName
0,15970,Men,Apparel,Topwear,Shirts,Navy Blue,Fall,2011.0,Casual,Turtle Check Men Navy Blue Shirt
1,39386,Men,Apparel,Bottomwear,Jeans,Blue,Summer,2012.0,Casual,Peter England Men Party Blue Jeans
2,59263,Women,Accessories,Watches,Watches,Silver,Winter,2016.0,Casual,Titan Women Silver Watch
3,21379,Men,Apparel,Bottomwear,Track Pants,Black,Fall,2011.0,Casual,Manchester United Men Solid Black Track Pants
4,53759,Men,Apparel,Topwear,Tshirts,Grey,Summer,2012.0,Casual,Puma Men Grey T-shirt


## Selecting and Cleaning Data

I'll only use relevant columns and drop any rows that contain missing data.


In [13]:
df = df[['baseColour', 'season', 'gender', 'usage', 'subCategory']] # select relevant columns

df.dropna(inplace=True) # drop rows with missing values

## Encoding Categorical Data

Since machine learning models can't handle text directly, I will convert categorical features into numbers using `LabelEncoder`.
I'll also save the `subCategory` encoder to decode predictions later.


In [14]:
le_color = LabelEncoder()
le_season = LabelEncoder()
le_gender = LabelEncoder()
le_usage = LabelEncoder()
le_subcat = LabelEncoder()

df['baseColour'] = le_color.fit_transform(df['baseColour'])
df['season'] = le_season.fit_transform(df['season'])
df['gender'] = le_gender.fit_transform(df['gender'])
df['usage'] = le_usage.fit_transform(df['usage'])
df['subCategory'] = le_subcat.fit_transform(df['subCategory'])

joblib.dump(le_subcat, '../models/subcategory_encoder.pkl')


['../models/subcategory_encoder.pkl']

## Splitting into Train and Test Sets

I'll split the data into training and test sets to evaluate model performance.


In [15]:
X = df[['baseColour', 'season', 'gender', 'usage']]
y = df['subCategory']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Training the Model

I'll use a `RandomForestClassifier` because it's robust and works well with categorical data.


In [16]:
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

## Evaluating the Model

I'll evaluate the model using accuracy and a classification report.


In [17]:
y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n")
print(classification_report(y_test, y_pred))

Accuracy: 0.547073502722323

Classification Report:

              precision    recall  f1-score   support

           0       0.71      0.19      0.29        27
           1       0.00      0.00      0.00        23
           2       0.45      0.43      0.44       607
           3       0.00      0.00      0.00         3
           5       0.00      0.00      0.00       144
           6       0.46      0.12      0.19       531
           7       1.00      0.43      0.60        14
           8       0.50      0.03      0.06       103
           9       0.00      0.00      0.00         8
          10       0.65      0.13      0.22       229
          11       0.20      0.01      0.02       169
          12       0.66      0.82      0.73       206
          13       0.00      0.00      0.00        21
          14       0.00      0.00      0.00         4
          15       0.00      0.00      0.00         5
          16       0.46      0.10      0.17        59
          18       0.55     

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Saving the Trained Model

I'll save the trained model so I can later load it in a web API (Flask/FastAPI) or use it directly in the app.


In [18]:
joblib.dump(model, '../models/style_predictor.pkl')
print("✅ Model saved to models/style_predictor.pkl")

✅ Model saved to models/style_predictor.pkl


In [19]:
import joblib

# Save the encoders used for input features
joblib.dump(le_color, '../models/color_encoder.pkl')
joblib.dump(le_season, '../models/season_encoder.pkl')
joblib.dump(le_gender, '../models/gender_encoder.pkl')
joblib.dump(le_usage, '../models/usage_encoder.pkl')


['../models/usage_encoder.pkl']