Set up Python + scikit-learn

In this cell I am just making sure the envirnment is clean and has the packages I need.
First I print out sys.executable so I can see exactly which Python interpreter this notebook is using
Then I call pip through that same executable to:

upgrade pip itself
install scikit-learn for this interpreter

In [3]:
#1
import sys
print(sys.executable)
# upgrade pip for this interpreter
!"{sys.executable}" -m pip install --upgrade pip

# install scikit-learn for this interpreter
!"{sys.executable}" -m pip install scikit-learn

d:\Ds_Project\venv\python.exe


Load all raw Kaggle CSVs

Here I import load_raw_kaggle_data from utils_data_io and call it once.

This helper function goes into data/raw/ and loads all the Kaggle CSV files (sales_train.csv, items.csv, shops.csv, etc.) into a dictionary of DataFrames.

I store that into raw_data and quickly check the keys just to see what datasets I have available.

In [4]:
#2
from utils_data_io import load_raw_kaggle_data

raw_data = load_raw_kaggle_data()
raw_data.keys()


dict_keys(['sales_train', 'items', 'item_categories', 'shops', 'test', 'sample_submission'])

Quick overview of each raw table

In this loop I walk through every (name, df) pair from raw_data.items() and print:

the table name (sales_train, items, shops, …)

the shape (rows, columns)

the list of column names

This is a fast “sanity check” so I know the files loaded correctly and the columns look like what I expect from the Kaggle description.

In [5]:
#3
for name, df in raw_data.items():
    print(f"=== {name} ===")
    print("Shape:", df.shape)
    print("Columns:", list(df.columns))
    print()


=== sales_train ===
Shape: (2935849, 6)
Columns: ['date', 'date_block_num', 'shop_id', 'item_id', 'item_price', 'item_cnt_day']

=== items ===
Shape: (22170, 3)
Columns: ['item_name', 'item_id', 'item_category_id']

=== item_categories ===
Shape: (84, 2)
Columns: ['item_category_name', 'item_category_id']

=== shops ===
Shape: (60, 2)
Columns: ['shop_name', 'shop_id']

=== test ===
Shape: (214200, 3)
Columns: ['ID', 'shop_id', 'item_id']

=== sample_submission ===
Shape: (214200, 2)
Columns: ['ID', 'item_cnt_month']



Peek at the first 5 rows of each table

Now I do a more visual check.

For every raw DataFrame I print a header line and then use display(df.head()) to show the first 5 rows.

This helps me see actual example values: dates format, shop IDs, item IDs, etc. It’s easier to catch weird issues (like wrong delimiter or encoding) when looking at real rows instead of just shapes.

In [6]:
#4
for name, df in raw_data.items():
    print(f"=== {name} (first 5 rows) ===")
    display(df.head())


=== sales_train (first 5 rows) ===


Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day
0,2013-01-02,0,59,22154,999.0,1.0
1,2013-01-03,0,25,2552,899.0,1.0
2,2013-01-05,0,25,2552,899.0,-1.0
3,2013-01-06,0,25,2554,1709.05,1.0
4,2013-01-15,0,25,2555,1099.0,1.0


=== items (first 5 rows) ===


Unnamed: 0,item_name,item_id,item_category_id
0,! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.) D,0,40
1,!ABBYY FineReader 12 Professional Edition Full...,1,76
2,***В ЛУЧАХ СЛАВЫ (UNV) D,2,40
3,***ГОЛУБАЯ ВОЛНА (Univ) D,3,40
4,***КОРОБКА (СТЕКЛО) D,4,40


=== item_categories (first 5 rows) ===


Unnamed: 0,item_category_name,item_category_id
0,PC - Гарнитуры/Наушники,0
1,Аксессуары - PS2,1
2,Аксессуары - PS3,2
3,Аксессуары - PS4,3
4,Аксессуары - PSP,4


=== shops (first 5 rows) ===


Unnamed: 0,shop_name,shop_id
0,"!Якутск Орджоникидзе, 56 фран",0
1,"!Якутск ТЦ ""Центральный"" фран",1
2,"Адыгея ТЦ ""Мега""",2
3,"Балашиха ТРК ""Октябрь-Киномир""",3
4,"Волжский ТЦ ""Волга Молл""",4


=== test (first 5 rows) ===


Unnamed: 0,ID,shop_id,item_id
0,0,5,5037
1,1,5,5320
2,2,5,5233
3,3,5,5232
4,4,5,5268


=== sample_submission (first 5 rows) ===


Unnamed: 0,ID,item_cnt_month
0,0,0.5
1,1,0.5
2,2,0.5
3,3,0.5
4,4,0.5


Explore sales_train a bit more

Here I focus specifically on the main dataset, sales_train:

I print the dtypes of each column, to make sure date is a string (to be parsed later) and IDs are integers.

I print the min and max of the date column to see the calendar coverage.

I print how many unique shops and items there are.

This is basically me getting a feeling for how big and how diverse the dataset is before aggregating.

In [7]:
#5
sales = raw_data["sales_train"]

print("Sales_train dtypes:")
print(sales.dtypes)
print()

print("Date range in sales_train:")
print("Min date:", sales["date"].min())
print("Max date:", sales["date"].max())
print()

print("Number of unique shops:", sales["shop_id"].nunique())
print("Number of unique items:", sales["item_id"].nunique())


Sales_train dtypes:
date              datetime64[ns]
date_block_num             int64
shop_id                    int64
item_id                    int64
item_price               float64
item_cnt_day             float64
dtype: object

Date range in sales_train:
Min date: 2013-01-01 00:00:00
Max date: 2015-10-31 00:00:00

Number of unique shops: 60
Number of unique items: 21807


Build the monthly shop–item table

Now I switch from raw daily data to a monthly view.

I import build_base_training_table from GraphQL_utils and call it on sales_train.
Inside that helper it:

Aggregates daily records into monthly item_cnt_month per (date_block_num, shop_id, item_id)

Adds simple calendar features: year and month

The result is the monthly DataFrame, which is my main starting point for modelling.

In [8]:
#6
from GraphQL_utils import build_base_training_table

sales = raw_data["sales_train"]
monthly = build_base_training_table(sales)

monthly.head()


Unnamed: 0,date_block_num,shop_id,item_id,item_cnt_month,avg_item_price,year,month
0,0,0,32,6.0,221.0,2013,1
1,0,0,33,3.0,347.0,2013,1
2,0,0,35,1.0,247.0,2013,1
3,0,0,43,1.0,221.0,2013,1
4,0,0,51,2.0,128.5,2013,1


Summarize the monthly table

Here I print

the overall shape of monthly

the list of columns

the min / max of date_block_num

the unique years and months present

This tells me the time span and confirms that the added year and month columns are working fine (e.g., 2013–2015, months 1–12).

In [9]:
#7
print("Monthly table shape:", monthly.shape)
print("Columns:", list(monthly.columns))
print()

print("date_block_num range:", monthly["date_block_num"].min(), "→", monthly["date_block_num"].max())
print("Years:", sorted(monthly["year"].unique()))
print("Months:", sorted(monthly["month"].unique()))


Monthly table shape: (1609124, 7)
Columns: ['date_block_num', 'shop_id', 'item_id', 'item_cnt_month', 'avg_item_price', 'year', 'month']

date_block_num range: 0 → 33
Years: [np.int64(2013), np.int64(2014), np.int64(2015)]
Months: [np.int64(1), np.int64(2), np.int64(3), np.int64(4), np.int64(5), np.int64(6), np.int64(7), np.int64(8), np.int64(9), np.int64(10), np.int64(11), np.int64(12)]


Basic stats of the target (item_cnt_month)

This snippet is just descriptive statistics of the target:

monthly["item_cnt_month"].describe()


I print the result so I can see:

min / max sales per month

mean, std, and quantiles

It helps me understand if the target is very skewed, if there are outliers, and roughly what range of values the model will be predicting.

In [10]:
#8
print("item_cnt_month basic stats:")
print(monthly["item_cnt_month"].describe())


item_cnt_month basic stats:
count    1.609124e+06
mean     2.267200e+00
std      8.649882e+00
min     -2.200000e+01
25%      1.000000e+00
50%      1.000000e+00
75%      2.000000e+00
max      2.253000e+03
Name: item_cnt_month, dtype: float64


Create lagged features and import training helpers

Here I import several modeling helpers from GraphQL_utils:

make_lagged_features – > adds lag_1, lag_2, lag_3 on item_cnt_month

make_train_val_sets – > builds the train/validation split

train_baseline_model – > trains a RandomForest baseline

evaluate_model – > computes RMSE on the validation set

Then I call:

lagged = make_lagged_features(monthly, lags=(1, 2, 3))


This produces a new table lagged that has the original monthly columns plus lag_1, lag_2, and lag_3. Those lags are the the  main predictive features for the model.

In [11]:
#9
from GraphQL_utils import (
    make_lagged_features,
    make_train_val_sets,
    train_baseline_model,
    evaluate_model,
)

lagged = make_lagged_features(monthly, lags=(1, 2, 3))
lagged.head()


Unnamed: 0,date_block_num,shop_id,item_id,item_cnt_month,avg_item_price,year,month,lag_1,lag_2,lag_3
63224,1,0,30,31.0,265.0,2013,2,0.0,0.0,0.0
63225,1,0,31,11.0,434.0,2013,2,0.0,0.0,0.0
0,0,0,32,6.0,221.0,2013,1,0.0,0.0,0.0
63226,1,0,32,10.0,221.0,2013,2,6.0,0.0,0.0
1,0,0,33,3.0,347.0,2013,1,0.0,0.0,0.0


Build train and validation sets
Now I turn the lagged table into proper modeling matrices.

I call make_train_val_sets with

features_df = lagged
val_block = 32 (this is the month I hold out as validation)

target_col = "item_cnt_month"
The helper returns:

X_train, y_train –> all rows with date_block_num < 32

X_val, y_val –> rows with date_block_num = 32

feature_cols –> the list of feature column names (lags + price + calender)

Then I print shapes and the feature list so I know exactly what the model will be using.

In [12]:
#10
# Create train/validation sets from the lagged table
X_train, y_train, X_val, y_val, feature_cols = make_train_val_sets(
    lagged,
    val_block=32,               # validation month (you can change later)
    target_col="item_cnt_month" # what we are predicting
)

print("Train shape:", X_train.shape)
print("Validation shape:", X_val.shape)
print("Number of features:", len(feature_cols))
print("Feature columns:", feature_cols)


Train shape: (1547915, 8)
Validation shape: (29678, 8)
Number of features: 8
Feature columns: ['shop_id', 'item_id', 'avg_item_price', 'year', 'month', 'lag_1', 'lag_2', 'lag_3']


Train baseline RandomForest and get validation RMSE

In this cell I actually train a first model:

model = train_baseline_model(
    X_train,
    y_train,
    n_estimators=50,
    max_depth=12,
)

n_estimators=50 keeps training fast while we test

max_depth=12 avoids extremely deep trees.

After training, I call evaluate_model(model, X_val, y_val) to compute RMSE on the validation month.
The printed Validation RMSE is the main metric I've used to say “this baseline works okay” before I go further.

In [13]:
#11
# Train a baseline RandomForest model
# (you can tweak n_estimators / max_depth if it's slow)
model = train_baseline_model(
    X_train,
    y_train,
    n_estimators=50,   # start small for speed; you can increase later
    max_depth=12,      # limit tree depth so it finishes in reasonable time
    # random_state is fixed inside the function for reproducibility
)

print("Model trained.")


Model trained.


In [14]:
# Evaluate the model on the validation month
rmse = evaluate_model(model, X_val, y_val)
print("Validation RMSE:", rmse)


Validation RMSE: 16.920514008462145


MAE and MAPE

Here I keep the same model but look at more evaluation numbers.

Steps:

Predict y_val_pred = model.predict(X_val)

Compute MAE with mean_absolute_error

Recompute RMSE manually using NumPy as a consistency check

Compute MAPE in percentage:

In [15]:
#12
import numpy as np
from sklearn.metrics import mean_absolute_error

# Predictions on validation set
y_val_pred = model.predict(X_val)

# 1. MAE
mae = mean_absolute_error(y_val, y_val_pred)

# 2. RMSE (should match what evaluate_model gave)
mse = ((y_val - y_val_pred) ** 2).mean()
rmse_again = np.sqrt(mse)

# 3. MAPE (in %)
mape = (np.abs((y_val - y_val_pred) / (y_val + 1e-8))).mean() * 100

print("MAE :", mae)
print("RMSE:", rmse_again)
print("MAPE:", mape, "%")



MAE : 1.5181185747123525
RMSE: 16.920514008462145
MAPE: 33680588.17675428 %


Naive lag-1 baseline vs RandomForest

In this snippet I build a very simple naive baseline of “tomorrow equals yesterday”, or more exactly:

prediction for month t = lag_1 (sales from month t-1).


Filter lagged to only the validation month (date_block_num == 32).

y_val_baseline is the true item_cnt_month.
y_pred_naive is lag_1.
Compute RMSE between those two using mean_squared_error and np.sqrt.

Print
Naive lag-1 RMSE
RandomForest RMSE (from earlier)
This will directly compare:

if the RandomForest is actually better than a poor “use last month” strategy
by how much we are improving over that simple baseline.

In [16]:
#13
import numpy as np
from sklearn.metrics import mean_squared_error

val_block = 32  # same block you used for validation

# 1. Take only the validation month rows
val_mask = lagged["date_block_num"] == val_block
y_val_baseline = lagged.loc[val_mask, "item_cnt_month"]

# 2. Naive prediction = lag_1 (previous month’s sales)
y_pred_naive = lagged.loc[val_mask, "lag_1"]

# 3. Compute RMSE for the naive baseline
mse_naive = mean_squared_error(y_val_baseline, y_pred_naive)
rmse_naive = np.sqrt(mse_naive)

print("Naive lag-1 RMSE:", rmse_naive)
print("RandomForest RMSE:", rmse)   # 'rmse' from evaluate_model(model, X_val, y_val)


Naive lag-1 RMSE: 16.879604634930846
RandomForest RMSE: 16.920514008462145


Train final model on all history

Once model structure is good, I retrain it using all available data, not just the 32 months.

Print feature_cols again so it’s clear what the final model is using.

Build:

X_full = lagged[feature_cols]

y_full = lagged["item_cnt_month"]

Call train_baseline_model again with n_estimators=100 and max_depth=12 on this full dataset.

This final_model is the one I will save to disk and serve later through GraphQL and Docker.

In [17]:
#14
# Train a final model on ALL available history (for serving via GraphQL)

# 1. Use the same feature columns we used for train/val
print("Feature columns:", feature_cols)

X_full = lagged[feature_cols].copy()
y_full = lagged["item_cnt_month"].copy()

print("Full training shape:", X_full.shape)

# 2. Train a slightly bigger forest for the final model
final_model = train_baseline_model(
    X_full,
    y_full,
    n_estimators=100,   # a bit larger than before for stability
    max_depth=12,
)

print("Final model trained on all data.")


Feature columns: ['shop_id', 'item_id', 'avg_item_price', 'year', 'month', 'lag_1', 'lag_2', 'lag_3']
Full training shape: (1609124, 8)
Final model trained on all data.


Save trained model and metadata

Here I work on the trained model and some metadata to disk so other scripts (API, notebooks) can reload it.

os.makedirs("models", exist_ok=True) creates a folder to hold model files

joblib.dump(final_model, model_path) writes the RandomForest to models/rf_ecommerce_monthly.joblib.

I build a small model_meta dict with

feature_cols –> the exact feature order

lags –> which lags I used (1, 2, 3)

I save that JSON to models/model_meta.json.
Later, load_trained_model_and_features reads these two files to rebuild the the full inference pipeline.

In [18]:
#15
import os
import json
import joblib

# Make a folder inside your project to hold models
os.makedirs("models", exist_ok=True)

model_path = "models/rf_ecommerce_monthly.joblib"
meta_path  = "models/model_meta.json"

# 1. Save the trained RandomForest model
joblib.dump(final_model, model_path)
print("Saved model to:", model_path)

# 2. Save metadata (feature columns and lags used)
model_meta = {
    "feature_cols": feature_cols,
    "lags": [1, 2, 3],
}

with open(meta_path, "w") as f:
    json.dump(model_meta, f, indent=2)

print("Saved metadata to:", meta_path)


Saved model to: models/rf_ecommerce_monthly.joblib
Saved metadata to: models/model_meta.json


Save processed tables (monthly + lagged)

Finally, I also save the intermediate processed tables for conveneince.

Create data/processed if it doesn’t exist.

Save:

monthly → data/processed/monthly.pkl
lagged → data/processed/lagged.pkl

These make it easy to reload the engineered features quickly, without re-doing all the raw CSV reading and aggregation every single time. \
It’s not striclty required for the GraphQL API, but it’s useful for debugging and future experimnts.

In [19]:
#16
import os

os.makedirs("data/processed", exist_ok=True)

monthly_path = "data/processed/monthly.pkl"
lagged_path  = "data/processed/lagged.pkl"

monthly.to_pickle(monthly_path)
lagged.to_pickle(lagged_path)

print("Saved monthly to:", monthly_path)
print("Saved lagged  to:", lagged_path)


Saved monthly to: data/processed/monthly.pkl
Saved lagged  to: data/processed/lagged.pkl
