Single-point forecast using the shared `predict_sales` helper

In this step I’m just checking that the shared helper functions from `GraphQL_utils.py` work end-to-end.

1. I call `load_trained_model_and_features()` which:
   * loads the saved RandomForest model,
   * rebuilds the full lagged feature table from the raw Kaggle CSVs,
   * returns the exact `feature_cols` the model expects.

2. I pick one real combination from the data:
   * `date_block_num = 1`
   * `shop_id = 0`
   * `item_id = 30`

3. I call the **shared** `predict_sales(...)` helper, passing:
   * the model,
   * the full lagged table,
   * the feature column list,
   * and that (shop, item, month) combo.

The output I got here is:

- **Sample prediction ≈ 2.39 units**

This is the same helper that the GraphQL API uses internally, so this cell is basically
a quick “does our end-to-end forecasting pipeline actually work?” check before we wrap
it behind the API and Docker.

In [2]:
from GraphQL_utils import load_trained_model_and_features, predict_sales

# Load model + full lagged feature table
model, lagged, feature_cols = load_trained_model_and_features()
print("Lagged shape:", lagged.shape)
print("First feature cols:", feature_cols[:10])

# Take one real (date_block_num, shop_id, item_id) combo
sample = lagged[["date_block_num", "shop_id", "item_id"]].iloc[0]
test_block = int(sample["date_block_num"])
test_shop  = int(sample["shop_id"])
test_item  = int(sample["item_id"])

print("Testing combo:", test_block, test_shop, test_item)

pred = predict_sales(
    model=model,
    lagged=lagged,
    feature_cols=feature_cols,
    shop_id=test_shop,
    item_id=test_item,
    date_block_num=test_block,
)
print("Sample prediction:", pred)


Lagged shape: (1609124, 10)
First feature cols: ['shop_id', 'item_id', 'avg_item_price', 'year', 'month', 'lag_1', 'lag_2', 'lag_3']
Testing combo: 1 0 30
Sample prediction: 2.393372841806763


Install GraphQL / API dependencies

In this snippet I install the extra packages needed for the GraphQL and FastAPI parts of the project into the current Jupyter environment
I use sys.executable so that pip runs against the same Python that this notebook kernel is using.

In practice we just need to run this the first time (or in a fresh environment). After that the packages are already installed and this snippet can be skipped.

In [3]:
#1
import sys
print(sys.executable)
!"{sys.executable}" -m pip install graphene==3.3 fastapi starlette-graphene3

d:\Ds_Project\venv\python.exe


Import GraphQL library for in-notebook tests

Here I import graphene, which I use to define a small GraphQL schema directly inside the notebook.

This lets me test the predictSales query in memory with schema.execute(...), without actually starting a web server. The “real” HTTP API runs from GraphQL_API.py, but this import is enough for local schema experiments.

In [11]:
#2
import graphene
from fastapi import FastAPI
from starlette_graphene3 import GraphQLApp, make_graphiql_handler


Import helper to load the trained model + features

This snippet imports the main helper function from GraphQL_utils.py:

load_trained_model_and_features()

That function hides all the messy details and it loads the trained Randomforst model from disk, rebuilds the monthly + lagged tables using the same logic as the training notebook and returns everything already merged together.

The notebook doesn’t need to manually rebuild the pipeline as it just calls this helper once and gets the model, the lagged DataFrame, and the list of feature columns.

In [4]:
#3
import GraphQL_utils
from importlib import reload
reload(GraphQL_utils)

from GraphQL_utils import (
    load_trained_model_and_features,
    build_base_training_table,
    make_lagged_features,
    make_train_val_sets,
    train_baseline_model,
    evaluate_model,
)


Load trained model, lagged table, and feature list

Here I actually call load_trained_model_and_features() and unpack the result into ->

model –> the trained RandomForest regressor

lagged –> the full lagged training table (with lags, calendar features, etc.)

feature_cols –> the exact list of feature columns, in the order used during training

I've also print a bit of information (like the shape of lagged and the feature column names) so I can quickly confirm that everything loaded correctly and matches what I saw in the training notebook.

This is the main “setup” step for the rest of the notebook.

In [5]:
#4
import os
import json
import joblib
import pandas as pd

from GraphQL_utils import load_trained_model_and_features

model, lagged, feature_cols = load_trained_model_and_features()

print("Loaded model and features")
print("Lagged shape:", lagged.shape)
print("Feature cols:", feature_cols)


Loaded model and features
Lagged shape: (1609124, 10)
Feature cols: ['shop_id', 'item_id', 'avg_item_price', 'year', 'month', 'lag_1', 'lag_2', 'lag_3']


Peek at valid (month, shop, item) combinations

In this snippet I just inspect the first few rows of the lagged table, but only the date_block_num, shop_id, and item_id columns.

The goal is to see some real combinations that actually exist in the data. That makes it easier to pick a valid (shop_id, item_id, date_block_num) for testing, and avoids errors like “no data found for this combination” later when I query the model or the GraphQL schema.

In [6]:
#5
# Look at some combinations we can use for testing
lagged[["date_block_num", "shop_id", "item_id"]].head()


Unnamed: 0,date_block_num,shop_id,item_id
63224,1,0,30
63225,1,0,31
0,0,0,32
63226,1,0,32
1,0,0,33


Helpers to build feature rows and predict sales

This snippet defines two important helper functions that I've reused in both the notebook and the API:

build_feature_row(shop_id, item_id, date_block_num)

Filters the lagged DataFrame to find the row that matches the requested (shop_id, item_id, date_block_num).

If no row is found, it raises a clear ValueError so I know the combination is invalid.

If rows exist, it takes the first one and builds a 1-row DataFrame containing only feature_cols, in the exact same order the model was trained on.

predict_sales(shop_id, item_id, date_block_num)

Calls build_feature_row() to build the feature matrix X for that specific combination.

Passes X into the trained RandomForest model with model.predict(X).

Returns the predicted monthly sales (item_cnt_month) as a plain Python float.

Together, these helpers give me path to go from a human-friendly query (shop, item, month) to a model ready feature vector and final numeric prediction.

In [7]:
#6
def build_feature_row(shop_id: int, item_id: int, date_block_num: int):
    """
    Find the matching row in `lagged` and return a one-row DataFrame
    with the correct feature columns and order.
    """
    # Filter the lagged table
    mask = (
        (lagged["shop_id"] == shop_id) &
        (lagged["item_id"] == item_id) &
        (lagged["date_block_num"] == date_block_num)
    )
    rows = lagged.loc[mask]

    if rows.empty:
        raise ValueError(
            f"No data found for shop_id={shop_id}, item_id={item_id}, "
            f"date_block_num={date_block_num}"
        )

    # If there are multiple rows, just take the first one
    row = rows.iloc[[0]]  # keep it as DataFrame with one row

    # Keep only the feature columns, in the same order as training
    X = row[feature_cols].copy()

    return X


def predict_sales(shop_id: int, item_id: int, date_block_num: int) -> float:
    """
    Use the trained RandomForest model to predict item_cnt_month
    for a given (shop, item, date_block_num).
    """
    X = build_feature_row(shop_id, item_id, date_block_num)
    y_pred = model.predict(X)[0]

    # Convert numpy scalar to plain float
    return float(y_pred)


Pick a real (month, shop, item) for testing

Here I pick one real example from the lagged table:

I take the first row of lagged and keep only date_block_num, shop_id, and item_id,

then I print that mini-series so I can see exactly which month, shop, and item I’m about to use.

This guarantees I’m using a combination that actually exists in the data, which keeps the later tests simple and avoids “no data found” errors.

In [8]:
#7
# Step 1: pick a real (date_block_num, shop_id, item_id) that exists
sample = lagged[["date_block_num", "shop_id", "item_id"]].iloc[0]
print(sample)


date_block_num     1
shop_id            0
item_id           30
Name: 63224, dtype: int64


Sanity-check the predict_sales helper

Using the sample from snippet 7.

Pull out test_date_block, test_shop_id, and test_item_id from the sample row.

Print those values so I can see which combination I’m testing.

Call pred = predict_sales(test_shop_id, test_item_id, test_date_block).

Print the predicted item_cnt_month value.

This is a basic end-to-end sanity check: it confirms that the model, the lagged table, and the feature-building logic are all wired together correctly before I involve GraphQL or HTTP.

In [9]:
#8
# Step 2: use that sample row to test our prediction function

test_date_block = int(sample["date_block_num"])
test_shop_id    = int(sample["shop_id"])
test_item_id    = int(sample["item_id"])

print("Using combo:",
      "date_block_num =", test_date_block,
      "shop_id =", test_shop_id,
      "item_id =", test_item_id)

pred = predict_sales(test_shop_id, test_item_id, test_date_block)
print("Predicted item_cnt_month:", pred)


Using combo: date_block_num = 1 shop_id = 0 item_id = 30
Predicted item_cnt_month: 3.419104059723947


Define the GraphQL schema for in-notebook use

This snippet defines a small GraphQL schema using graphene so I can test queries directly inside the notebook:

SalesPredictionType describes the shape of a prediction object, with fields:

shop_id

item_id

date_block_num

prediction

Query defines a single field called predict_sales (which appears as predictSales to GraphQL clients).
It takes three arguments:

shop_id (Int, required)

item_id (Int, required)

date_block_num (Int, required)

The resolver resolve_predict_sales:

calls the predict_sales(...) helper from snippet 6,

and returns a SalesPredictionType object with all fields filled in.

Finally, I build schema = graphene.Schema(query=Query), which lets me run queries like schema.execute(query_string) to test the GraphQL layer entirely in memory.

In [12]:
#9
# ---- GraphQL schema (uses our predict_sales helper) ----

class SalesPredictionType(graphene.ObjectType):
    shop_id = graphene.Int()
    item_id = graphene.Int()
    date_block_num = graphene.Int()
    prediction = graphene.Float()


class Query(graphene.ObjectType):
    # Graphene will expose this as `predictSales` in the GraphQL schema
    predict_sales = graphene.Field(
        SalesPredictionType,
        shop_id=graphene.Int(required=True),
        item_id=graphene.Int(required=True),
        date_block_num=graphene.Int(required=True),
    )

    def resolve_predict_sales(self, info, shop_id, item_id, date_block_num):
        # Call our model helper
        y = predict_sales(shop_id, item_id, date_block_num)
        return SalesPredictionType(
            shop_id=shop_id,
            item_id=item_id,
            date_block_num=date_block_num,
            prediction=float(y),
        )


schema = graphene.Schema(query=Query)
print("GraphQL schema ready.")


GraphQL schema ready.


Run a GraphQL query directly against the schema

In this snippet I build a GraphQL query string that calls:

predictSales(shopId: ..., itemId: ..., dateBlockNum: ...)


and asks for:

shopId

itemId

dateBlockNum

prediction

I then run:

result = schema.execute(query)


This executes the query fully in memory using the schema from snippet 9 and the predict_sales helper from snippet 6.

Printing result.data lets me see the same structure I would get from the real HTTP API, but without needing to start a web server. It’s a nice way to confirm that the schema and resolver logic are working correctly.

(Separately, I install the requests library in another cell so I can call the real HTTP endpoint in the next snippet.)

In [13]:
#10
query = """
{
  predictSales(shopId: 0, itemId: 30, dateBlockNum: 1) {
    shopId
    itemId
    dateBlockNum
    prediction
  }
}
"""

result = schema.execute(query)
print(result.data)


{'predictSales': {'shopId': 0, 'itemId': 30, 'dateBlockNum': 1, 'prediction': 3.4191040597239457}}


In [19]:

%pip install requests


Collecting requests
  Downloading requests-2.32.5-py3-none-any.whl.metadata (4.9 kB)
Collecting charset_normalizer<4,>=2 (from requests)
  Downloading charset_normalizer-3.4.4-cp312-cp312-win_amd64.whl.metadata (38 kB)
Collecting urllib3<3,>=1.21.1 (from requests)
  Downloading urllib3-2.6.2-py3-none-any.whl.metadata (6.6 kB)
Collecting certifi>=2017.4.17 (from requests)
  Downloading certifi-2025.11.12-py3-none-any.whl.metadata (2.5 kB)
Downloading requests-2.32.5-py3-none-any.whl (64 kB)
Downloading charset_normalizer-3.4.4-cp312-cp312-win_amd64.whl (107 kB)
Downloading urllib3-2.6.2-py3-none-any.whl (131 kB)
Downloading certifi-2025.11.12-py3-none-any.whl (159 kB)
Installing collected packages: urllib3, charset_normalizer, certifi, requests

   ---------------------------------------- 0/4 [urllib3]
   ---------- ----------------------------- 1/4 [charset_normalizer]
   -------------------- ------------------- 2/4 [certifi]
   ------------------------------ --------- 3/4 [requests]
 

Call the running HTTP GraphQL API (Docker / uvicorn)

This is the final end-to-end test: here I act as a real client and call the running GraphQL API over HTTP.

Assumptions:

The API server is already running, either:

via Docker:
docker run --rm -p 8000:8000 ecommerce-graphql

or via uvicorn directly:
uvicorn GraphQL_API:app --reload --port 8000

In the snippet I:

Set url = "http://127.0.0.1:8000/graphql", which is where the server listens.

Build the same GraphQL query string as in snippet 11.

Use requests.post(url, json={"query": query}) to send the query as JSON.

Print the HTTP status code (should be 200 if everything is OK).

Pretty-print the JSON response with json.dumps(response.json(), indent=2).

The prediction field in this JSON is the final model output coming from the Dockerized FastAPI + GraphQL service, which is exactly what the project is meant to demonstrate.

In [2]:
#11
import requests
import json

url = "http://127.0.0.1:8000/graphql"

query = """
{
  predictSales(shopId: 0, itemId: 30, dateBlockNum: 1) {
    shopId
    itemId
    dateBlockNum
    prediction
  }
}
"""

response = requests.post(url, json={"query": query})
print(response.status_code)
print(json.dumps(response.json(), indent=2))


200
{
  "data": {
    "predictSales": {
      "shopId": 0,
      "itemId": 30,
      "dateBlockNum": 1,
      "prediction": 3.4191040597239457
    }
  }
}
