# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"> **Air Quality** </span><span style="font-width:bold; font-size: 3rem; color:#333;">- Part 04: Batch Inference</span>

## üóíÔ∏è This notebook is divided into the following sections:

1. Download model and batch inference data
2. Make predictions, generate PNG for forecast
3. Store predictions in a monitoring feature group adn generate PNG for hindcast

## <span style='color:#ff5f27'> üìù Imports

In [1]:
import sys
from pathlib import Path
import os

def is_google_colab() -> bool:
    if "google.colab" in str(get_ipython()):
        return True
    return False

def clone_repository() -> None:
    !git clone https://github.com/featurestorebook/mlfs-book.git
    %cd mlfs-book

def install_dependencies() -> None:
    !pip install --upgrade uv
    !uv pip install --all-extras --system --requirement pyproject.toml


if is_google_colab():
    clone_repository()
    install_dependencies()
    root_dir = str(Path().absolute())
    print("Google Colab environment")
else:
    root_dir = Path().absolute()
    # Strip ~/notebooks/ccfraud from PYTHON_PATH if notebook started in one of these subdirectories
    if root_dir.parts[-1:] == ('airquality',):
        root_dir = Path(*root_dir.parts[:-1])
    if root_dir.parts[-1:] == ('notebooks',):
        root_dir = Path(*root_dir.parts[:-1])
    root_dir = str(root_dir) 
    print("Local environment")

# Add the root directory to the `PYTHONPATH` to use the `recsys` Python module from the notebook.
if root_dir not in sys.path:
    sys.path.append(root_dir)
print(f"Added the following directory to the PYTHONPATH: {root_dir}")
    
# Read the API keys and configuration variables from the file <root_dir>/.env
from mlfs import config
if os.path.exists(f"{root_dir}/.env"):
    settings = config.HopsworksSettings(_env_file=f"{root_dir}/.env")

Local environment
Added the following directory to the PYTHONPATH: /home/sunnypawat/mlfs-book
HopsworksSettings initialized!


In [2]:
import datetime
import pandas as pd
from xgboost import XGBRegressor
import hopsworks
import json
from mlfs.airquality import util

In [3]:
today = datetime.datetime.now() - datetime.timedelta(0)
tomorrow = today + datetime.timedelta(days = 1)
today

datetime.datetime(2025, 11, 18, 17, 50, 9, 815988)

## <span style="color:#ff5f27;"> üì° Connect to Hopsworks Feature Store </span>

In [4]:
project = hopsworks.login()
fs = project.get_feature_store() 

secrets = hopsworks.get_secrets_api()
location_str = secrets.get_secret("SENSOR_LOCATION_JSON").value
location = json.loads(location_str)
country=location['country']
city=location['city']
street=location['street']

2025-11-18 17:50:11,974 INFO: Initializing external client
2025-11-18 17:50:11,975 INFO: Base URL: https://c.app.hopsworks.ai:443
To ensure compatibility please install the latest bug fix release matching the minor version of your backend (4.2) by running 'pip install hopsworks==4.2.*'







2025-11-18 17:50:14,629 INFO: Python Engine initialized.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/1296534


## <span style="color:#ff5f27;">ü™ù Download the model from Model Registry</span>

In [5]:
mr = project.get_model_registry()

# Task C: Retrieve the model 'air_quality_xgboost_model_lagged' from the model registry
retrieved_model = mr.get_model(
    name="air_quality_xgboost_model_lagged",
    version=1,
)

fv = retrieved_model.get_feature_view()

# Download the saved model artifacts to a local directory
saved_model_dir = retrieved_model.download()

2025-11-18 17:50:33,238 INFO: Initializing for batch retrieval of feature vectors


Downloading: 0.000%|          | 0/527904 elapsed<00:00 remaining<?

Downloading model artifact (0 dirs, 1 files)... 

Downloading: 0.000%|          | 0/115066 elapsed<00:00 remaining<?

Downloading model artifact (0 dirs, 2 files)... 

Downloading: 0.000%|          | 0/26687 elapsed<00:00 remaining<?

Downloading model artifact (1 dirs, 3 files)... DONE

In [8]:
# Loading the XGBoost regressor model and label encoder from the saved model directory
# retrieved_xgboost_model = joblib.load(saved_model_dir + "/xgboost_regressor.pkl")
retrieved_xgboost_model = XGBRegressor()

retrieved_xgboost_model.load_model(saved_model_dir + "/model.json")

# Displaying the retrieved XGBoost regressor model
retrieved_xgboost_model

## <span style="color:#ff5f27;">‚ú® Get Weather Forecast Features with Feature View   </span>



In [27]:
weather_fg = fs.get_feature_group(
    name='weather',
    version=1,
)
batch_data = weather_fg.filter(weather_fg.date >= today).read()
batch_data

Finished: Reading data from Hopsworks, using Hopsworks Feature Query Service (0.97s) 


Unnamed: 0,date,temperature_2m_mean,precipitation_sum,wind_speed_10m_max,wind_direction_10m_dominant,city
0,2025-11-20 00:00:00+00:00,0.65,0.1,18.250259,22.011339,Stockholm
1,2025-11-19 00:00:00+00:00,-2.1,0.0,5.116561,320.710602,Stockholm
2,2025-11-21 00:00:00+00:00,1.2,0.0,16.965895,342.718414,Stockholm
3,2025-11-23 00:00:00+00:00,1.5,0.0,6.638072,229.398788,Stockholm
4,2025-11-22 00:00:00+00:00,-0.4,0.0,14.57889,212.905243,Stockholm
5,2025-11-24 00:00:00+00:00,2.8,0.0,11.269782,63.435013,Stockholm


In [28]:
### Task C: Get lagged features for batch inference
air_quality_fg = fs.get_feature_group(
    name='air_quality',
    version=1,
)

aq_history = air_quality_fg.read()[["date", "pm25"]]

# Merge weather with historical pm25
batch_data = batch_data.merge(aq_history, on="date", how="left")

# Sort by date (important)
batch_data = batch_data.sort_values("date")

# Create lag columns
batch_data["pm25_lag_1"] = batch_data["pm25"].shift(1)
batch_data["pm25_lag_2"] = batch_data["pm25"].shift(2)
batch_data["pm25_lag_3"] = batch_data["pm25"].shift(3)

batch_data

Finished: Reading data from Hopsworks, using Hopsworks Feature Query Service (2.26s) 


Unnamed: 0,date,temperature_2m_mean,precipitation_sum,wind_speed_10m_max,wind_direction_10m_dominant,city,pm25,pm25_lag_1,pm25_lag_2,pm25_lag_3
1,2025-11-19 00:00:00+00:00,-2.1,0.0,5.116561,320.710602,Stockholm,,,,
0,2025-11-20 00:00:00+00:00,0.65,0.1,18.250259,22.011339,Stockholm,,,,
2,2025-11-21 00:00:00+00:00,1.2,0.0,16.965895,342.718414,Stockholm,,,,
4,2025-11-22 00:00:00+00:00,-0.4,0.0,14.57889,212.905243,Stockholm,,,,
3,2025-11-23 00:00:00+00:00,1.5,0.0,6.638072,229.398788,Stockholm,,,,
5,2025-11-24 00:00:00+00:00,2.8,0.0,11.269782,63.435013,Stockholm,,,,


### <span style="color:#ff5f27;">ü§ñ Making the predictions</span>

In [10]:
batch_data['predicted_pm25'] = retrieved_xgboost_model.predict(
    batch_data[['temperature_2m_mean', 'precipitation_sum', 'wind_speed_10m_max', 'wind_direction_10m_dominant']])
batch_data

ValueError: feature_names mismatch: ['temperature_2m_mean', 'precipitation_sum', 'wind_speed_10m_max', 'wind_direction_10m_dominant', 'pm25_lag_1', 'pm25_lag_2', 'pm25_lag_3'] ['temperature_2m_mean', 'precipitation_sum', 'wind_speed_10m_max', 'wind_direction_10m_dominant']
expected pm25_lag_3, pm25_lag_2, pm25_lag_1 in input data

In [None]:
batch_data.info()

### <span style="color:#ff5f27;">ü§ñ Saving the predictions (for monitoring) to a Feature Group</span>

In [None]:
batch_data['street'] = street
batch_data['city'] = city
batch_data['country'] = country
# Fill in the number of days before the date on which you made the forecast (base_date)
batch_data['days_before_forecast_day'] = range(1, len(batch_data)+1)
batch_data = batch_data.sort_values(by=['date'])
batch_data

In [None]:
batch_data.info()

### Create Forecast Graph
Draw a graph of the predictions with dates as a PNG and save it to the github repo
Show it on github pages

In [None]:

pred_file_path = f"{root_dir}/docs/air-quality/assets/img/pm25_forecast.png"
plt = util.plot_air_quality_forecast(city, street, batch_data, pred_file_path)

plt.show()

In [None]:
# Get or create feature group
monitor_fg = fs.get_or_create_feature_group(
    name='aq_predictions',
    description='Air Quality prediction monitoring',
    version=1,
    primary_key=['city','street','date','days_before_forecast_day'],
    event_time="date"
)

In [None]:
monitor_fg.insert(batch_data, wait=True)

In [None]:
# We will create a hindcast chart for  only the forecasts made 1 day beforehand
monitoring_df = monitor_fg.filter(monitor_fg.days_before_forecast_day == 1).read()
monitoring_df

In [None]:
air_quality_fg = fs.get_feature_group(name='air_quality', version=1)
air_quality_df = air_quality_fg.read()
air_quality_df

In [None]:
outcome_df = air_quality_df[['date', 'pm25']]
preds_df =  monitoring_df[['date', 'predicted_pm25']]

hindcast_df = pd.merge(preds_df, outcome_df, on="date")
hindcast_df = hindcast_df.sort_values(by=['date'])

# If there are no outcomes for predictions yet, generate some predictions/outcomes from existing data
if len(hindcast_df) == 0:
    hindcast_df = util.backfill_predictions_for_monitoring(weather_fg, air_quality_df, monitor_fg, retrieved_xgboost_model)
hindcast_df

In [None]:
# --- Recursive 3-day lag preparation + recursive prediction ---
# This cell seeds lag columns from historical `aq_history` when available and then
# iterates the batch dates in chronological order, predicting day-by-day and
# using prior predictions to fill later-day lags.

import pandas as pd

# Ensure `aq_history`, `batch_data`, and `retrieved_xgboost_model` are present
try:
    _ = aq_history
    _ = batch_data
    model = retrieved_xgboost_model
except NameError as e:
    raise RuntimeError("Expected variables `aq_history`, `batch_data`, and `retrieved_xgboost_model` to be available in the notebook kernel.")

# Normalize dates and sort
aq_history = aq_history.copy()
aq_history['date'] = pd.to_datetime(aq_history['date'])
aq_history = aq_history.sort_values('date')

batch_data = batch_data.copy()
batch_data['date'] = pd.to_datetime(batch_data['date'])
batch_data = batch_data.sort_values('date').reset_index(drop=True)

# Build a lookup for known/seeded pm25 values (historical)
pm25_lookup = {pd.Timestamp(d): float(v) for d, v in zip(aq_history['date'], aq_history['pm25'])}

# Feature columns expected by the model
weather_cols = [
    'temperature_2m_mean',
    'precipitation_sum',
    'wind_speed_10m_max',
    'wind_direction_10m_dominant',
]
lag_cols = ['pm25_lag_1', 'pm25_lag_2', 'pm25_lag_3']
model_feature_cols = weather_cols + lag_cols

# Ensure lag columns exist
for c in lag_cols + ['predicted_pm25']:
    if c not in batch_data.columns:
        batch_data[c] = pd.NA

# Use last historical pm25 as fallback if necessary
if len(aq_history) > 0:
    last_hist_pm25 = float(aq_history['pm25'].dropna().iloc[-1])
else:
    last_hist_pm25 = 0.0

# Iterate unique batch dates in ascending order, predicting per-date
batch_dates = sorted(batch_data['date'].unique())

for d in batch_dates:
    mask = batch_data['date'] == d

    # Fill lagged values from lookup (which will be augmented with predictions as we go)
    for lag in (1, 2, 3):
        lookup_date = pd.Timestamp(d - pd.Timedelta(days=lag))
        val = pm25_lookup.get(lookup_date, None)
        if val is None:
            # leave NaN for now; will fill with fallback below
            batch_data.loc[mask, f'pm25_lag_{lag}'] = pd.NA
        else:
            batch_data.loc[mask, f'pm25_lag_{lag}'] = val

    # Prepare features for prediction; if weather columns are missing, raise a helpful error
    missing_weather = [c for c in weather_cols if c not in batch_data.columns]
    if missing_weather:
        raise RuntimeError(f"Missing required weather columns in `batch_data`: {missing_weather}")

    X_row = batch_data.loc[mask, model_feature_cols].copy()

    # Fill missing lag values with sensible fallback (last historical pm25)
    X_row[lag_cols] = X_row[lag_cols].fillna(last_hist_pm25)

    # If any remaining NaNs (shouldn't happen), fill with column mean
    X_row = X_row.fillna(X_row.mean(numeric_only=True))

    # Convert to numeric dtype for model
    X_row = X_row.astype(float)

    if X_row.shape[0] == 0:
        continue

    # Predict for all rows of this date
    preds = model.predict(X_row[model_feature_cols])

    # Assign predictions back
    batch_data.loc[mask, 'predicted_pm25'] = preds

    # Seed the lookup with the day's prediction for recursive use by later dates.
    # If multiple rows for the same date, use the mean as the seed value.
    seed_val = float(pd.Series(preds).mean())
    pm25_lookup[pd.Timestamp(d)] = seed_val

# Finalize: create batch_data_lagged for downstream steps
batch_data_lagged = batch_data.copy()

print('Recursive prediction completed.' )
print('Batch rows:', len(batch_data_lagged))
print('Sample rows with predictions:')
print(batch_data_lagged.loc[batch_data_lagged['predicted_pm25'].notna()].head())


### Plot the Hindcast comparing predicted with forecasted values (1-day prior forecast)

__This graph will be empty to begin with - this is normal.__

After a few days of predictions and observations, you will get data points in this graph.

In [None]:
hindcast_file_path = f"{root_dir}/docs/air-quality/assets/img/pm25_hindcast_1day.png"
plt = util.plot_air_quality_forecast(city, street, hindcast_df, hindcast_file_path, hindcast=True)
plt.show()

### Upload the prediction and hindcast dashboards (png files) to Hopsworks


In [None]:
dataset_api = project.get_dataset_api()
str_today = today.strftime("%Y-%m-%d")
if dataset_api.exists("Resources/airquality") == False:
    dataset_api.mkdir("Resources/airquality")
dataset_api.upload(pred_file_path, f"Resources/airquality/{city}_{street}_{str_today}", overwrite=True)
dataset_api.upload(hindcast_file_path, f"Resources/airquality/{city}_{street}_{str_today}", overwrite=True)

proj_url = project.get_url()
print(f"See images in Hopsworks here: {proj_url}/settings/fb/path/Resources/airquality")

---