# Day 2: Data Preprocessing for Tabular Model (Rossmann)

## 1. Objective
- Load and inspect Rossmann dataset.
- Merge datasets and handle missing values.
- Identify categorical and continuous features.

## 2. Key Steps
- Load `train.csv` and `store.csv`
- Merge them on `Store`
- Inspect nulls, data types, and structure

In [None]:
# Import required packages
from fastai.tabular.all import *
import pandas as pd
import numpy as np
from pathlib import Path

# Set data path
path = Path('../data/rossmann')
assert path.exists(), "Dataset path not found."

# Load CSVs
train_df = pd.read_csv(path/'train.csv', low_memory=False)
store_df = pd.read_csv(path/'store.csv')

print("Files loaded")
train_df.shape, store_df.shape

In [None]:
# Merge train with store metadata on 'Store'
df = pd.merge(train_df, store_df, how='left', on='Store')

# Peek at combined data
df.head()

In [None]:
df.shape

## 3. Results
- Combined dataset has 1017209 rows and 18 columns.
- Next: clean date column, identify missing values, and define variable types.

In [None]:
print(f"{'Column':<20} {'Type':<15} {'Example'}")
print("-" * 60)
for col in df.columns:
    dtype = df[col].dtype
    example = df[col].dropna().iloc[0] if df[col].notna().any() else "NaN"
    print(f"{col:<20} {str(dtype):<15} {str(example)}")

## 4. Preprocessing
We now convert the `Date` column to datetime, extract date features, and handle missing values.
We'll also define which columns are categorical vs. continuous for use in fastai.

In [None]:
# Convert Date column
df['Date'] = pd.to_datetime(df['Date'])

# Extract date features (fastai-style)
add_datepart(df, 'Date', drop=True)

# Look at missing values
df.isna().sum()[df.isna().sum() > 0]

In [None]:
# Define categorical and continuous variables
dep_var = 'Sales'

cat_names = ['Store', 'DayOfWeek', 'StateHoliday', 'SchoolHoliday', 'StoreType', 
             'Assortment', 'Promo', 'Promo2', 'PromoInterval', 'Month', 'Day', 'Year', 'Week', 'Dayofweek']

cont_names = ['Customers', 'Open', 'CompetitionDistance', 'CompetitionOpenSinceMonth',
              'CompetitionOpenSinceYear', 'Promo2SinceWeek', 'Promo2SinceYear']

procs = [Categorify, FillMissing, Normalize]

## 5. Summary
- Date features extracted
- Missing values handled (with fastai 'FillMissing')
- Feature types split into categorical and continuous

## 6. Create TabularDataLoaders
We'll use fastai's `TabularDataLoaders` to:
- Apply preprocessing (categorify, fillmissing, normalize)
- Build training/validation splits
- Preview batches before model training

In [None]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Random train/valid split
splits = RandomSplitter(seed=42)(df)

# Create DataLoaders
dls = TabularDataLoaders.from_df(
    df, 
    procs=procs,
    cat_names=cat_names,
    cont_names=cont_names,
    y_names=dep_var,
    splits=splits,
    bs=64
)

dls.show_batch(max_n=10)

In [None]:
# Save a small sample (optional, for reference/debugging)
df.head(100).to_csv('../data/rossmann/rossmann_sample.csv', index=False)
print("Sample saved to data/rossmann/rossmann_sample.csv")

In [None]:
# Best practice: save full TabularPandas (optional upgrade)
to = TabularPandas(
    df,
    procs=procs,
    cat_names=cat_names,
    cont_names=cont_names,
    y_names=dep_var,
    splits=splits
)

import pickle

with open('../data/rossmann_tabular.pkl', 'wb') as f:
    pickle.dump(to, f)

## 7. Results

- Loaded and merged `train.csv` and `store.csv` on `Store`
- Parsed `Date` column and extracted date-based features (`Year`, `Month`, `Week`, etc.)
- Handled missing values using fastai's `FillMissing` processor
- Defined `cat_names` and `cont_names` for model inputs
- Created `TabularDataLoaders` with randomized training/validation split
- Previewed a clean batch with categorical encoding and normalization applied
- Saved a snapshot of the processed data as `rossmann_sample.csv`

## 8. Summary

- Preprocessing with fastai’s `TabularDataLoaders` allowed for seamless handling of categorical, continuous, and missing values
- `add_datepart()` extracted useful temporal features from `Date` for tabular learning
- Fastai’s `FillMissing` created companion `_na` columns, preserving interpretability
- Defined structured lists of categorical and continuous variables for use in model training
- Dataset is now fully processed and ready for deep learning — next step: build and train the model