# Imports

In [0]:
!pip install -r requirements.txt

In [0]:
import pandas as pd
import numpy as np
import os
import zipfile
import getpass

--- SECTION 1: KAGGLE SETUP FOR DATABRICKS ---

In [0]:
os.environ['KAGGLE_USERNAME'] = "elenegabeskiria"
os.environ['KAGGLE_KEY'] = "fbc7c735b9a28fa8d6fe48b75ebe1d6b"

DATA_DIR = '/dbfs/FileStore/walmart_project/data/raw'
COMPETITION_NAME = 'walmart-recruiting-store-sales-forecasting'

if not os.path.exists(DATA_DIR):
    os.makedirs(DATA_DIR, exist_ok=True)

if not os.path.exists(os.path.join(DATA_DIR, 'train.csv')):
    print("Raw data not found. Downloading from Kaggle...")
    from kaggle.api.kaggle_api_extended import KaggleApi
    
    api = KaggleApi()
    api.authenticate()
    
    api.competition_download_files(COMPETITION_NAME, path=DATA_DIR, quiet=True)

    master_zip_path = os.path.join(DATA_DIR, f'{COMPETITION_NAME}.zip')
    with zipfile.ZipFile(master_zip_path, 'r') as z:
        z.extractall(DATA_DIR)
    for item in ['train.csv.zip', 'test.csv.zip', 'features.csv.zip']:
        with zipfile.ZipFile(os.path.join(DATA_DIR, item), 'r') as z:
            z.extractall(DATA_DIR)
    print("Data successfully downloaded and unzipped to DBFS.")
else:
    print("Raw data already exists in DBFS. Skipping download.")

# LOAD, MERGE, AND PROCESS DATA

In [0]:
# Load raw data from DBFS
train_df = pd.read_csv(os.path.join(DATA_DIR, 'train.csv'))
features_df = pd.read_csv(os.path.join(DATA_DIR, 'features.csv'))
stores_df = pd.read_csv(os.path.join(DATA_DIR, 'stores.csv'))
test_df = pd.read_csv(os.path.join(DATA_DIR, 'test.csv'))

The Training Set (Feb 2010 - Jan 2012)
Action: You will train your forecasting model only on this 2-year chunk of data.

The Validation Set (Feb 2012 - Oct 2012)

Purpose: To give your model a "practice exam" and get a reliable score.

Action: After training on the training set, you will use the model to forecast the 39 weeks of the validation set. You then compare your forecast to the actual sales in this period. The error you calculate here is your best estimate of your future Kaggle leaderboard score.

The Competition Test Set (The real test.csv data)

Purpose: To generate your final submission file for Kaggle.

Action: Once you are happy with your model's performance on the validation set, you do one final step: You retrain your model on the entire original train.csv data (all 143 weeks). By using all the data available, you create the most powerful version of your model. You then use this final, fully-trained model to predict the sales for the competition test set.

In [0]:
train_df.head()

In [0]:
features_df.head()

In [0]:
stores_df.head()

In [0]:
# Merge data
df = train_df.merge(features_df, on=['Store', 'Date', 'IsHoliday'], how='left')
df = df.merge(stores_df, on='Store', how='left')
df['Date'] = pd.to_datetime(df['Date'])

In [0]:
# --- 2. Split the Training Data into Train and Validation Sets ---
from datetime import timedelta

# The competition test set is 39 weeks long.
# We will create a validation set of the same length from the end of our training data.
validation_length_weeks = 39

# Find the last date in the training data
last_train_date = df['Date'].max()

# Calculate the split date
split_date = last_train_date - timedelta(weeks=validation_length_weeks)

# Split the data
train_data = df[df['Date'] <= split_date].copy()
validation_data = df[df['Date'] > split_date].copy()

print(f"Data split into training and validation sets at: {split_date.date()}")
print(f"Training set shape: {train_data.shape}")
print(f"Validation set shape: {validation_data.shape}")

In [0]:
df.head()

In [0]:
PROCESSED_DIR = '/dbfs/FileStore/walmart_project/data/processed'
if not os.path.exists(PROCESSED_DIR):
    os.makedirs(PROCESSED_DIR, exist_ok=True)

from src.preprocessing import advanced_feature_engineering

train_processed = advanced_feature_engineering(train_data)
train_processed.to_csv(os.path.join(PROCESSED_DIR, 'train_final.csv'), index=False)


validation_processed = advanced_feature_engineering(validation_data)
validation_processed.to_csv(os.path.join(PROCESSED_DIR, 'validation_final.csv'), index=False)

raw_test_data = test_df.merge(features_df, on=['Store', 'Date', 'IsHoliday'], how='left')
raw_test_data = raw_test_data.merge(stores_df, on='Store', how='left')
test_processed = advanced_feature_engineering(raw_test_data)
test_processed.to_csv(os.path.join(PROCESSED_DIR, 'test_final.csv'), index=False)

In [0]:
# Rolling feature columns
# Reason: These missing values are expected. They appear at the beginning of each time series (for each store/department) because there isn't enough historical data to calculate the rolling window. For example, sales_roll_mean_4 is empty for the first three weeks of data for each group.

In [0]:
#checking seasonality
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf, month_plot, quarter_plot


In [0]:
import matplotlib.pyplot as plt
decomposition = seasonal_decompose(train_data['Weekly_Sales'], model = 'additive', period= 52)
fig = decomposition.plot()
fig.set_size_inches(14, 8)
plt.show()

In [0]:
import matplotlib.pyplot as plt
decomposition = seasonal_decompose(train_data['Weekly_Sales'], model = 'add', period= 12)
fig = decomposition.plot()
fig.set_size_inches(14, 8)
plt.show()

For Weekly_Sales data, autocorrelation tells how much this week's sales are influenced by previous weeks' sales.

Identifying Momentum: A significant correlation at lag 1 means that high sales one week are likely followed by high sales the next week (and vice versa). This is a strong indicator of sales trends or momentum.

Finding Seasonality: A significant correlation at lag 52 is a dead giveaway for yearly seasonality. It means the sales in a given week are strongly correlated with the sales from the same week last year. This is a critical pattern for your forecasting model to learn.

 Strong Trend / Momentum (The First Few Lags)

What you see: The first few bars (lags 1, 2, 3, etc.) are very tall and extend far beyond the light blue shaded area.

What it means: This indicates that the sales in any given week are highly correlated with the sales from the past few weeks. In simple terms, if sales were high last week, they are likely to be high this week. This shows a strong, positive short-term trend or "momentum" in the sales data.

2. Clear Yearly Seasonality (The Spike at Lag 52)

What you see: Look all the way to the right of the plot. There is a very clear, significant spike at lag 52.

What it means: This is the most important finding. It tells you that the sales in a given week are strongly and positively correlated with the sales from the same week last year (since there are 52 weeks in a year). For example, sales during the week of Christmas this year are very similar to sales during the week of Christmas last year. This is a classic sign of yearly seasonality.

What This Means for Your Project
This plot gives you a clear roadmap for how to build your forecasting model:

Because of the short-term momentum, your model needs to look at recent past values. This is the "AR" (AutoRegressive) part of a model like ARIMA.

Because of the yearly seasonality, your model must account for this repeating annual pattern. This is the "S" (Seasonal) part of a model like SARIMA.

Based on this plot, a SARIMA (Seasonal AutoRegressive Integrated Moving Average) model would be an excellent choice for forecasting the sales of this specific store and department.

In [0]:
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf

store_id_to_analyze = 13
dept_id_to_analyze = 7

# Filter for the specific store and department
ts_df = train_data[(train_data['Store'] == store_id_to_analyze) & (train_data['Dept'] == dept_id_to_analyze)].copy()

ts_df = ts_df.set_index('Date')

# Now, select the 'Weekly_Sales' column (which now has a DatetimeIndex)
ts_data = ts_df['Weekly_Sales']

# Now you can safely use .asfreq()
ts_data = ts_data.asfreq('W-FRI')

# Fill any potential missing weeks
ts_data.fillna(method='ffill', inplace=True)

# Check if the series is empty before plotting
if ts_data.empty:
    print(f"No data available for Store {store_id_to_analyze}, Dept {dept_id_to_analyze} in the training set.")
else:
    # --- Plotting Code (unchanged) ---
    fig, ax = plt.subplots(figsize=(14, 7))
    plot_acf(ts_data, lags=60, ax=ax)

    plt.title(f'Autocorrelation Function (ACF) for Store {store_id_to_analyze}, Dept {dept_id_to_analyze}', fontsize=16)
    plt.xlabel('Lag (Number of Weeks)', fontsize=12)
    plt.ylabel('Autocorrelation', fontsize=12)
    plt.grid(True)
    plt.show()

In [0]:
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_pacf

store_id_to_analyze = 4
dept_id_to_analyze = 1

# Filter for the specific store and department
ts_df = train_data[(train_data['Store'] == store_id_to_analyze) & (train_data['Dept'] == dept_id_to_analyze)].copy()

# Set the 'Date' column as the index
ts_df = ts_df.set_index('Date')

# Select the 'Weekly_Sales' column
ts_data = ts_df['Weekly_Sales']

# Ensure a consistent weekly frequency
ts_data = ts_data.asfreq('W-FRI')

# Fill any potential missing weeks
ts_data.fillna(method='ffill', inplace=True)

# --- FIX IS HERE ---
# Request a number of lags that is less than half the sample size (104 / 2 = 52)
# 50 is a safe and informative number.
if not ts_data.empty and len(ts_data) > 2 * 50:
    fig, ax = plt.subplots(figsize=(14, 7))
    plot_pacf(ts_data, lags=50, ax=ax) # Changed from 60 to 50

    plt.title(f'Partial Autocorrelation Function (PACF) for Store {store_id_to_analyze}, Dept {dept_id_to_analyze}', fontsize=16)
    plt.xlabel('Lag (Number of Weeks)', fontsize=12)
    plt.ylabel('Partial Autocorrelation', fontsize=12)
    plt.grid(True)
    plt.show()
else:
    print(f"Not enough data to plot PACF for Store {store_id_to_analyze}, Dept {dept_id_to_analyze}.")

1. Significant Spike at Lag 1

What you see: The bar at Lag 1 is very tall and positive, extending far beyond the light blue shaded area.

What it means: This shows a strong, direct relationship between this week's sales and last week's sales. After accounting for all other historical data, last week's sales figure is still the single most important predictor for this week. This is a clear signal for the "AR" (AutoRegressive) part of your model.

2. Significant Spike at Lag 5

What you see: There is a significant negative spike at Lag 5.

What it means: This is interesting. It suggests that after accounting for the sales of the last four weeks, the sales from five weeks ago have a direct negative correlation with this week's sales. This could be due to a specific sales cycle, like a big promotion every month that pulls sales forward, leading to a dip later on.

3. Significant Spikes Around Lag 52

What you see: There are a couple of significant spikes near Lag 52 (one positive, one negative).

What it means: This confirms the yearly seasonality we saw in the ACF plot. It tells you that even after accounting for all the sales in between, the sales from the same time last year still have a direct, significant impact on this week's sales.

What This Means for Your Project
This PACF plot, combined with the ACF plot, gives you a strong starting point for building a SARIMA model:

The 'p' (AR) Parameter: The sharp cutoff after lag 1 in the PACF plot suggests that a p value of 1 is a good starting point for your model.

The 'P' (Seasonal AR) Parameter: The significant spikes around lag 52 suggest you need a seasonal AR component. A P value of 1 would be a reasonable place to start.

In summary, this plot tells you that to predict sales for this department, you should primarily look at last week's sales and the sales from this time last year.

In [0]:
train_data = train_data.set_index('Date')
month_plot(train_data['Weekly_Sales'].resample('M').mean(), ylabel='Weekly_Sales')

In [0]:
quarter_plot(train_data['Weekly_Sales'].resample('Q').mean(), ylabel='Weekly_Sales')