## Data Exploration 

**Objective:** Load data from competition using time series from walmart. 
**Dataset:** The M5 Walmart dataset comes from the Kaggle competition “M5 Forecasting – Accuracy”, which focuses on hierarchical time-series forecasting in a large-scale retail setting. The data represent daily unit sales for Walmart products across the United States over several years, enriched with calendar events and pricing information.

The dataset is designed to reflect real-world forecasting challenges, including:
	•	thousands of related time series
	•	multiple aggregation levels
	•	promotions and holidays
	•	non-stationary demand patterns
	•	strong cross-sectional heterogeneity

**Dataset Objective:** The objective of the competition is to forecast the next 28 days of sales for all item–store combinations while ensuring coherent predictions across hierarchical levels, evaluated using the Weighted Root Mean Squared Scaled Error (WRMSSE) metric.

In [23]:
# Import Libraries
import pandas as pd
import zipfile
import os
from sklearn.model_selection import train_test_split

# Define file paths
data_dir = '../data/'
zip_file_path = os.path.join(data_dir, 'raw', 'm5-forecasting-accuracy.zip')
extracted_dir = os.path.join(data_dir, 'raw', 'dataset')
preprocessed_dir = os.path.join(data_dir, 'proc')
os.makedirs(extracted_dir, exist_ok=True)
os.makedirs(preprocessed_dir, exist_ok=True)

# Function to unzip files
def zipextract(zip_path, extract_to):
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall(extract_to)        

# Unzip dataset
zipextract(zip_file_path, extracted_dir)

In [24]:
# Open csv files with pandas
sales = pd.read_csv(os.path.join(extracted_dir, 'sales_train_evaluation.csv'))
calendar = pd.read_csv(os.path.join(extracted_dir, 'calendar.csv'))
prices = pd.read_csv(os.path.join(extracted_dir, 'sell_prices.csv'))


In [25]:
# Check Shape, first rows, nulls and datatypes
print("Sales Data Shape:", sales.shape)
print(sales.head())
print("Nulls in Sales Data:\n", sales.isnull().sum())
print("Sales Data Types:\n", sales.dtypes)

Sales Data Shape: (30490, 1947)
                              id        item_id  ... d_1940 d_1941
0  HOBBIES_1_001_CA_1_evaluation  HOBBIES_1_001  ...      0      1
1  HOBBIES_1_002_CA_1_evaluation  HOBBIES_1_002  ...      0      0
2  HOBBIES_1_003_CA_1_evaluation  HOBBIES_1_003  ...      0      1
3  HOBBIES_1_004_CA_1_evaluation  HOBBIES_1_004  ...      2      6
4  HOBBIES_1_005_CA_1_evaluation  HOBBIES_1_005  ...      1      0

[5 rows x 1947 columns]
Nulls in Sales Data:
 id          0
item_id     0
dept_id     0
cat_id      0
store_id    0
           ..
d_1937      0
d_1938      0
d_1939      0
d_1940      0
d_1941      0
Length: 1947, dtype: int64
Sales Data Types:
 id          object
item_id     object
dept_id     object
cat_id      object
store_id    object
             ...  
d_1937       int64
d_1938       int64
d_1939       int64
d_1940       int64
d_1941       int64
Length: 1947, dtype: object


In [26]:
# Calendar Check Shape, first rows, nulls and datatypes
print("Calendar Data Shape:", calendar.shape)
print(calendar.head())
print("Nulls in calendar Data:\n", calendar.isnull().sum())
print("Calendar Data Types:\n", calendar.dtypes)

Calendar Data Shape: (1969, 14)
         date  wm_yr_wk    weekday  ...  snap_CA  snap_TX  snap_WI
0  2011-01-29     11101   Saturday  ...        0        0        0
1  2011-01-30     11101     Sunday  ...        0        0        0
2  2011-01-31     11101     Monday  ...        0        0        0
3  2011-02-01     11101    Tuesday  ...        1        1        0
4  2011-02-02     11101  Wednesday  ...        1        0        1

[5 rows x 14 columns]
Nulls in calendar Data:
 date               0
wm_yr_wk           0
weekday            0
wday               0
month              0
year               0
d                  0
event_name_1    1807
event_type_1    1807
event_name_2    1964
event_type_2    1964
snap_CA            0
snap_TX            0
snap_WI            0
dtype: int64
Calendar Data Types:
 date            object
wm_yr_wk         int64
weekday         object
wday             int64
month            int64
year             int64
d               object
event_name_1    object
event