<a href="https://colab.research.google.com/github/davidrimon2004/DEPI_project/blob/calendar/project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Problem Formulation

##Problem Description:
This project aims to develop a machine learning model that predicts future sales
and demand by utilising historical sales data and external factors, including
product details, promotions, seasonality, holidays, and economic indicators. The
goal is to analyse historical patterns and generate reliable forecasts that help
businesses make data-driven decisions to reduce costs, increase efficiency, and
improve customer satisfaction by predicting the daily sales for the next 28 days.

##Objectives

● Collect and preprocess historical sales and demand data.

● Identify key features that influence sales trends.

● Build, train, and optimise forecasting models to predict future sales and
demand.

● Deploy the best-performing model to generate forecasts in real-time or in
batches.
##Data source:
Hierarchical sales data from Walmart, the world’s largest company by revenue in the US.



# Setup for google drive

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Code setup

## Important libraries

In [2]:
import pandas as pd

## data reading

In [None]:
sales_validation=pd.read_csv('/content/drive/MyDrive/DEPI_project/sales_train_validation.csv')
sales_validation.head(3)

Unnamed: 0,id,item_id,dept_id,cat_id,store_id,state_id,d_1,d_2,d_3,d_4,...,d_1904,d_1905,d_1906,d_1907,d_1908,d_1909,d_1910,d_1911,d_1912,d_1913
0,HOBBIES_1_001_CA_1_validation,HOBBIES_1_001,HOBBIES_1,HOBBIES,CA_1,CA,0,0,0,0,...,1,3,0,1,1,1,3,0,1,1
1,HOBBIES_1_002_CA_1_validation,HOBBIES_1_002,HOBBIES_1,HOBBIES,CA_1,CA,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2,HOBBIES_1_003_CA_1_validation,HOBBIES_1_003,HOBBIES_1,HOBBIES,CA_1,CA,0,0,0,0,...,2,1,2,1,1,1,0,1,1,1


In [7]:
cal=pd.read_csv('/content/drive/MyDrive/DEPI_project/calendar.csv')
cal.head(3)

Unnamed: 0,date,wm_yr_wk,weekday,wday,month,year,d,event_name_1,event_type_1,event_name_2,event_type_2,snap_CA,snap_TX,snap_WI
0,2011-01-29,11101,Saturday,1,1,2011,d_1,,,,,0,0,0
1,2011-01-30,11101,Sunday,2,1,2011,d_2,,,,,0,0,0
2,2011-01-31,11101,Monday,3,1,2011,d_3,,,,,0,0,0


In [6]:
final_data = pd.read_csv('/content/drive/MyDrive/DEPI_project/data.csv')
final_data.head(3)

Unnamed: 0,item_id,dept_id,cat_id,store_id,state_id,d,sales
0,HOBBIES_1_001,HOBBIES_1,HOBBIES,CA_1,CA,d_1,0
1,HOBBIES_1_002,HOBBIES_1,HOBBIES,CA_1,CA,d_1,0
2,HOBBIES_1_003,HOBBIES_1,HOBBIES,CA_1,CA,d_1,0


# EDA

In [None]:
data.head(20)

In [None]:
cal.info()

In [None]:
cal.head(200)

# Schema formating

## sales_validation file

In [None]:
data = sales_validation.melt(
    id_vars=["id","item_id", "dept_id","cat_id","store_id","state_id"],  # columns to keep
    var_name="d",                          # new column name for day labels (d_1, d_2, ...)
    value_name="sales"                     # new column name for sales values
)



In [None]:
data.drop(columns=['id'],inplace=True)

## calendar file

In [8]:
cal["date"]= pd.to_datetime(cal["date"])

In [9]:
cal["event_name_1"]= cal["event_name_1"].fillna("No event")
cal["event_type_1"]= cal["event_type_1"].fillna("No event")
cal["event_name_2"]= cal["event_name_2"].fillna("No event")
cal["event_type_2"]= cal["event_type_2"].fillna("No event")

In [None]:
import numpy as np

# Merge data and cal dataframes on the 'd' column
merged_data = pd.merge(final_data, cal, on='d', how='left')

conditions = [
    merged_data["state_id"] == "CA",
    merged_data["state_id"] == "TX",
    merged_data["state_id"] == "WI"
]
choices= [
    merged_data["snap_CA"],
    merged_data["snap_TX"],
    merged_data["snap_WI"]
]
merged_data["snap"]= np.select(conditions, choices)

merged_data.drop(columns=['snap_CA','snap_TX','snap_WI'],inplace=True)

In [None]:
merged_data.to_csv('/content/drive/MyDrive/DEPI_project/data.csv', index=False)