# 0. The problem

Source: https://www.kaggle.com/competitions/rossmann-store-sales/overview

## Context

Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied.

## Solution

Predict 6 weeks of daily sales for 1115 stores located across Germany. Reliable sales forecasts enable store managers to create effective staff schedules that increase productivity and motivation. By helping Rossmann create a robust prediction model, you will help store managers stay focused on what’s most important to them: their customers and their teams! 

# 1. Data description

**Files**
- train.csv - historical data including Sales
- test.csv - historical data excluding Sales
- store.csv - supplemental information about the stores

**Data fields**
- Id - an Id that represents a (Store, Date) duple within the test set
- Store - a unique Id for each store
- Sales - the turnover for any given day (this is what you are predicting)
- Customers - the number of customers on a given day
- Open - an indicator for whether the store was open: 0 = closed, 1 = open
- StateHoliday - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None
- SchoolHoliday - indicates if the (Store, Date) was affected by the closure of public schools
- StoreType - differentiates between 4 different store models: a, b, c, d
- Assortment - describes an assortment level: a = basic, b = extra, c = extended
- CompetitionDistance - distance in meters to the nearest competitor store
- CompetitionOpenSince[Month/Year] - gives the approximate year and month of the time the nearest competitor was opened
- Promo - indicates whether a store is running a promo on that day
- Promo2 - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating
- Promo2Since[Year/Week] - describes the year and calendar week when the store started participating in Promo2
- PromoInterval - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store

## 1.1. Imports

In [70]:
import pandas as pd

from sklearn.model_selection import train_test_split

## 1.2. Loading data

In [71]:
df_store = pd.read_csv(filepath_or_buffer="../data/store.csv")
df_train_raw = pd.read_csv(filepath_or_buffer="../data/train.csv", low_memory=False)
df_val = pd.read_csv(filepath_or_buffer="../data/test.csv")

# 2. Exploratory data analisys

## 2.1. Data description

### Store

In [72]:
# Information of each store
df_store.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1115 entries, 0 to 1114
Data columns (total 10 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Store                      1115 non-null   int64  
 1   StoreType                  1115 non-null   object 
 2   Assortment                 1115 non-null   object 
 3   CompetitionDistance        1112 non-null   float64
 4   CompetitionOpenSinceMonth  761 non-null    float64
 5   CompetitionOpenSinceYear   761 non-null    float64
 6   Promo2                     1115 non-null   int64  
 7   Promo2SinceWeek            571 non-null    float64
 8   Promo2SinceYear            571 non-null    float64
 9   PromoInterval              571 non-null    object 
dtypes: float64(5), int64(2), object(3)
memory usage: 87.2+ KB


In [73]:
# Is there any duplicated store?
assert df_store.duplicated(subset=['Store']).sum() == 0

In [74]:
df_store.sample(10)

Unnamed: 0,Store,StoreType,Assortment,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear,PromoInterval
282,283,a,a,2260.0,,,1,40.0,2014.0,"Jan,Apr,Jul,Oct"
712,713,a,c,220.0,,,1,10.0,2014.0,"Jan,Apr,Jul,Oct"
136,137,a,a,1730.0,7.0,2015.0,1,40.0,2014.0,"Jan,Apr,Jul,Oct"
126,127,d,a,1350.0,12.0,2005.0,1,13.0,2010.0,"Jan,Apr,Jul,Oct"
503,504,c,c,820.0,,,0,,,
390,391,a,a,460.0,11.0,2014.0,1,31.0,2013.0,"Feb,May,Aug,Nov"
345,346,a,c,8090.0,,,0,,,
1088,1089,d,a,5220.0,5.0,2009.0,0,,,
537,538,a,a,990.0,2.0,2010.0,0,,,
241,242,d,a,6880.0,9.0,2001.0,1,14.0,2011.0,"Jan,Apr,Jul,Oct"


In [75]:
df_store["PromoInterval"].value_counts()

PromoInterval
Jan,Apr,Jul,Oct     335
Feb,May,Aug,Nov     130
Mar,Jun,Sept,Dec    106
Name: count, dtype: int64

**OBS.:**
- StoreType: change to category
- Assortment: change to category
- CompetitionOpenSinceMonth: change to int and add instance month when is NaN
- CompetitionOpenSinceYear: change to int and add instance year when is NaN
- Promo2SinceWeek: change to int and add instance week when is NaN
- Promo2SinceYear: change to int and NaN(?)
- PromoInterval: there are three intervals, change to a categorical value

### Train

In [76]:
df_train_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1017209 entries, 0 to 1017208
Data columns (total 9 columns):
 #   Column         Non-Null Count    Dtype 
---  ------         --------------    ----- 
 0   Store          1017209 non-null  int64 
 1   DayOfWeek      1017209 non-null  int64 
 2   Date           1017209 non-null  object
 3   Sales          1017209 non-null  int64 
 4   Customers      1017209 non-null  int64 
 5   Open           1017209 non-null  int64 
 6   Promo          1017209 non-null  int64 
 7   StateHoliday   1017209 non-null  object
 8   SchoolHoliday  1017209 non-null  int64 
dtypes: int64(7), object(2)
memory usage: 69.8+ MB


In [77]:
# Convert Date type to datetime
df_train_raw["Date"] = pd.to_datetime(df_train_raw["Date"])

In [78]:
df_train_raw

Unnamed: 0,Store,DayOfWeek,Date,Sales,Customers,Open,Promo,StateHoliday,SchoolHoliday
0,1,5,2015-07-31,5263,555,1,1,0,1
1,2,5,2015-07-31,6064,625,1,1,0,1
2,3,5,2015-07-31,8314,821,1,1,0,1
3,4,5,2015-07-31,13995,1498,1,1,0,1
4,5,5,2015-07-31,4822,559,1,1,0,1
...,...,...,...,...,...,...,...,...,...
1017204,1111,2,2013-01-01,0,0,0,0,a,1
1017205,1112,2,2013-01-01,0,0,0,0,a,1
1017206,1113,2,2013-01-01,0,0,0,0,a,1
1017207,1114,2,2013-01-01,0,0,0,0,a,1


**OBS**
- StateHoliday: change to categorical

In [79]:
# There should be no repeated dates for each store
assert df_train_raw.duplicated(subset=["Store", "Date"]).sum() == 0

In [80]:
# Number of instances for each store
df_train_raw.groupby("Store").size().value_counts()

942    934
758    180
941      1
Name: count, dtype: int64

In [81]:
# Closed store has no sales, so it is not relevant
original_shape = df_train_raw.shape
df_train_raw = df_train_raw[df_train_raw["Open"] == 1]
print(f"Original shape: {original_shape}")
print(f"New shape: {df_train_raw.shape}")
print(f"{original_shape[0] - df_train_raw.shape[0]} lines were removed.")

Original shape: (1017209, 9)
New shape: (844392, 9)
172817 lines were removed.


### Join train and store dataframes

In [82]:
df_raw = pd.merge(df_train_raw, df_store, how="left", on="Store")
df_raw.shape

(844392, 18)

## 2.2. Train-test split

- It is necessary to set aside the last 6 registered weeks of each store for testing

In [None]:
def split_store_data(df, test_size=6*7):
    stores = df["Store"].unique()
    train_list = []
    test_list = []

    for store in stores:
        store_data = df[df["Store"] == store].sort_values(by="Date")
        train, test = train_test_split(store_data, test_size=test_size, shuffle=False)
        train_list.append(train)
        test_list.append(test)

    df_train = pd.concat(train_list)
    df_test = pd.concat(test_list)
    
    return df_train, df_test

In [None]:
df_train, df_test = split_store_data(df=df_train_raw, test_size=6*7)

In [None]:
df_train.shape, df_test.shape

In [None]:
# Check data splitting
assert df_train.shape[0] + df_test.shape[0] == df_train_raw.shape[0], "Problem in data splitting"

In [None]:
stores = df_store["Store"].unique()
for store in stores:
    assert df_train[df_train["Store"] == store]["Date"].max() < df_test[df_test["Store"] == store]["Date"].min(), \
        "Minimum test data is earlier than maximum train data"

# 3. Feature Engineering