In [1]:
import torch
import numpy as np
import pandas as pd
from category_encoders import *
from tqdm import notebook
import matplotlib.pyplot as plt
import gc
import pickle as pkl
%matplotlib inline

In [3]:
train_data = pd.read_csv('../../data/sales_train_validation.csv')
sell_prices = pd.read_csv('../../data/sell_prices.csv')
calendar = pd.read_csv('../../data/calendar.csv')
sample_submission = pd.read_csv('../../data/sample_submission.csv')
weights = pd.read_csv('../../data/weights_validation.csv')

In [4]:
with open('./../data/data.pickle', 'rb') as f:
    data_dict = pkl.load(f)
    
sales_data_index = data_dict['sales_data_index']
calendar_index = data_dict['calendar_index']
X_prev_day_sales = data_dict['X_prev_day_sales']
X_sell_price = data_dict['X_sell_price']
X_calendar = data_dict['X_calendar']
X_calendar_cols = data_dict['X_calendar_cols']
Y = data_dict['Y']

In [11]:
weights[weights.Level_id=='Level12']

Unnamed: 0,Level_id,Agg_Level_1,Agg_Level_2,Weight
12350,Level12,FOODS_1_001,CA_1,1.970000e-05
12351,Level12,FOODS_1_001,CA_2,1.850000e-05
12352,Level12,FOODS_1_001,CA_3,1.430000e-05
12353,Level12,FOODS_1_001,CA_4,5.380000e-06
12354,Level12,FOODS_1_001,TX_1,5.980000e-07
...,...,...,...,...
42835,Level12,HOUSEHOLD_2_516,TX_2,1.270000e-05
42836,Level12,HOUSEHOLD_2_516,TX_3,7.920000e-06
42837,Level12,HOUSEHOLD_2_516,WI_1,1.580000e-06
42838,Level12,HOUSEHOLD_2_516,WI_2,1.580000e-06


In [12]:
weights[weights.Level_id=='Level2']

Unnamed: 0,Level_id,Agg_Level_1,Agg_Level_2,Weight
1,Level2,CA,X,0.442371
2,Level2,TX,X,0.269297
3,Level2,WI,X,0.288332


### Possible approaches:
* Forecast the 30490 series very well using embeddings for item_id, state_id, etc. And by adding weights in a smarter way (increase the weight if an item influences significantly a higher level's weight as well??)
* Forecast for all the series. Maybe use a different model for aggregated series. Devise a postprocessing method to make the forecasts coherent
* (Best one??) The problem with bottom-up approaches (1st approach) is it is easier to forecast for aggregated series as they are more filled(?). What if we provide the time series data for Levels 1-5 as features but forecast at level 12. Can try this after implementing 1st approach.

For now, train using the first approach, decide your validation method, clean the code, improve the model a bit and then come back to see of you should be using any other approach to improve it further


### What are the series all about?

* #### Level 1: Aggregate everything
      Number of series: 1
      Graph: Item --> Department --> Category --> Store --> State --> All
* #### Level 2: Aggregate for each state
      Number of series: 3
      Graph: Item --> Department --> Category --> Store --> State
* #### Level 3: Aggregate for each store
      Number of series: 10
      Graph: Item --> Department --> Category --> Store
* #### Level 4: Aggregate for each category
      Number of series: 3
      Graph: Item --> Department --> Category
* #### Level 5: Aggregate for each department
      Number of series: 7
      Graph: Item --> Department
* #### Level 6: Aggregate for each state and category
      Number of series: 9
      Graph: Item --> Department --> Category+State
* #### Level 7: Aggregate for each state and department
      Number of series: 21
      Graph: Item --> Department+State
* #### Level 8: Aggregate for each store and category
      Number of series: 30
      Graph: Item --> Department --> Category+Store
* #### Level 9: Aggregate for each store and department
      Number of series: 70
      Graph: Item --> Department+Store
* #### Level 10: Aggregate of each product for each stores/states
      Number of series: 3049
      Graph: Store --> Item
* #### Level 11: Aggregate of each product for each state
      Number of series: 9147
      Graph: State+Item
* #### Level 10: Aggregate of each product for each store
      Number of series: 30490
      Graph: Item --> Store