# Project Description

To learn "Time Series", you are provided with daily historical sales data. The task is to forecast the total amount of products sold in every shop for the test set.
Note that the list of shops and products slightly changes every month. Creating a robust model that can handle such situations is part of the challenge.

* Robust model? :
    robust estimation. An estimation technique which is insensitive to small departures from the idealized assumptions which have been used to optimize the algorithm.

## File Description

* sales_train.csv : training set. Daily historical data from January 2013 to October 2015
* test.csv : the test set. You need to forecast the sales for these shops and products for November 2015
* sample_submission.csv : sample submission with correct format
* items.csv : supplemental information about the itmes categories
* shops.csv : supplemental information about the shops

## Data fields

* ID : an tuple of (Shop, Item) within the test set
* shop_id : unique identifier of a shop
* item_id : unique identifier of a product
* item_cateogory_id : unique identifier of item category
* item_cnt_day : number of products sold. You are predicting a monthly amount of this measure
* item price : current price of an item
* date : date in format dd/mm/yyyy
* date_block_num : a consecutive month number, used for convenience. January 2013 is 0, February 2013 is 1,..., October 2015 is 33
* item_name : name of item
* shop_name : name of shop
* item_category_name : name of item category

# Methodology going to be used

### : Time series

### Importing Packages

In [1]:
# Import basic packages
import numpy as np  # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import random as rd # generating random numbers
import datetime # manipulating data formats

# Import packages for visualization
import matplotlib.pyplot as plt # basic plotting
import seaborn as sns # for prettier plots

# Import packages for time series
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from pandas.plotting import autocorrelation_plot
from statsmodels.tsa.stattools import adfuller, acf, pacf, arma_order_select_ic
import statsmodels.formula.api as smf
import statsmodels.tsa.api as smt
import statsmodels.api as sm
import scipy.stats as scs

# Settings
import warnings
warnings.filterwarnings ("ignore")

In [2]:
# Import datasets
train = pd.read_csv ('./dataset/competitive-data-science-predict-future-sales/sales_train.csv')      #sales data for training
item_cat = pd.read_csv ('./dataset/competitive-data-science-predict-future-sales/item_categories.csv')   #item category id
item = pd.read_csv ('./dataset/competitive-data-science-predict-future-sales/items.csv')    #supplemental information about the itmes categories
sample = pd.read_csv ('./dataset/competitive-data-science-predict-future-sales/sample_submission.csv')   #sample submission with correct format
shops = pd.read_csv ('./dataset/competitive-data-science-predict-future-sales/shops.csv')     #supplemental information about the shops
test = pd.read_csv ('./dataset/competitive-data-science-predict-future-sales/test.csv')    #test set

In [3]:
# Checking the data
train.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day
0,02.01.2013,0,59,22154,999.0,1.0
1,03.01.2013,0,25,2552,899.0,1.0
2,05.01.2013,0,25,2552,899.0,-1.0
3,06.01.2013,0,25,2554,1709.05,1.0
4,15.01.2013,0,25,2555,1099.0,1.0


### Preprocessing

In [4]:
# Formating the date column correctly
train.date = train.date.apply (lambda x:datetime.datetime.strptime (x, '%d.%m.%Y'))

In [5]:
# checking the data again
train.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day
0,2013-01-02,0,59,22154,999.0,1.0
1,2013-01-03,0,25,2552,899.0,1.0
2,2013-01-05,0,25,2552,899.0,-1.0
3,2013-01-06,0,25,2554,1709.05,1.0
4,2013-01-15,0,25,2555,1099.0,1.0


In [6]:
# Aggregation: test set에서 monthly basis로 sales prediction 해야하므로 다른 데이터를 monthly level로 aggregation needed --> Groupby

# Aggregation 방향: row 구분 (date_block_num : 첫번째달로부터 몇번째 달인지 구분, shop_id, item_id), column 구분(date:min-max, item_price:mean 평균가격, item_cnt_day:sum 총판매갯수)

In [7]:
monthly_sales = train.groupby ([])

ValueError: No group keys passed!