# Predict future sales
In this competition you will work with a challenging time-series dataset consisting of daily sales data, kindly provided by one of the largest Russian software firms - 1C Company. 

We are asking you to predict total sales for every product and store in the next month. By solving this competition you will be able to apply and enhance your data science skills.

In [1]:
import pandas as pd
import numpy as np
import pickle
import os
import gc
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline 

pd.set_option('display.max_rows', 600)
pd.set_option('display.max_columns', 50)

import lightgbm as lgb
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
from tqdm import tqdm_notebook

from itertools import product

In [2]:
DATA_FOLDER = '../readonly/final_project_data/'

sales    = pd.read_csv(os.path.join(DATA_FOLDER, 'sales_train.csv.gz'))
items           = pd.read_csv(os.path.join(DATA_FOLDER, 'items.csv'))
item_categories = pd.read_csv(os.path.join(DATA_FOLDER, 'item_categories.csv'))
shops           = pd.read_csv(os.path.join(DATA_FOLDER, 'shops.csv'))
train           = pd.read_csv(os.path.join(DATA_FOLDER, 'sales_train.csv.gz'), compression='gzip')
test           = pd.read_csv(os.path.join(DATA_FOLDER, 'test.csv'))

# EDA

I check the size of the data

In [3]:
print ('sales shape %s' % np.str(sales.shape))
print ('items shape %s' % np.str(items.shape))
print ('item_categories shape %s' % np.str(item_categories.shape))
print ('shops shape %s' % np.str(shops.shape))
print ('train shape %s' % np.str(train.shape))
print ('test shape %s' % np.str(test.shape))

sales shape (2935849, 6)
items shape (22170, 3)
item_categories shape (84, 2)
shops shape (60, 2)
train shape (2935849, 6)
test shape (214200, 3)


I give a 1st look at the data.
Sales & Train have the same shape. Are the same df?

In [4]:
sales.equals(train)

True

In [5]:
sales.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day
0,02.01.2013,0,59,22154,999.0,1.0
1,03.01.2013,0,25,2552,899.0,1.0
2,05.01.2013,0,25,2552,899.0,-1.0
3,06.01.2013,0,25,2554,1709.05,1.0
4,15.01.2013,0,25,2555,1099.0,1.0


In [6]:
shops.head()

Unnamed: 0,shop_name,shop_id
0,"!Якутск Орджоникидзе, 56 фран",0
1,"!Якутск ТЦ ""Центральный"" фран",1
2,"Адыгея ТЦ ""Мега""",2
3,"Балашиха ТРК ""Октябрь-Киномир""",3
4,"Волжский ТЦ ""Волга Молл""",4


In [7]:
items.head()

Unnamed: 0,item_name,item_id,item_category_id
0,! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.) D,0,40
1,!ABBYY FineReader 12 Professional Edition Full...,1,76
2,***В ЛУЧАХ СЛАВЫ (UNV) D,2,40
3,***ГОЛУБАЯ ВОЛНА (Univ) D,3,40
4,***КОРОБКА (СТЕКЛО) D,4,40


In [8]:
item_categories.head()

Unnamed: 0,item_category_name,item_category_id
0,PC - Гарнитуры/Наушники,0
1,Аксессуары - PS2,1
2,Аксессуары - PS3,2
3,Аксессуары - PS4,3
4,Аксессуары - PSP,4


In [9]:
test.head()

Unnamed: 0,ID,shop_id,item_id
0,0,5,5037
1,1,5,5320
2,2,5,5233
3,3,5,5232
4,4,5,5268


so I need to predict the shop sales, which in this case means predicting the sales of the combination of shop & product, not just the total shop sales

1st I add the descriptions to shops & categories in the sales df

In [10]:
items_merge = pd.merge(left = items, right = item_categories , left_on = 'item_category_id', right_on = 'item_category_id')

In [11]:
sales_merge = pd.merge(left = sales,right = shops, left_on ='shop_id', right_on = 'shop_id' )
sales_merge = pd.merge(left = sales_merge,right = items_merge, left_on ='item_id', right_on = 'item_id' )

In [12]:
sales_merge.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,shop_name,item_name,item_category_id,item_category_name
0,02.01.2013,0,59,22154,999.0,1.0,"Ярославль ТЦ ""Альтаир""",ЯВЛЕНИЕ 2012 (BD),37,Кино - Blu-Ray
1,02.01.2013,0,25,22154,999.0,1.0,"Москва ТРК ""Атриум""",ЯВЛЕНИЕ 2012 (BD),37,Кино - Blu-Ray
2,03.01.2013,0,25,22154,999.0,1.0,"Москва ТРК ""Атриум""",ЯВЛЕНИЕ 2012 (BD),37,Кино - Blu-Ray
3,20.01.2013,0,25,22154,999.0,1.0,"Москва ТРК ""Атриум""",ЯВЛЕНИЕ 2012 (BD),37,Кино - Blu-Ray
4,23.01.2013,0,25,22154,999.0,1.0,"Москва ТРК ""Атриум""",ЯВЛЕНИЕ 2012 (BD),37,Кино - Blu-Ray


i check the types of the cols

In [13]:
sales_merge.dtypes

date                   object
date_block_num          int64
shop_id                 int64
item_id                 int64
item_price            float64
item_cnt_day          float64
shop_name              object
item_name              object
item_category_id        int64
item_category_name     object
dtype: object

daily sales is float. Are there any partial sales?

In [14]:
(sales_merge.item_cnt_day%1 != 0).any()

False

Are there any NaNs?

In [15]:
sales_merge.isnull().values.any()

False

In [16]:
test.isnull().values.any()

False

So all cells have been populated with some values

Does each item belong just to one category?

In [27]:
len(items_merge.groupby(['item_name','item_category_name']).nunique()) == len(items_merge)

True

Now let's have a look how train & test are constructed

In [31]:
set(sales_merge.shop_id) - set(test.shop_id)

{0, 1, 8, 9, 11, 13, 17, 20, 23, 27, 29, 30, 32, 33, 40, 43, 51, 54}

All shops in the test set are also in the train set. 

In [35]:
len(set(test.item_id) - set(sales_merge.item_id))

363

Merda! 363 items have been placed in the test set but they have never been observed in the train set... 
this can be an hint that the test set has been artificially constructed.

In [37]:
len(test.groupby(['shop_id','item_id']))

214200

In [38]:
len(set(test.item_id)) * len(set(test.shop_id))

214200

Hypo confirmed. The test set has been made by combining a set of items with a set of shops.

Now what I would like to do is try to understand whether there was a logic in the selection of the items/shops

In [46]:
add_items_test = set(test.item_id) - set(sales_merge.item_id)

In [55]:
items_merge.loc[add_items_test].sort_values(['item_category_name'])

Unnamed: 0,item_name,item_id,item_category_id,item_category_name
20439,Universal: Кабель оптический Gioteck XC-6 цифр...,7235,3,Аксессуары - PS4
20378,PS Vita 1000: Крышка Hori Face Cover защитная ...,5579,5,Аксессуары - PSVita
20400,PS Vita: Чехол 4gamers Clean 'n' Protect мягки...,5601,5,Аксессуары - PSVita
20403,PS Vita: Чехол 4gamers Travel Case дорожный кр...,5604,5,Аксессуары - PSVita
20401,PS Vita: Чехол 4gamers Deluxe Travel Case доро...,5602,5,Аксессуары - PSVita
20461,X360: Файтстик Hori Fighting Edge (HX3-70U),7896,6,Аксессуары - XBOX 360
20476,Аксессуар: Xbox 360 Жесткий диск 500 ГБ (6FM-0...,8448,6,Аксессуары - XBOX 360
20677,Комплект Sony PS3 (320 GB) (CECH-3008B) + игра...,13433,11,Игровые консоли - PS3
20680,Комплект Sony PS3 Super Slim (12 Gb) (CECH-400...,13436,11,Игровые консоли - PS3
21974,Игровая приставка SEGA Genesis Nano Trainer (б...,12126,17,Игровые консоли - Прочие
