# Predict Future Sales

You are provided with daily historical sales data. The task is to forecast the total amount of products sold in every shop for the test set. Note that the list of shops and products slightly changes every month. Creating a robust model that can handle such situations is part of the challenge.

### File Description
* sales_train.csv - the training set. Daily historical data from January 2013 to October 2015.
* sales_test.csv - the test set. You need to forecast the sales for these shops and products for November 2015.
* sample_submission.csv - a sample submission file in the correct format.
* items.csv - supplemental information about the items/products.
* item_categories.csv  - supplemental information about the items categories.
* shops.csv- supplemental information about the shops.


### Data Fields
* ID - an Id that represents a (Shop, Item) tuple within the test set
* shop_id - unique identifier of a shop
* item_id - unique identifier of a product
* item_category_id - unique identifier of item category
* item_cnt_day - number of products sold. You are predicting a monthly amount of this measure
* item_price - current price of an item
* date - date in format dd/mm/yyyy
* date_block_num - a consecutive month number, used for convenience. January 2013 is 0, February 2013 is 1,..., October 2015 is 33
* item_name - name of item
* shop_name - name of shop
* item_category_name - name of item category

In [77]:
import pandas as pd
import numpy as np

In [12]:
train = pd.read_csv('data/sales_train.csv')
test = pd.read_csv('data/sales_test.csv')
train.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day
0,02.01.2013,0,59,22154,999.0,1.0
1,03.01.2013,0,25,2552,899.0,1.0
2,05.01.2013,0,25,2552,899.0,-1.0
3,06.01.2013,0,25,2554,1709.05,1.0
4,15.01.2013,0,25,2555,1099.0,1.0


In [3]:
items = pd.read_csv('data/items.csv')
items.head()

Unnamed: 0,item_name,item_id,item_category_id
0,! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.) D,0,40
1,!ABBYY FineReader 12 Professional Edition Full...,1,76
2,***В ЛУЧАХ СЛАВЫ (UNV) D,2,40
3,***ГОЛУБАЯ ВОЛНА (Univ) D,3,40
4,***КОРОБКА (СТЕКЛО) D,4,40


In [7]:
item_categories = pd.read_csv('data/item_categories.csv')
item_categories.head()

Unnamed: 0,item_category_name,item_category_id
0,PC - Гарнитуры/Наушники,0
1,Аксессуары - PS2,1
2,Аксессуары - PS3,2
3,Аксессуары - PS4,3
4,Аксессуары - PSP,4


In [6]:
shops = pd.read_csv('data/shops.csv')
shops.head()

Unnamed: 0,shop_name,shop_id
0,"!Якутск Орджоникидзе, 56 фран",0
1,"!Якутск ТЦ ""Центральный"" фран",1
2,"Адыгея ТЦ ""Мега""",2
3,"Балашиха ТРК ""Октябрь-Киномир""",3
4,"Волжский ТЦ ""Волга Молл""",4


## Data Optimization for `sales_train.csv`

In [9]:
# real memory usage dataframe 'train'  - 299.6MB
train.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2935849 entries, 0 to 2935848
Data columns (total 6 columns):
date              object
date_block_num    int64
shop_id           int64
item_id           int64
item_price        float64
item_cnt_day      float64
dtypes: float64(2), int64(3), object(1)
memory usage: 299.6 MB


In [10]:
# check the number of null values in each column
train.isnull().sum()

date              0
date_block_num    0
shop_id           0
item_id           0
item_price        0
item_cnt_day      0
dtype: int64

### Optimize numeric columns

In [11]:
# optimize float columns to fit the data size in actual values

float_cols = train.select_dtypes(include=['float'])
float_cols.columns

Index(['item_price', 'item_cnt_day'], dtype='object')

In [16]:
for fc in float_cols.columns:
    train[fc] = pd.to_numeric(train[fc], downcast='float')

In [18]:
int_cols = train.select_dtypes(include=['int'])
int_cols.columns

Index(['date_block_num', 'shop_id', 'item_id'], dtype='object')

In [20]:
for ic in int_cols.columns:
    train[ic] = pd.to_numeric(train[ic], downcast='integer')

### Convert `id` columns to `category` type column
* Condition : `(number of unique `id` values / total number of data rows) < .5`

In [31]:
print(len(train['shop_id'].unique()) / len(train) < .5)
print(len(train['item_id'].unique()) / len(train) < .5)

True
True


In [32]:
train['shop_id'] = train['shop_id'].astype('category')

In [33]:
train['item_id'] = train['item_id'].astype('category')

### `Float` type to `Int` type & optimize column size

In [43]:
len(train['item_cnt_day'].unique())

198

In [52]:
# converting available?
for unq in train['item_cnt_day'].unique():
    if unq % 1 != 0:
        print('not available')
        break
else:
    print('converting from float to int available')

converting from float to int available


In [54]:
train['item_cnt_day'] = pd.to_numeric(train['item_cnt_day'].astype('int'),
                                      downcast='integer')

### `date` column to `date.datetime` datatype

In [62]:
train['date'] = pd.to_datetime(train['date'])

### How much optimized?

In [None]:
# original
pd.read_csv('data/sales_train.csv').info(memory_usage='deep')

In [None]:
# optimized
train.info(memory_usage='deep')

## Concatenate & combine data


In [94]:
# map the 'item_category_id' column values from 'item' dataframe
# and change the type as category

item_item_cat_id_mapped = dict(zip(items['item_id'].tolist(), items['item_category_id'].tolist()))
train['item_category_id'] = train['item_id'].map(item_item_cat_id_mapped)

In [100]:
train['item_category_id'] = train['item_category_id'].astype('category')  

In [95]:
# map the 'item_category_name' column values from 'item_category_name'
# we do not change 'item_category_name' in train dataframe at now

item_cat_id_item_name_mapped = dict(zip(item_categories['item_category_id'].tolist(),
                                       item_categories['item_category_name'].tolist()))
train['item_category_name'] = train['item_category_id'].map(item_cat_id_item_name_mapped)

In [96]:
train.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,item_category_id,item_category_name
0,2013-02-01,0,59,22154,999.0,1,37,Кино - Blu-Ray
1,2013-03-01,0,25,2552,899.0,1,58,Музыка - Винил
2,2013-05-01,0,25,2552,899.0,-1,58,Музыка - Винил
3,2013-06-01,0,25,2554,1709.050049,1,58,Музыка - Винил
4,2013-01-15,0,25,2555,1099.0,1,56,Музыка - CD фирменного производства


In [101]:
train.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2935849 entries, 0 to 2935848
Data columns (total 8 columns):
date                  datetime64[ns]
date_block_num        int8
shop_id               category
item_id               category
item_price            float32
item_cnt_day          int16
item_category_id      category
item_category_name    object
dtypes: category(3), datetime64[ns](1), float32(1), int16(1), int8(1), object(1)
memory usage: 398.6 MB


## Get additional data using `Google API`
* **Translate Russian text in `item_category_name` into English text using Google Translator**
* **Get the full address of the store by searching the `shop_name` via Google Searching**

In [107]:
# Get translated item_category_name as English

item_category_names = train['item_category_name'].unique()
len(item_category_names)

84

In [165]:
from google.oauth2 import service_account
credentials = service_account.Credentials.from_service_account_file(
    'path_of_credentials')

In [166]:
translate_client = translate.Client(credentials=credentials)

In [168]:
translated_item_category_names = []

for unique_cat_name in item_category_names:
    # The text to translate
    text = unique_cat_name

    # The target language
    target = 'en'

    # Translates some text into Russian
    translation = translate_client.translate(
        text,
        target_language=target)

    #print(u'Text: {}'.format(text))
    #print(u'Translation: {}'.format(translation['translatedText']))
    translated_item_category_names.append(translation['translatedText'])

In [170]:
translated_catname_mapped = dict(zip(item_category_names,
                                    translated_item_category_names))

In [171]:
train['item_category_name_eng'] = train['item_category_name'].map(translated_catname_mapped)
train = train.drop(['item_category_name'], axis=1)

In [174]:
# Get translated shop_name as English

shop_id_item_name_mapped = dict(zip(shops['shop_id'].tolist(),
                                       shops['shop_name'].tolist()))
train['shop_name'] = train['shop_id'].map(shop_id_item_name_mapped)

In [176]:
shop_names = train['shop_name'].unique()
len(shop_names)

60

In [177]:
translated_shop_names = []

for unique_shop_name in shop_names:
    # The text to translate
    text = unique_shop_name

    # The target language
    target = 'en'

    # Translates some text into Russian
    translation = translate_client.translate(
        text,
        target_language=target)

    #print(u'Text: {}'.format(text))
    #print(u'Translation: {}'.format(translation['translatedText']))
    translated_shop_names.append(translation['translatedText'])

In [178]:
translated_shopname_mapped = dict(zip(shop_names,
                                    translated_shop_names))

train['shop_name_eng'] = train['shop_name'].map(translated_shopname_mapped)
train = train.drop(['shop_name'], axis=1)

In [179]:
train.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,item_category_id,item_category_name_eng,shop_name_eng
0,2013-02-01,0,59,22154,999.0,1,37,Cinema - Blu-Ray,Yaroslavl Shopping center &quot;Altair&quot;
1,2013-03-01,0,25,2552,899.0,1,58,Music - Vinyl,Moscow TRK &quot;Atrium&quot;
2,2013-05-01,0,25,2552,899.0,-1,58,Music - Vinyl,Moscow TRK &quot;Atrium&quot;
3,2013-06-01,0,25,2554,1709.050049,1,58,Music - Vinyl,Moscow TRK &quot;Atrium&quot;
4,2013-01-15,0,25,2555,1099.0,1,56,Music - CD of branded production,Moscow TRK &quot;Atrium&quot;


## Explanatory Data Exploration (EDA)
* ~~~
* ~~~


# Feature Engineering
* `item_category_name_eng` : we can divide the information into two category names by hierarchy.
* `shop_name_eng` : we can extract additional information about the area which the shop is located at.
* Creating columns like:
  * `season`, `lifecycle_for_item`, `item_in_trend_or_not`, ....
 
### We will finish data optimization at the point feature engineering completed.

In [180]:
# never forget to set the identical touch on the `test` data set!
test.head()

Unnamed: 0,ID,shop_id,item_id
0,0,5,5037
1,1,5,5320
2,2,5,5233
3,3,5,5232
4,4,5,5268
