# Predict Future Sales

You are provided with daily historical sales data. The task is to forecast the total amount of products sold in every shop for the test set. Note that the list of shops and products slightly changes every month. Creating a robust model that can handle such situations is part of the challenge.

### File Description
* sales_train.csv - the training set. Daily historical data from January 2013 to October 2015.
* sales_test.csv - the test set. You need to forecast the sales for these shops and products for November 2015.
* sample_submission.csv - a sample submission file in the correct format.
* items.csv - supplemental information about the items/products.
* item_categories.csv  - supplemental information about the items categories.
* shops.csv- supplemental information about the shops.


### Data Fields
* ID - an Id that represents a (Shop, Item) tuple within the test set
* shop_id - unique identifier of a shop
* item_id - unique identifier of a product
* item_category_id - unique identifier of item category
* item_cnt_day - number of products sold. You are predicting a monthly amount of this measure
* item_price - current price of an item
* date - date in format dd/mm/yyyy
* date_block_num - a consecutive month number, used for convenience. January 2013 is 0, February 2013 is 1,..., October 2015 is 33
* item_name - name of item
* shop_name - name of shop
* item_category_name - name of item category

In [45]:
import pandas as pd
import numpy as np

In [46]:
train = pd.read_csv('data/sales_train.csv')
test = pd.read_csv('data/sales_test.csv')
train.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day
0,02.01.2013,0,59,22154,999.0,1.0
1,03.01.2013,0,25,2552,899.0,1.0
2,05.01.2013,0,25,2552,899.0,-1.0
3,06.01.2013,0,25,2554,1709.05,1.0
4,15.01.2013,0,25,2555,1099.0,1.0


In [47]:
items = pd.read_csv('data/items.csv')
items.head()

Unnamed: 0,item_name,item_id,item_category_id
0,! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.) D,0,40
1,!ABBYY FineReader 12 Professional Edition Full...,1,76
2,***В ЛУЧАХ СЛАВЫ (UNV) D,2,40
3,***ГОЛУБАЯ ВОЛНА (Univ) D,3,40
4,***КОРОБКА (СТЕКЛО) D,4,40


In [48]:
item_categories = pd.read_csv('data/item_categories.csv')
item_categories.head()

Unnamed: 0,item_category_name,item_category_id
0,PC - Гарнитуры/Наушники,0
1,Аксессуары - PS2,1
2,Аксессуары - PS3,2
3,Аксессуары - PS4,3
4,Аксессуары - PSP,4


In [49]:
shops = pd.read_csv('data/shops.csv')
shops.head()

Unnamed: 0,shop_name,shop_id
0,"!Якутск Орджоникидзе, 56 фран",0
1,"!Якутск ТЦ ""Центральный"" фран",1
2,"Адыгея ТЦ ""Мега""",2
3,"Балашиха ТРК ""Октябрь-Киномир""",3
4,"Волжский ТЦ ""Волга Молл""",4


## Data Optimization for `sales_train.csv`

In [50]:
# real memory usage dataframe 'train'  - 299.6MB
train.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2935849 entries, 0 to 2935848
Data columns (total 6 columns):
date              object
date_block_num    int64
shop_id           int64
item_id           int64
item_price        float64
item_cnt_day      float64
dtypes: float64(2), int64(3), object(1)
memory usage: 299.6 MB


In [51]:
# check the number of null values in each column
train.isnull().sum()

date              0
date_block_num    0
shop_id           0
item_id           0
item_price        0
item_cnt_day      0
dtype: int64

### Optimize numeric columns

In [52]:
# optimize float columns to fit the data size in actual values

float_cols = train.select_dtypes(include=['float'])
float_cols.columns

Index(['item_price', 'item_cnt_day'], dtype='object')

In [53]:
for fc in float_cols.columns:
    train[fc] = pd.to_numeric(train[fc], downcast='float')

In [54]:
int_cols = train.select_dtypes(include=['int'])
int_cols.columns

Index(['date_block_num', 'shop_id', 'item_id'], dtype='object')

In [55]:
for ic in int_cols.columns:
    train[ic] = pd.to_numeric(train[ic], downcast='integer')

### Convert `id` columns to `category` type column
* Condition : `(number of unique `id` values / total number of data rows) < .5`

In [56]:
print(len(train['shop_id'].unique()) / len(train) < .5)
print(len(train['item_id'].unique()) / len(train) < .5)

True
True


In [57]:
train['shop_id'] = train['shop_id'].astype('category')

In [58]:
train['item_id'] = train['item_id'].astype('category')

### `Float` type to `Int` type & optimize column size

In [59]:
len(train['item_cnt_day'].unique())

198

In [60]:
# converting available?
for unq in train['item_cnt_day'].unique():
    if unq % 1 != 0:
        print('not available')
        break
else:
    print('converting from float to int available')

converting from float to int available


In [61]:
train['item_cnt_day'] = pd.to_numeric(train['item_cnt_day'].astype('int'),
                                      downcast='integer')

### `date` column to `date.datetime` datatype

In [62]:
train['date'] = pd.to_datetime(train['date'])

### How much optimized?

In [63]:
# original
pd.read_csv('data/sales_train.csv').info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2935849 entries, 0 to 2935848
Data columns (total 6 columns):
date              object
date_block_num    int64
shop_id           int64
item_id           int64
item_price        float64
item_cnt_day      float64
dtypes: float64(2), int64(3), object(1)
memory usage: 299.6 MB


In [64]:
# optimized
train.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2935849 entries, 0 to 2935848
Data columns (total 6 columns):
date              datetime64[ns]
date_block_num    int8
shop_id           category
item_id           category
item_price        float32
item_cnt_day      int16
dtypes: category(2), datetime64[ns](1), float32(1), int16(1), int8(1)
memory usage: 51.2 MB


## Concatenate & combine data


In [65]:
# map the 'item_category_id' column values from 'item' dataframe
# and change the type as category

item_item_cat_id_mapped = dict(zip(items['item_id'].tolist(), items['item_category_id'].tolist()))
train['item_category_id'] = train['item_id'].map(item_item_cat_id_mapped)

In [66]:
train['item_category_id'] = train['item_category_id'].astype('category')  

In [67]:
# map the 'item_category_name' column values from 'item_category_name'
# we do not change 'item_category_name' in train dataframe at now

item_cat_id_item_name_mapped = dict(zip(item_categories['item_category_id'].tolist(),
                                       item_categories['item_category_name'].tolist()))
train['item_category_name'] = train['item_category_id'].map(item_cat_id_item_name_mapped)

In [68]:
train.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,item_category_id,item_category_name
0,2013-02-01,0,59,22154,999.0,1,37,Кино - Blu-Ray
1,2013-03-01,0,25,2552,899.0,1,58,Музыка - Винил
2,2013-05-01,0,25,2552,899.0,-1,58,Музыка - Винил
3,2013-06-01,0,25,2554,1709.050049,1,58,Музыка - Винил
4,2013-01-15,0,25,2555,1099.0,1,56,Музыка - CD фирменного производства


In [69]:
train.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2935849 entries, 0 to 2935848
Data columns (total 8 columns):
date                  datetime64[ns]
date_block_num        int8
shop_id               category
item_id               category
item_price            float32
item_cnt_day          int16
item_category_id      category
item_category_name    object
dtypes: category(3), datetime64[ns](1), float32(1), int16(1), int8(1), object(1)
memory usage: 398.6 MB


## Get additional data using `Google API`
* **Translate Russian text in `item_category_name` into English text using Google Translator**
* **Get the full address of the store by searching the `shop_name` via Google Searching**

In [70]:
# Get translated item_category_name as English

item_category_names = train['item_category_name'].unique()
len(item_category_names)

84

In [72]:
from google.oauth2 import service_account
credentials = service_account.Credentials.from_service_account_file(
    'local-path-of-credentials')

In [73]:
# Imports the Google Cloud client library
from google.cloud import translate

translate_client = translate.Client(credentials=credentials)

In [132]:
translated_item_category_names = []

for unique_cat_name in item_category_names:
    # The text to translate
    text = unique_cat_name

    # The target language
    target = 'en'

    # Translates some text into Russian
    translation = translate_client.translate(
        text,
        target_language=target)

    #print(u'Text: {}'.format(text))
    #print(u'Translation: {}'.format(translation['translatedText']))
    translated_item_category_names.append(translation['translatedText'])

In [133]:
# Check; all texts translated successfully?
print(translated_item_category_names)

['Cinema - Blu-Ray', 'Music - Vinyl', 'Music - CD of branded production', 'Music - Musical video', 'Music - CD of local production', 'Games - XBOX 360', 'Games - PS3', 'PC Games - Additional Edition', 'PC Games - Standard Edition', 'Games - PSP', 'Cinema - DVD', 'Programs - Home and Office', 'Books - Methodical materials 1С', 'PC Games - Collector&#39;s Edition', 'Games - PSVita', 'Gifts - Development', 'Programs - 1C: Enterprise 8', 'Programs - Teaching', 'Music - MP3', 'Music - Gift edition', 'Accessories - PSP', 'Gifts - Gadgets, robots, sports', 'Books - Audiobooks', 'Game Consoles - XBOX 360', 'Accessories - PS3', 'Accessories - PS4', 'Accessories - PSVita', 'Gifts - Certificates, services', 'Payment cards - PSN', 'Payment cards - Live!', 'Accessories - XBOX 360', 'Cinema - Blu-Ray 3D', 'Games - Accessories for games', 'Game Consoles - PSVita', 'Books - Audiobooks 1C', 'Cinema - Collector&#39;s', 'Gifts - Postcards, stickers', 'Game Consoles - PS3', 'Gifts - Souvenirs', 'Gifts - B

## Text column preprocessing

* Validate the translation done successfully
  * If not, execute translation again
* transform as lowercase
* Remove unnecessary characters
* and?

In [134]:
translated_item_category_names.index('Книги - Путеводители')

63

In [135]:
translation = translate_client.translate(
    translated_item_category_names[63],
    target_language=target)

translation['translatedText']

'Книги - Путеводители'

In [136]:
# google translate api does not work for this case
# manullay find the translated word in English

translated_item_category_names[63] = 'Books - Travel Guides'

In [137]:
print(translated_item_category_names)

['Cinema - Blu-Ray', 'Music - Vinyl', 'Music - CD of branded production', 'Music - Musical video', 'Music - CD of local production', 'Games - XBOX 360', 'Games - PS3', 'PC Games - Additional Edition', 'PC Games - Standard Edition', 'Games - PSP', 'Cinema - DVD', 'Programs - Home and Office', 'Books - Methodical materials 1С', 'PC Games - Collector&#39;s Edition', 'Games - PSVita', 'Gifts - Development', 'Programs - 1C: Enterprise 8', 'Programs - Teaching', 'Music - MP3', 'Music - Gift edition', 'Accessories - PSP', 'Gifts - Gadgets, robots, sports', 'Books - Audiobooks', 'Game Consoles - XBOX 360', 'Accessories - PS3', 'Accessories - PS4', 'Accessories - PSVita', 'Gifts - Certificates, services', 'Payment cards - PSN', 'Payment cards - Live!', 'Accessories - XBOX 360', 'Cinema - Blu-Ray 3D', 'Games - Accessories for games', 'Game Consoles - PSVita', 'Books - Audiobooks 1C', 'Cinema - Collector&#39;s', 'Gifts - Postcards, stickers', 'Game Consoles - PS3', 'Gifts - Souvenirs', 'Gifts - B

#### We can split each category name by ` - `, however;
* Many word seem to have been translated as `number`, `figure`, `numeral`, etc ...
  * we doubt whether the translation is correct or not.
  * **We consider them as `digital` type purchase and keep them in data.**
* There are some inevident categorization
  * For the first categry name, `'payment cards'` and `'payment cards ((movies, music, games))'` exist seperately
  * `official - tickets` and `tickets (figure)` as the first category names exist seperately as well
    * **For `payment cards`, we can combine them. For the `tickets` thing, we just leave them as divided.**


In [138]:
import re

In [139]:
translated_item_category_names_dict = {}

for i, word in enumerate(translated_item_category_names):
    
    word = word.lower()
    split = word.split(' - ')
    
    first_cat = split[0]
    
    try:
        second_cat = split[1]
    except:
        second_cat = ''
        
    if first_cat not in translated_item_category_names_dict:
        translated_item_category_names_dict[first_cat] = []
        
    translated_item_category_names_dict[first_cat].append(second_cat)
    
translated_item_category_names_dict

{'accessories': ['psp', 'ps3', 'ps4', 'psvita', 'xbox 360', 'ps2', 'xbox one'],
 'android games': ['digit'],
 'books': ['methodical materials 1с',
  'audiobooks',
  'audiobooks 1c',
  'business literature',
  'computer literature',
  'number',
  'audiobooks (numbers)',
  'travel guides',
  'fiction',
  'cognitive literature',
  'comics, manga',
  'postcards',
  'artbook, encyclopedia'],
 'cinema': ['blu-ray', 'dvd', 'blu-ray 3d', 'collector&#39;s'],
 'clean carriers (spire)': [''],
 'clean media (piece)': [''],
 'delivery of goods': [''],
 'elements of a food': [''],
 'game consoles': ['xbox 360',
  'psvita',
  'ps3',
  'psp',
  'ps2',
  'ps4',
  'other',
  'xbox one'],
 'games': ['xbox 360',
  'ps3',
  'psp',
  'psvita',
  'accessories for games',
  'ps2',
  'ps4',
  'xbox one'],
 'gifts': ['development',
  'gadgets, robots, sports',
  'certificates, services',
  'postcards, stickers',
  'souvenirs',
  'board games (compact)',
  'board games',
  'soft toys',
  'souvenirs (per sample)'

In [140]:
# reflect the above

for i, word in enumerate(translated_item_category_names):
    
    word_ = re.sub(r'figure|numbers|number|numeral|digital|digit', 'digital', word.lower())
    translated_item_category_names[i] = word_
    
    if word_ == 'payment cards (movies, music, games)':
        translated_item_category_names[i] = 'payment cards - movies, music, games'
    
    print(translated_item_category_names[i])

cinema - blu-ray
music - vinyl
music - cd of branded production
music - musical video
music - cd of local production
games - xbox 360
games - ps3
pc games - additional edition
pc games - standard edition
games - psp
cinema - dvd
programs - home and office
books - methodical materials 1с
pc games - collector&#39;s edition
games - psvita
gifts - development
programs - 1c: enterprise 8
programs - teaching
music - mp3
music - gift edition
accessories - psp
gifts - gadgets, robots, sports
books - audiobooks
game consoles - xbox 360
accessories - ps3
accessories - ps4
accessories - psvita
gifts - certificates, services
payment cards - psn
payment cards - live!
accessories - xbox 360
cinema - blu-ray 3d
games - accessories for games
game consoles - psvita
books - audiobooks 1c
cinema - collector&#39;s
gifts - postcards, stickers
game consoles - ps3
gifts - souvenirs
gifts - board games (compact)
clean media (piece)
clean carriers (spire)
gifts - board games
office
gifts - soft toys
pc - heads

In [141]:
translated_catname_mapped = dict(zip(item_category_names,
                                    translated_item_category_names))

train['item_category_name_eng'] = train['item_category_name'].map(translated_catname_mapped)
train = train.drop(['item_category_name'], axis=1)

In [142]:
train.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,item_category_id,item_category_name_eng
0,2013-02-01,0,59,22154,999.0,1,37,cinema - blu-ray
1,2013-03-01,0,25,2552,899.0,1,58,music - vinyl
2,2013-05-01,0,25,2552,899.0,-1,58,music - vinyl
3,2013-06-01,0,25,2554,1709.050049,1,58,music - vinyl
4,2013-01-15,0,25,2555,1099.0,1,56,music - cd of branded production


In [143]:
# Get translated shop_name as English
# Since we did not execute `id-name` mapping regarding shop, 

shop_id_item_name_mapped = dict(zip(shops['shop_id'].tolist(),
                                       shops['shop_name'].tolist()))
train['shop_name'] = train['shop_id'].map(shop_id_item_name_mapped)

In [144]:
shop_names = train['shop_name'].unique()
len(shop_names)

60

### As above, we gather translations, validate and reflect to the `shop_name` data.

In [163]:
shop_loc_names = train['shop_name'].apply(lambda x: x.split(' ')[0])
train['shop_loc_name'] = shop_loc_names

In [164]:
train.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,item_category_id,item_category_name_eng,shop_name,shop_loc_name
0,2013-02-01,0,59,22154,999.0,1,37,cinema - blu-ray,"Ярославль ТЦ ""Альтаир""",Ярославль
1,2013-03-01,0,25,2552,899.0,1,58,music - vinyl,"Москва ТРК ""Атриум""",Москва
2,2013-05-01,0,25,2552,899.0,-1,58,music - vinyl,"Москва ТРК ""Атриум""",Москва
3,2013-06-01,0,25,2554,1709.050049,1,58,music - vinyl,"Москва ТРК ""Атриум""",Москва
4,2013-01-15,0,25,2555,1099.0,1,56,music - cd of branded production,"Москва ТРК ""Атриум""",Москва


In [173]:
shop_loc_names_uniques = train['shop_loc_name'].unique().tolist()
shop_loc_names_translated = []

for unique_shop_loc_name in shop_loc_names_uniques:
    text = unique_shop_loc_name
    target = 'en'
    translation = translate_client.translate(
        text,
        target_language=target)

    shop_loc_names_translated.append(translation['translatedText'].lower())

In [174]:
print(shop_loc_names_uniques)

['Ярославль', 'Москва', 'Курск', 'Красноярск', 'Волжский', 'Воронеж', 'Адыгея', 'Балашиха', '!Якутск', 'Коломна', 'Калуга', 'Жуковский', 'Казань', 'Интернет-магазин', 'Уфа', 'Н.Новгород', 'Чехов', 'Химки', 'Сургут', 'Тюмень', 'СПб', 'РостовНаДону', 'Омск', 'Самара', 'Новосибирск', 'Сергиев', 'Вологда', 'Якутск', 'Цифровой', 'Выездная', 'Томск', 'Мытищи']


In [175]:
print(shop_loc_names_translated)

['yaroslavl', 'moscow', 'kursk', 'krasnoyarsk', 'volzhsky', 'voronezh', 'adygea', 'balashiha', 'yakutsk', 'kolomna', 'kaluga', 'zhukovsky', 'kazan', 'online store', 'ufa', 'n.novgorod', 'chekhov', 'khimki', 'surgut', 'tyumen', 'st. petersburg', 'rostovnadonu', 'omsk', 'samara', 'novosibirsk', 'sergiev', 'vologda', 'yakutsk', 'digital', 'traveling', 'tomsk', 'mytischi']


In [176]:
shoploc_traveling_idx = shop_loc_names_translated.index('traveling')  # convert to 'export'
shoploc_onlinestore_idx = shop_loc_names_translated.index('online store')  # convert to 'online'
shoploc_digital_idx = shop_loc_names_translated.index('digital')  # convert to 'online'

shop_loc_names_translated[shoploc_traveling_idx] = 'export'
shop_loc_names_translated[shoploc_onlinestore_idx] = 'online'
shop_loc_names_translated[shoploc_digital_idx] = 'online'

In [179]:
shop_name_loc_mapped = dict(zip(shop_loc_names_uniques,
                               shop_loc_names_translated))
train['shop_loc_name_eng'] = train['shop_loc_name'].map(shop_name_loc_mapped)
train.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,item_category_id,item_category_name_eng,shop_name,shop_loc_name,shop_loc_name_eng
0,2013-02-01,0,59,22154,999.0,1,37,cinema - blu-ray,"Ярославль ТЦ ""Альтаир""",Ярославль,yaroslavl
1,2013-03-01,0,25,2552,899.0,1,58,music - vinyl,"Москва ТРК ""Атриум""",Москва,moscow
2,2013-05-01,0,25,2552,899.0,-1,58,music - vinyl,"Москва ТРК ""Атриум""",Москва,moscow
3,2013-06-01,0,25,2554,1709.050049,1,58,music - vinyl,"Москва ТРК ""Атриум""",Москва,moscow
4,2013-01-15,0,25,2555,1099.0,1,56,music - cd of branded production,"Москва ТРК ""Атриум""",Москва,moscow


### Get additional information from the name of the city (location) using `web scraping`
SOURCE TO CRAWL : https://en.wikipedia.org/wiki/List_of_cities_and_towns_in_Russia_by_population

* HTML PATTERN
```html
<tr>
<th align="center">1</th>
<td><i><b><a href="/wiki/Moscow" title="Moscow">Moscow</a></b></i></td>
<td><span lang="ru" xml:lang="ru">Москва</span></td>
<td><span class="flagicon"><img alt="" src="//upload.wikimedia.org/wikipedia/commons/thumb/6/6d/Flag_of_Moscow.svg/23px-Flag_of_Moscow.svg.png" width="23" height="15" class="thumbborder" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/6/6d/Flag_of_Moscow.svg/35px-Flag_of_Moscow.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/6/6d/Flag_of_Moscow.svg/45px-Flag_of_Moscow.svg.png 2x" data-file-width="1200" data-file-height="800" />&#160;</span><a href="/wiki/Moscow" title="Moscow">Moscow (federal city)</a><sup id="cite_ref-3" class="reference"><a href="#cite_note-3">[3]</a></sup></td>
<td><a href="/wiki/Central_Federal_District" title="Central Federal District">Central</a></td>
<td align="right" rowspan="1" style="background-color:#F9F9F9;">12,228,685</td>
<td align="right" rowspan="1" style="background-color:#F9F9F9;">11,503,501</td>
<td align="right" rowspan="1" style="background-color:#F9F9F9;"><span style="display:none" class="sortkey">7000630402866049210♠</span><span style="color:green">+6.30%</span></td>
</tr>
<tr>
<th align="center">2</th>
<td><i><b><a href="/wiki/Saint_Petersburg" title="Saint Petersburg">Saint Petersburg</a></b></i></td>
<td><span lang="ru" xml:lang="ru">Санкт-Петербург</span></td>
<td><span class="flagicon"><img alt="" src="//upload.wikimedia.org/wikipedia/commons/thumb/f/f4/Flag_of_Saint_Petersburg_Russia.svg/23px-Flag_of_Saint_Petersburg_Russia.svg.png" width="23" height="15" class="thumbborder" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/f/f4/Flag_of_Saint_Petersburg_Russia.svg/35px-Flag_of_Saint_Petersburg_Russia.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/f/f4/Flag_of_Saint_Petersburg_Russia.svg/45px-Flag_of_Saint_Petersburg_Russia.svg.png 2x" data-file-width="1200" data-file-height="800" />&#160;</span><a href="/wiki/Saint_Petersburg" title="Saint Petersburg">Saint Petersburg (federal city)</a><sup id="cite_ref-4" class="reference"><a href="#cite_note-4">[4]</a></sup></td>
<td><a href="/wiki/Northwestern_Federal_District" title="Northwestern Federal District">Northwest</a></td>
<td align="right" rowspan="1" style="background-color:#F9F9F9;">5,281,579</td>
<td align="right" rowspan="1" style="background-color:#F9F9F9;">4,879,566</td>
<td align="right" rowspan="1" style="background-color:#F9F9F9;"><span style="display:none" class="sortkey">7000823870401588990♠</span><span style="color:green">+8.24%</span></td>
</tr>
```

### What to crawl for each city

* `shop_loc_population`
* `shop_loc_latitude`
* `shop_loc_longitude`

In [183]:
# get the information above for each city except for 'online' and 'export' values.

In [184]:
import requests
from bs4 import BeautifulSoup

In [199]:
print(shop_loc_names_uniques)

['Ярославль', 'Москва', 'Курск', 'Красноярск', 'Волжский', 'Воронеж', 'Адыгея', 'Балашиха', '!Якутск', 'Коломна', 'Калуга', 'Жуковский', 'Казань', 'Интернет-магазин', 'Уфа', 'Н.Новгород', 'Чехов', 'Химки', 'Сургут', 'Тюмень', 'СПб', 'РостовНаДону', 'Омск', 'Самара', 'Новосибирск', 'Сергиев', 'Вологда', 'Якутск', 'Цифровой', 'Выездная', 'Томск', 'Мытищи']


In [201]:
print(shop_name_loc_mapped)

{'Ярославль': 'yaroslavl', 'Москва': 'moscow', 'Курск': 'kursk', 'Красноярск': 'krasnoyarsk', 'Волжский': 'volzhsky', 'Воронеж': 'voronezh', 'Адыгея': 'adygea', 'Балашиха': 'balashiha', '!Якутск': 'yakutsk', 'Коломна': 'kolomna', 'Калуга': 'kaluga', 'Жуковский': 'zhukovsky', 'Казань': 'kazan', 'Интернет-магазин': 'online', 'Уфа': 'ufa', 'Н.Новгород': 'n.novgorod', 'Чехов': 'chekhov', 'Химки': 'khimki', 'Сургут': 'surgut', 'Тюмень': 'tyumen', 'СПб': 'st. petersburg', 'РостовНаДону': 'rostovnadonu', 'Омск': 'omsk', 'Самара': 'samara', 'Новосибирск': 'novosibirsk', 'Сергиев': 'sergiev', 'Вологда': 'vologda', 'Якутск': 'yakutsk', 'Цифровой': 'online', 'Выездная': 'export', 'Томск': 'tomsk', 'Мытищи': 'mytischi'}


In [215]:
url = 'https://en.wikipedia.org/wiki/List_of_cities_and_towns_in_Russia_by_population'
res = requests.get(url)
soup = BeautifulSoup(res.content, 'lxml')
table_classes ={"class": ["sortable", "plainrowheaders"]}
wikitable = soup.findAll("table", table_classes)[0]

#print(wikitables)

In [246]:
# column names of wiki table

wikitable.findAll("tr")[0].text.split('\n')[1:-1]

['Rank (2017)',
 'City/town',
 'Russian',
 'Federal Subject',
 'Federal District',
 'Population',
 '(2017 estimate)[1]',
 'Population',
 '(2010 Census)[2]',
 'Change']

In [247]:
# values in the first row of wiki table
wikitable.findAll("tr")[1].text.split('\n')[1:-1]

['1',
 'Moscow',
 'Москва',
 '\xa0Moscow (federal city)[3]',
 'Central',
 '12,228,685',
 '11,503,501',
 '7000630402866049210♠+6.30%']

In [254]:
wikitable_cols = [
    [],[],[],[],[],[],[],[]
]

for i, row in enumerate(wikitable.findAll("tr")):
    if i != 0:
        row_values = row.text.split('\n')[1:-1]
        #print(row_values)

        for j, val in enumerate(row_values):
            wikitable_cols[j].append(val)

In [256]:
wiki_ru_table = pd.DataFrame({'rank_2017' : wikitable_cols[0],
                             'name_en' : wikitable_cols[1],
                             'name_ru' : wikitable_cols[2],
                             'fed_sub' : wikitable_cols[3],
                             'fed_dist' : wikitable_cols[4],
                             'pop_2017_est' : wikitable_cols[5],
                             'pop_2010_cen' : wikitable_cols[6],
                             'change_perc' : wikitable_cols[7]})

wiki_ru_table.head()

Unnamed: 0,change_perc,fed_dist,fed_sub,name_en,name_ru,pop_2010_cen,pop_2017_est,rank_2017
0,7000630402866049210♠+6.30%,Central,Moscow (federal city)[3],Moscow,Москва,11503501,12228685,1
1,7000823870401588990♠+8.24%,Northwest,Saint Petersburg (federal city)[4],Saint Petersburg,Санкт-Петербург,4879566,5281579,2
2,7000876408138671720♠+8.76%,Siberia,Novosibirsk Oblast,Novosibirsk,Новосибирск,1473754,1602915,3
3,7000783406382707600♠+7.83%,Ural,Sverdlovsk Oblast,Yekaterinburg,Екатеринбург,1349772,1455514,4
4,6999883322578659040♠+0.88%,Volga,Nizhny Novgorod Oblast,Nizhny Novgorod,Нижний Новгород,1250619,1261666,5


In [276]:
'6999883322578659040♠+0.88%'[-5:-1]

'0.88'

In [271]:
train.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,item_category_id,item_category_name_eng,shop_name,shop_loc_name,shop_loc_name_eng
0,2013-02-01,0,59,22154,999.0,1,37,cinema - blu-ray,"Ярославль ТЦ ""Альтаир""",Ярославль,yaroslavl
1,2013-03-01,0,25,2552,899.0,1,58,music - vinyl,"Москва ТРК ""Атриум""",Москва,moscow
2,2013-05-01,0,25,2552,899.0,-1,58,music - vinyl,"Москва ТРК ""Атриум""",Москва,moscow
3,2013-06-01,0,25,2554,1709.050049,1,58,music - vinyl,"Москва ТРК ""Атриум""",Москва,moscow
4,2013-01-15,0,25,2555,1099.0,1,56,music - cd of branded production,"Москва ТРК ""Атриум""",Москва,moscow


In [185]:
shoploc_pop, shoploc_areasize, shoploc_lat, shoploc_lon = [], [], [], []

for shoploc in shop_loc_names_translated:
    
    url = 'https://en.wikipedia.org/wiki/'+shoploc
    res = requests.get(url)
    
    # lat
    'span.geo-default > span > span.latitude'
    
    # lon
    'span.geo-default > span > span.longitude'
    
    # area size
    'table.infobox.geography.vcard > tbody > tr:nth-child(20) > td'
    
    # popluation
    'table.infobox.geography.vcard > tbody > tr:nth-child(21) > td'

yaroslavl
moscow
kursk
krasnoyarsk
volzhsky
voronezh
adygea
balashiha
yakutsk
kolomna
kaluga
zhukovsky
kazan
online
ufa
n.novgorod
chekhov
khimki
surgut
tyumen
st. petersburg
rostovnadonu
omsk
samara
novosibirsk
sergiev
vologda
yakutsk
online
export
tomsk
mytischi


### Back to `category_name` - 
### Split the category name into two levels

In [42]:
train['item_category_name_eng'].unique()

array(['Cinema - Blu-Ray', 'Music - Vinyl',
       'Music - CD of branded production', 'Music - Musical video',
       'Music - CD of local production', 'Games - XBOX 360', 'Games - PS3',
       'PC Games - Additional Edition', 'PC Games - Standard Edition',
       'Games - PSP', 'Cinema - DVD', 'Programs - Home and Office',
       'Books - Methodical materials 1С',
       'PC Games - Collector&#39;s Edition', 'Games - PSVita',
       'Gifts - Development', 'Programs - 1C: Enterprise 8',
       'Programs - Teaching', 'Music - MP3', 'Music - Gift edition',
       'Accessories - PSP', 'Gifts - Gadgets, robots, sports',
       'Books - Audiobooks', 'Game Consoles - XBOX 360',
       'Accessories - PS3', 'Accessories - PS4', 'Accessories - PSVita',
       'Gifts - Certificates, services', 'Payment cards - PSN',
       'Payment cards - Live!', 'Accessories - XBOX 360',
       'Cinema - Blu-Ray 3D', 'Games - Accessories for games',
       'Game Consoles - PSVita', 'Books - Audiobooks 1C',
  

In [43]:
missed_cat_name_map = {'Книги - Путеводители' : 'Books - Travel Guides'}

train['item_category_name_eng'] = train['item_category_name_eng'].map(missed_cat_name_map)


array([nan, 'Books - Travel Guides'], dtype=object)

In [40]:

# unique values in the first level category 

train['item_category_name_eng'].apply(lambda x: x.split(' - ')[0]).unique()

array(['Cinema', 'Music', 'Games', 'PC Games', 'Programs', 'Books',
       'Gifts', 'Accessories', 'Game Consoles', 'Payment cards',
       'Clean media (piece)', 'Clean carriers (spire)', 'Office', 'PC',
       'Elements of a food', 'Delivery of goods', 'Книги',
       'Payment cards (Movies, Music, Games)', 'Movies',
       'Tickets (figure)', 'Android games', 'MAC Games', 'Official',
       'Payment Cards'], dtype=object)

In [41]:
# unique values in the seocnd level category 
train['item_category_name_eng'].apply(lambda x: x.split(' - ')[1]).unique()

IndexError: list index out of range

### Drop unnecessary columns

### Processing `test.csv` dataset in the same way above for further modeling job

## Explanatory Data Exploration (EDA)
* ~~~
* ~~~


# Feature Engineering
* `item_category_name_eng` : we can divide the information into two category names by hierarchy.
* `shop_name_eng` : we can extract additional information about the area which the shop is located at.
* Creating columns like:
  * `season`, `lifecycle_for_item`, `item_in_trend_or_not`, ....
 
### We will finish data optimization at the point feature engineering completed.

In [180]:
# never forget to set the identical touch on the `test` data set!
test.head()

Unnamed: 0,ID,shop_id,item_id
0,0,5,5037
1,1,5,5320
2,2,5,5233
3,3,5,5232
4,4,5,5268


### Final data optimization

## Export datafile as `csv` and get started with modeling for predictions!