### Data Details

Data fields
* **ID** - an Id that represents a (Shop, Item) tuple within the test set
* **shop_id** - unique identifier of a shop
* **item_id** - unique identifier of a product
* **item_category_id** - unique identifier of item category
* **item_cnt_day** - number of products sold. You are predicting a monthly amount of this measure
* **item_price** - current price of an item
* **date** - date in format dd/mm/yyyy
* **date_block_num** - a consecutive month number, used for convenience. January 2013 is 0, February 2013 is 1,..., October 2015 is 33
* **item_name** - name of item
* **shop_name** - name of shop
* **item_category_name** - name of item category



In [2]:

import pandas as pd
import numpy as np
from datetime import datetime, date



In [3]:
train = pd.read_csv("data/sales_train.csv")
item_cat_df = pd.read_csv('data/items.csv')

# calculate and add year and month column
train['month'] = train.date.apply(lambda x: datetime.strptime(x, '%d.%m.%Y').strftime('%m'))
train['year'] = train.date.apply(lambda x: datetime.strptime(x, '%d.%m.%Y').strftime('%Y'))

# train

### group by and calculate the monthly sum
### select specific columns
### drop duplicates
train['item_cnt_month'] = train.groupby(['date_block_num', 'shop_id', 'item_id', 'month', 'year'])[['item_cnt_day']].transform(sum)
train = train[['date_block_num', 'shop_id', 'item_id', 'month', 'year', 'item_cnt_month']]
train = train.drop_duplicates()

# train.info()
train = pd.merge(train, item_cat_df[['item_id', 'item_category_id']], on='item_id')
# train.info()


train.to_csv("data/train_input_data.csv", index = False)



In [10]:
test = pd.read_csv("data/test.csv")
test.head


<bound method NDFrame.head of             ID  shop_id  item_id
0            0        5     5037
1            1        5     5320
2            2        5     5233
3            3        5     5232
4            4        5     5268
5            5        5     5039
6            6        5     5041
7            7        5     5046
8            8        5     5319
9            9        5     5003
10          10        5     4806
11          11        5     4843
12          12        5     4607
13          13        5     4869
14          14        5     4870
15          15        5     4872
16          16        5     4874
17          17        5     4678
18          18        5     4892
19          19        5     4964
20          20        5     4717
21          21        5     5002
22          22        5     5823
23          23        5     5814
24          24        5     5900
25          25        5     5907
26          26        5     5908
27          27        5     5643
28          2