# Predict Future Sales
Final project for "How to win a data science competition" Coursera course

This challenge serves as final project for the ["How to win a data science competition"](https://www.coursera.org/learn/competitive-data-science/home/welcome) Coursera course.

In this competition you will work with a challenging time-series dataset consisting of daily sales data, kindly provided by one of the largest Russian software firms - **[1C Company](http://1c.ru/eng/title.htm)**. 

We are asking you to predict total sales for every product and store in the next month. By solving this competition you will be able to apply and enhance your data science skills.

You are provided with daily historical sales data. The task is to forecast the total amount of products sold in every shop for the test set. Note that the list of shops and products slightly changes every month. Creating a robust model that can handle such situations is part of the challenge.

File descriptions
-----------------

*   **sales\_train.csv** - the training set. Daily historical data from January 2013 to October 2015.
*   **test.csv** - the test set. You need to forecast the sales for these shops and products for November 2015.
*   **sample\_submission.csv** - a sample submission file in the correct format.
*   **items.csv** - supplemental information about the items/products.
*   **item\_categories.csv**  - supplemental information about the items categories.
*   **shops.csv**\- supplemental information about the shops.

Data fields
-----------

*   **ID** - an Id that represents a (Shop, Item) tuple within the test set
*   **shop\_id** - unique identifier of a shop
*   **item\_id** - unique identifier of a product
*   **item\_category\_id** - unique identifier of item category
*   **item\_cnt\_day** - number of products sold. You are predicting a monthly amount of this measure
*   **item\_price** - current price of an item
*   **date** - date in format dd/mm/yyyy
*   **date\_block\_num** - a consecutive month number, used for convenience. January 2013 is 0, February 2013 is 1,..., October 2015 is 33
*   **item\_name** - name of item
*   **shop\_name** - name of shop
*   **item\_category\_name** - name of item category

This dataset is permitted to be used for any purpose, including commercial use.

Link: https://www.kaggle.com/competitions/competitive-data-science-predict-future-sales/overview

In [30]:
import pandas as pd
import numpy as np
from catboost import CatBoostRegressor, Pool
from tqdm.notebook import tqdm

<IPython.core.display.Javascript object>

In [2]:
%load_ext nb_black

<IPython.core.display.Javascript object>

In [3]:
!ls ../data/competitive-data-science-predict-future-sales

item_categories.csv  sales_train.csv	    shops.csv
items.csv	     sample_submission.csv  test.csv


<IPython.core.display.Javascript object>

In [4]:
item_categories_df = pd.read_csv(
    "../data/competitive-data-science-predict-future-sales/item_categories.csv"
).set_index("item_category_id")
item_categories_df

Unnamed: 0_level_0,item_category_name
item_category_id,Unnamed: 1_level_1
0,PC - Гарнитуры/Наушники
1,Аксессуары - PS2
2,Аксессуары - PS3
3,Аксессуары - PS4
4,Аксессуары - PSP
...,...
79,Служебные
80,Служебные - Билеты
81,Чистые носители (шпиль)
82,Чистые носители (штучные)


<IPython.core.display.Javascript object>

In [5]:
items_df = pd.read_csv(
    "../data/competitive-data-science-predict-future-sales/items.csv"
).set_index("item_id")
items_df

Unnamed: 0_level_0,item_name,item_category_id
item_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.) D,40
1,!ABBYY FineReader 12 Professional Edition Full...,76
2,***В ЛУЧАХ СЛАВЫ (UNV) D,40
3,***ГОЛУБАЯ ВОЛНА (Univ) D,40
4,***КОРОБКА (СТЕКЛО) D,40
...,...,...
22165,"Ядерный титбит 2 [PC, Цифровая версия]",31
22166,Язык запросов 1С:Предприятия [Цифровая версия],54
22167,Язык запросов 1С:Предприятия 8 (+CD). Хрустале...,49
22168,Яйцо для Little Inu,62


<IPython.core.display.Javascript object>

In [6]:
sales_train_df = pd.read_csv(
    "../data/competitive-data-science-predict-future-sales/sales_train.csv"
)
sales_train_df

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day
0,02.01.2013,0,59,22154,999.00,1.0
1,03.01.2013,0,25,2552,899.00,1.0
2,05.01.2013,0,25,2552,899.00,-1.0
3,06.01.2013,0,25,2554,1709.05,1.0
4,15.01.2013,0,25,2555,1099.00,1.0
...,...,...,...,...,...,...
2935844,10.10.2015,33,25,7409,299.00,1.0
2935845,09.10.2015,33,25,7460,299.00,1.0
2935846,14.10.2015,33,25,7459,349.00,1.0
2935847,22.10.2015,33,25,7440,299.00,1.0


<IPython.core.display.Javascript object>

In [7]:
shops_df = pd.read_csv(
    "../data/competitive-data-science-predict-future-sales/shops.csv"
).set_index("shop_id")
shops_df

Unnamed: 0_level_0,shop_name
shop_id,Unnamed: 1_level_1
0,"!Якутск Орджоникидзе, 56 фран"
1,"!Якутск ТЦ ""Центральный"" фран"
2,"Адыгея ТЦ ""Мега"""
3,"Балашиха ТРК ""Октябрь-Киномир"""
4,"Волжский ТЦ ""Волга Молл"""
5,"Вологда ТРЦ ""Мармелад"""
6,"Воронеж (Плехановская, 13)"
7,"Воронеж ТРЦ ""Максимир"""
8,"Воронеж ТРЦ Сити-Парк ""Град"""
9,Выездная Торговля


<IPython.core.display.Javascript object>

In [8]:
sample_submission_df = pd.read_csv(
    "../data/competitive-data-science-predict-future-sales/sample_submission.csv"
)
sample_submission_df

Unnamed: 0,ID,item_cnt_month
0,0,0.5
1,1,0.5
2,2,0.5
3,3,0.5
4,4,0.5
...,...,...
214195,214195,0.5
214196,214196,0.5
214197,214197,0.5
214198,214198,0.5


<IPython.core.display.Javascript object>

In [9]:
test_df = pd.read_csv("../data/competitive-data-science-predict-future-sales/test.csv")
test_df

Unnamed: 0,ID,shop_id,item_id
0,0,5,5037
1,1,5,5320
2,2,5,5233
3,3,5,5232
4,4,5,5268
...,...,...,...
214195,214195,45,18454
214196,214196,45,16188
214197,214197,45,15757
214198,214198,45,19648


<IPython.core.display.Javascript object>

# Подготовка данных

In [10]:
train_df = sales_train_df[
    ["date_block_num", "shop_id", "item_id", "item_cnt_day"]
].copy()
train_df = (
    train_df.groupby(["date_block_num", "shop_id", "item_id"]).sum().reset_index()
)
train_df

Unnamed: 0,date_block_num,shop_id,item_id,item_cnt_day
0,0,0,32,6.0
1,0,0,33,3.0
2,0,0,35,1.0
3,0,0,43,1.0
4,0,0,51,2.0
...,...,...,...,...
1609119,33,59,22087,6.0
1609120,33,59,22088,2.0
1609121,33,59,22091,1.0
1609122,33,59,22100,1.0


<IPython.core.display.Javascript object>

In [11]:
data = []

for date_block_num in range(34):
    for item_id in items_df.index.unique():
        for shop_id in shops_df.index.unique():
            data.append([shop_id, item_id, date_block_num])

df1 = pd.DataFrame(data, columns=["shop_id", "item_id", "date_block_num"])
df1["item_cnt_day"] = np.nan

df1 = (
    df1.set_index(["shop_id", "item_id", "date_block_num"])
    .fillna(train_df.set_index(["shop_id", "item_id", "date_block_num"]))
    .fillna(0)
    .reset_index()
)
df1

Unnamed: 0,shop_id,item_id,date_block_num,item_cnt_day
0,0,0,0,0.0
1,1,0,0,0.0
2,2,0,0,0.0
3,3,0,0,0.0
4,4,0,0,0.0
...,...,...,...,...
45226795,55,22169,33,0.0
45226796,56,22169,33,0.0
45226797,57,22169,33,0.0
45226798,58,22169,33,0.0


<IPython.core.display.Javascript object>

In [12]:
X_test = test_df.drop("ID", axis=1).copy()
X_test["date_block_num"] = 34
X_test["item_cnt_day"] = np.nan
X_test

Unnamed: 0,shop_id,item_id,date_block_num,item_cnt_day
0,5,5037,34,
1,5,5320,34,
2,5,5233,34,
3,5,5232,34,
4,5,5268,34,
...,...,...,...,...
214195,45,18454,34,
214196,45,16188,34,
214197,45,15757,34,
214198,45,19648,34,


<IPython.core.display.Javascript object>

# Подготовка и обучение моделей

In [32]:
submission = []

for idx, row in tqdm(X_test.iterrows(), total=len(X_test)):
    sub_df = df1[
        (df1["shop_id"] == row["shop_id"]) & (df1["item_id"] == row["item_id"])
    ]
    X, y = sub_df[["date_block_num"]], sub_df[["item_cnt_day"]]

    y_pred = 0.0
    if sub_df["item_cnt_day"].sum() != 0:
        model = CatBoostRegressor(logging_level="Silent")
        model.fit(Pool(X, y))

        y_pred = model.predict([34])

    submission.append([idx, y_pred])

submission_df = pd.DataFrame(submission, columns=["ID", "item_cnt_month"])
submission_df

  0%|          | 0/214200 [00:00<?, ?it/s]

Unnamed: 0,ID,item_cnt_month
0,0,0.018254
1,1,0.000000
2,2,1.003325
3,3,-0.000079
4,4,0.000000
...,...,...
214195,214195,0.991535
214196,214196,0.000000
214197,214197,-0.001327
214198,214198,0.000000


<IPython.core.display.Javascript object>

In [35]:
submission_df["item_cnt_month"] = submission_df["item_cnt_month"].clip(0.0, 20.0)

<IPython.core.display.Javascript object>

In [36]:
submission_df.to_csv(
    "../data/competitive-data-science-predict-future-sales/submission.csv", index=False
)

<IPython.core.display.Javascript object>