# Predict Future Sale - Baseline
## Final project for "How to win a data science competition" Coursera course
https://www.kaggle.com/c/competitive-data-science-predict-future-sales/data  
>Student: Rafael Caneiro de Oliveira  
>Email: rafael.caneiro@gmail.com  
>Date: 04/08/2020

## Load the data

In [1]:
import numpy as np
import pandas as pd

from pathlib import Path

PATH = Path.cwd().parent
DATA_PATH = Path(PATH, "./data/raw/") 

In [2]:
sales_train_df = pd.read_csv(Path(DATA_PATH,"sales_train.csv"))
test_df = pd.read_csv(Path(DATA_PATH,"test.csv"))
items_df = pd.read_csv(Path(DATA_PATH,"items.csv"))

train_df = pd.merge(sales_train_df,
                    items_df[["item_id", "item_category_id"]],
                    how="inner",
                    on="item_id")

train_df["date"] = pd.to_datetime(train_df["date"], format="%d.%m.%Y")

test_df = pd.merge(test_df,
                   items_df[["item_id", "item_category_id"]],
                   how="inner",
                   on="item_id")

test_df["date_block_num"] = 34

In [3]:
print(train_df.shape)
train_df.head()

(2935849, 7)


Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,item_category_id
0,2013-01-02,0,59,22154,999.0,1.0,37
1,2013-01-23,0,24,22154,999.0,1.0,37
2,2013-01-20,0,27,22154,999.0,1.0,37
3,2013-01-02,0,25,22154,999.0,1.0,37
4,2013-01-03,0,25,22154,999.0,1.0,37


In [4]:
print(test_df.shape)
test_df.head()

(214200, 5)


Unnamed: 0,ID,shop_id,item_id,item_category_id,date_block_num
0,0,5,5037,19,34
1,5100,4,5037,19,34
2,10200,6,5037,19,34
3,15300,3,5037,19,34
4,20400,2,5037,19,34


## Basic Stats Reminder

In [5]:
train_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
date_block_num,2935849.0,14.569911,9.422988,0.0,7.0,14.0,23.0,33.0
shop_id,2935849.0,33.001728,16.226973,0.0,22.0,31.0,47.0,59.0
item_id,2935849.0,10197.227057,6324.297354,0.0,4476.0,9343.0,15684.0,22169.0
item_price,2935849.0,890.853233,1729.799631,-1.0,249.0,399.0,999.0,307980.0
item_cnt_day,2935849.0,1.242641,2.618834,-22.0,1.0,1.0,1.0,2169.0
item_category_id,2935849.0,40.001383,17.100759,0.0,28.0,40.0,55.0,83.0


- investigar valor -1 em `item_price`
- nlp com dados dos nomes dos produtos e categorias >> Russion Word2Vec

## Outliers
https://towardsdatascience.com/ways-to-detect-and-remove-the-outliers-404d16608dba

In [6]:
cols = ["item_price", "item_cnt_day"]
for col in cols:
    upperbound = np.percentile(train_df[col], 99)
    train_df.loc[train_df[col] > upperbound, "is_outlier"] = 1
    
train_df.is_outlier.fillna(0, inplace=True)

train_df.is_outlier.value_counts()

train_df = train_df[train_df.is_outlier==0]

## Model

In [7]:
train_df.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,item_category_id,is_outlier
0,2013-01-02,0,59,22154,999.0,1.0,37,0.0
1,2013-01-23,0,24,22154,999.0,1.0,37,0.0
2,2013-01-20,0,27,22154,999.0,1.0,37,0.0
3,2013-01-02,0,25,22154,999.0,1.0,37,0.0
4,2013-01-03,0,25,22154,999.0,1.0,37,0.0


In [8]:
block_33_sales = train_df[(train_df.date_block_num==33) &
                          (train_df.item_cnt_day>=0)] \
    .groupby(["shop_id", "item_id"])["item_cnt_day"].sum()

block_33_sales.head()

shop_id  item_id
2        31         1.0
         486        3.0
         787        1.0
         794        1.0
         968        1.0
Name: item_cnt_day, dtype: float64

In [9]:
test_df.head()

Unnamed: 0,ID,shop_id,item_id,item_category_id,date_block_num
0,0,5,5037,19,34
1,5100,4,5037,19,34
2,10200,6,5037,19,34
3,15300,3,5037,19,34
4,20400,2,5037,19,34


In [10]:
submission = test_df.merge(block_33_sales,
                           how="left",
                           on=["shop_id", "item_id"])

submission.fillna(0, inplace=True)

submission.loc[submission.item_cnt_day>20, "item_cnt_day"] = 20
submission.rename({"item_cnt_day":"item_cnt_month"},
                  inplace=True,
                  axis=1)

print(test_df.shape)
print(submission.shape)
submission.head()

(214200, 5)
(214200, 6)


Unnamed: 0,ID,shop_id,item_id,item_category_id,date_block_num,item_cnt_month
0,0,5,5037,19,34,0.0
1,5100,4,5037,19,34,0.0
2,10200,6,5037,19,34,1.0
3,15300,3,5037,19,34,0.0
4,20400,2,5037,19,34,0.0


In [13]:
submission[["ID", "item_cnt_month"]].to_csv("submission.csv",
                                          index=False)

In [14]:
!kaggle competitions submit -c competitive-data-science-predict-future-sales -f submission.csv -m "Baseline3"

100%|███████████████████████████████████████| 2.14M/2.14M [00:05<00:00, 404kB/s]
403 - Your team has used its submission allowance (5 of 5). This resets at midnight UTC (5.9 hours from now).
