# Project name: [Predict Future Sales](https://www.kaggle.com/c/competitive-data-science-predict-future-sales)

## Objective

Get better than 1.05 score on [Public Leaderboard](https://www.kaggle.com/c/competitive-data-science-predict-future-sales/leaderboard)

## Version

In [1]:
__ver__ = "0.1"

## Setup

In [64]:
import numpy as np
import pandas as pd
import catboost
import sklearn
import matplotlib.pyplot as plt
%matplotlib inline 

import itertools

# Data load

In [65]:
# path
cat_path = "./raw_data/item_categories.csv"
items_path = "./raw_data/items.csv"
shop_path = "./raw_data/shops.csv"
sales_path = "./raw_data/sales_train.csv.gz"
test_path = "./raw_data/test.csv.gz"

In [66]:
# load
cat = pd.read_csv(cat_path)
items = pd.read_csv(items_path)
shops = pd.read_csv(shop_path)
sales_params = dict(parse_dates=[0], infer_datetime_format = True, dayfirst=True)
sales = pd.read_csv(sales_path, **sales_params)
# load and save ID in index
test = pd.read_csv(test_path).set_index('ID')

## Monthly sales

In [67]:
# drop columns - not in test
sales.drop(["date", "item_price"], axis=1, inplace=True)

In [68]:
# get monthly sales
sales = sales.groupby(["date_block_num", "shop_id", "item_id"], as_index=False).sum()

## Stack train and test data

In [69]:
# add date_block_num to test
test_date_block = sales.date_block_num.max() + 1
test["date_block_num"] = test_date_block

In [71]:
# stack train and test data
data = pd.concat([sales, test], axis=0, sort=False)
data.index.name = "ID"
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1823324 entries, 0 to 214199
Data columns (total 4 columns):
date_block_num    int64
shop_id           int64
item_id           int64
item_cnt_day      float64
dtypes: float64(1), int64(3)
memory usage: 69.6 MB


In [72]:
# downcast types
down_cast = dict(
    date_block_num='int8',
    shop_id='category',
    item_id='category',
    item_cnt_day='float16' # must be float - NaNs in test part
)
data = data.astype(down_cast)
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1823324 entries, 0 to 214199
Data columns (total 4 columns):
date_block_num    int8
shop_id           category
item_id           category
item_cnt_day      float16
dtypes: category(2), float16(1), int8(1)
memory usage: 25.1 MB


In [73]:
# save
processed_path = f"./processed_data/data_{__ver__}.pickle"
data.to_pickle(processed_path)

## Exploratory data analysis

Histograms and Statistics:

* NaN
* Outliers
* Constant features
* Duplicated features
* Duplicated rows

Domain knowledge logic

Features  label-colored cross-plots

Features correlations and groups

Features means and groups

Label-feature cross-plots

Feature-index label-color plot

Label-index plot

## Preprocessing

#### Numeric

Scaling and nonlinear transformation

Outliers

#### Categorical and ordinal features

One-hot encoding

Frequency encoding

#### Date and time 

Day number in week, month, season, year second, minute, hour

Time since

Difference between dates

#### Coordinates

Rotations

POI

Centers of clusters

Aggregated statistics

#### Text

Lowercase

Lemmatization

Stemming

Stopwords

Bag of words

N-grams

TFiDF

Embeddings

#### Image

Augmentation

NN

#### Missing values

Fill

Isnull feature

#### Downcast types and save preprocessied data

## Feature generation

Mean encoding

Percentiles, std, distribution bins

Crosses, multiplications, divisions, group-by features

Dimensionality reduction (SVD, PCA, NMF, tSNE)

Extract features via decision trees

Lags, Rolling statistics

KNN

## Load processed data

In [54]:
data = pd.read_pickle(processed_path)
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1823324 entries, 0 to 214199
Data columns (total 4 columns):
date_block_num    int8
shop_id           category
item_id           category
item_cnt_day      float16
dtypes: category(2), float16(1), int8(1)
memory usage: 24.5 MB


## Simple baseline solution

In [74]:
# test_date_block
baseline_sales = data.item_cnt_day[data.date_block_num < test_date_block].median()

In [75]:
sub_index = data.index[data.date_block_num == test_date_block]

In [76]:
sub_df = pd.DataFrame(baseline_sales, index=sub_index, columns=["item_cnt_month"])

## Submission

In [78]:
sub_path = f"./submissions/submission_{__ver__}.csv"
sub_df.to_csv(sub_path)

## Validation strategies

Holdout

KFold

StratifiedKFold

Timewise

Public test score vs Validation score

## Hyperparameters tuning

Only if you don’t have any more ideas or you have spare computational resources

Average everything:

* Over random seed
* Or over small deviations from optimal parameters

## Ensembling

* Save all good models
* Make diverse models

* Averaging
* Weighted averaging
* Bagging (BaggingClassifier and BaggingRegressor from Sklearn, seed bagging)
* Boosting (AdaBoostClassifier from Sklearn)
* Stacking (Meta model should be modest)
* StackNet
    - Diversity of base algoritms
    - Diversity of base data
    - Simpler algoritms on higher levels
    - Feature engineering of meta feature (differences, std ...)
    - For every level 1model for 5-10 modelesin orivios level

## Iterate

* Organize ideas in some structure
* Sort all parameters by Importance, Feasibility and Understanding
* Select the most important and promising ideas
* Start with simple (or even primitive) solution
* Debug full pipeline from reading data to writing submission file
* Very important to have reproducible results - Keep important code clean
* Submit and save notebook per submission
* Evaluate and try to understand the reasons why something does/doesn’t work
* Get new ideas (read paprs) about ML-related things and problem domain
* Add change features and iterate