<a href="https://colab.research.google.com/github/ValentinoVizner/google_Colab/blob/master/ML_grocery_store.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Brick-and-mortar grocery stores are always in a delicate dance with purchasing and sales forecasting. Predict a little over, and grocers are stuck with overstocked, perishable goods. Guess a little under, and popular items quickly sell out, leaving money on the table and customers fuming.

The problem becomes more complex as retailers add new locations with unique needs, new products, ever transitioning seasonal tastes, and unpredictable product marketing. Corporación Favorita, a large Ecuadorian-based grocery retailer, knows this all too well. They operate hundreds of supermarkets, with over 200,000 different products on their shelves.

Corporación Favorita has challenged the Kaggle community to build a model that more accurately forecasts product sales. They currently rely on subjective forecasting methods with very little data to back them up and very little automation to execute plans. They’re excited to see how machine learning could better ensure they please customers by having just enough of the right products at the right time.

# New Section

In [1]:
!pip install fastai==0.7.0
!pip install "torchtext==0.2.3"
!pip uninstall -y kaggle
!pip install --upgrade pip
!pip install kaggle==1.5.6

Collecting fastai==0.7.0
[?25l  Downloading https://files.pythonhosted.org/packages/50/6d/9d0d6e17a78b0598d5e8c49a0d03ffc7ff265ae62eca3e2345fab14edb9b/fastai-0.7.0-py3-none-any.whl (112kB)
[K     |████████████████████████████████| 122kB 3.5MB/s 
Collecting torch<0.4
[?25l  Downloading https://files.pythonhosted.org/packages/5b/a5/e8b50b55b1abac9f1e3346c4242f1e42a82d368a8442cbd50c532922f6c4/torch-0.3.1-cp36-cp36m-manylinux1_x86_64.whl (496.4MB)
[K     |████████████████████████████████| 496.4MB 24kB/s 
Collecting pandas-summary
  Downloading https://files.pythonhosted.org/packages/f5/20/865d4785f86e86f7ad7ebd52fc14810ef59ef13cd959f1363a2615d1c665/pandas_summary-0.0.7-py2.py3-none-any.whl
Collecting bcolz
[?25l  Downloading https://files.pythonhosted.org/packages/5c/4e/23942de9d5c0fb16f10335fa83e52b431bcb8c0d4a8419c9ac206268c279/bcolz-1.2.1.tar.gz (1.5MB)
[K     |████████████████████████████████| 1.5MB 40.8MB/s 
Collecting isoweek
  Downloading https://files.pythonhosted.org/package

In [2]:
import os
from google.colab import files
files.upload()
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
os.environ['valentinovizner'] = "4654bf5a93bb48985b13165780e16df4" # username from the json file os.environ['KAGGLEKEY'] = "xxxxxxxxxxxxxxxxxxxxxxxxxxxx" # key from the json file

Saving kaggle.json to kaggle.json


In [3]:
!kaggle competitions list -s favorita-grocery-sales-forecasting
!kaggle competitions download -c favorita-grocery-sales-forecasting -p /content/data/grocery_store

ref                                 deadline             category   reward  teamCount  userHasEntered  
----------------------------------  -------------------  --------  -------  ---------  --------------  
favorita-grocery-sales-forecasting  2018-01-15 23:59:00  Featured  $30,000       1674            True  
Downloading favorita-grocery-sales-forecasting.zip to /content/data/grocery_store
 98% 449M/458M [00:08<00:00, 53.5MB/s]
100% 458M/458M [00:08<00:00, 57.8MB/s]


In [4]:
!unzip /content/data/grocery_store/*.zip -d /content/data/grocery_store/parent_unzip
!7za x -y "/content/data/grocery_store/parent_unzip/*.7z" -o/content/data/grocery_store/parent_unzip/unzipped

Archive:  /content/data/grocery_store/favorita-grocery-sales-forecasting.zip
  inflating: /content/data/grocery_store/parent_unzip/holidays_events.csv.7z  
  inflating: /content/data/grocery_store/parent_unzip/items.csv.7z  
  inflating: /content/data/grocery_store/parent_unzip/oil.csv.7z  
  inflating: /content/data/grocery_store/parent_unzip/sample_submission.csv.7z  
  inflating: /content/data/grocery_store/parent_unzip/stores.csv.7z  
  inflating: /content/data/grocery_store/parent_unzip/test.csv.7z  
  inflating: /content/data/grocery_store/parent_unzip/train.csv.7z  
  inflating: /content/data/grocery_store/parent_unzip/transactions.csv.7z  

7-Zip (a) [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,40 CPUs Intel(R) Xeon(R) CPU @ 2.30GHz (306F0),ASM,AES-NI)

Scanning the drive for archives:
  0M Scan /content/data/grocery_store/parent_unzip/      

In [7]:
!pip install scikit-learn==0.19.1

Collecting scikit-learn==0.19.1
  Downloading scikit_learn-0.19.1-cp36-cp36m-manylinux1_x86_64.whl (12.4 MB)
[K     |████████████████████████████████| 12.4 MB 3.5 MB/s 
[31mERROR: yellowbrick 0.9.1 has requirement scikit-learn>=0.20, but you'll have scikit-learn 0.19.1 which is incompatible.[0m
[31mERROR: imbalanced-learn 0.4.3 has requirement scikit-learn>=0.20, but you'll have scikit-learn 0.19.1 which is incompatible.[0m
[?25hInstalling collected packages: scikit-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 0.22.1
    Uninstalling scikit-learn-0.22.1:
      Successfully uninstalled scikit-learn-0.22.1
Successfully installed scikit-learn-0.19.1


In [0]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

In [2]:
from fastai.imports import *
from fastai.structured import *


from pandas_summary import DataFrameSummary
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from IPython.display import display

from sklearn import metrics
import dask.dataframe as dd

  from numpy.core.umath_tests import inner1d


In [0]:
PATH = "data/grocery_store/parent_unzip/unzipped/"

In [0]:
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df


def import_data(file):
    """create a dataframe and optimize its memory usage"""
    df = pd.read_csv(file, parse_dates=True, keep_date_col=True)
    df = reduce_mem_usage(df)
    return df

In [5]:
!ls {PATH}

holidays_events.csv  oil.csv		    stores.csv	train.csv
items.csv	     sample_submission.csv  test.csv	transactions.csv


In [6]:
%%time
df = pd.read_csv(f"{PATH}test.csv")
df.head()

CPU times: user 1.16 s, sys: 192 ms, total: 1.35 s
Wall time: 1.35 s


In [0]:
types = {"id": "int64", "store_nbr": "int8", "item_nbr": "int32", "onpromotion": "object", "unit_sales": "float32"}
# the reason onpromotion is object is because it has missing values

In [8]:
df

Unnamed: 0,id,date,store_nbr,item_nbr,onpromotion
0,125497040,2017-08-16,1,96995,False
1,125497041,2017-08-16,1,99197,False
2,125497042,2017-08-16,1,103501,False
3,125497043,2017-08-16,1,103520,False
4,125497044,2017-08-16,1,103665,False
...,...,...,...,...,...
3370459,128867499,2017-08-31,54,2132163,False
3370460,128867500,2017-08-31,54,2132318,False
3370461,128867501,2017-08-31,54,2132945,False
3370462,128867502,2017-08-31,54,2132957,False


# 1. Read data

In [9]:
%%time
#df = pd.read_csv(f"{PATH}train.csv", parse_dates=['date'], dtype=types, infer_datetime_format=True)
df = import_data(f"{PATH}train.csv")


  call = lambda f, *a, **k: f(*a, **k)


Memory usage of dataframe is 5744.80 MB
Memory usage after optimization is: 1915.02 MB
Decreased by 66.7%
CPU times: user 1min 8s, sys: 13.3 s, total: 1min 22s
Wall time: 1min 22s


# Optimization

In [10]:
# scaling part 1 to the entire dataset
dtype_dict={"id":np.uint32,
            "store_nbr":np.uint8,
            "item_nbr":np.uint32,
            "unit_sales":np.float32
           }

train_part1 = pd.read_csv(f"{PATH}train.csv",dtype=dtype_dict,usecols=[0,2,3,4])
print(train_part1.dtypes)

id             uint32
store_nbr       uint8
item_nbr       uint32
unit_sales    float32
dtype: object


In [11]:
# now scaling it to the entire dataset of train
%%time
train_part2=pd.read_csv(f"{PATH}train.csv",dtype=dtype_dict,usecols=[1,5],parse_dates=[0])
train_part2['Year'] = pd.DatetimeIndex(train_part2['date']).year
train_part2['Month'] = pd.DatetimeIndex(train_part2['date']).month
train_part2['Day'] =pd.DatetimeIndex(train_part2['date']).day.astype(np.uint8)
del(train_part2['date'])
train_part2['Day']=train_part2['Day'].astype(np.uint8)
train_part2['Month']=train_part2['Month'].astype(np.uint8)
train_part2['Year']=train_part2['Year'].astype(np.uint16)

#impute the missing values to be -1
train_part2["onpromotion"].fillna(0, inplace=True)
train_part2["onpromotion"]=train_part2["onpromotion"].astype(np.int8)
print(train_part2.head())
print(train_part2.dtypes)



   onpromotion  Year  Month  Day
0            0  2013      1    1
1            0  2013      1    1
2            0  2013      1    1
3            0  2013      1    1
4            0  2013      1    1
onpromotion      int8
Year           uint16
Month           uint8
Day             uint8
dtype: object
CPU times: user 1min 46s, sys: 2.79 s, total: 1min 49s
Wall time: 1min 49s


In [0]:
# joining part one and two
# For people familiar with R , the equivalent of cbind in pandas is the following command
train = pd.concat([train_part1.reset_index(drop=True), train_part2], axis=1)
#drop temp files
del(train_part1)
del(train_part2)
#Further Id is just an indicator column, hence not required for analysis
id=train['id']
del(train['id'])
# check memory
print(train.memory_usage())
#The extracted train.csv file is approx 5 GB
mem_train=5*1024**3
new_mem_train=train.memory_usage().sum()
print("Train dataset uses ",new_mem_train/ 1024**2," MB after changes")
print("memory saved is approx",(mem_train-new_mem_train)/ 1024**2," MB")

Index                128
store_nbr      125497040
item_nbr       501988160
unit_sales     501988160
onpromotion    125497040
Year           250994080
Month          125497040
Day            125497040
dtype: int64
Train dataset uses  1675.5663757324219  MB after changes
memory saved is approx 3444.433624267578  MB


In [12]:
df = train

NameError: ignored

In [13]:
df.onpromotion.fillna(False, inplace=True)
df.onpromotion = df.onpromotion.map({'False' : False, 'True' : True})
df.onpromotion = df.onpromotion.astype(bool)

df

Unnamed: 0,id,date,store_nbr,item_nbr,unit_sales,onpromotion
0,0,2013-01-01,25,103665,7.0,True
1,1,2013-01-01,25,105574,1.0,True
2,2,2013-01-01,25,105575,2.0,True
3,3,2013-01-01,25,108079,1.0,True
4,4,2013-01-01,25,108701,1.0,True
...,...,...,...,...,...,...
125497035,125497035,2017-08-15,54,2089339,4.0,True
125497036,125497036,2017-08-15,54,2106464,1.0,True
125497037,125497037,2017-08-15,54,2110456,192.0,True
125497038,125497038,2017-08-15,54,2113914,198.0,True


In [14]:
%time df.to_feather('raw_groceries')

CPU times: user 1.2 s, sys: 1.34 s, total: 2.53 s
Wall time: 10.8 s


In [15]:
df.describe(include='all')

Unnamed: 0,id,date,store_nbr,item_nbr,unit_sales,onpromotion
count,125497000.0,125497040,125497000.0,125497000.0,125497000.0,125497040
unique,,1684,,,,1
top,,2017-07-01,,,,True
freq,,118194,,,,125497040
mean,62748520.0,,27.46458,972769.2,5.319669,
std,36227880.0,,16.33051,520533.6,23.06714,
min,0.0,,1.0,96995.0,-15372.0,
25%,31374260.0,,12.0,522383.0,2.0,
50%,62748520.0,,28.0,959500.0,4.0,
75%,94122780.0,,43.0,1354380.0,9.0,


In [16]:
df_test = pd.read_csv(f'{PATH}test.csv', parse_dates=['date'], dtype=types, infer_datetime_format=True)

df_test.onpromotion.fillna(False, inplace=True)
df_test.onpromotion = df_test.onpromotion.map({'False' : False, 'True' : True})
df_test.onpromotion = df_test.onpromotion.astype(bool)
df_test.describe(include='all')

Unnamed: 0,id,date,store_nbr,item_nbr,onpromotion
count,3370464.0,3370464,3370464.0,3370464.0,3370464
unique,,16,,,2
top,,2017-08-27 00:00:00,,,False
freq,,210654,,,3171867
first,,2017-08-16 00:00:00,,,
last,,2017-08-31 00:00:00,,,
mean,127182300.0,,27.5,1244798.0,
std,972969.3,,15.58579,589836.2,
min,125497000.0,,1.0,96995.0,
25%,126339700.0,,14.0,805321.0,


In [0]:
df_all = pd.read_feather('raw_groceries')

In [18]:
df_all.unit_sales = np.log1p(np.clip(df_all.unit_sales, 0, None))
df_all.unit_sales

0            2.079442
1            0.693147
2            1.098612
3            0.693147
4            0.693147
               ...   
125497035    1.609438
125497036    0.693147
125497037    5.262690
125497038    5.293305
125497039    1.098612
Name: unit_sales, Length: 125497040, dtype: float32

In [0]:
#df_all['date'] = df_all['date'].astype('datetime64[ns]')

In [0]:
df_all.dtypes

store_nbr        uint8
item_nbr        uint32
unit_sales     float32
onpromotion       bool
Year            uint16
Month            uint8
Day              uint8
dtype: object

In [0]:
%time add_datepart(df_all, 'date')

In [0]:
def split_vals(a, n):
    return a[:n].copy(), a[n:].copy()

In [20]:
n_valid = len(df_test)
n_train = len(df_all) - n_valid
train, valid = split_vals(df_all, n_train)
train.shape, valid.shape

((122126576, 6), (3370464, 6))

In this case we dont need to apply categories to get all the numbers, since we have all the numbers already

In [0]:
# train_cats(raw_train)
# apply_cats(raw_valid, raw_train)

In [21]:
%%time
trn, y, nas = proc_df(train, 'unit_sales')

CPU times: user 5.15 s, sys: 52.9 ms, total: 5.2 s
Wall time: 5.2 s


In [0]:
val, y_val, nas = proc_df(valid, 'unit_sales')

In [0]:
??proc_df

In [0]:
def rmse(x,y): 
    return math.sqrt(((x-y)**2).mean())

def print_score(m):
    res = [rmse(m.predict(x), y), rmse(m.predict(val), y_val),
                m.score(x, y), m.score(val, y_val)]
    if hasattr(m, 'oob_score_'): 
        res.append(m.oob_score_)
    print(res)

In [0]:
# We probably don't want to make a tree from 120 million record, so we use 1 000 000 here.
set_rf_samples(1000000)

In [27]:
# We do this step to convert everyything to float32 because RF does this everytime for every tree, this we do this only once and not 20 times
%time x = np.array(trn, dtype=np.float32)

CPU times: user 29.7 s, sys: 10.4 s, total: 40.1 s
Wall time: 40 s


In [28]:
%%time 
m = RandomForestRegressor(n_estimators=20, min_samples_leaf=100, n_jobs=8)
m.fit(x,y)

CPU times: user 3min 43s, sys: 11.6 s, total: 3min 54s
Wall time: 36.2 s


In [29]:
print_score(m)

[0.7426612191572745, 0.7481560406293348, 0.2908222518099983, 0.25503105474282195]
