# Walmart sales prediction

...

Summary:

- Prediction type: __Regression model__
- Domain: __Retail__
- Prediction target: __Sales__ 
- Population size: __???__

_Author: Dr. Patrick Urbanke_

# Background

...

It has been downloaded from the [CTU Prague relational learning repository](https://relational.fit.cvut.cz/dataset/CORA) (Motl and Schulte, 2015).

### A web frontend for getML

The getML monitor is a frontend built to support your work with getML. The getML monitor displays information such as the imported data frames, trained pipelines and allows easy data and feature exploration. You can launch the getML monitor [here](http://localhost:1709).

### Where is this running?

Your getML live session is running inside a docker container on [mybinder.org](https://mybinder.org/), a service built by the Jupyter community and funded by Google Cloud, OVH, GESIS Notebooks and the Turing Institute. As it is a free service, this session will shut down after 10 minutes of inactivity.

# Analysis

Let's get started with the analysis and set up your session:

In [1]:
import copy
import os
from urllib import request

import numpy as np
import pandas as pd
from IPython.display import Image
import matplotlib.pyplot as plt
plt.style.use('seaborn')
%matplotlib inline  

import getml

getml.engine.set_project('walmart')


Connected to project 'walmart'


## 1. Loading data

### 1.1 Download from source

We begin by downloading the data from the source file:

In [2]:
conn = getml.database.connect_mariadb(
    host="relational.fit.cvut.cz",
    dbname="Walmart",
    port=3306,
    user="guest",
    password="relational"
)

conn

Connection(conn_id='default',
           dbname='Walmart',
           dialect='mysql',
           host='relational.fit.cvut.cz',
           port=3306)

In [3]:
def load_if_needed(name):
    """
    Loads the data from the relational learning
    repository, if the data frame has not already
    been loaded.
    """
    if not getml.data.exists(name):
        data_frame = getml.data.DataFrame.from_db(
            name=name,
            table_name=name,
            conn=conn
        )
        data_frame.save()
    else:
        data_frame = getml.data.load_data_frame(name)
    return data_frame

In [4]:
weather = load_if_needed("weather")
key = load_if_needed("key")
train = load_if_needed("train")

In [None]:
weather

In [5]:
train

Name,store_nbr,item_nbr,units,date
Role,unused_float,unused_float,unused_float,unused_string
0.0,1,1,0,2012-01-01
1.0,1,2,0,2012-01-01
2.0,1,3,0,2012-01-01
3.0,1,4,0,2012-01-01
4.0,1,5,0,2012-01-01
,...,...,...,...
4617595.0,45,107,0,2014-10-31
4617596.0,45,108,0,2014-10-31
4617597.0,45,109,0,2014-10-31
4617598.0,45,110,0,2014-10-31


### 1.2 Prepare data for getML

getML requires that we define *roles* for each of the columns.

In [9]:
weather.set_role("station_nbr", getml.data.roles.join_key)
weather.set_role("date", getml.data.roles.time_stamp)
weather.set_role(weather.unused_float_names, getml.data.roles.numerical)
weather.set_role(weather.unused_string_names, getml.data.roles.categorical)

weather

Name,date,station_nbr,sunrise,sunset,codesum,tmax,tmin,tavg,depart,dewpoint,wetbulb,heat,cool,snowfall,preciptotal,stnpressure,sealevel,resultspeed,resultdir,avgspeed
Role,time_stamp,join_key,categorical,categorical,categorical,numerical,numerical,numerical,numerical,numerical,numerical,numerical,numerical,numerical,numerical,numerical,numerical,numerical,numerical,numerical
Units,"time stamp, comparison only",Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2
0.0,2012-01-01,1,,,RA FZFG BR,52,31,42,,36,40,23,0,,0.05,29.78,29.92,3.6,20,4.6
1.0,2012-01-02,1,,,,50,31,41,,26,35,24,0,,0.01,29.44,29.62,9.8,24,10.3
2.0,2012-01-03,1,,,,32,11,22,,4,18,43,0,,0,29.67,29.87,10.8,31,11.6
3.0,2012-01-04,1,,,,28,9,19,,-1,14,46,0,,0,29.86,30.03,6.3,27,8.3
4.0,2012-01-05,1,,,,38,25,32,,13,25,33,0,,0,29.67,29.84,6.9,25,7.8
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20512.0,2014-10-27,20,,,,85,66,76,,59,65,0,11,0,0,29.11,29.82,10,18,10.4
20513.0,2014-10-28,20,,,,80,68,74,,60,65,0,9,0,0,29.3,29.97,3.1,36,6.4
20514.0,2014-10-29,20,,,,78,55,67,,47,56,0,2,0,0,29.42,30.12,4.9,6,6.1
20515.0,2014-10-30,20,,,,80,52,66,,50,57,0,1,0,0,29.4,30.11,1.6,14,4.9


Moreover, we want to add the station number directly to the sales data (this could be useful, because now we know which stores are close to each other).

In [10]:
store_nbr = key.store_nbr.to_numpy()
station_nbr = key.station_nbr.to_numpy()

mapping = {store: station for (store, station) in zip(store_nbr, station_nbr)}

train["station_nbr"] = np.asarray([mapping[store] for store in train.store_nbr.to_numpy()])

train

Name,store_nbr,item_nbr,units,station_nbr,date
Role,unused_float,unused_float,unused_float,unused_float,unused_string
0.0,1,1,0,1,2012-01-01
1.0,1,2,0,1,2012-01-01
2.0,1,3,0,1,2012-01-01
3.0,1,4,0,1,2012-01-01
4.0,1,5,0,1,2012-01-01
,...,...,...,...,...
4617595.0,45,107,0,16,2014-10-31
4617596.0,45,108,0,16,2014-10-31
4617597.0,45,109,0,16,2014-10-31
4617598.0,45,110,0,16,2014-10-31


We need to separate our data set into a training, testing and validation set:

In [11]:
train.set_role(["store_nbr", "item_nbr", "station_nbr"], getml.data.roles.join_key)
train.set_role("date", getml.data.roles.time_stamp)
train.set_role("units", getml.data.roles.target)

train

Name,date,store_nbr,item_nbr,station_nbr,units
Role,time_stamp,join_key,join_key,join_key,target
Units,"time stamp, comparison only",Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
0.0,2012-01-01,1,1,1,0
1.0,2012-01-01,1,2,1,0
2.0,2012-01-01,1,3,1,0
3.0,2012-01-01,1,4,1,0
4.0,2012-01-01,1,5,1,0
,...,...,...,...,...
4617595.0,2014-10-31,45,107,16,0
4617596.0,2014-10-31,45,108,16,0
4617597.0,2014-10-31,45,109,16,0
4617598.0,2014-10-31,45,110,16,0


We also have to separate the data set into a training and testing set.

In [12]:
population = train.with_column(
    train.store_nbr, name="store_nbr_cat", role=getml.data.roles.categorical
).with_column(
    train.item_nbr, name="item_nbr_cat", role=getml.data.roles.categorical
).with_column(
    train.station_nbr, name="station_nbr_cat", role=getml.data.roles.categorical
).with_column(
    train.store_nbr + "-" + train.item_nbr, name="store_item_nbr_cat", role=getml.data.roles.categorical
)

In [None]:
split = getml.data.split.random(train=0.3, validation=0.3, test=0.4)

data_train = population[split == "train"]
data_validation = population[split == "validation"]

In [None]:
data_validation

In [None]:
container = getml.data.DataContainer(train=data_train, validation=data_validation)
container.add(weather=weather, past_sales=train)
container.freeze()

## 2. Predictive modeling

We loaded the data and defined the roles and units. Next, we create a getML pipeline for relational learning.

### 2.1 Define relational model

...

In [None]:
dm = getml.data.DataModel(data_train.to_placeholder("population"))

dm.add(getml.data.to_placeholder(weather=weather, past_sales=train))

dm.population.join(
    dm.past_sales,
    on=['store_nbr', 'item_nbr'],
    time_stamps="date",
    horizon=getml.data.time.days(1),
    memory=getml.data.time.days(180),
    allow_lagged_targets=True,
)

dm.population.join(
    dm.weather,
    on='station_nbr',
    memory=getml.data.time.days(2),
    time_stamps="date",
)

dm

### 2.2 getML pipeline

<!-- #### 2.1.1  -->
__Set-up the feature learner & predictor__

We use the relboost algorithms for this problem. Because of the large number of keywords, we regularize the model a bit by requiring a minimum support for the keywords (`min_num_samples`).

In [None]:
seasonal = getml.preprocessors.Seasonal()

mapping = getml.preprocessors.Mapping()

fast_prop = getml.feature_learning.FastPropModel(
    loss_function=getml.feature_learning.loss_functions.SquareLoss,
    aggregation=getml.feature_learning.FastPropModel.agg_sets.All,
    sampling_factor=0.05,
    num_features=100,
)

predictor = getml.predictors.XGBoostRegressor()

__Build the pipeline__

In [None]:
pipe = getml.pipeline.Pipeline(
    tags=['fast_prop'],
    data_model=dm,
    preprocessors=[seasonal, mapping],
    feature_learners=[fast_prop],
    predictors=[predictor]
)

pipe

### 2.3 Model training

In [None]:
pipe.check(container.train)

In [None]:
pipe.fit(container.train)

### 2.4 Model evaluation

In [None]:
pipe.score(container.validation)

### 2.6 Studying features

__Feature correlations__

We want to analyze how the features are correlated with the target variables.

__Feature importances__
 
Feature importances are calculated by analyzing the improvement in predictive accuracy on each node of the trees in the XGBoost predictor. They are then normalized, so that all importances add up to 100%.

In [None]:
names, importances = pipe2.features.importances()

plt.subplots(figsize=(20, 10))

plt.bar(names, importances, color='#6829c2')

plt.title('Feature importances for class label ' + class_label[TARGET_NUM])
plt.xlabel('Features')
plt.ylabel('Importances')
plt.xticks(rotation='vertical')
plt.show()

__Column importances__

Because getML uses relational learning, we can apply the principles we used to calculate the feature importances to individual columns as well.

In [None]:
names, importances = pipe2.columns.importances(target_num=TARGET_NUM)

plt.subplots(figsize=(20, 10))

plt.bar(names, importances, color='#6829c2')

plt.title('Columns importances for class label ' + class_label[TARGET_NUM])
plt.xlabel('Columns')
plt.ylabel('Importances')
plt.xticks(rotation='vertical')
plt.show()

## 3. Conclusion

...

## References

...

# Next Steps

This tutorial benchmarked getML against academic state-of-the-art algorithms from relational learning literature and getML's qualities with respect to categorical data.

If you are interested in further real-world applications of getML, head back to the [notebook overview](welcome.md) and choose one of the remaining examples.

Here is some additional material from our [documentation](https://docs.getml.com/latest/) if you want to learn more about getML:
* [Feature learning with Multirel](https://docs.getml.com/latest/user_guide/feature_engineering/feature_engineering.html#multirel)
* [Feature learning with Relboost](https://docs.getml.com/latest/user_guide/feature_engineering/feature_engineering.html#relboost)

# Get in contact

If you have any question schedule a [call with Alex](https://go.getml.com/meetings/alexander-uhlig/getml-demo), the co-founder of getML, or write us an [email](team@getml.com). Prefer a private demo of getML? Just contact us to make an appointment.