# Rossmann Store Sales

## Forecast sales using store, promotion, and competitor data

[Rossmann Store Sales](https://www.kaggle.com/c/rossmann-store-sales)

Rossmann operates over 3,000 drug stores in 7 European countries. Currently, 
Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied.

In their first Kaggle competition, Rossmann is challenging you to predict 6 weeks of daily sales for 1,115 stores located across Germany. Reliable sales forecasts enable store managers to create effective staff schedules that increase productivity and motivation. By helping Rossmann create a robust prediction model, you will help store managers stay focused on what’s most important to them: their customers and their teams! 
- - -

In [5]:
import pandas as pd
import numpy as np

## 1) Loading the Data

In [9]:
print("Load the training, test and store data using pandas")
train = pd.read_csv("data/train.csv")
test = pd.read_csv("data/test.csv")
store = pd.read_csv("data/store.csv")

Load the training, test and store data using pandas


  interactivity=interactivity, compiler=compiler, result=result)


In [10]:
test.head()

Unnamed: 0,Id,Store,DayOfWeek,Date,Open,Promo,StateHoliday,SchoolHoliday
0,1,1,4,2015-09-17,1,1,0,0
1,2,3,4,2015-09-17,1,1,0,0
2,3,7,4,2015-09-17,1,1,0,0
3,4,8,4,2015-09-17,1,1,0,0
4,5,9,4,2015-09-17,1,1,0,0


In [11]:
train.head()

Unnamed: 0,Store,DayOfWeek,Date,Sales,Customers,Open,Promo,StateHoliday,SchoolHoliday
0,1,5,2015-07-31,5263,555,1,1,0,1
1,2,5,2015-07-31,6064,625,1,1,0,1
2,3,5,2015-07-31,8314,821,1,1,0,1
3,4,5,2015-07-31,13995,1498,1,1,0,1
4,5,5,2015-07-31,4822,559,1,1,0,1


In [12]:
store.head()

Unnamed: 0,Store,StoreType,Assortment,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear,PromoInterval
0,1,c,a,1270,9,2008,0,,,
1,2,a,a,570,11,2007,1,13.0,2010.0,"Jan,Apr,Jul,Oct"
2,3,a,a,14130,12,2006,1,14.0,2011.0,"Jan,Apr,Jul,Oct"
3,4,c,c,620,9,2009,0,,,
4,5,a,a,29910,4,2015,0,,,


In [14]:
test.describe()

Unnamed: 0,Id,Store,DayOfWeek,Open,Promo,SchoolHoliday
count,41088.0,41088.0,41088.0,41077.0,41088.0,41088.0
mean,20544.5,555.899533,3.979167,0.854322,0.395833,0.443487
std,11861.228267,320.274496,2.015481,0.352787,0.489035,0.496802
min,1.0,1.0,1.0,0.0,0.0,0.0
25%,10272.75,279.75,2.0,1.0,0.0,0.0
50%,20544.5,553.5,4.0,1.0,0.0,0.0
75%,30816.25,832.25,6.0,1.0,1.0,1.0
max,41088.0,1115.0,7.0,1.0,1.0,1.0


In [15]:
train.describe()

Unnamed: 0,Store,DayOfWeek,Sales,Customers,Open,Promo,SchoolHoliday
count,1017209.0,1017209.0,1017209.0,1017209.0,1017209.0,1017209.0,1017209.0
mean,558.429727,3.998341,5773.818972,633.145946,0.830107,0.381515,0.178647
std,321.908651,1.997391,3849.926175,464.411734,0.375539,0.485759,0.383056
min,1.0,1.0,0.0,0.0,0.0,0.0,0.0
25%,280.0,2.0,3727.0,405.0,1.0,0.0,0.0
50%,558.0,4.0,5744.0,609.0,1.0,0.0,0.0
75%,838.0,6.0,7856.0,837.0,1.0,1.0,0.0
max,1115.0,7.0,41551.0,7388.0,1.0,1.0,1.0


In [16]:
store.describe()

Unnamed: 0,Store,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear
count,1115.0,1112.0,761.0,761.0,1115.0,571.0,571.0
mean,558.0,5404.901079,7.224704,2008.668857,0.512108,23.595447,2011.763573
std,322.01708,7663.17472,3.212348,6.195983,0.500078,14.141984,1.674935
min,1.0,20.0,1.0,1900.0,0.0,1.0,2009.0
25%,279.5,717.5,4.0,2006.0,0.0,13.0,2011.0
50%,558.0,2325.0,8.0,2010.0,1.0,22.0,2012.0
75%,836.5,6882.5,10.0,2013.0,1.0,37.0,2013.0
max,1115.0,75860.0,12.0,2015.0,1.0,50.0,2015.0


### What are the different features

- - -

## 2) Exploratory Data Analysis & Preliminary Data Visualization

look at [Rossmann Exploratory Analysis](https://www.kaggle.com/thie1e/rossmann-store-sales/exploratory-analysis-rossmann)
look at [Basic Visualization](https://www.kaggle.com/shadimoodad/rossmann-store-sales/basic-visualization)

* Create a scatterplot matrix that allows us to visualize the pair-wise correlations between the different features in this dataset in one place
* To quantify the linear relationship between the features, we will now create a correlation matrix.
* Descriptive statistics
- - -

## 3) Data Preprocessing

### Dealing with missing data

### Removing observations or features with missing values

### Imputing missing values

look at [https://www.kaggle.com/nsecord/rossmann-store-sales/filling-gaps-in-the-training-set](https://www.kaggle.com/nsecord/rossmann-store-sales/filling-gaps-in-the-training-set)

#### Is there any categorical Data?

* Ordinal Features?
* Converting class labels to integers
* Should we create a new dummy feature for each unique value in the nominal feature column (using one-hot encoding)?



### Using Pipelines for multiple preprocessing steps

### Partition dataset into training and test sets

"If we are dividing a dataset into training and test datasets, we have to keep in mind that we are withholding valuable information that the learning algorithm could benfit from. Thus, we don't want to allocate too much information to the test set. However, the smaller the test set, the more inaccurate the estimation of the generalization error. Dividing a dataset into training and test sets is all about balancing this trade-off. In practice, the most commonly used splits are 60:40, 70:30, or 80:20, depending on the size of the initial dataset. However, for large datasets, 90:10 or 99:1 splits into training and test subsets are also common and appropriate. Instead of discarding the allocated test data after model training and evaluation, it is a good idea to retrain a classi er on the entire dataset for optimal performance."

### Evaluation Function

[Evaluation function](https://www.kaggle.com/c/rossmann-store-sales/forums/t/16908/evaluation-function/95532)

### Feature Scaling

Two common approaches to bringing different features onto the same
scale: **normalization** and **standardization**. 

Typically, normalization refers to the rescaling of the features to a range of [0, 1], which is a special case of min-max scaling. To normalize our data, we can simply apply the min-max scaling to each feature column.

In [2]:
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()
# X_train_norm = mms.fit_transform(X_train)
# X_test_norm = mms.transform(X_test)

Although normalization via min-max scaling is a commonly used technique that is useful when we need values in a bounded interval, standardization can be more practical for many machine learning algorithms. The reason is that many linear models, such as logistic regression and support vector machines, initialize the weights to 0 or small random values close to 0. 

Using standardization, we center the feature columns at mean 0 with standard deviation 1 so that the feature columns take the form of
a normal distribution, which makes it easier to learn the weights. 

Furthermore, standardization maintains useful information about outliers and makes the algorithm less sensitive to them in contrast to min-max scaling, which scales the data to a limited range of values.

In [3]:
from sklearn.preprocessing import StandardScaler

stdsc = StandardScaler()
# X_train_std = stdsc.fit_transform(X_train)
# X_test_std = stdsc.transform(X_test)

Again, it is also important to highlight that we fit the `StandardScaler` only once on the training data and use those parameters to transform the test set or any new data point.

The following table illustrates the difference between the two commonly used feature scaling techniques, standardization and normalization on a simple sample dataset consisting of numbers 0 to 5:

In [6]:
ex = pd.DataFrame([0, 1, 2 ,3, 4, 5])

# standardize
ex[1] = (ex[0] - ex[0].mean()) / ex[0].std()
# normalize
ex[2] = (ex[0] - ex[0].min()) / (ex[0].max() - ex[0].min())
ex.columns = ['input', 'standardized', 'normalized']
ex

Unnamed: 0,input,standardized,normalized
0,0,-1.336306,0.0
1,1,-0.801784,0.2
2,2,-0.267261,0.4
3,3,0.267261,0.6
4,4,0.801784,0.8
5,5,1.336306,1.0


- - -

## 4) Feature Selection

If we notice that a model performs much better on a training dataset than on the test dataset, this observation is a strong indicator for over tting. Over tting means that model  ts the parameters too closely to the particular observations in the training dataset but does not generalize well to real data—we say that the model has a high variance. A reason for over tting is that our model is too complex for the given training data and common solutions to reduce the generalization error are listed
as follows:

* Collect more training data
* Introduce a penalty for complexity via regularization
* Choose a simpler model with fewer parameters
* Reduce the dimensionality of the data

Collecting more training data is often not applicable. In the following sections and subsections, we will look at common ways to reduce overfitting by regularization and dimensionality reduction via feature selection.

### Feature selection algorithms in scikit-learn

There are many more feature selection algorithms available via scikit- learn. Those include recursive backward elimination based on feature weights, tree-based methods to select features by importance, and univariate statistical tests. A good summary with illustrative examples can be found at [http://scikit-learn.org/stable/modules/feature_selection.html](http://scikit-learn.org/stable/modules/feature_selection.html).

### Sequential feature selection algorithms

An alternative way to reduce the complexity of the model and avoid overfitting is dimensionality reduction via feature selection, which is especially useful for unregularized models. There are two main categories of dimensionality reduction techniques: feature selection and feature extraction. Using feature selection, we select a subset of the original features. In feature extraction, we derive information from the feature set to construct a new feature subspace.

Sequential feature selection algorithms are a family of greedy search algorithms that are used to reduce an initial d-dimensional feature space to a k-dimensional feature subspace where k < d. The motivation behind feature selection algorithms is to automatically select a subset of features that are most relevant to the problem to improve computational ef ciency or reduce the generalization error of the model by removing irrelevant features or noise, which can be useful for algorithms that don't support regularization. A classic sequential feature selection algorithm is Sequential Backward Selection (SBS), which aims to reduce the dimensionality of the initial feature subspace with a minimum decay in performance of the classi er to improve upon computational ef ciency. In certain cases, SBS can even improve the predictive power of the model if a model suffers from over tting.

### Assessing feature importance with random forests

A useful approach to select relevant features from a dataset is to use a random forest, an ensemble technique. 

Using a random forest, we can measure feature importance as the averaged impurity decrease computed from all decision trees in the forest without making any assumptions whether our data is linearly separable or not. 

Conveniently, the random forest implementation in scikit-learn already collects feature importances for us so that we can access them via
the `feature_importances_` attribute after  fitting a `RandomForestClassifier`. 

### What about external features such as weather, etc.?

Look at:
* https://www.kaggle.com/c/rossmann-store-sales/forums/t/17229/external-data-and-other-information
* https://www.kaggle.com/c/rossmann-store-sales/forums/t/17048/putting-stores-on-the-map
* [Clarification on "future" external data](https://www.kaggle.com/c/rossmann-store-sales/forums/t/16905/clarification-on-future-external-data)

- - -

## 5) Linear Regression Models

### Ordinary Least Squares Linear Regression

### Cross Validation

#### Implementation of OLS regression

### Sparse solutions with Lasso (L1 regularization)

[Move and group with Ridge and ElasticNet?]

L1 regularization yields sparse feature vectors; most feature weights will be zero. Sparsity can be useful in practice
if we have a high-dimensional dataset with many features that are irrelevant, especially cases where we have more irrelevant dimensions than samples. In this sense, L1 regularization can be understood as a technique for feature selection.

#### Implementation of Lasso linear regression

#### Plot regularization path

## Ridge Regression

## Elastic Net

### Turning a linear regression model into a curve – polynomial regression

## 5B) Additional ML Algorithms 

* Support Vector Machines
* Decision Trees
* Random Forests
* Genetic Algorithms?

## 5C) Evaluating Models

- - -

## 6) Compressing Data via Dimensionality Reduction

An alternative approach to feature selection for dimensionality reduction is feature extraction. Three fundamental techniques that will help us to summarize the information content of a dataset by transforming it onto a new feature subspace of lower dimensionality than the original one. 

Data compression is an important topic in machine learning, and it helps us to store and analyze the increasing amounts of data that are produced
and collected in the modern age of technology.

* Principal component analysis (PCA) for unsupervised data compression
* Linear Discriminant Analysis (LDA) as a supervised dimensionality
reduction technique for maximizing class separability
* Nonlinear dimensionality reduction via kernel principal component analysis

- - -

## 7) Model Evaluation and Hyperparameter Tuning

Best practices of building good machine learning models by  fine-tuning the algorithms and evaluating the model's performance.

* Obtain unbiased estimates of a model's performance
* Diagnose the common problems of machine learning algorithms
* Fine-tune machine learning models
* Evaluate predictive models using different performance metrics
* Grid Search
- - -

## 8) Combining Different Models for Ensemble Learning

Explore different methods for constructing a set of classifiers that can often have a better predictive performance than any of its individual members. 

* Make predictions based on majority voting
* Reduce overfitting by drawing random combinations of the training set with repetition
* Build powerful models from weak learners that learn from their mistakes

* [Using Ensembles in Kaggle Data Science Competitions – Part 1](http://www.kdnuggets.com/2015/06/ensembles-kaggle-data-science-competition-p1.html)
* [Using Ensembles in Kaggle Data Science Competitions – Part 2](http://www.kdnuggets.com/2015/06/ensembles-kaggle-data-science-competition-p2.html)
* [Using Ensembles in Kaggle Data Science Competitions- Part 3 ](http://www.kdnuggets.com/2015/06/ensembles-kaggle-data-science-competition-p3.html)
* [KAGGLE ENSEMBLING GUIDE](http://mlwave.com/kaggle-ensembling-guide/)
- - -

## 9) Embedding a Machine Learning Model into a Web Application

Embed a machine learning model into a web application that can not only classify but also learn from data in real-time. 

* Saving the current state of a trained machine learning model
* Using SQLite databases for data storage
* Developing a web application using the popular Flask web framework
* Deploying a machine learning application to a public web server
- - -

## 10) Boosting Algorithms

* [Quick Introduction to Boosting Algorithms in Machine Learning](http://www.analyticsvidhya.com/blog/2015/11/quick-introduction-boosting-algorithms-machine-learning/)
* [XGBoost](https://github.com/dmlc/xgboost)
* [XGBoost Documentation](https://github.com/dmlc/xgboost/blob/master/doc/index.md)
- - -

## 11) Neural Networks and Deep Learning Algorithms
- - -

## 12) Model Presentation, Visualization and Interpretation

---

### Info about Kaggle

* [6 Tricks I Learned From The OTTO Kaggle Challenge](https://medium.com/@chris_bour/6-tricks-i-learned-from-the-otto-kaggle-challenge-a9299378cd61)
- - -

### Info about running on the cloud

* [Running a notebook server](https://ipython.org/ipython-doc/3/notebook/public_server.html)
- - -