In [1]:
#vector/matrix library
import numpy as np

#data frame library (similar to R)
import pandas as pd

#visualization library
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')

#regular expression library for data cleasning
import re

In [None]:
path_full_dataset = <FILL_IN>L
df_full = pd.read_csv(path_full_dataset, sep='\t')

#check the data types for the columns and cast if necessary (type information gets lost during serialization)
<FILL_IN>

print(df_full.dtypes)

#read test
df_test = pd.read_csv('test.csv')

# 1. Merging train and testdata

* We will create a matrix representation with the dataset we have create so far. As the transformations on the data will be the same for both train and test data merge the two sets together. Note that the testset has in 'id' column which will be NaN for the training data (that's how you can separate them at the end)



In [None]:
df_combined = <FILL_IN>

# 2 Simple Linear Regression Model

#### X: features

* weekend: boolean value extracted from day of the week
* quarter: extracted from month (first 3 months are Q1,... last 3 are Q4)
* province: can be extracted from the province
* animal_type
* day_offset: can be caculated as the number of days since first order)

#### y: sum of quantities ordered for a certain combination of features


## A. Creating the training/test matrix



Machine learning models can only deal with numerical data. Now we will map our dataframe onto a numerical respresentation (i.e. numpy arrays). 

From now on we scikit-learn will be our best friend. This is the defacto library for training models in python. Other specialized libraries exist, for example focusing on neural networks, but this will often be your first resort.

Sklearn follows a very simple approach: http://scikit-learn.org/stable/data_transforms.html

- **Estimators* have a fit method, which learns model parameters from a dataset
- *Transformers** have a transform method which applies this transformation model to unseen data. Making a prediction is considered a transformation, but also scaling the data for example!
- fit_transform combines the above

Two important data preparation steps are:

- Continuous variables should be **scaled** (map all values to [-1,+1]), this will improve the performance of most learning algorithms (sklearn minmaxscaler)

- Categorical data should be mapped on a binary scale {0,1} (sklearn labelencoder). If you have more than two categories the recommended technique for mapping is called **one-hot-encoding** (pandas routine get_dummies can help you with that)




In [19]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder
#pd.get_dummies(series, prefix='...')

### 1. day_offset is a continuous variable which could be mapped on [0,1]

* convert using the MinMaxScaler

In [20]:
minmax = MinMaxScaler()

### 2. One-hot-encoding

* `pd.get_dummies(series)`
* OHE for categorical data: animal types, province

**HINT1** As there are a lot of additional columns created during OHE, it can get difficult to not lose track where the variables originally came from. The get_dummies function allows you to prefix the new column names. Here you could use 'animal' for example. Then you'd get columns: animal_chicken, animal_cow,...

**HINT2** The output of the OHE call is again a dataframe. Merge with the original frame using pandas concat function:



`pd.concat([df1, df2], axis=1)` 

**NOTE** If you only have a binary categorical variable you can also use `labelEncoder`


### 3. Additional features can be calculated...

- 'weekend' is a boolean variable which can be easily extracted from 'weekday'
- 'quarter' can be easily extracted from 'month'

**HINT:** Simply use pandas to create new columns via the following template:
`df['columname'].apply(lambda x: f(x))`

## B Linear Regression Model

* Split training and test data based on the presence/absence of a value in the id column

In [36]:
df_train = <FILL_IN>
df_test = <FILL_IN>


* Convert your dataframes to numpy matrices via `df.values` 

In [39]:
X_train = <FILL_IN>
y_train = <FILL_IN>

X_test = <FILL_IN>


* initialize a linear model from sklearn:

In [None]:
from sklearn import linear_model
lin_reg = linear_model.LinearRegression()

* Fit your model to the data

In [40]:
lin_reg.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

* With this model we can now predict the weights on the training data

In [43]:
y_predict = lin_reg.predict(X_test)

* Merge your predictions with the df_test dataframe and prepare a Kaggle submission

In [None]:
df_test['quantity_ordered'] = y_predict

In [46]:
df_test[['id', 'quantity_ordered']].to_csv('./submissions/LR1.csv', sep=',', header=True, index=False)

# ...Make your first Kaggle Submission and prepare for GLORLY!

# Offline Model Evaluation

### Train / Test - Cross Validations - or?

* In a machine learning task the part of the data is typically used for training the data (training set) and part of the data is used for offline evaluation (test set). 

* An even better approach is to opt for <a href="http://scikit-learn.org/stable/modules/cross_validation.html"> K-fold cross validation </a>. In the case you divide your data in K equal fractions. K-1 fractions are used for training and 1 is used for evaluation, this can be repeated K-times.

* Depending on the nature of the feature you used it might be necessary to make a 'causal' split of the data. In this case your training data is all the orders before a date T, the testset is then the days after T. This might have very different behaviour, but might in fact be the most realistic choice for a forecasting model. Further reading about **data leakage: ** https://www.kaggle.com/wiki/Leakage

### Hands-first approach

* While you are used to having an in-depth theoretical coverage prior to doing something hands-on, I personally think it is quite interesting to actually flip this and treat the algorithms as black boxes and to learn more about them as you go. As a first source of what is available I would suggest to take a tour of supervised methods starting here: http://scikit-learn.org/stable/supervised_learning.html . Do focus on the regression algorithms as we are trying to predict a continuous variable. A higher level overview of some popular ML algorithms, read for example this post on KDNuggest: https://www.kdnuggets.com/2016/08/10-algorithms-machine-learning-engineers.html/2

### Some algorithms...


* Basic algorithms which I recommend you to play with are:

    - <a href="http://scikit-learn.org/stable/modules/linear_model.html"> Linear Regression </a>
    - <a href="http://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbors-regression"> Nearest Neighbours </a>
    - <a href="http://scikit-learn.org/stable/modules/svm.html#regression"> SVMs </a>
    - <a href="http://scikit-learn.org/stable/modules/tree.html#regression"> Decision Trees </a>
    - <a href="http://scikit-learn.org/stable/modules/ensemble.html#random-forests"> Random Forests </a>: very important here is that RFs can be used for feature selection, you can ask them what are the features of highest importance.
    - <a href="http://scikit-learn.org/stable/modules/ensemble.html#"> Ensemble Methods </a> (example: creating your own ensemble where each algorithm gets voting rights)

### Diagnostics: Learning curves

* Learning curves show the error on the training vs test data as a function of the training set size. These curves give you information on whether your model has too many parameters and is overfitting (small training error but large test error) or your model is too simple  (large training error = large test error). A blog post which visualizes this concept: https://www.dataquest.io/blog/learning-curves-machine-learning/

* sklearn page: http://scikit-learn.org/stable/modules/learning_curve.html

* sklearn page about model evaluation: http://scikit-learn.org/stable/modules/model_evaluation.html

### And don't hesitate to discuss, ask questions, both on minerva and on Kaggle. 

* If you have questions I can invite Geert Jacobs to the Kaggle forum, you can also contact him directly (he told me he doesn't mind => geert.jacobs@actemium.com)


In [48]:
#how to import these...
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

#algorithms
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import VotingClassifier

#model evaluation
from sklearn.learning_curve import learning_curve
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
