> **Note:** Every week, you will be solving exercises posed in a Jupyter notebook that looks like this one. Because you are cloning a Github repository that only we can push to, you should **NOT EDIT** any of the files you pull from Github. Instead, what you should do, is either make a new notebook and write your solutions in there, or **make a copy of this notebook and save it somewhere else** on your computer, not inside the `tsds` folder that you cloned, so you can write your answers in there. If you edit the notebook you pulled from Github, those edits (possible your solutions to the exercises) may be overwritten and lost the next time you pull from Github. This is important, so don't hesitate to ask if it is unclear.

# Week 3: Data modelling

*Thursday, February 22, 2018*

In this part of today's session you will be working with implementing machine learning models. 
- Exercise 3.1: supervised regression
- Exercise 3.2: supervised classification
- Exercise 3.3: unsupervised learning with k-means and principal compoments

We begin with loading the standard packages:

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import requests
import seaborn as sns
%matplotlib inline

## Exercises

### Part 3.1: Predicting tips

In this second part of the exercise we will implement a machine learning model for predicting tips. We will use the same data as in Exercise 1.

> **Ex. 3.1.0**: Load the tips data from Seaborn. This can be done with `load_dataset` giving it `'tips'` as input.

In [2]:
tips = sns.load_dataset('tips')

#### Structuring the data

We have loaded a simple dataset. We are interested in modelling the tips as a function of the dining characteristics (bill size, participants, time). Thus `tips` is our labelled data which we will use in our supervised machine learning problem.

Although data is more or less structured we need to tweak it a little bit. Linear models using regularization need data that is normalized. Furthermore linear models need categorical data to be replaced by dummy variables which linear regression models can handle.

> **Ex. 3.1.1**: Use  L2 normalization (i.e. mean zero, one std. dev.) on the numeric columns, i.e. 'total_bill', 'tip', 'size'. Convert categorical/discrete data to dummy variables. Call the final dataset 'data'.

> *Hint 1*: `sklearn.preprocessing` has a helpful method called `normalize`.

> *Hint 2*: the `get_dummies` method in pandas may be useful with the keyword argument `drop_first=True`.

In [3]:
# [Answer to Ex. 3.1.1]

In order to evaluate our model we are interested in evaluating on a different sample from where we estimate the model. That is, we split into a training data set for estimation and a test data set for evaluation.

The size of the training dataset relative to the test data set depends on the number of observations you have. The more observations you have, the larger you can make the training dataset. Because we have a small dataset, i.e. number of observation `<`1000, we should have a large share of test data of around 30 pct. 

> **Ex. 3.1.2**:  Specify the exogenous variable **y** as `'tips'` and **X** as the remaining variables.  Then split the dataset into a training set and a test set. The respective sizes of the train and test data should be 70 pct. and 30 pct.

> *Hint*: try using the `sample` method available to dataframes. Use `random_state` = 42.

In [5]:
# [Answer to Ex. 3.1.2]

#### Fitting a model of tips

Having prepared the data we can now begin the modelling. We want to compare the models of ordinary least squares (OLS) against the Lasso. Note that in `sklearn` $\lambda$ is called $alpha$; the reason is that in Python lambda, and programming in general, lambda refers to a specific type of function.

> **Ex. 3.1.3**: Use the training data to fit models with the explanatory variables. Try both OLS and the Lasso with $alpha=0.001$. 

> *Hint*: OLS is called `LinearRegression` in `sklearn`

In [7]:
# [Answer to Ex. 3.1.3]



> **Ex. 3.1.4**: Compute the predicted tips. Evaluate the models on the test data. What is the root mean squared error (RMSE)?

In [9]:
# [Answer to Ex. 3.1.4]

#### Feature engineering

One of the main challenges of machine learning is creating meaningful variables may be relevant. The simplest way to do that is attempt different interactions between the variables.

> **Ex. 3.1.5**: Create interactions between all explanatory variables.

> *Hint*: Try the `PolynomialFeatures` method in sklearn's preprocessing submodule.

In [11]:
# [Answer to Ex. 3.1.5]

> **Ex. 3.1.6**: Repeat the above comparison of OLS and Lasso. What happens to RMSE if we set $alpha=10^{-6}$?  Does Lasso improve the test predictions? Does the $\alpha$ hyperparamater matter?

> Note you need to split the quadratic feature set as well into test and training - use the same split indices.

In [13]:
# [Answer to Ex. 3.1.6]

> **Ex. 3.1.7**: What happens to the Lasso if we compute polynomial features for higher orders of `n`? Try n=1,2,3,...,10. What would happen to OLS if we tried to estimate for higher and higher n - is this feasible? How many variables do we have in the features set for n=10? Do we have more variables than observations?

> *Hint*: We can reuse the above code using a loop.

In [15]:
# [Answer to Ex. 3.1.7]

> **Ex. 3.1.8**: Use the ridge regression to make a predictive model using the quadratic feature set. Use $alpha=10^{-3}$. How does it perform in  terms of RMSE?

In [17]:
# [Answer to Ex. 3.1.8]

### Part 3.2: Predicing high, low wage

Download the wage dataset available in the data folder at:

 https://archive.ics.uci.edu/ml/datasets/adult

> **Ex 3.2.1**: Load both train and test data. Ensure na_values are parsed correctly, assign column names and check if some rows should be skipped.

In [19]:
# [Answer to Ex. 3.2.1]

> **Ex 3.2.2:** Structure the data using the following steps:
1. Drop columns that have no relevant information (e.g. encoded both as number and as text).
2. Use L2 normalization (ie. mean zero, unit std. dev.) on the continuous variables (e.g. 'age'). 
3. Ensure wage is encoded correctly.
4. Add dummies for categorical or string variables.
5. Drop rows with any missing data
6. Partition into y,X as well as test and training sets.

In [21]:
# [Answer to Ex. 3.2.2]

### Estimating and valuating classification models

> **Ex 3.2.3:** Estimate logistic regression for $C$ = $10^{-2}, 10^{-1}, ..., 10^{3}, 10^{4}$. Read the documentation of the logistic regression; what is the relationship between $alpha$ and $C$? What is the overall accuracy of the models?

In [23]:
# [Answer to Ex. 3.2.3]

When evaluating classification models the overall accuracy of the model is rarely a good indicator of the errors. Often we are interested in whether we make Type I or II errors. This can be measured by precision and recall. If you are unsure what the precision, recall and F1 is, Wikipedia has a good overview that you should read [here](https://en.wikipedia.org/wiki/Precision_and_recall). 

> **Ex 3.2.4:** For each of the estimated models compute precision, recall and F1 for the class `wage_above_50k==1`.

> *Hint*: Although computation is straightforward it may be easier to use scikit learn's built in functions.

In [25]:
# [Answer to Ex. 3.2.4]

### Part 3.3: Unsupervised learning 

In this exercise we will work with unlabelled data. We will try out to fundamental approaches to unsupervised learning. We will be using one of the classic datasets in machine learning, the `iris` dataset.


> **Ex. 3.3.0**: Load the `iris` data from Seaborn. 

In [26]:
# [Answer to Ex. 3.3.0]

#### K-means clustering

Data clustering is a method to divide observations into groups where points are similar to one another. In particular the clustering algorithms aims to provide the same label to similar observations. This approach has the advantage that it can detect underlying groups without needing access to labelled data. Thus it may be used as feature engineering.

The k-means clustering algorithm outputs for a given input set $k$ points that serves as clustering centers. Points are assigned to cluster that they are closest to in Euclidian space. The assignment procedure uses an iterative optimization procedure. Therefore it needs a starting point and thus depend on the initial (random) guess. 

To get an idea of how the k-means algorithm works try out the implementation visualized [here](http://stanford.edu/class/ee103/visualizations/kmeans/kmeans.html).


> **Ex. 3.3.1**: Compute a K-means cluster. Use three clusters.

> *Hint*: the submodule `cluster` in sklearn has a method called `KMeans`

In [28]:
# [Answer to Ex. 3.3.1]


> **Ex. 3.3.2**: Make a scatterplot of with `petal_length` and `petal_width` as respectively x and y axis. Color the points with species. Make another plot using the computed k-means. Would the k-means label provide a good feature for predicing `species`?

> *Hint:* try seaborn's `lmplot` without fitting a regression.

In [30]:
# [Answer to Ex. 3.3.2]

#### Dimensionality reduction and principal components

In datasets with many variables it can often be difficult to distinguish which ones are relevant. Especially if the variables are highly correlated. One approach is to reduce the number of dimensions by finding one or more variables that explains large parts of the data. One approach is linear and called principal components analysis (**PCA**). The aim of PCA is to find variables that explain as much of the variation in the data as possible.

> **Ex. 3.3.3**: Estimate a PCA model using the four features: `'sepal_length', 'sepal_width', 'petal_length', 'petal_width'`. 

In [32]:
# [Answer to Ex. 3.3.3]

> **Ex. 3.3.4**: How much variance is explained by the respective components, i.e. eigenvectors? Try to plot the variance explained by the eigenvalues where eigenvalues are in non-increasing order; this is known as a scree plot. What does the scree plot look like? Make a plot of the species using the two first components. Does the new dimension help distinguish

In [33]:
# [Answer to Ex. 3.3.4]