## 1. Overview & Background

**Machine Learning**: Developing Algorithms and models that improve their performance at a task by processing data.

**Data**: Sets of features, $x_i$ and (optionally) labels $y_i$. A *label* is the ideal answer we'd like our ML model to give when inputed the features $x$.

In **Supervised Learning**, our model expects to train on both features and labels. The task is to construct an model which is able to predict the label of an object given the set of features.

Supervised learning is further broken down into two categories, **classification** and **regression**.
In *classification* problems, the labels are discrete, and dont come from a metric space. An example of classification is to determine if a person has the flu based on their vital signs.
In *regression* problems, the labels are continous and from a metric space. An example of regression is to determine a person's body temperature, given their vital signs.

For the exercises in this section, we shall be using the [scikit-learn](http://scikit-learn.org) library.

Most machine learning algorithms implemented in scikit-learn expect data to be stored in a
**two-dimensional array or matrix**.  The arrays can be
either ``numpy`` arrays, or ``scipy.sparse`` matrices.
The size of the array is expected to be `[n_samples, n_features]`

- **n_samples:**   The number of samples: each sample is an item to process (e.g. classify).
  A sample can be a document, a picture, a sound, a video, a row in database or CSV file,
  or whatever you can describe with a fixed set of quantitative traits.
- **n_features:**  The number of features or distinct traits that can be used to describe each
  item in a quantitative manner.  Features are generally real-valued, but may be boolean or
  discrete-valued in some cases.

The number of features must be fixed in advance. However it can be very high dimensional
(e.g. millions of features). For Supervised learning problems, data in scikit-learn is represented as a **feature matrix** and a **label vector**

$$
{\rm feature~matrix:~~~} {\bf X}~=~\left[
\begin{matrix}
x_{11} & x_{12} & \cdots & x_{1D}\\
x_{21} & x_{22} & \cdots & x_{2D}\\
x_{31} & x_{32} & \cdots & x_{3D}\\
\vdots & \vdots & \ddots & \vdots\\
\vdots & \vdots & \ddots & \vdots\\
x_{N1} & x_{N2} & \cdots & x_{ND}\\
\end{matrix}
\right]
$$

$$
{\rm label~vector:~~~} {\bf y}~=~ [y_1, y_2, y_3, \cdots y_N]
$$

Here there are $N$ samples and $D$ features.

In [None]:
import numpy as np
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn import linear_model
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn import datasets
from sklearn.datasets import make_moons
from sklearn import metrics
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
%matplotlib inline

## 2. Supervised Learning: Regression

**Exercise 1** 

a) Please use numpy to create a data set with 50 points corresponding to a line with slope 2 and intercept 5, with standard normal noise at each point. Please arrange this data with all the x-values as a feature matrix, and the y values as the label vector. ($y_i=2 \times x_i + 5 + \epsilon_i$)

b) Create a scatter plot of these using matplotlib.

(I've started the exercise code for you. The solutions are at the bottom of this notebook)

#### 2.1 Data Structuring

In [None]:
np.random.seed(0)
X = np.linspace(0,10,50)....

##### 2.2 Normalize Data


![alt text](Errata/Normalization.jpg)



For gradient descent type applications, we'd like all features to have similar means, variances and ranges; so that the error surface is properly shaped. Else, we can get into wild oscillations during optimization iterations and even a loss of convergence.

The MinMaxScaler() is one of the data scaling and normalization functions available. 
It takes a **2D** matrix as input to be scaled and (optionally) the range to be scaled to.

It applies a 2 step transform:

Step 1: X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))

Step 2: X_scaled = X_std * (max_of_range - min_of_range) + min_of_range

To use this scaler, 

a) Instantiate the scaler with the range that you'd like to scale to,

b) fit the scaler on your data,

c) Use this fitted scaler to transform your data.

In [None]:
scalerX=MinMaxScaler(feature_range=(-1,1)) #instantiate with the range
scalerX.fit(X)                             #Fit on data
Xnorm=scalerX.transform(X)                 #Transform data

In [None]:
scalerY=MinMaxScaler(feature_range=(-1,1))
scalerY.fit(y[:,np.newaxis]) 
ynorm=scalerY.transform(y[:,np.newaxis])

In [None]:
plt.scatter(Xnorm,ynorm)

**Important**
When you're making your final plots, you may want the answers to be unscaled.
In that case, use the inverse_transform() method of the trained scaler.

In [None]:
Xunscaled=scalerX.inverse_transform(Xnorm)
yunscaled=scalerY.inverse_transform(ynorm)
plt.scatter(Xunscaled,yunscaled)

#### 2.3 Regression Model

A basic scikit-learn supervised learning model, there are 3 steps:

a) Instantiate: Declare an instance of a specific model,

b) Train: Fit (or train) this model instance on data,

c) Analyze: Use this trained model to make predictions for new data, analyze model coefficients.

Let's see these steps in action, using a linear regression model on our data.

In [None]:
# Step 1: Declare an instance of a linear regressor
model = LinearRegression(normalize=True)

In [None]:
# Step 2: Fit the model on data
model.fit(X, y) 

In [None]:
# Step 3: Make predictions
y_pred=model.predict(X)

plt.figure()
plt.scatter(X,y,label="Data")
plt.plot(X,y_pred,label="Model Predictions")
plt.legend()

In [None]:
# Evaluate Coefficients of Trained Model:
print("Slope: ",model.coef_)
print("Intercept: ",model.intercept_)

In [None]:
# Evaluate Error of trained Model:
print ("mean squared error:", metrics.mean_squared_error(model.predict(X), y))

### Difference between Curve-fitting and Machine Learning

**Generalizability**: How well does my trained model perform on data that it has never seen? 
Exam questions analogy.

**Dividing Learning Data set into train-test sets**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

**Exercise 2** 

a) Use this split dataset to train and evaluate a (new) Linear Regression model. Use the train dataset (X_train, y_train) to train the model. Use the test dataset to evaluate the performance. 

b) Plot the model predictions on the test dataset versus the test data.

### Linear regression: Using higher order fits

In [None]:
def complex_regression_data(x, err=0.5):
    y = 10 - 1. / (x + 0.1)
    if err > 0:
        y = np.random.normal(y, err)
    return y

In [None]:
X = np.linspace(0,1,100)[:,np.newaxis]
y=complex_regression_data(X)

plt.scatter(X,y)

In [None]:
model = LinearRegression()
model.fit(X, y)
X_test = np.linspace(-0.01, 1.01, 500)[:, None]
y_test = model.predict(X_test)

plt.scatter(X.ravel(), y)
plt.plot(X_test.ravel(), y_test)
print ("mean squared error:", metrics.mean_squared_error(model.predict(X), y))

In [None]:
class PolynomialRegression(LinearRegression):
    def __init__(self, degree=1, **kwargs):
        self.degree = degree
        LinearRegression.__init__(self, **kwargs)
        
    def fit(self, X, y):
        if X.shape[1] != 1:
            raise ValueError("Only 1D data, Please!")
        Xp = X ** (1 + np.arange(self.degree))
        return LinearRegression.fit(self, Xp, y)
        
    def predict(self, X):
        Xp = X ** (1 + np.arange(self.degree))
        return LinearRegression.predict(self, Xp)
    
    def err(self,X,y):
        Xp = X ** (1 + np.arange(self.degree))
        yp=LinearRegression.predict(self, Xp)
        return np.mean((y-yp)*(y-yp))

In [None]:
model = PolynomialRegression(degree=42)
model.fit(X, y)
y_test = model.predict(X_test)

plt.scatter(X.ravel(), y)
plt.plot(X_test.ravel(), y_test)
print ("mean squared error:", metrics.mean_squared_error(model.predict(X), y))

### Hyperparameters & Validation Data

*Parameters*: Variables describing the model that are tuned using the training data, for instance, the slope and the intercept, the weights & biases of a neural network.

*Hyperparameters*: Variables describing the model that are not tuned using the training data, for instance, the degree of the model expression, the architecture of the neural network.

Hyperparameters are traditionally set using validation or cross validation. In good old fashioned validation, you divide your original dataset into **3** parts: Training, validation (or "holdout") and testing.

The training data is used to *tune* model parameters, 

the validation data is used to *select* model hyperparameters,

the testing data is used to *report* an estimate for generalization error.

In [None]:
X_train, X_temp, y_train, y_temp = train_test_split(X, y, train_size=0.6)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, train_size=0.5)

X_train.shape,X_val.shape,X_test.shape

**Exercise 3** 

Use this split dataset to train and evaluate a (new) Regression model. Select the best degree of the model by using the error on the validation dataset.
The splits have been made for you.

Create a loop over the degrees.

For each iteration of the loop, instantiate a PolynomialRegression model with the correct degree.

Train this instance on the training data.

Evaluate the trained model on the validation dataset.

Plot the validation error versus the degree of the PolynomialRegression after the loop.

Note: Let's only have degrees from 1 to 8.

In [None]:
degrees=range(1,9)
e=np.zeros(8)

for i in range(8):
    model = PolynomialRegression(degree=degrees[i])
    .....


### Overfitting & Regularization

*Regularization*: Adding external information to a machine learning model to make it "well-posed".

Traditionally, we express a preference towards simpler, smoother models by adding a penalty.

Model Error: $\sum (y_i-\hat{y_i})^2 + \lambda \sum ||W_j||^2$

L2 Regularization, aka, Ridge Regression.

$\lambda$, is the regularization parameter and is a hyperparameter.

The **Diabetes Dataset**: 442 measurements from diabetes patients on features like age, sex, BMI, Blood Pressure, Serum measurements. The target is quantitative measurement of the progress of diabetes one year after these measurements were taken. 

A **good** ML model would be useful not just to predict the future incidence for new patients, but also evaluate the importance of different features on the progress of the disease.

Cite[Efron, Hastie, etc "LARS"]

In [None]:
dataset=datasets.load_diabetes()

In [None]:
X=dataset.data
y=dataset.target

In [None]:
print(dataset.feature_names)

In [None]:
plt.scatter(X[:,2],y)
plt.xlabel("BMI (Normalized)")
plt.ylabel("Response")

In [None]:
model=Ridge()

In [None]:
alphas=np.array([1,0.1,0.01,0.001,0.0001])

In [None]:
grid=GridSearchCV(estimator=model, param_grid=dict(alpha=alphas), cv=3)

In [None]:
grid.fit(X,y)

In [None]:
print(grid.best_estimator_.alpha)

## 3. Supervised Learning: Classification



In [None]:
X, y = make_moons(n_samples=500, noise=0.1)

In [None]:
plt.figure()
plt.scatter(X[:, 0], X[:, 1], c=y)

#### Exercise 4

Split the classification dataset into a training dataset with 80% of the samples and a testing dataset with 20% of the samples. Check and print the shapes of the datasets.

In [None]:
clf = LogisticRegression(random_state=0, penalty='l2')
clf.fit(X_train, y_train)

In [None]:
y_pred=clf.predict(X_test)

In [None]:
plt.figure(figsize=(10,4))
plt.subplot(1,2,1)
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test)
plt.title('Test Data')

plt.subplot(1,2,2)
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_pred)
plt.title('Predictions')

### Solutions

In [None]:
# Exercise 1
np.random.seed(0)
X = np.linspace(0,10,50)[:,np.newaxis]
y = 2 * X.squeeze() + 5 + np.random.normal(size=50)

plt.scatter(X,y)

In [None]:
# Exercise 2
model2= LinearRegression(normalize=True)
model2.fit(X_train,y_train)

print ("mean squared error:", metrics.mean_squared_error(model.predict(X_test), y_test))



In [None]:
y_pred=model.predict(X_test)
plt.figure()
plt.scatter(X_test,y_test,label="Data")
plt.plot(X_test,y_pred,label="Model Predictions")
plt.legend()

In [None]:
X.shape,X_test.shape

In [None]:
# Exercise 3
degrees=range(1,16)
e=np.zeros(15)

for i in range(15):
    model = PolynomialRegression(degree=degrees[i])
    model.fit(X_train, y_train)
    e[i]=model.err(X_val,y_val)
    print("Degree: ", degrees[i]," Error: ",e[i])
    
plt.plot(degrees,e)

In [None]:
# Exercise 4
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

X_train.shape, X_test.shape, y_train.shape, y_test.shape