# Supervised Learning

### scikit-learn:
scikit-learn's built-in datasets are of type `Bunch`, which are dictionary-like objects. \
Use dictionary- or column- notation to to access `Bunch` keys\
(`Bunch.image` or `Bunch['image']`)

## K-Nearest Neighbors (KNN)

`from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neightbors=6)
kkn.fit(iris['data'], iris['target'])`

* fit data, labels/targets, aka x , y
* requires data and target to be either NumPy array or pandas DataFrame
* requires that the features take on continuous values 
* requires that there are no missing values 
* in particular, the scikit-learn api requires that the features are in an array where each column is a feature and each row a different observation or data point
* data and target but be of same length of observations \
__Predicting on unlabeled data:__

`X_new = np.array([[5.6, 2.8, 3.9, 1.1,], [5.7, 2.6, 3.8, 1.3], [4.7, 3.2, 1.3, 0.2]])
prediction = knn.predict(X_new)`

***

`#Import KNeighborsClassifier from sklearn.neighbors
from sklearn.neighbors import KNeighborsClassifier`

`#Create arrays for the features and the response variable
y = df['party'].values
X = df.drop('party', axis=1).values`

`#Create a k-NN classifier with 6 neighbors
knn = KNeighborsClassifier(n_neighbors=6)`

`#Fit the classifier to the data
knn.fit(X,y)`

Having fit a k-NN classifier, you can now use it to predict the label of a new data point.

### Measuring model performance:

### Train test split
__`from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, testsize=0.3, random_state=21, stratify=y)`__

#Returns four arrays (unpacked into four variables):
1) Training data\
2) Test data\
3) Training labels\
4) Test labels\
#By default, `train_test_split` splits the data into 75% training data and 25% testing data.\
#`random_state`= random seed to reproduce results downstream.\
#`test_size` = portion of data you would like to allocate to testing data (here, 30%).\
#`Stratify = y` makes labels distributed in train and test sets as they are in the original data set; where `y` is the list or array containing the labels.

To check out the accuract of our model, __`knn.score(X_test, y_test)`__

### Model complexity
__Larger k__ = smoother decision boundary = less complex model\
__Smaller k__ = more complex model = can lead to overfitting\
See: [Model complexity curve](https://www.analyticsvidhya.com/blog/2020/08/bias-and-variance-tradeoff-machine-learning/), Bias-Variance Tradeoff\
Plot __model complexity curves__ to determine the optimal value for k

`#Import necessary modules
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split`

`#Create feature and target arrays
X = digits.data
y = digits.target`

`#Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42, stratify=y)`

`#Create a k-NN classifier with 7 neighbors: knn
knn = KNeighborsClassifier(n_neighbors=7)`

`#Fit the classifier to the training data
knn.fit(X_train,y_train)`

`#Print the accuracy
print(knn.score(X_test, y_test))`

### Compute and plot the training and testing accuracy scores for a variety of different neighbor values

`#Setup arrays to store train and test accuracies`
```neighbors = np.arange(1, 9)
train_accuracy = np.empty(len(neighbors))
test_accuracy = np.empty(len(neighbors))```

`#Loop over different values of k`\
`for i, k in enumerate(neighbors):`\
    ```#Setup a k-NN Classifier with k neighbors: knn
    `knn = KNeighborsClassifier(n_neighbors=k)
    #Fit the classifier to the training data
    knn.fit(X_train, y_train)
    #Compute accuracy on the training set
    train_accuracy[i] = knn.score(X_train, y_train)
    #Compute accuracy on the testing set
    test_accuracy[i] = knn.score(X_test, y_test)```

```#Generate plot
plt.title('k-NN: Varying Number of Neighbors')
plt.plot(neighbors, test_accuracy, label = 'Testing Accuracy')
plt.plot(neighbors, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.xlabel('Number of Neighbors')
plt.ylabel('Accuracy')
plt.show()```

# Regression

Scikit-learn wants features and target values in distinct arrays.

So first, split DataFrame into separate dfs, X and y:

```boston = pd.read_csv
X = boston.drop('MEDV', axis=1).values
y = boston['MEDV'].values #Using the values attribute returns the numpy arrays we can use```

Predicting house value from a single feature:

```X_rooms = X[:,5]
y = y.reshape(-1, 1)
X_rooms = X_rooms.reshape(-1, 1)```

__`np.reshape()`__: [numpy.reshape](https://numpy.org/doc/stable/reference/generated/numpy.reshape.html) \
numpy.reshape(a, newshape, order='C')

__Fitting a regression model:__
```import numpy as np
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(X_rooms, y)
prediction_space = np.linspace(min(X_rooms), max(X_rooms)).reshape(-1, 1)```

__Importing data for supervised learning:__
* Import the data and get it into the form needed by scikit-learn. 
* This involves creating feature and target variable arrays. 
* Since you are going to use only one feature to begin with, you need to do some reshaping using NumPy's .reshape() method.

```
#Import numpy and pandas
import numpy as np
import pandas as pd
#Read the CSV file into a DataFrame: df
df = pd.read_csv('gapminder.csv')
#Create arrays for features and target variable
y = df['life']
X = df['fertility']
#Print the dimensions of y and X before reshaping
print("Dimensions of y before reshaping: ", y.shape)
print("Dimensions of X before reshaping: ", X.shape)
#Reshape X and y
y_reshaped = y.reshape(-1,1)
X_reshaped = X.reshape(-1,1)
#Print the dimensions of y_reshaped and X_reshaped
print("Dimensions of y after reshaping: ", y_reshaped.shape)
print("Dimensions of X after reshaping: ", X_reshaped.shape)```

Heatmap: \
__`sns.heatmap(df.corr(), square=True, cmap='RdYlGn')`__

## __error__ function = __loss__ function = __cost__ function

__superscript:__ $R^2$ \
__subscript:__ $R_2$

__Ordinary Least Squares (OLS):__ Minimize sum of squares of residuals; same as minimizing mean squared error; calling `.fit()` on a linear regression model in scikit-learn performs OLS "under the hood."

__$R^2$:__ The default scoring method for accuracy as metric of model performance __for linear regression__; intuitively, this metric quantifies the amount of variance in the target variable that is predicted from the feature variables; use `.score()`

__RMSE (Root Mean Squared Error):__ Another popular metric to measure accuracy of model performance.

__`reg_all.score(X_test, y_test)`__

You will almost never use linear regression straight out of the box; you will almost always want to use __regularization__.


```#Import LinearRegression
from sklearn.linear_model import LinearRegression
#Create the regressor: reg
reg = LinearRegression()
#Create the prediction space
prediction_space = np.linspace(min(X_fertility), max(X_fertility)).reshape(-1,1)
#Fit the model to the data
reg.fit(X_fertility, y)
#Compute predictions over the prediction space: y_pred
y_pred = reg.predict(prediction_space)
#Print R^2 
print(reg.score(X_fertility, y))
#Plot regression line
plt.plot(prediction_space, y_pred, color='black', linewidth=3)
plt.show()```

__```#Import necessary modules
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
#Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
#Create the regressor: reg_all
reg_all = LinearRegression()
#Fit the regressor to the training data
reg_all.fit(X_train, y_train)
#Predict on the test data: y_pred
y_pred = reg_all.predict(X_test)
#Compute and print R^2 and RMSE
print("R^2: {}".format(reg_all.score(X_test, y_test)))
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error: {}".format(rmse))```__

## Cross-validation

Model performance is dependent on the way the data are split.

__K-fold cross-validation:__ Cross-validation of data into training and testing splits of K # of folds; for example 5-fold CV = five folds (4 training, 1 test). 

More folds = more computationally expensive.

This method avoids the problem of your metric of choice being dependent on the train test split

### Cross-validation in scikit-learn:

```from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
cv_results = cross_val_score(reg, X, y, cv = 5)
print(cv_results)
print(np.mean(cv_results))```

`cross_val_score()` from scikit learn, uses $R^2$ as metric of choice for regression

```
#Import the necessary modules
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
#Create a linear regression object: reg
reg = LinearRegression()
#Compute 5-fold cross-validation scores: cv_scores
cv_scores = cross_val_score(reg, X, y, cv=5)
#Print the 5-fold cross-validation scores
print(cv_scores)
print("Average 5-Fold CV Score: {}".format(np.mean(cv_scores)))
```

__`%timeit`:__

`%timeit cross_val_score(reg, X, y, cv = ____)`

## Regularized regression
Why regularize? 
* Linear regression minimizes a loss function
* It chooses a coefficient for each feature variable 
* Large coefficients can lead to overfitting (especially with many features)
* It is common practice to alter the loss function so that it penalizes for large coefficients
* __Regulatization:__ penalizing for large coefficients; there are different types of regularized regression.

### Ridge regresssion
* alpha is a parameter we need to choose
* picking alpha for ridge regression is similar to picking k in k-NN <-- __hyperparameter tuning__
* alpha sometimes also referred to as lambda
* __alpha__ = a parameter that controls model complexity
    * when alpha = 0, we get back OLS (can lead to __overfitting__)
    * A very large alpha means that large coefficients are significantly penalized
    * Very high alpha can lead to __underfitting__.
* Also known as __L2 regression__ (becuase the regularization term is the L2 norm of the coefficients). 
    
__Ridge regression in scikit-learn:__

```from sklearn.linear_model import Ridge
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)
ridge = Ridge(alpha  = 0.1, normalize = True 
ridge.fit(X_train, y_train)
ridge_pred = Ridge.predict(X_test)
ridge.score(X_test, y_test)```

### Function for fitting ridge regression over a range of different alphas:

In [2]:
def display_plot(cv_scores, cv_scores_std):
    fig = plt.figure()
    ax = fig.add_subplot(1,1,1)
    ax.plot(alpha_space, cv_scores)

    std_error = cv_scores_std / np.sqrt(10)

    ax.fill_between(alpha_space, cv_scores + std_error, cv_scores - std_error, alpha=0.2)
    ax.set_ylabel('CV Score +/- Std Error')
    ax.set_xlabel('Alpha')
    ax.axhline(np.max(cv_scores), linestyle='--', color='.5')
    ax.set_xlim([alpha_space[0], alpha_space[-1]])
    ax.set_xscale('log')
    plt.show()

```#Import necessary modules
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
#Setup the array of alphas and lists to store scores
alpha_space = np.logspace(-4, 0, 50)
ridge_scores = []
ridge_scores_std = []
#Create a ridge regressor: ridge
ridge = Ridge(normalize= True)
#Compute scores over range of alphas
for alpha in alpha_space:
    #Specify the alpha value to use: ridge.alpha
    ridge.alpha = alpha    
    #Perform 10-fold CV: ridge_cv_scores
    ridge_cv_scores = cross_val_score(ridge, X, y, cv= 10)   
    #Append the mean of ridge_cv_scores to ridge_scores
    ridge_scores.append(np.mean(ridge_cv_scores))
    #Append the std of ridge_cv_scores to ridge_scores_std
    ridge_scores_std.append(np.std(ridge_cv_scores))
#Display the plot
display_plot(ridge_scores, ridge_scores_std)```

## Lasso regression
* mirrors Ridge regression (substitute `Lasso` for `Ridge` in code above)
* Lasso regression for feature selection-- can be used to select important features of a dataset
* tends to shrink the coefficients of less important features to exactly 0
    * the features who coefficients are not shrunk to zero are then selected by the Lasso algorithm
* Lasso for feature selection in scikit-learn:

```from scikit-learn.linear_model import Lasso
names = boston.drop('MEDV', axis = 1).columns
lasso = Lasso(alpha = 0.1) #optional parameter: normalize = True
lasso_coef = lasso.fit(X, y).coef_
_ = plt.plot(range(len(names)), lasso_coef)
_ = plt.xticks(range(len(names)), names, rotation = 60)
_ = plt.ylabel("Coefficients")
plt.show()```

Important for communicating important features to bosses and colleagues in a powerful visual tool.

Lasso regularization also known as __L1 regression__ (because the regularization term is the L1 norm of the coefficients).

### Confusion matrices
__class imbalance:__ the situation when one class is more frequent. Example: of emails, 99% real and 1% spam; a very common situation in practice that requires a more nuanced metric (than accuracy) to assess the performance of a model.

__Accuracy:__ sum of the diagonal, divided by the total sum of the confusion matrix:
                    __(tp + tn) / (tp + tn + fp + fn)__ \
(true positive + true negative) / (true positive + true negative + false positive + false negative)

__Precision:__ Also called "Positive Predicted Value" or PPV: 
__tp / (tp + fp)__

__Recall:__ Also called __sensitivity__, "hit rate", or True Positive Rate 
__tp / (tp + fn)__

__F1score:__ the harmonic mean of precision and recall
__2 * (precision * recall) / (precision + recall)__


__High precision:__ means that our classifier had a low false positive rate; Not many real emails were predicted as spam.

__High recall:__ means that our classifier predicted most positive or spam emails correctly 

__Support column:__ gives the number of samples of the true response that lie in that class

```
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
knn = KNeighborsClasifier(n_neighbors = 8)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state=42)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(confusion_matrix(x_test, y_pred)) #to print classification matrix
print(classification_report(y_test, y_pred)) #to compute the resulting metrics
```


```
#Import necessary modules
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
#Create training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 42)
#Instantiate a k-NN classifier: knn
knn = KNeighborsClassifier(n_neighbors= 6)
#Fit the classifier to the training data
knn.fit(X_train, y_train)
#Predict the labels of the test data: y_pred
y_pred = knn.predict(X_test)
#Generate the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
```

## Logistic Regression and the ROC curve

* Also known as log reg (for binary classification)
* Log reg outputs probabilities
* If 'p' is labeled greater than 0.5:
    * the data is labeled '1'
* If 'p' is less than 0.5:
    * the data is labeled '0'
* Log reg produces a linear decision boundary
* follows same formula as KNeighbors and LinearRegression
* __By default, logistical regression threshold = 0.5__
* the set of points we get when trying all possible thresholds is called the __Receiver Operating Characteristic (ROC) Curve__.
* Classification reports and confusion matrices are great methods to quantitatively evaluate model performance, while ROC curves provide a way to visually evaluate models. 

```
from sklearn.metrics import roc_curve
y_pred_prob = logreg.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob) #first argument: actual labels
#second argument: predicitive probailities and unpack the results into three variables
_ = plt.plot([0,1], [0,1], 'k--')
_ = plt.plot(fpr, tpr, label = 'Logistic Regression')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Logistic Regression ROC Curve')
plt.show()
```

__`predict.proba()`__ returns an array with two columns; each column contains the probabilities for the respective target values (in the above example we chose the second column- column 1- that is, the probabilities of the predicted labels being one). Most classifiers in scikit-learn have a `.predict_proba()` method.

scikit-learn makes it very easy to try different models, since the Train-Test-Split/Instantiate/Fit/Predict paradigm applies to all classifiers and regressors - which are known in scikit-learn as 'estimators'. 

```
#Import the necessary modules
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
#Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state=42)
#Create the classifier: logreg
logreg = LogisticRegression()
#Fit the classifier to the training data
logreg.fit(X_train, y_train)
#Predict the labels of the test set: y_pred
y_pred = logreg.predict(X_test)
```

```
#Import necessary modules
from sklearn.metrics import roc_curve
#Compute predicted probabilities: y_pred_prob
y_pred_prob = logreg.predict_proba(X_test)[:,1]
#Generate ROC curve values: fpr, tpr, thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
#Plot ROC curve
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()
```

Like the alpha parameter in Ridge and Lasso Regression, Logistic Regression has a regularization __parameter: *C*__.\
__*C*__ controls the *inverse* of the regularization strength.\
A large __*C*__ can *overfit* a model.\
A small __*C*__ can *underfit* a model 

## AUC in scikit-learn

```
from sklearn.metrics import roc_auc_score
logreg = LogisticRegression()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 42)
logreg.fit(X_train, y_train)
y_pred_prob = logreg.predict_proba(X_test)[:, 1]
roc_auc_score(y_test, y_pred_prob)```


* To compute the AUC, we first compute the predicted probabilities as above, and then pass the true labels and the predicted probabilities to roc_auc_score.
* Also, compute AUC using cross-validation:

```
from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(logreg, X, y, cv = 5, scoring = 'roc_auc')
print(cv_scores)
```

```
#Import necessary modules
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_val_score
#Compute predicted probabilities: y_pred_prob
y_pred_prob = logreg.predict_proba(X_test)[:,1]
#Compute and print AUC score
print("AUC: {}".format(roc_auc_score(y_test, y_pred_prob)))
#Compute cross-validated AUC scores: cv_auc
cv_auc = cross_val_score(logreg, X, y, cv = 5, scoring = 'roc_auc')
#Print list of AUC scores
print("AUC scores computed using 5-fold cross-validation: {}".format(cv_auc))
```

## Grid Search Cross Validation:
```
sklearn.model_selection import GridSearch CV
param_grid = {'n_neighbors' : np.arange(1, 50)}
knn = KNeighborsClassifier()
knn_cv = GridSearchCV(knn, param_grid, cv = 5)
knn_cv.fit(X, y)
knn_cv.best_params_
knn_cv.best_score_```

* Specify hyperparameter as a dictionary in which the keys are the hyperparameter names and (such as `n_neighbors` or `alpha`). The values are lists containing the values we wish to tune the relevant hyperparameter or hyperparameters over. 
* If we specify multiple parameters, all possible combinations will be tried 
* GridSearchCV returns a GridSearch object that you can then fit to the data and this fit performs the actual GridSearch in place 
* RandomizedSearchCV, which is similar to Grid Search, except that it is able to jump around the grid
* GridSearch can be computationally expensive, especially if you are searching over a large hyperparameter space and dealing with multiple hyperparameters

```
#Import necessary modules
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
#Setup the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space}
#Instantiate a logistic regression classifier: logreg
logreg = LogisticRegression()
#Instantiate the GridSearchCV object: logreg_cv
logreg_cv = GridSearchCV(logreg, param_grid, cv=5)
#Fit it to the data
logreg_cv.fit(X, y)
#Print the tuned parameters and score
print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_)) 
print("Best score is {}".format(logreg_cv.best_score_))
```

## RandomizedSearchCV
* GridSearch can be computationally expensive, especially if you are searching over a large hyperparameter space and dealing with multiple hyperparameters
* A solution to this is to use RandomizedSearchCV, in which __not__ all hyperparameter values are tried out. 
* Instead, a fixed number of hyperparameter settings is sampled from specified probability distributions. 
* Decision trees have *many* parameters that can be tuned, such as `max_features`, `max_depth`, and `min_samples_leaf`. This makes Decision trees an ideal use case for RandomizedSearchCV.


## DecisionTreeClassifier
* Decision trees have many parameters that can be tuned, such as `max_features`, `max_depth`, and `min_samples_leaf`. This makes Decision trees an ideal use case for `RandomizedSearchCV`.

```
#Import necessary modules
from scipy.stats import randint
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV
#Setup the parameters and distributions to sample from: param_dist
param_dist = {"max_depth": [3, None],
              "max_features": randint(1, 9),
              "min_samples_leaf": randint(1, 9),
              "criterion": ["gini", "entropy"]}
#Instantiate a Decision Tree classifier: tree
tree = DecisionTreeClassifier()
#Instantiate the RandomizedSearchCV object: tree_cv
tree_cv = RandomizedSearchCV(tree, param_dist, cv= 5)
#Fit it to the data
tree_cv.fit(X,y)
#Print the tuned parameters and score
print("Tuned Decision Tree Parameters: {}".format(tree_cv.best_params_))
print("Best score is {}".format(tree_cv.best_score_))
```

## Hold out set for final evaluation
### Hold out set reasoning
* How well can the model perform on never before seen data (given your scoring method of choice)?
* Using ALL data for cross-validation is not ideal (because estimating model performance on any of it may not provide an accurate picture of how it will perform on unseen data).
* __Split data into training and hold-out set at the beginning.__
* Perform grid cross validation on the training set to tune model's hyperparameters
* Then, select model's best hyperparameters and evaluate on hold-out set. 

### Hold out set: Classification
* In addition to __C__, logistic regression has a 'penalty' hyperparameter which specifies whether to use 'L1' or 'L2' regularization.

```
#Import necessary modules
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
#Create the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space, 'penalty': ['l1', 'l2']}
#Instantiate the logistic regression classifier: logreg
logreg = LogisticRegression()
#Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state= 42)
#Instantiate the GridSearchCV object: logreg_cv
logreg_cv = GridSearchCV(logreg, param_grid, cv= 5)
#Fit it to the training data
logreg_cv.fit(X_train, y_train)
#Print the optimal parameters and best score
print("Tuned Logistic Regression Parameter: {}".format(logreg_cv.best_params_))
print("Tuned Logistic Regression Accuracy: {}".format(logreg_cv.best_score_))
```

## Hold out set: Regression
* Remember lasso and ridge regression from the previous chapter? Lasso used the L1 penalty to regularize, while ridge used the L2 penalty. 
* There is another type of regularized regression known as the elastic net. In elastic net regularization, the penalty term is a linear combination of the L1 and L2 penalties: 
                                    a * L1 + b * L2 
* In scikit-learn, this term is represented by the 'l1_ratio' parameter: An 'l1_ratio' of 1 corresponds to an  penalty, and anything lower is a combination of L1 and L2.

```
#Import necessary modules
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV, train_test_split
#Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 42)
#Create the hyperparameter grid
l1_space = np.linspace(0, 1, 30)
param_grid = {'l1_ratio': l1_space}
#Instantiate the ElasticNet regressor: elastic_net
elastic_net = ElasticNet()
#Setup the GridSearchCV object: gm_cv
gm_cv = GridSearchCV(elastic_net, param_grid, cv=5)
#Fit it to the training data
gm_cv.fit(X_train, y_train)
#Predict on the test set and compute metrics
y_pred = gm_cv.predict(X_test)
r2 = gm_cv.score(X_test, y_test)
mse = mean_squared_error(y_test, y_pred)
print("Tuned ElasticNet l1 ratio: {}".format(gm_cv.best_params_))
print("Tuned ElasticNet R squared: {}".format(r2))
print("Tuned ElasticNet MSE: {}".format(mse))
```

## Preprocessing Data
* In the real world, data is messy and you'll have to preprocess your data before you can build models. 

### Dealing with categorical features
* Scikit-learn will not accept categorical features by default; you will have to preprocess these features into the correct format. 
* Need to encode categorical features numerically.
* The way we achieve this, is by splitting the feature into a number of binary features called 'dummy variables' (one for each category).
    * 0 means the observation was NOT that category.
    * 1 means the observation WAS that category.
    * In the case that an observation must belong to one of the categories, you only need n-1 number of categories in columns. For example, Cars of origin US, Asia, Europe: you only need two columns. The third is implied. If we don't do this, we are duplicating information, which may be an issue for some models

### Dealing with categorical features in Python
* scikit-learn: `OneHotEncoder()`
* pandas: `get_dummies()`

#### Encoding dummy variables

```
import pandas as pd
df = pd.read_csv('auto.csv')
df_origin = pd.get_dummies(df)
print(df_origin.head())
```
__Note:__ Here, __by default__ pandas creates three dummy variables for the cars origin DataFrame \
* To prevent duplicate information which may cause issues for some models, you must drop (one of the) extra/redundant columns:

`df_origin = df_origin.drop(origin_Asia, axis = 1)`

### Linear Regression with dummy variables

```
from sklearn.model_selection import test_train_split
from sklearn.linear_model import Ridge
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)
ridge = Ridge(alpha= 0.5, normalize = True).fit(X_train, y_train)
ridge.score(X_test, y_test)
```

### Boxplots
Boxplots are particularly useful for visualizing categorical features:
```
#Import pandas
import pandas as pd
# Read 'gapminder.csv' into a DataFrame: df
df = pd.read_csv('gapminder.csv')
#Create a boxplot of life expectancy per region
df.boxplot('life', 'Region', rot=60)
#Show the plot
plt.show()
```

```
#Create dummy variables: df_region
df_region = pd.get_dummies(df)
#Print the columns of df_region
print(df_region.columns)
#Create dummy variables with drop_first=True: df_region
df_region = pd.get_dummies(df, drop_first= True)
#Print the new columns of df_region
print(df_region.columns)
```

```
#Import necessary modules
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
#Instantiate a ridge regressor: ridge
ridge = Ridge(alpha= 0.5, normalize=True)
#Perform 5-fold cross-validation: ridge_cv
ridge_cv = cross_val_score(ridge, X, y, cv=5)
#Print the cross-validated scores
print(ridge_cv)
```

## Handling missing data
### Turn 0's into NaNs and drop:
__Drop all rows containing missing data using `.dropna()`__
```
df.insulin.replace(0, np.nan, inplace= True)
df.triceps.replace(0, np.nan, inplace= True)
df.bmi.replace(0, np.nan, inplace= True)
df.dropna()
```

### Imputing missing data
* Data imputation is one of several important steps for preprocessing ML models
* Imputing missing values = making an educating guess about the values
* Use __mean__ and __median__ of non-missing entries.

``` 
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy= 'mean', axis = 0)
imp.fit(X)
X = imp.transform(X)
```
* Imputers are called transformers
* Any model that can transform data in this way is called a transformer
* Do both at once with scikit-learn's pipeline

```
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values = 'NaN', strategy = 'mean', axis= 1)
logreg = LogisticRegression()
steps = [('imputation', imp), 'logistic_regression', logreg)]
pipeline = Pipeline(steps)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.3, random_state= 42)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
pipeline.score(X_test, y_test)
```

* build a pipeline object by constructing a list of steps in the pipeline, where each step is a two-tuple containing the name you wish to give the relevant step and the estimator.
* Then pass this list to the Pipeline constructor
* Then split data into training and test sets
* Then fit the Pipeline to the training set
* Then predict on the test set
* For good measure, compute accuracy
* __NOTE:__ In a pipeline, each step but the last __must__ be a transformer, and the last __must__ be an estimator, such as a classifier, regressor, or transformer

```
#Convert '?' to NaN
df[df == '?'] = np.nan
#Print the number of NaNs
print(df.isnull().sum())
#Print shape of original DataFrame
print("Shape of Original DataFrame: {}".format(df.shape))
#Drop missing values and print shape of new DataFrame
df = df.dropna()
#Print shape of new DataFrame
print("Shape of DataFrame After Dropping All Rows with Missing Values: {}".format(df.shape))
```

###  Impute missing data in an ML Pipeline:
* There are many steps to building a model, from creating training and test sets, to fitting a classifier or regressor, to tuning its parameters, to evaluating its performance on new data. 
* Imputation can be seen as the first step of this machine learning process, the entirety of which can be viewed within the context of a pipeline. 
* Scikit-learn provides a pipeline constructor that allows you to piece together these steps into one process and thereby simplify your workflow.
* You can practice setting up a pipeline with two steps: the imputation step, followed by the instantiation of a classifier. 
    * example classifiers: k-NN, logistic regression, and the decision tree, Support Vector Machine (SVM)
    * SVC stands for Support Vector Classification, which is a type of SVM.
    
```
#Import the Imputer module
from sklearn.preprocessing import Imputer
from sklearn.svm import SVC
#Setup the Imputation transformer: imp
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)
#Instantiate the SVC classifier: clf
clf = SVC()
#Setup the pipeline with the required steps: steps
steps = [('imputation', imp),
        ('SVM', clf)]
```

* What makes pipelines so incredibly useful is the simple interface that they provide. 
* You can use the .fit() and .predict() methods on pipelines just as with classifiers and regressors!
```
#Import necessary modules
from sklearn.preprocessing import Imputer
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
#Setup the pipeline steps: steps
steps = [('imputation', Imputer(missing_values='NaN', strategy='most_frequent', axis=0)),
        ('SVM', SVC())]
#Create the pipeline: pipeline
pipeline = Pipeline(steps)
#Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.3, random_state= 42)
#Fit the pipeline to the train set
pipeline.fit(X_train, y_train)
#Predict the labels of the test set
y_pred = pipeline.predict(X_test)
#Compute metrics
print(classification_report(y_test, y_pred))
```

```
#Setup the pipeline
steps = [('scaler', StandardScaler()),
         ('SVM', SVC())]
pipeline = Pipeline(steps)
#Specify the hyperparameter space
parameters = {'SVM__C':[1, 10, 100],
              'SVM__gamma':[0.1, 0.01]}
#Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state= 21)
#Instantiate the GridSearchCV object: cv
cv = GridSearchCV(pipeline, param_grid= parameters)
#Fit to the training set
cv.fit(X_train, y_train)
#Predict the labels of the test set: y_pred
y_pred = cv.predict(X_test)
#Compute and print metrics
print("Accuracy: {}".format(cv.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print("Tuned Model Parameters: {}".format(cv.best_params_))
```

## Bringing it all together: Pipeline for regression
* goal: build a pipeline that imputes the missing data, scales the features, and fits an ElasticNet to the Gapminder data. You will then tune the l1_ratio of your ElasticNet using GridSearchCV.
```
#Setup the pipeline steps: steps
steps = [('imputation', Imputer(missing_values='NaN', strategy='mean', axis=0)),
         ('scaler', StandardScaler()),
         ('elasticnet', ElasticNet())]
#Create the pipeline: pipeline 
pipeline = Pipeline(steps)
#Specify the hyperparameter space
parameters = {'elasticnet__l1_ratio':np.linspace(0,1,30)}
#Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.4, random_state=42)
#Create the GridSearchCV object: gm_cv
gm_cv = GridSearchCV(pipeline, param_grid= parameters)
#Fit to the training set
gm_cv.fit(X_train, y_train)
#Compute and print the metrics
r2 = gm_cv.score(X_test, y_test)
print("Tuned ElasticNet Alpha: {}".format(gm_cv.best_params_))
print("Tuned ElasticNet R squared: {}".format(r2))
```

Book: __Introduction to Machine Learning with Python__ by Andreas Muller and Sarah Guido 

## Centering and Scaling
* Centering and Scaling is another important preprocessing step in ML
* Scaling features can significantly improve the performance of a model (however, this is not always the case: when all features are binary, scaling will have minimal effect) 
__Why scale your data?__
* Many ML models use some sort of distance measurement to inform them
* Therefore, differing ranges or scales can be a huge problem 
* Features on larger scales can unduly influence the model 
* Example: kNN uses distance explicitly when making predictions
* For these reasons, we want features to be on a similar scale
* To achieve this, we use __normalizing__ (or __centering and scaling__)
__Ways to normalize your data:__
* There are several ways to normalize your data
* __Standardization:__ Given any column, subtract the mean and divide by the variance
    * Makes all features centered around zero 
    * Makes all features have variance one. 

* Subtract by the minimum and divide by the range 
    * Makes dataset have minimum 0 and maximum 1
    
* Can also normalize so that the data ranges from -1 to +1

* See scikit-learn docs for further info on each method above.

## Standardization
```
from sklearn.preprocessing import scale
X_scaled = scale(X)
```

* Check columns for new mean and standard deviation: 
```
np.mean(X), np.std(X)
np.mean(X_scaled), np.std(X_scaled)
```

### Scaling in a pipeline
* Also: put Scaler in a Pipeline object
```
from sklearn.preprocessing import StandardScaler
steps = [('scaler', StandardScaler()), ('knn', KNeighborsClassifier())]
pipeline = Pipeline(steps)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state= 21)
knn_scaled = pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
accuracy_score = (y_test, y_pred)
```
* In the above example from DataCamp, scaling resulted in an improvement in accuracy score from 0.928 to 0.956
* So here, scaling did improve our model performance

### CV and scaling in a pipeline
```
steps = [('scaler', StandardScaler()), ('knn', KNeighborsClassifier())]
pipeline = Pipeline(steps)
parameters = {knn__n_neighbors : np.arange(1, 50)}
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state= 21)
cv = GridSearchCV(pipeline, param_grid= parameters)
cv.fit(X_train, y_train)
y_pred = cv.predict(X_test)
print(cv.best_params_)
print(cv_score(X_test, y_test))
print(classification_report(y_test, y_pred))
```
* Specify hyperparameter space by creating a dictionary 
* The keys are the pipeline step name, followed by a double underscore, followed by the hyperparameter name
* The corresponding value is a list or array of values to try for that particular hyperparameter
* train test split
* We then perform a grid search over the parameters in the pipeline by instantiating the GridSearchCV object and fitting it to training data

```
#Import scale
from sklearn.preprocessing import scale
#Scale the features: X_scaled
X_scaled = scale(X)
#Print the mean and standard deviation of the unscaled features
print("Mean of Unscaled Features: {}".format(np.mean(X))) 
print("Standard Deviation of Unscaled Features: {}".format(np.std(X)))
#Print the mean and standard deviation of the scaled features
print("Mean of Scaled Features: {}".format(np.mean(X_scaled))) 
print("Standard Deviation of Scaled Features: {}".format(np.std(X_scaled)))
```
***
```
#Import the necessary modules
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
#Setup the pipeline steps: steps
steps = [('scaler', StandardScaler()),
        ('knn', KNeighborsClassifier())]      
#Create the pipeline: pipeline
pipeline = Pipeline(steps)
#Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state= 42)
#Fit the pipeline to the training set: knn_scaled
knn_scaled = pipeline.fit(X_train, y_train)
#Instantiate and fit a k-NN classifier to the unscaled data
knn_unscaled = KNeighborsClassifier().fit(X_train, y_train)
#Compute and print metrics
print('Accuracy with Scaling: {}'.format(knn_scaled.score(X_test, y_test)))
print('Accuracy without Scaling: {}'.format(knn_unscaled.score(X_test, y_test)))
```

## Bringing it all together: Pipeline for classification
* Goal: build a pipeline that includes scaling and hyperparameter tuning to classify
* __SVM classifier__: hyperparameters: 
    * __*C*__ controls the regularization strength; it is analogous to the *C* of Logistic Regression
    * __*gamma*__ controls the kernel coefficient
    

# Notes for Cap2:

* baseline error of using mean to predict values
* train test split
* optimal value of k for k-fold cross validation
* SCALE / normalize values
* Centering/ scaling/ normalizing values-- Standardization?