# Class 24: Linear regression and unsupervised learning

Plan for today:
- Linear regression
- Clustering


In [32]:
import YData

# YData.download.download_class_code(24)   # get class code    
# YData.download.download_class_code(24, TRUE) # get the code with the answers 

# YData.download.download_homework(9)  # downloads the homework 

# project review template
# YData.download.download_class_file('reviewer_template.ipynb', 'homework')


If you are using colabs, you should run the code below.

In [33]:
# !pip install https://github.com/emeyers/YData_package/tarball/master
# from google.colab import drive
# drive.mount('/content/drive')

In [34]:
import statistics
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
from urllib.request import urlopen

import matplotlib.pyplot as plt
%matplotlib inline

# Suppress ConvergenceWarning - please ignore this code 
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

## 0. Feature normalization

If you look at the features we have been using in our analyses, you will notice that they are on very different scales. This is quite problematic for a KNN classifier since the classifier is finding the distance between each data point, so features that have large values will dominate this distance. 

Let's explore the scales that different features have by looking at some descriptive statistics. In particular, let's go back to the manually created `X_train`, `X_test`, `y_train`, `y_test` to examine the scale that different features are measured on.


In [35]:
# get the features and the labels

penguins = sns.load_dataset("penguins")
penguins = penguins.dropna()
penguins = penguins.sample(frac = 1)

X_penguin_features = penguins[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']]

y_penguin_labels = penguins['species']


In [36]:
from sklearn.model_selection import train_test_split


# Create the training and test splots of the data using train_test_split



# Get summary statistics of the training data using the .describe() method



Let's do a z-score transformation of our features which set the mean of the features to 0 and the standard deviation to 1. We can do this using the using the `StandardScaler()` object as follows: 

1. Create a new `StandardScaler()` object using `scaler = StandardScaler()` 

2. Have the `scaler` object learn the means and standard deviations of our training data by calling the `scaler.fit(X)` function on the training data.

3. Use the fit `scaler` object to transform both the training and test features so that all features are on a similar scale by calling the `.transform(X)` method. 


In [37]:
from sklearn.preprocessing import StandardScaler


# learning the mean and standard deviations to scale the features






In [38]:
# z-score transform the features 






Let's now look at our transformed training data...

In [39]:
# view descriptive statistics on the transformed features





Let's see how our classification accuracy changes using the z-score transformed data

In [40]:
from sklearn.neighbors import KNeighborsClassifier

# apply KNN classification on the normalized features





In order to transform our features inside a cross-validation loop, we can set up a pipeline. This pipeline will do the following:

1. It will split the data into a training and test set
2. It will fit the transformation of the features on the training set (i.e., learn the means and standard deviations on the training set). 
3. It will apply a z-score transformation of the training and test set based on the features learned in step 2
4. It will train the classifier on the transformed data
5. It will measure the classification accuracy on the test data
6. It will repeat this process k times, where k here refers to how many cross-validation splits we are using

In order to do this in scikit-learn we can use a `Pipeline` object which sets up the stages of transformation and classification. We can then use the `cross_val_score()` function to run cross-validation on this pipeline. 

In [41]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score


# create a pipeline for running cross-validation with feature normalization

# components that go into the pipeline



# build the pipeline



# get the cross-validation scores



# print out the mean score over the 5 cross-validation splits


## 1. Linear regression

In regression, we try to predict a quantitative variable y, from a set of features X. 

Let's explore this by predicting the body mass of penguins (in grams) from other quantitative features of a penguin (e.g., their bill and flipper sizes). 


In [42]:
penguins = sns.load_dataset("penguins")

penguins = penguins.dropna()

penguins = penguins.sample(frac = 1)

penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
100,Adelie,Biscoe,35.0,17.9,192.0,3725.0,Female
311,Gentoo,Biscoe,52.2,17.1,228.0,5400.0,Male
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
195,Chinstrap,Dream,45.5,17.0,196.0,3500.0,Female
299,Gentoo,Biscoe,45.2,16.4,223.0,5950.0,Male


In [43]:
# get the features and the labels

X_penguin_features = penguins[['bill_length_mm', 'bill_depth_mm','flipper_length_mm']]

y_penguin = penguins['body_mass_g']


# also save the penguin species to use later
y_penguin_species = penguins['species']


Let's use scikit-learn to generate training and test data as we did previously for our KNN classifier. 

In [44]:
from sklearn.model_selection import train_test_split

# split data into a training and test set






We can now create a new linear regression model, fit it to data, and make predictions. The method names are again very similar to what we used for the KNN classifier (i.e., the `fit()` and predict()` methods). 

In [45]:
from sklearn.linear_model import LinearRegression

# create a new linear regression model



In [46]:
# fit the model to our training data



In [47]:
# make predictions of the penguins body weight on the test data




We can assess the accuracy of our predictons using the root mean squared error which is defined as: 

$$RMSE = \sqrt{ \frac{1}{n}\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}$$

Here $\hat{y}$ is the predictions made by our linear model on the test data (i.e., the predicted body weight) and y is the actual body weights for the points in our test set.

In [48]:
# test the RMSE on the test data




We can also use scikit-learn's `mean_squared_error()` to get the MSE, and we can use the `cross_val_score` to run k-fold cross-validation (again, in a very similar way to what we did for our KNN classifier). 

In [49]:
from sklearn.metrics import mean_squared_error

# Use scikit-learn's mean_squared_error() function to get the RMSE




In [50]:
# using cross-validation
from sklearn.model_selection import cross_val_score





### Regression model equation

In linear regression, our predicted $\hat{y}$ values are given by the equation: $\hat{y} = b_0 + b_1 x_1 + ... + + b_k x_k$.

Let's fill out this equation for prediciting penguin body mass. 

To do this, let's start by extracting the intercept ($b_0$) and slope coefficients ($b_i's$) from our scikit-learn model.


In [51]:

# fit the linear regression model to our training data


# get the intercept and slope coefficients



# print out the coefficient values



Given these coefficient values can you write our the regression equation for predicting penguin body mass? 


#### Answer




#### Writing our own prediction function

Let's also write our own function called `get_predictions(b0_intercept, b_coefficients, X_data)` that takes the coefficient values and X values and returns predicted $\hat{y}$ values for each X value. In particular, the arguments to the function are:

1. `b0_intercept`: The linear regression intercept
2. `b_coefficients`: The linear regression slope coefficients
3. `X_data`: The X data values 

The returned value is a numpy ndarray of predictions for each X data point. 


In [52]:
# write a function to get the predictions
def get_predictions(b0_intercept, b_coefficients, X_data):
    ...



# get the predicted values on the test data




# see the it matches the scikit-learn predictions




## 1b. Inference on regression coefficients

We can also run inference procedures on our regression model using the statsmodel package. In particular, we can run hypothesis tests and create confidence intervals for our regression coefficents. 

When running a hypothesis test, our hypotheses are:





In [53]:
# Hypothesis test on regression coeffients - which coefficients are statistically significantly different from zero? 
# (and confidence interval)

import statsmodels.api as sm

# add a constant value of 1 to our data


# fit the linear regression model using the OLS function


# get information on the regression coefficients found



## 2. Unsupervised learning: clustering

We can do k-means clustering in scikit-learn using the `KMeans()` object.


In [54]:
from sklearn.cluster import KMeans

# fit k-means with 3 clusters 





In [55]:
# see which cluster each point belongs to 



In [56]:
# look at a matrix of which penguin types end up in which cluster 





In [57]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# do clustering with feature normalization 





In [58]:
# see which cluster each (normalized) point belongs to





In [59]:
# look at a matrix of which penguin types end up in which cluster 





### 2b. Unsupervised learning: Hierarchical clustering


In [60]:
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster import hierarchy

#  Ward's method adds points to a cluster that minimizes the sum of squared differences within all clusters




In [61]:
# display a dendrogram




In [62]:
# cluster points into 3 clusters 




# get the predicted cluster for each point



In [63]:
# visualize how well the clustering matches the penguin species




