# Class 25: Linear regression and unsupervised learning

Plan for today:
- Linear regression
- Clustering


## Notes on the class Jupyter setup

If you have the *ydata123_2024a* environment set up correctly, you can get the class code using the code below (which presumably you've already done given that you are seeing this notebook).  

In [1]:
import YData

# YData.download.download_class_code(25)   # get class code    
# YData.download.download_class_code(25, TRUE) # get the code with the answers 


If you are using colabs, you should install the YData packages by uncommenting and running the code below.

In [2]:
# !pip install https://github.com/emeyers/YData_package/tarball/master

If you are using google colabs, you should also uncomment and run the code below to mount the your google drive

In [3]:
# from google.colab import drive
# drive.mount('/content/drive')

In [4]:
import statistics
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
from urllib.request import urlopen

import matplotlib.pyplot as plt
%matplotlib inline

# Suppress ConvergenceWarning - please ignore this code 
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

## 1. Linear regression

In regression, we try to predict a quantitative variable y, from a set of features X. 

Let's explore this by predicting the body mass of penguins (in grams) from other quantitative features of a penguin (e.g., their bill and flipper sizes). 


In [5]:
penguins = sns.load_dataset("penguins")

penguins = penguins.dropna()

penguins = penguins.sample(frac = 1)

penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,Female
107,Adelie,Biscoe,38.2,20.0,190.0,3900.0,Male
183,Chinstrap,Dream,54.2,20.8,201.0,4300.0,Male
184,Chinstrap,Dream,42.5,16.7,187.0,3350.0,Female
128,Adelie,Torgersen,39.0,17.1,191.0,3050.0,Female


In [6]:
# get the features and the labels

X_penguin_features = penguins[['bill_length_mm', 'bill_depth_mm','flipper_length_mm']]

y_penguin = penguins['body_mass_g']


# also save the penguin species to use later
y_penguin_species = penguins['species']


Let's use scikit-learn to generate training and test data as we did previously for our KNN classifier. 

In [7]:
from sklearn.model_selection import train_test_split

# split data into a training and test set






We can now create a new linear regression model, fit it to data, and make predictions. The method names are again very similar to what we used for the KNN classifier (i.e., the `fit()` and predict()` methods). 

In [8]:
from sklearn.linear_model import LinearRegression

# create a new linear regression model



In [9]:
# fit the model to our training data



In [10]:
# make predictions of the penguins body weight on the test data




We can assess the accuracy of our predictons using the root mean squared error which is defined as: 

$$RMSE = \sqrt{ \frac{1}{n}\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}$$

Here $\hat{y}$ is the predictions made by our linear model on the test data (i.e., the predicted body weight) and y is the actual body weights for the points in our test set.

In [11]:
# test the RMSE on the test data




We can also use scikit-learn's `mean_squared_error()` to get the MSE, and we can use the `cross_val_score` to run k-fold cross-validation (again, in a very similar way to what we did for our KNN classifier). 

In [12]:
from sklearn.metrics import mean_squared_error

# Use scikit-learn's mean_squared_error() function to get the RMSE




In [13]:
# using cross-validation
from sklearn.model_selection import cross_val_score





## 2. Unsupervised learning: clustering

We can do k-means clustering in scikit-learn using the `KMeans()` object.


In [14]:
from sklearn.cluster import KMeans

# fit k-means with 3 clusters 





In [15]:
# see which cluster each point belongs to 



In [16]:
# look at a matrix of which penguin types end up in which cluster 





In [17]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# do clustering with feature normalization 





In [18]:
# see which cluster each (normalized) point belongs to





In [19]:
# look at a matrix of which penguin types end up in which cluster 





### 2b. Unsupervised learning: Hierarchical clustering


In [20]:
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster import hierarchy

#  Ward's method adds points to a cluster that minimizes the sum of squared differences within all clusters




In [21]:
# display a dendrogram




In [22]:
# cluster points into 3 clusters 




# get the predicted cluster for each point



In [23]:
# visualize how well the clustering matches the penguin species




