# Class 25: Unsupervised learning

Plan for today:
- Linear regression
- Clustering


In [None]:
import YData

# YData.download.download_class_code(25)   # get class code    
# YData.download.download_class_code(25, TRUE) # get the code with the answers 


If you are using colabs, you should run the code below.

In [None]:
# !pip install https://github.com/emeyers/YData_package/tarball/master
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
import statistics
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
from urllib.request import urlopen

import matplotlib.pyplot as plt
%matplotlib inline

# Suppress ConvergenceWarning - please ignore this code 
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

## 0. Linear regression review

In regression, we try to predict a quantitative variable y, from a set of features X. 

Let's explore this by predicting the body mass of penguins (in grams) from other quantitative features of a penguin (e.g., their bill and flipper sizes). 


In [None]:
penguins = sns.load_dataset("penguins")
penguins = penguins.dropna()
penguins = penguins.sample(frac = 1)
penguins.head()

In [None]:
# get the features and the labels

X_penguin_features = penguins[['bill_length_mm', 
                               'bill_depth_mm',
                               'flipper_length_mm']]

y_penguin = penguins['body_mass_g']


# also save the penguin species to use later
y_penguin_species = penguins['species']


Let's use scikit-learn to generate training and test data as we did previously for our KNN classifier. 

In [None]:
from sklearn.model_selection import train_test_split

# split data into a training and test set

X_train, X_test, y_train, y_test = train_test_split(X_penguin_features,  
                                                    y_penguin, 
                                                    random_state = 0)

print(X_train.shape)
print(X_test.shape)

X_train.head(5)


We can now create a new linear regression model, fit it to data, and make predictions. The method names are again very similar to what we used for the KNN classifier (i.e., the `fit()` and predict()` methods). 

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# create a new linear regression moded
linear_model = LinearRegression()


# fit the model to our training data
linear_model.fit(X_train, y_train)


# make predictions of the penguins body weight on the test data
body_mass_predictions = linear_model.predict(X_test)


# Use scikit-learn's mean_squared_error() function to get the RMSE
np.sqrt(mean_squared_error(y_test, body_mass_predictions))


In [None]:
# using cross-validation
from sklearn.model_selection import cross_val_score

linear_model = LinearRegression()

scores = cross_val_score(linear_model, 
                         X_penguin_features,  
                         y_penguin, 
                         cv = 5, 
                         scoring='neg_mean_squared_error')

np.sqrt(np.mean(-1 * scores))

### Regression model equation

In linear regression, our predicted $\hat{y}$ values are given by the equation: $\hat{y} = b_0 + b_1 x_1 + ... + + b_k x_k$.

Let's fill out this equation for prediciting penguin body mass. 

To do this, let's start by extracting the intercept ($b_0$) and slope coefficients ($b_i's$) from our scikit-learn model.


In [None]:

# fit the linear regression model to our training data
linear_model.fit(X_train, y_train)

# get the intercept and slope coefficients
sklearn_intercept = linear_model.intercept_
sklearn_coefficients = linear_model.coef_

# print out the coefficient values
(sklearn_intercept, sklearn_coefficients)  


Given these coefficient values can you write our the regression equation for predicting penguin body mass? 


#### Answer

$\hat{y}_{mass} = -6680.13 + 0.5049 \cdot x_{bill-length} + 21.6384 \cdot x_{bill-depth} +  52.1411 \cdot x_{lipper-length}$



## 1. Inference on regression coefficients

We can also run inference procedures on our regression model using the statsmodel package. In particular, we can run hypothesis tests and create confidence intervals for our regression coefficents. 

When running a hypothesis test, our hypotheses are:





In [None]:
# Hypothesis test on regression coeffients - which coefficients are statistically significantly different from zero? 
# (and confidence interval)

import statsmodels.api as sm

# add a constant value of 1 to our data


# fit the linear regression model using the OLS function


# get information on the regression coefficients found



## 2. Unsupervised learning: clustering

We can do k-means clustering in scikit-learn using the `KMeans()` object.


In [None]:
from sklearn.cluster import KMeans

# fit k-means with 3 clusters 





In [None]:
# see which cluster each point belongs to 



In [None]:
# look at a matrix of which penguin types end up in which cluster 





In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# do clustering with feature normalization 





In [None]:
# see which cluster each (normalized) point belongs to





In [None]:
# look at a matrix of which penguin types end up in which cluster 





### 2b. Unsupervised learning: Hierarchical clustering


In [None]:
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster import hierarchy

#  Ward's method adds points to a cluster that minimizes the sum of squared differences within all clusters




In [None]:
# display a dendrogram




In [None]:
# cluster points into 3 clusters 




# get the predicted cluster for each point



In [None]:
# visualize how well the clustering matches the penguin species






## 3. Chatbots 

Large language models (LLMs) are taking over the world. I, for one, welcome our new robot [overlords](https://www.youtube.com/watch?v=8lcUHQYhPTE).

Let's explore how we can use a model from HuggingFace to create a chatbot.

To do this we need to install some additional packages. I recommend cloning your Jupyter environment, and then adding these packages to the new environment.


In [None]:
# Modified from code created by Giuliano Formisano
# Updated to work with new version of hugging face packages

from transformers import pipeline, BlenderbotTokenizer, BlenderbotForConditionalGeneration

model_name = "facebook/blenderbot-400M-distill"

# Option A: Using text2text-generation pipeline (single-turn):
chatbot = pipeline("text2text-generation", model=model_name, tokenizer=model_name, device = "cpu")

# This uses "mps" by default on my mac which should be faster but there appears to be a bug in the code
#chatbot = pipeline("text2text-generation", model=model_name, tokenizer=model_name)


user_input = "Hi! What can you do?"
response = chatbot(user_input, max_new_tokens=50)

print(f"User:  {user_input}")
print(f"Bot:   {response[0]['generated_text']}")



### Loop for an interaction User-Chatbot

In [None]:
# Loop of interaction user-chatbot
while True:
  user_input = input("You: ") # add prompt in the appearing box below
  if user_input.lower() == "quit": # write "quit" to interrupt
    break
  response = chatbot(user_input) # this is a bit slow
  print(f"Chatbot: {response[0]['generated_text']}")

### Hugging Face data sets

In [None]:
# Hugging Face also has a number of large/interesting data sets
# (some of them controversial, and of course, one should be cautious of the veracity of all data sets)



# Load a data set 

from datasets import load_dataset

#emails = load_dataset("tensonaut/EPSTEIN_FILES_20K", split="train")
emails = load_dataset("corbt/enron-emails", split="train")




# print type and shape






# convert to a pandas data frame





In [None]:
# search for keywords








