# Class 24: Unsupervised learning

Plan for today:
- Clustering
- Object-oriented programming
  

In [16]:
import YData

# YData.download.download_class_code(24)   # get class code    
# YData.download.download_class_code(24, TRUE) # get the code with the answers 

# YData.download.download_homework(9)  # downloads the homework 

# project review template
# YData.download.download_class_file('reviewer_template.ipynb', 'homework')


If you are using colabs, you should run the code below.

In [17]:
# !pip install https://github.com/emeyers/YData_package/tarball/master
# from google.colab import drive
# drive.mount('/content/drive')

In [18]:
import statistics
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
from urllib.request import urlopen

import matplotlib.pyplot as plt
%matplotlib inline

# Suppress ConvergenceWarning - please ignore this code 
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

In [19]:
# Get our penguin data that we can use to test that our code is working properly

from sklearn.model_selection import train_test_split

penguins = sns.load_dataset("penguins")
penguins = penguins.dropna()
penguins = penguins.sample(frac = 1)

X_penguin_features = penguins[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']]
y_penguin_labels = penguins['species']


## 1. Unsupervised learning: clustering

We can do k-means clustering in scikit-learn using the `KMeans()` object.


In [20]:
from sklearn.cluster import KMeans

# fit k-means with 3 clusters 





In [21]:
# see which cluster each point belongs to 



In [22]:
# look at a matrix of which penguin types end up in which cluster 





In [23]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# do clustering with feature normalization 





In [24]:
# see which cluster each (normalized) point belongs to





In [25]:
# look at a matrix of which penguin types end up in which cluster 





### 1b. Unsupervised learning: Hierarchical clustering


In [26]:
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster import hierarchy

#  Ward's method adds points to a cluster that minimizes the sum of squared differences within all clusters




In [27]:
# display a dendrogram




In [28]:
# cluster points into 3 clusters 




# get the predicted cluster for each point



In [29]:
# visualize how well the clustering matches the penguin species






## 2. Object-oriented programming

[Object-oriented programming (OOP)](https://en.wikipedia.org/wiki/Object-oriented_programming) is a programming paradigm based on the concept of objects, which can contain data and code: data in the form of fields (often known as attributes or properties), and code in the form of procedures (often known as methods). In OOP, computer programs are designed by making them out of objects that interact with one another.

Let's write our own K-Nearest Neighbor class that can create K-Nearest Neighbor classifiers!


### KNN functions

Below are the functions we previously wrote in class 22 to do K-Nearest Neighbor classification. 

We will now turn this code into a KNN object.


In [30]:
from sklearn.model_selection import train_test_split

# split data into a training and test set
X_train, X_test, y_train, y_test = train_test_split(X_penguin_features,  
                                                    y_penguin_labels, 
                                                    random_state = 0)

print(X_train.shape)


(249, 4)


In [31]:
# From class 22

# Calculate the Euclidean distance
def euclid_dist(x1, x2):
    return np.sqrt(np.sum((x1 - x2)**2))


# Get the labels and distances between a test point and all the training data
def get_labels_and_distances(test_point, X_train_features, y_train_labels):
    
    the_distances = []
    
    # get the distance between the test point and all training points
    for i in range(X_train_features.shape[0]):
        the_distances.append(euclid_dist(test_point, X_train_features.iloc[i]))

    
    # Create a DataFrame with the training labels and distances 
    labels_and_distances = pd.DataFrame({'label': y_train_labels, 'distance':the_distances})
    return labels_and_distances



# Classify a single test point
def classify_point(test_point, k, X_train_features, y_train_labels):
    
    labels_and_distances =  get_labels_and_distances(test_point, 
                                                     X_train_features, 
                                                     y_train_labels)

    sorted_labels_dist = labels_and_distances.sort_values("distance")
    sorted_labels_dist = sorted_labels_dist.iloc[0:k]
    
    
    count_table = sorted_labels_dist.groupby("label").count().reset_index()
    sorted_count_table = count_table.sort_values("distance", ascending = False)
    majority_class = sorted_count_table.iloc[0]["label"]
    
    return majority_class



# Classify a whole test set
def classify_all_test_data(X_test_data, k, X_train_features, y_train_labels):
    
    predictions = []
    
    for i in range(X_test_data.shape[0]):
        
        curr_test_point = X_test_data.iloc[i]
        
        curr_prediction = classify_point(curr_test_point, 
                                         k, 
                                         X_train_features, 
                                         y_train_labels)
        
        predictions.append(curr_prediction)

    return np.array(predictions)



all_predictions = classify_all_test_data(X_test, 5, X_train, y_train)

all_predictions

array(['Chinstrap', 'Adelie', 'Gentoo', 'Gentoo', 'Adelie', 'Chinstrap',
       'Gentoo', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie',
       'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Gentoo',
       'Adelie', 'Adelie', 'Gentoo', 'Adelie', 'Gentoo', 'Gentoo',
       'Adelie', 'Gentoo', 'Adelie', 'Adelie', 'Gentoo', 'Adelie',
       'Adelie', 'Gentoo', 'Gentoo', 'Adelie', 'Chinstrap', 'Adelie',
       'Gentoo', 'Gentoo', 'Adelie', 'Chinstrap', 'Adelie', 'Gentoo',
       'Gentoo', 'Adelie', 'Gentoo', 'Adelie', 'Adelie', 'Adelie',
       'Gentoo', 'Adelie', 'Gentoo', 'Gentoo', 'Adelie', 'Adelie',
       'Adelie', 'Adelie', 'Adelie', 'Gentoo', 'Gentoo', 'Adelie',
       'Adelie', 'Gentoo', 'Gentoo', 'Adelie', 'Adelie', 'Adelie',
       'Gentoo', 'Gentoo', 'Adelie', 'Adelie', 'Chinstrap', 'Adelie',
       'Adelie', 'Adelie', 'Gentoo', 'Adelie', 'Gentoo', 'Gentoo',
       'Gentoo', 'Gentoo', 'Adelie', 'Gentoo', 'Adelie', 'Adelie'],
      dtype='<U9')

### Object constructor

To start, let's write the "constructor" code that can be used to create a new KNN object. This code will simply store the number of neighbors used in a field called `k`. 


In [32]:
class KNN:
    
    # Constructor
    def __init__(self, n_neighbors): 

        ...  



In [33]:
# create an instance




In [34]:
# get the value stored property k



### The .fit() method

Let's now write the `.fit()` method. This method will merely store the training and test data into fields called `X_train` and `y_train`. 


In [35]:
class KNN:
    
    # Constructor
    def __init__(self, n_neighbors): 
        self.k = n_neighbors 

    # The fit method
    def fit(self, X_features_train, y_labels_train):
        ...



In [36]:
# Create an KNN object and try the .fit() method 






### The .predict() method

Now let's write the `.predict()` method which will take a test data set `X_test` and will make predictions for which class each test point belongs to. 

To do this we will cheat a little and use the classification functions we wrote previous (i.e., the functions above). We could also just include these functions into our object (i.e., cut and paste them into our object). 


In [37]:
class KNN:
    
    # Constructor
    def __init__(self, n_neighbors): 
        self.k = n_neighbors 

    # The fit method
    def fit(self, X_train, y_train):
        self.X_train = X_train
        self.y_train = y_train

    # The predict method
    def predict(self, X_test_data):
        ...


In [38]:
# Create an KNN object and try the .predict() method 




### Special methods

"Special methods" (also known as "dunder methods") allow objects to work in consistent/predictable ways. 

Let's add a method that makes it so our KNN object displays more useful information when we call the `print()` function on it. 


In [39]:
# What is printed when we call the print() function on our current KNN object

print(KNN)

<class '__main__.KNN'>


In [40]:
class KNN:
    
    # Constructor
    def __init__(self, n_neighbors): 
        self.k = n_neighbors 

    # The fit method
    def fit(self, X_train, y_train):
        self.X_train = X_train
        self.y_train = y_train

    # The predict method
    def predict(self, X_test_data):
        return classify_all_test_data(X_test_data, self.k, self.X_train, self.y_train)

    # The print "special" method
    def __str__(self):
        ...


In [41]:
# Test the print() method



<br>
<br>
<br>
<br>
<br>
<br>

![](https://theenglishtree.it/wp-content/uploads/2016/10/Untitled.png)