<a href="https://colab.research.google.com/github/dylanwalker/MGSC496/blob/main/MGSC496_R06.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#@title Your Info

your_name = '' #@param {type:"string"}
your_email = '' #@param {type:"string"}
today_date = '' #@param {type:"date"}


# How to "read" this notebook

As you go through this notebook (or any notebook for this class), you will encounter new concepts and python code that implements them -- just like you would see in a textbook. Of course, in a textbook, it's easy to read code and an explanation of what it does and think that you understand it.
<br />
<br />

### Learn by doing
But this notebook is different from a textbook because it allows you to not just read the code, but play with it. **You can and should try out changing the code that you see**. In fact, in many places throughout this reading notebook, you will be asked to write your own code to experiment with a concept that was just covered. This is a form of "active reading" and the idea behind it is that we really learn by **doing**. 
<br />
<br />

### Change everything
But don't feel limited to only change code when I prompt you. This notebook is your learning environment and your playground. I encourage you to try changing and running all the code throughout the notebook and even to **add your own notes and new code blocks**. Adding comments to code to explain what you are testing, experimenting with or trying to do is really helpful to understand what you were thinking when you revisit it later. 
<br />
<br />
### Make this notebook your own
Make this notebook your own. Write your questions and thoughts. At the end of every reading notebook, I will ask the same set of questions to try to elicit your questions, reaction and feedback. When we review the reading notebook in class, I encourage you to   



# Code Preface

In [None]:
import sklearn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# scikit-learn: Machine Learning in Python

<img src="https://drive.google.com/uc?id=19uME3dIsvBDOtVg5eDeIN7tmLLiRpjwx" width=300>


Scikit Learn (sklearn for short) is a machine learning library in Python that provides access to many common [ML algorithms](https://scikit-learn.org/stable/index.html). You've already seen supervised and unsupervised ML algorithms. So in this lecture, we'll focus on the basics of the sklearn APIs.  "scikit" refers to "scipy toolkit". 

The six main categories of scikit-learn algorithms are:
* Regression
* Classification
* Clustering
* Dimensionality reduction
* Model selection
* Preprocessing

Data (features and targets) is passed into sklearn algorithms typically as Pandas dataframes or numpy arrays or even python lists. 

We'll first discuss how to us sklearn using artificial data, as this will simplify the basic concepts of fitting different models to data. We will follow this with a discussion of steps, tips and tricks for using sklearn with real data.

We'll start by talking about the Estimator API


## Estimators

In scikit-learn, a machine learning model is called as **Estimator**.

Each **Estimator** is a Python `class` and has a form like this:

```python
class estimator():
    def __init__(self, data):
        self.data = data
    def fit(self):
        # do some calculations with self.data
```

Most commonly, the steps in using the Scikit-Learn Estimator API are as follows:

1. Choose a class of model by importing the appropriate estimator class from Scikit-Learn.
2. Choose model hyperparameters by instantiating this class with desired values.
3. Arrange data into a features matrix and target vector.
4. Fit the model to your data by calling the ``fit()`` method of the model instance.
5. Apply the Model to new data:
   - For supervised learning, often we predict labels for unknown data using the ``predict()`` method.
   - For unsupervised learning, we often transform or infer properties of the data using the ``transform()`` or ``predict()`` method.




So if you want to estimate coefficients or learn patterns within data, simply do the following:

1. initialize an estimator.
2. Fit the estimator with data of your interest.

We'll walk through some very common ML algorithms, starting with regression.

## Regression : Linear Regression - Ordinary Least Squares (OLS)

You should all be familiar with the concept of regression. In regression, we have a set of *independent* variables:

$x_1, x_2, ..., x_p$

and a *dependent* variable:

$y$

We might have independent and dependent variables in dataframe, where the rows are the different data points and the independent variables or **features** of each data point are the columns $x_i$ and the **outcome** (also called the **target**) $y$ is a column.

Our goal is to learn a model that "predicts" the outcome $\hat{y}$. Every model has parameters or weights that can be adjusted or tuned in order to best **fit** the model to the data -- we call this procedure "training the model" or "fitting the model to the data".

For Ordinary Least Squares regression, our model is:

$\hat{y} = w_0 + w_1 x_1 + w_2 x_2 + ... + w_p x_p$

and the parameters or weights are $w_0,w_1,...,w_p$.  

This is often written in a more compact way in vector/matrix notation:

$\hat{y} = Xw$

$\hat{y}$ is the predicted outcome (we'll often use `y_pred` to denote this in our code). In other words, for a given set of weights, the model will predict the outcome $\hat{y}$ for a data point, while the actual outcome of that data point is $y$. They will not necessarily be the same, because our model will not fit the data perfectly. Each data point will have an error $\epsilon = y-\hat{y}$


Learning the model is all about finding a set of weights $w_i$ that minimizes the error from all data points: 

$\sum_{data\ points}{(y-Xw)^2}$


To accomplish this in sklearn, we'll start by importing `linear_model` and initializing it:

In [None]:
# initialize a linear model estimator
from sklearn import linear_model # import model
lm = linear_model.LinearRegression() # instantiate model (no hyperparameters here)

`LinearRegression` accepts two inputs $X$ and $y$.

We'll typically use dataframes or 2d numpy arrays (though the shape depends on whether each feature $x$ is a single number of a vector), so we'll try to use the convention that the first dimension is the rows(or data points) and 2nd dimension is the columns or features.

For example, suppose we 5 data points, a single 1d features $x$ for each data point, and a single 1d outcome $y$ for each data point. Then:




$X = [[x_{11}], [x_{21}],[x_{31}]]$

y = $[y_{1}, y_{2}, y_{3}]$

More generally, the shape of $X$ is `(n_samples,n_features)` and the shape of $y$ is `(n_samples)` or, if we have multiple outcomes for each data point, `(n_samples, n_outcomes)`. (here, `n_samples` is the number of data points).

Here's a simple example:


In [None]:
# A simple examples of data features and outcomes
X = np.array([[1], [2], [3], [4], [5]]) # five samples or data points, each has a single 1d feature
y = np.array([0,2,4,1,4]) # five samples or data points, each has a single 1d outcome
print(f'Input features X have shape {X.shape}')
print(f'Outcomes y have shape {y.shape}')
plt.scatter(X, y);

Now, we can fit the estimator `lm` like this:

In [None]:
lm.fit(X, y)

Estimated intercept $w_0$ and coefficients $w_i$ are stored in `lm.intercept_` and `lm.coef_` respectively.

In [None]:
print(f'Estimated intercept is {lm.intercept_} and estimated coefficient is {lm.coef_[0]}')

The estimated linear model is $Y = 0.1 + 0.7X$.

Once we have fitr the model we'll use the model's `predict()` method to get a prediction for new data.

For example, if you want to predict a value for $x=10$? &rarr; we can use `lm.predict()`

In [None]:
lm.predict(np.array([[10]])) # be sure that the X you use with predict is a two dimensional array (hence the double square brackets)

Lastly, we can use `lm.predict()`, to get the predicted outcome for the training data:

In [None]:
y_pred = lm.predict(X)
plt.scatter(X, y)
plt.plot(X, y_pred, color='gray', marker='o', linestyle='--');

Notice in the above that the gray line that our model learned is the one that minimizes the error across all data points. The gray dot is the predicted outcome $\hat{y}$, while the blue dot is actual outcome $y$.


Remember the rules of scikit-learn. Initialize and fit.

## Classification : Support Vector Machine (SVM)

A [Support Vector Machine (SVM)](https://en.wikipedia.org/wiki/Support_vector_machine) is a supervised learning model that classifies data points into a set of given classes or labels.

SVM finds "hyperplanes" (think of the higher dimensional extension of a line dividing two regions of space in two dimension) that maximally divides data into the labelled groups. All data on one side of the hyperplane gets one label (or class), all data on the other side gets a different label (or class). Therefore we say the model "classifies" the data. 

* For $p$ dimensional vectors, its hyperplane of $(p-1)$ dimensions can separate the vectors into labels.
* For example, if each observation has two values, a hyperplane that divides observations is a line (1-dim).
* If each observation has three values, a hyperplane is a plane (2-dim).

A hyperplane that divides data points can be expressed as $\overrightarrow{w}$ that satisfies $\overrightarrow{w}\overrightarrow{x}-b = c$ where $c$ is a value between two labels. But don't worry about this mathematical description, let's see how it works:

### SVM Example

To illustrate this, we'll create an artificial dataset by randomly generating samples from two multivariate normal distributions. The "class" or "label" in this case is the distribution from which it is drawn (in this example, there are two different distributions that we draw points from, so there are two classes).

In [None]:
np.random.seed(1) # set the seed, so we all get the same random draws and therefore have the same data.
dat1 = np.random.multivariate_normal(mean=[1,1], cov=[[0.3, 0], [0, 0.3]], size=50)
dat2 = np.random.multivariate_normal(mean=[2,1.5], cov=[[0.3, 0], [0, 0.3]], size=50)

In [None]:
# Visualize the datapoints drawn from these two distributions
plt.scatter(dat1[:,0], dat1[:,1]);
plt.scatter(dat2[:,0], dat2[:,1]);
plt.xlabel('$x_1$');
plt.ylabel('$x_2$');

Notice each data point has a 2d feature $(x_1,x_2)$ that is plotted. The color of the datapoint refers to which class it belongs to (i.e., which distribution it was drawn from).

We'll combine these to make a single dataset:

In [None]:
X = np.concatenate((dat1, dat2)) # The first 50 rows are from the first 2d normal distribution, the next 50 rows are from the second 2d normal distribution
y = [0]*50 + [1]*50 # This is just a quick way to make the labels by repeating the elements of two lists and concatenating them together. 
y = np.array(y)

In [None]:
print(f'Input features X have shape {X.shape}')
print(f'Outcomes y have shape {y.shape}')

To proceed, we simply follow the steps, starting with:

1. Initialize an estimator

In [None]:
from sklearn import svm
clf = svm.LinearSVC() # There are different types of SVMs, here we'll use a Linear support vector machine

2. Fit the estimator to the data.

In [None]:
clf.fit(X, y)

To see what the model arrived at for the fit, we can print the coefficients:

In [None]:
print(clf.intercept_)
print(clf.coef_)

Thus, the learned classifier is $-2.87 + 1.30x_{1}+0.67x_{2}$.

And we can make predictions using this model for new data:

In [None]:
Xnew = np.array([[0.5, 0], [1.5, 3], [3, 2]])
ynew_pred = clf.predict(Xnew)

#### Check the results on a plot

Let's visualise the results:

In [None]:
plt.scatter(dat1[:,0], dat1[:,1], alpha=0.2);
plt.scatter(dat2[:,0], dat2[:,1], alpha=0.2);

X_tmp = np.arange(0.5, 2.5, 0.1) # generate tickpoints along the x axis, so we can use them to draw the estimated line
SVM_line = 1/clf.coef_[0][1]*(-clf.intercept_[0] - clf.coef_[0][0]*X_tmp) # get the dividing line
plt.plot(X_tmp, SVM_line, color='gray', linestyle='--', label='SVM');


plt.scatter(Xnew[:,0], Xnew[:,1], marker='s', s=100, 
            color = ['tab:blue' if pred==0 else 'tab:orange' for pred in ynew_pred], label='Predicted'); # plot the predicted data for our three new datapoints
plt.xlabel('$x_1$')
plt.ylabel('$x_2$')
plt.legend();

In the above plot the faded blue and orange dots are the data that we used to train the SVM classifier. The dashed line is the line that the model learned that best divides the data into the two classes. The blue and orange boxes are new data (data the model never saw during training) that the model classifies. 

### Example: Let's build a SVM classifier that predicts who survived the Titanic

We'll start by looking at some data on who survived the titanic. The full dataset is included as part of the `seaborn` package, but we'll use a version that I've reduced and cleaned up a bit:

In [None]:
titanicFile='https://raw.githubusercontent.com/dylanwalker/MGSC496/main/datasets/titanic_cleaned.csv'
titanic = pd.read_csv(titanicFile)
titanic

In [None]:
X = titanic.loc[:,'pclass':'fare'] # select all rows, and all columns starting with 'pclass' and ending with 'fare'
y = titanic.loc[:,'survived'] 

In [None]:
X.head()

In [None]:
y.head()

We'll build a model using the pclass (passenger class), sex, age, and fare (the fair they paid) to try to predict survived (whether the passenger survived).

First, we'll initiate an SVM model. Unlike the previous example, we'll use the SVC type that provides some advanced kernels. I won't talk about these in detail, though you can read about SVM kernels [here](https://data-flair.training/blogs/svm-kernel-functions/), except to say that nonlinear kernels permit finding divisions in the data beyond a simple line dividing the space.

In [None]:
from sklearn import svm
clf = svm.SVC(gamma='auto', random_state=0) # SVC covers not only linear kernel as LinearSVC but also nonlinear kernels
clf.fit(X, y)

You might have noticed in the above that we set `random_state=0` when we instantiated the `SVC` model. You might be wondering what role random numbers play in the `SVC` model. SVC uses a method called cross-fold validation internally when training, which involves some shuffling of the data. We'll talk more about this in just a minute. We set the `random_state` to a specified seed to ensure that anyone running this code would get the same results. If we were doing this in a real-world application, we would not do this.


In order to evaluate an ML model, there are a variety of metrics that you should already be aware of (e.g., accuracy, precision, recall, F-score, etc.). We'll use accuracy, which we can import from `sklearn.metrics`. 

In [None]:
from sklearn.metrics import accuracy_score
y_pred = clf.predict(X)
print('{:.2%}\n'.format(accuracy_score(y, y_pred)))

So you might think this model performs pretty well. However, this is not a good assessment of the model, because we assessed performance on the training data. This is a meaningless measure of performance. As in all machine learning, we need to split data into training and test sets.

## Model selection : How good is our model?

Up until this point, we used the entire data to learn the model. However, you know that this is not the correct approach. We need to holdout some data in order to evaluate the model that we learned on "unseen data".

A typical step after loading data is to split it into train, validation, and test sets:

* **Training data**: a subset of data to train a model
* **Validation data**: a subset of data to pick the best hyperparameters of a model or compare across models
* **Test data**: a subset of data to evaluate the chosen learned model

We won't be tuning model hyperparameters or choosing between models right now, so we'll just split the data into a 70\% training set and 30\% as test set.  

Fortunately, `sklearn` provides an easy way to split data into train and test sets, using `train_test_split()`:

In [None]:
from sklearn.model_selection import train_test_split
# Let 30% of the data to be a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

In the above, I used `random_state=0` when splitting the data. This ensures that anyone that runs the code gets the exact same split. This is good for learning, since we can eliminate things varying due to random splitting of the data, but **when doing this "for real", we would not want to set `random_state`**.


Now we'll learn an SVM model, using only the training data:

In [None]:
from sklearn import svm
titanic_svc = svm.SVC(gamma='auto', random_state=0) # SVC covers not only linear kernel as LinearSVC but also nonlinear kernels
titanic_svc.fit(X_train, y_train)

We can then get the predicted labels for the test set using the model's `predict()` method and then test the accuracy (out of sample performance):

In [None]:
y_pred = titanic_svc.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score
print('{:.2%}\n'.format(accuracy_score(y_test, y_pred)))

SO the model performs **okay** (50% accuracy would be achieved with random prediction).

<hr/>
<img src="https://drive.google.com/uc?id=1sk8CSP26YY7sfyzmHGFXncuNRujkvu9v" align="left">

<font size=3 color="darkred">Exercise: Titanic SVM - Repeat with random test/train split</font>


Repeat the above procedure, but get rid of `random_state=0` (you can do so in both the train/test split AND in the SVC instantiation).  In other words, split the data randomly into 70% training / 30% testing data. Train the SVC model on the training data. Then, evaluate its performance on the holdout test data. Put all of your code into a single code cell so that you can run it multiple times quickly. Notice how the accuracy changes each time you run it, due to the model being trained on different portions of the data.


In [None]:
# Try it out
#1. Split the titanic data randomly into 70% train / 30% test (make sure its truly random, so do NOT include random_state keyword argument)

#2. Initiliaze the SVC model and fit it (train it) on the training data

#3. Calculate the model's accuracy on the holdout testing data and print it out

# Run the above multiple times to see how accuracy of the trained model varies with random variation in data train/test split

<hr/>

## Clustering : K-means

What if we have data where no labels are given ? 

One class of ML models discovers finds clusters in the data based on patterns. This process is called **unsupervised learning**. [K-means clustering](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) is one type of popular unsupervised learning methods.

Here's the basic idea: Assume that data points belong to one of $K$ different clusters. Our task is to find $K$ different centroids (each is the center of the cluster) and assign datapoints to one of these $K$ clusters, such that the  within-cluster variances are minimized.

Let $(x_1, x_2, ..., x_n)$ be observations and $(\mu_1, \mu_2, ..., \mu_m)$ be centroids of points in cluster $i, C_i$.

Then, our task is to minimize:

 $\sum_{i=1}^{m}\sum_{x\in C_{i}}\|x-\mu_i\|^2$.

This is an example of a model with a hyperparameter, because <font color=red> **we have to specify the number of clusters $K$ at the beginning**</font>

Just as before, we'll start by generating synthetic data by drawing datapoints from different distributions. We'll use the exact same data as in the SVM example.

In [None]:
# The same data that are generated in the SVM example
np.random.seed(1)
dat1 = np.random.multivariate_normal(mean=[1,1], cov=[[0.3, 0], [0, 0.3]], size=50)
dat2 = np.random.multivariate_normal(mean=[2,1.5], cov=[[0.3, 0], [0, 0.3]], size=50)
dat = np.concatenate((dat1, dat2))

Assume that we do not know underlying clusters to which data points belong.

In [None]:
plt.scatter(dat[:,0], dat[:,1], color='gray'); # but it is originally generated by two different distributions

We'll train a model with $K=2$ centroids (we *know* that the data is generated from two distributions in this example, but in a real-world scenario we would not know this, because it is unsupervised learning. In other words, in our training data, we don't the label (which cluster they belong to) of the data points so we can't supervise the model by telling it to get these labels correct.

We can specify $K=2$ by setting the keyword argument `nclusters=2` when we instantiate the `KMeans` model:

In [None]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2, random_state=0) 
kmeans.fit(dat)

In [None]:
plt.figure(figsize=(10,4));
plt.subplot(1,2,1);
plt.scatter(dat1[:,0], dat1[:,1]);
plt.scatter(dat2[:,0], dat2[:,1]);
plt.title('Original data');

plt.subplot(1,2,2);
labels = kmeans.labels_
plt.scatter(dat[:,0], dat[:,1], color=['tab:orange' if x==0 else 'tab:blue' for x in labels]);
plt.title('K-means clustering: K=2');

You can see that K-means clustering did a fairly good job at detecting the clusters that we created synthetically. Of course, since this is unsupervised learning, we would not know the "true" clusters or classes that the data belonged to -- so we can't assess the performance directly.


We would also have no way of knowing whether the data is best characterized by two clusters or more. So let's have a look at other values of $K$

<hr/>
<img src="https://drive.google.com/uc?id=1sk8CSP26YY7sfyzmHGFXncuNRujkvu9v" align="left">

<font size=3 color="darkred">Exercise: Train a Kmeans model with $K=4$ on the same data</font>

Repeat the above steps, but modify the model so that we are using $K=4$. We know that the data for this example were created from 2 clusters (not 4), but in a real-world example, we would not know this. That means we would have to experiment with different values of $K$ and try to find the model that fits our data the best. One way to do this would be to hold out even more data from training, creating a validation subset of the data. We wcould then train the model on the training data for different values of $K$, use the validation subset to pick the model (the value of $K$) that performs the best, and then finally evaluate the performance of that model on the test data.

For now, just train the model with $K=4$



In [None]:
# Try it out, as before call your model kmeans

You can use the below code to plot the original data and the fitted data

In [None]:
plt.figure(figsize=(10,4))
plt.subplot(1,2,1)
plt.scatter(dat1[:,0], dat1[:,1]);
plt.scatter(dat2[:,0], dat2[:,1]);
plt.title('Original data');

plt.subplot(1,2,2);
labels = kmeans.labels_
cmap = ['tab:blue', 'tab:orange', 'tab:green', 'tab:red']
plt.scatter(dat[:,0], dat[:,1], color=[cmap[x] for x in labels]);
plt.title('K-means clustering: K=4');

<hr/>

### Example: Unsupervised clustering applied to wine data

In this data, chemical compositions and types of wines are given. There are three types of wines (i.e., coming from different winemakers). We will investigate the extent to which k-means clustering can recover the wine types.

Note that there isn't any correlation between the label kmeans ascribes and the true label of the wine (on one run, a particular type of wine might be represented with cluster labeled by 1, but on another the same approximate cluster might be assigned the label 2). Because of this, we'll need a metric that is not dependent on the actual value of the labels. A good metric is the [Mutual Information Score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mutual_info_score.html) which is defined by:

 $MI(U,V)=\sum_{i=1}^{|U|} \sum_{j=1}^{|V|} \frac{|U_i\cap V_j|}{N}\log\frac{N|U_i \cap V_j|}{|U_i||V_j|}$

where $|U_i|$ is the number of samples in cluster $U_i$ and $|V_j|$ is the number of samples in cluster $V_j$. We'll use the predicted labels and true labels as $U$ and $V$.

In [None]:
from sklearn.datasets import load_wine
from sklearn.metrics import mutual_info_score
wine = load_wine() 
X = pd.DataFrame(wine.data, columns = wine.feature_names)
y = wine.target # there are three types

In [None]:
X.head()

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

In [None]:
kmeans = KMeans(n_clusters=3, random_state=0)
kmeans.fit(X_train)

In [None]:
y_pred = kmeans.predict(X_test)
print('{:.2%}\n'.format(mutual_info_score(y_test, y_pred)))

Not bad, but there is room for improvement. How?

## Preprocessing : Standardization

Often you will encounter data where the features have very different distributions (with very different mean and variance for each column). In such a case, you can benefit from **standardization** -- a process of transforming the features so that each column has zero mean and unit variance. Some ML algorithms will suffer if the data on which they are trained is not standardized. To standardize data, we will use the [`sklearn.preprocessing`](https://scikit-learn.org/stable/modules/preprocessing.html) package.

`StandardScaler` calculates mean and standard devaition of a train set.

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler() # calculate mean and standard deviation of train set
scaler.fit(X_train)

Let's print the means of the different features (columns):

In [None]:
scaler.mean_

`scaler.scale_` returns the standard deviations for each feature:

In [None]:
scaler.scale_

To standardize the data, we simply call `scaler.transform` on the original values to get the transformed data. Transformed data will have 0 mean and 1 variance.

In [None]:
X_train_scaled = scaler.transform(X_train) # You can apply the scaler even to test set

Let's check means of transformed data.

In [None]:
np.mean(X_train_scaled, axis=0)

Compared to means of the original data, those of the standardized data are close to zero. Now, train a k-mean clustering model with the standardized data. 

In [None]:
kmeans = KMeans(n_clusters=3, random_state=0)
kmeans.fit(X_train_scaled)

In [None]:
X_test_scaled = scaler.transform(X_test)
y_pred = kmeans.predict(X_test_scaled)
print('{:.2%}\n'.format(mutual_info_score(y_test, y_pred)))

You can see that standardizing the data has a pretty substantial impact on the performance.

### Pipelines : chaining pre-processors and estimators

We can do all procedures (standardize data and learn a model) at once by taking advantage of the [`sklearn.pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) package!

The below code block shows how to combine the different operations, scaling with the `StandardScaler` and running the `KMeans` algorithm.

In [None]:
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(
    StandardScaler(),
    KMeans(n_clusters=3, random_state=0)
)

To execute this pipe, following the rules of sklearn, use `.fit` method.
```python
pipe.fit(X_train, Y_train)
```

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split

wine = load_wine() 
X = pd.DataFrame(wine.data, columns = wine.feature_names)
Y = wine.target # there are three types
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state=0)

pipe.fit(X_train, Y_train)

X_test_scaled = scaler.transform(X_test)
Y_pred = kmeans.predict(X_test_scaled)
print('{:.2%}\n'.format(mutual_info_score(Y_test, Y_pred)))

And of course we get the same result as before.  Making a pipeline is useful as frequently you'll want to do several transformations of your data as part of model fitting.

## Dimensionality reduction

Dimensionality reduction offers several advantages. (https://en.wikipedia.org/wiki/Dimensionality_reduction)
* It reduces the time and storage space required.
* Removes  multi-collinearity to improve the interpretation of the parameters of the machine learning model.
* It becomes easier to visualize the data when reduced to very low dimensions such as 2D or 3D.
* It avoids the [curse of dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality).

For example, we can reduce 13 dimensions in the wine data into 2 dimensions. This process helps to understand and visualize complicated data. In this section, we will cover a popular dimensionality reduction method: **principal component analysis (PCA)**.

#### What is PCA?

[PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) finds linearly uncorrelated variables by combining existing correlated variables. Let's explore the concept with the wine data. In the wine data, `alcohol` and `color_intensity` are correlated.

In [None]:
from sklearn.datasets import load_wine
wine = load_wine() 
X = pd.DataFrame(wine.data, columns = wine.feature_names)
y = wine.target # there are three types

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)
X_train_scaled = scaler.transform(X_train) # You can apply the scaler even to test set

In [None]:
plt.figure(figsize=(6,6))
plt.scatter(X_train_scaled[:,0], X_train_scaled[:,-4])
plt.xlabel("Alcohol (standardized)")
plt.ylabel("Color intensity (standardized)")
print("Correlation between alchol and color intensity is", round(X.alcohol.corr(X.color_intensity), 2))

As the two variables are correlated, significant amount of variances between them can be captured through a new variable. PCA returns this new variable by combining correlated ones. The new variable is represented as the arrow on the below plot.

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2, random_state=0) # 13 dimensions to 2 dimensions
pca.fit(X_train_scaled[:, (0,-4)]) # find principal components

In [None]:
plt.figure(figsize=(6,6));
plt.scatter(X_train_scaled[:,0], X_train_scaled[:,-4], alpha=0.3);
plt.xlabel("Alcohol (standardized)");
plt.ylabel("Color intensity (standardized)");
plt.annotate("", [0,0], -3*pca.explained_variance_ratio_[0]*pca.components_[:,0], 
             arrowprops=dict(arrowstyle='<-', linewidth=3, color='red'));

In this way, PCA finds a given number of components (`n_components`) that are uncorrelated and explain variance well.

"pca.explained_variance_ratio_" summarizes how much variance that a component explains.

In [None]:
pca.explained_variance_ratio_

This shows that the first principal component (red arrow) explains about 78% of the total variance.

Let's apply PCA for the entire wine data.

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2, random_state=0) # 13 dimensions to 2 dimensions
pca.fit(X_train_scaled) # find principal components

In [None]:
pca.explained_variance_ratio_

About 37\% of the variance is explained by the first principle component and about 19\% of the variance is explained by the second principle component. It means that the first two components capture more than half of all the variance of the data. 

So, projecting the wine data onto the first two principal components can give a good overview of the data.

In [None]:
pca_transformed = pca.transform(X_train_scaled) # project the data onto principal components

In [None]:
cmap = ['tab:blue', 'tab:orange', 'tab:green', 'tab:red']
plt.scatter(pca_transformed[:,0], pca_transformed[:,1], color = [cmap[x] for x in y_train]);

Three wine types are separated well by the first two principal components

# Preprocessing Real-World Data for sklearn models

## Transforming categorical data to dummy variables

Almost all models in sklearn rely on numerical data (as we have seen with the artifical and real world examples we have looked at so far). But often we have dataframes that contain categorical data. How can we pre-process categorical data so that we can train sklearn models on it?

Consider the following dataset from a bank that has granted some applicants loans:

In [None]:
loan_df = pd.read_csv('https://raw.githubusercontent.com/dylanwalker/MGSC496/main/datasets/loan_data_set.csv')
loan_df.dropna(inplace=True) # We'll drop columns that have missing values
loan_df.drop(columns=['Loan_ID'],inplace=True) # this is a unique identifier and we wouldn't want to let our model train on it, since this is a useless feature for test data 
loan_df.head()

Some of the columns are numerical, such as `ApplicantIncome` and `Loan_Amount_Term`, while other columns are categorical, such as `Gender` and `Property_Area`. We saw before (in the pandas lecture), that we can use some tricks to convert categorical data to numerical data.

For example, we could create column called `Gender_num` by doing:
```python
loan_df['Gender_num'] = (loan_df.Gender=='Male')*1.0
```
This would create a numerical column where the value was 1 for rows when `Gender` was `Male` and 0 otherwise. However, this trick would not work if the column conatained more than two values. Another problem wit this approach is that the model we train with the data would not know that numerical values could only be 1 and 0 for this column. A better approach is to create multiple columns from a single categorical column and make dummy variables for each of the values it could take. We can do this using the function [`pd.get_dummies()`](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html) on the categorical column.

For example:

In [None]:
pd.get_dummies(loan_df.Property_Area)

You can see that this created 3 dummy variable columns for each of the 3 possible values of `Property_Area` (`Rural`, `Semiurban`, `Urban`).

In practice, we would want to hold out one of these categories (one we know that the values for `Semiurban` and `Urban`, we know the value for the third category. We can do this using the keyword argument `drop_first=True`:

In [None]:
pd.get_dummies(loan_df.Property_Area, drop_first=True)

It is even possible to apply `pd.get_dummies()` to multiple columns in a dataframe at once. For example: 

In [None]:
pd.get_dummies(loan_df.loc[:,['Gender','Property_Area', 'Education']], drop_first=True)

In fact, `pd.get_dummies()` will ignore columns that are numerical, so we could apply it to the entire dataframe in this case:

<hr/>
<img src="https://drive.google.com/uc?id=1sk8CSP26YY7sfyzmHGFXncuNRujkvu9v" align="left">

<font size=3 color="darkred">Exercise: Use `pd.get_dummies()`, make `X` and `y` from `loan_df`, and train/test split the data </font>

Let's complete all the steps necessary to go from a starting dataframe to data that we will use to train a model in sklearn.

<br />

Do the following:

1. Apply `pd.get_dummies()` to the whole `loan_df` dataframe with `drop_first=True` and look at the results.
2. Using the output from the last step, create a variable called `X` that contains all the feature columns (everything except the one that captures whether the loan was granted (`Loan_status_Y`).
3. Create a variable called `y` that captures the outcome variable (`Loan_status_Y`)
4. Use `train_test_split()` on `X` and `y` to get `Xtrain`, `Xtest`, `ytrain`, `ytest` 





In [None]:
# Try it out

# 1. Use pd.get_dummies() with drop_first=True to transform the dataframe

# 2. Create the features variable X

# 3. Create the outcome variable y

# 4. Use train_test_split to get training and test data for X and y


Look over what you created. Does it make sense? Do you understand what you're doing?

<hr/>

## Transforming numerical data to categorical data

Sometimes we have numerical columns and we want to transform them to categorical ones (which we may then transform to dummies).

To see how numerical data is distributed, we can use the dataframe methods:
* [`df.col.describe()](https://pandas.pydata.org/docs/reference/api/pandas.Series.describe.html)` - shows statistics of a column including mean, std, min, and percentile values
*[`df.col.value_counts()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html) - shows how many rows take each distinct value

Once we understand this, we can bin numerical data into meaningful categories using the pandas functions:
* [`pd.cut()`](https://pandas.pydata.org/docs/reference/api/pandas.cut.html) - cut the data into bins, based on the edges of the bins that you specify
* [`pd.qcut()`](https://pandas.pydata.org/docs/reference/api/pandas.qcut.html) - cut the data into bins, based on the percentile values that you specify 

For example, lets look at the column `ApplicantIncome`:

In [None]:
loan_df.ApplicantIncome.describe()

Since the 0%, 25%, and 75% values of ApplicantIncome are all different (that might not always be the case), we can create bins whose edges are these values and bin the rows into categories:

In [None]:
loan_df.ApplicantIncome

In [None]:
pd.qcut(loan_df.ApplicantIncome, q=[0,0.25,0.75,1.0], labels=['low','medium','high']) # The low labelk will be assigned to rows whose values fall between the 0% and 25% value of ApplicantIncome, etc.

Now let's have a look at the column `Loan_Amount_Term`:

In [None]:
loan_df.Loan_Amount_Term.describe()

It seens like most of the loans have a term of 360 months and the percentile values are the same for 25%, 50%, 75%. So binning the data by percentiles doesn't really make sense for this variable. Let's have a look at how the distinct values of `Loan_Amount_Term` are distributed:

In [None]:
loan_df.Loan_Amount_Term.value_counts()

The most popular loan term is 360 months. We might say that such loans are `normal` terms, while loans with shorter terms are categoried as `short`, and loans of longer term are categorized as `long`. We can accomplish this categorization with `pd.cut()`: 

In [None]:
loan_term_cat = pd.cut(loan_df.Loan_Amount_Term, bins=[0,359,361,480], labels=['short','normal','long'])
loan_term_cat

We can then see how many loans fall into each of the categories we specified:

In [None]:
loan_term_cat.value_counts()

As you can see, choosing the right categorization requires thinking about the context of the data and making judgment calls. 

You might be wondering:
>**Why should I change numerical columns to categorical ones, just to change it into dummy variables? Why not just use the numerical column we already have?**

Here are two reasons:
1. Changing a numerical variable to a categorical one reduces variability that might not be meaningful. This can yield results which are easier to interpret.
2. Some models in sklearn want dummies instead of columns (see, for example [Categorical Naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html#categorical-naive-bayes)) 

# A ton of models

Sklearn has a ton of models that you can use. Covering them all is well outside of the scope of this class. Though you are likely familiar with many of them  from your prior coursework. In lecture, I will talk about:
* K nearest neighbor
* Decision trees
* Naive bayes


# Feedback
What did you think about this notebook? What questions do you have? Were any parts confusing? Write your thoughts in the text box below.

<font size =2> note: You can double click this text box in colab to edit it.</font>

PUT YOUR THOUGHTS HERE

# Submit
Don't forget to submit your notebook before class! Make sure you have saved your work (**Colab Menu: File-> Save**) and then download a pure python copy (**Colab Menu: File-> Download -> Download .py**) and a python notebook copy (**Colab Menu: File-> Download -> Download .ipynb**). You will upload both of these to the assignment on the canvas page.
