<a href="https://colab.research.google.com/github/axel-sirota/ml_ad_ai_course/blob/main/Classical%20ML/5_KNN_and_Random_Forest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  K-Nearest Neighbors and Random Forest

© Data Trainers LLC. GPL v 3.0.

Author: Axel Sirota

<img src="https://www.dropbox.com/scl/fi/esem47f0n5rc37bxrrogw/knn.png?rlkey=c30sk3cn5d7tw65xl2qmha6ag&raw=1"  align="center"/>

<a id="learning-objectives"></a>
## Learning Objectives

1. Utilize the KNN model on the iris data set.
2. Implement scikit-learn's KNN model.
3. Assess the fit of a KNN Model using scikit-learn.

In this lesson, we will get an intuitive and practical feel for the **k-Nearest Neighbors** model. kNN is a **non-parametric model**. So, the model is not represented as an equation with parameters (e.g. the $\beta$ values in linear regression).

First, we will make a model by hand to classify iris flower data. Next, we will automatedly make a model using kNN.

> You may have heard of the clustering algorithm **k-Means Clustering**. These techniques have nothing in common, aside from both having a parameter k!

<img src="https://www.dropbox.com/scl/fi/tkh2nmcaitcott0h41024/iris.jpeg?rlkey=623t4kmn606z74fwtqitlzbg6&raw=1"  align="center"/>

<a id="overview-of-the-iris-dataset"></a>
## Loading the Iris Data Set
---

#### Read the iris data into a pandas DataFrame, including column names.

In [None]:
%%writefile get_data.sh
mkdir -p data
if [ ! -f data/cell_phone_churn.csv ]; then
  wget -O data/cell_phone_churn.csv https://www.dropbox.com/scl/fi/qutq3sa7dge9vx133to6o/cell_phone_churn.csv?rlkey=1jpwo0ork58254lzxxy0qz4kf&dl=0
fi
if [ ! -f data/churn_missing.csv ]; then
  wget -O data/churn_missing.csv https://www.dropbox.com/scl/fi/rab18zeo6bq58fz1tadwc/churn_missing.csv?rlkey=32tcp05gaj8rgnpc76vh2dbca&dl=0
fi
if [ ! -f data/iris.data ]; then
  wget -O data/iris.data https://www.dropbox.com/scl/fi/0vpbcxsiesofpknnkz1mo/iris.data?rlkey=8lz6biaoccef8ggvpx4kebrbm&dl=0
fi
if [ ! -f data/NBA_players_2015.csv ]; then
  wget -O data/NBA_players_2015.csv https://www.dropbox.com/scl/fi/0jgo8u5lbphvwwl2btq1w/NBA_players_2015.csv?rlkey=q86m5lp3ycndh5jbegvjewwzu&dl=0
fi
if [ ! -f data/NHL_Data_GA.csv ]; then
  wget -O data/NHL_Data_GA.csv https://www.dropbox.com/scl/fi/lf41hb2tfe212dfqof9w8/NHL_Data_GA.csv?rlkey=jzgi8133t53wk6ybmjay1duig&dl=0
fi



In [None]:
!bash get_data.sh

In [None]:
# Read the iris data into a DataFrame.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

# Display plots in-notebook
%matplotlib inline

# Increase default figure and font sizes for easier viewing.
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['font.size'] = 14

data = 'data/iris.data'
iris = pd.read_csv(data)

In [None]:
iris.head()

In [None]:
iris.species.unique()

In [None]:
iris[iris['species']=='Iris-versicolor']

<a id="terminology"></a>
### Terminology

- **150 observations** (n=150): Each observation is one iris flower.
- **Four features** (p=4): sepal length, sepal width, petal length, and petal width.
- **Response**: One of three possible iris species (setosa, versicolor, or virginica)
- **Classification problem** because response is categorical.

In [None]:
iris.head(2)

<a id="exercise-human-learning-with-iris-data"></a>
## Guided Practice: "Human Learning" With Iris Data

**Question:** Can we predict the species of an iris using petal and sepal measurements? Together, we will:

1. Read the iris data into a Pandas DataFrame, including column names.
2. Gather some basic information about the data.
3. Use sorting, split-apply-combine, and/or visualization to look for differences between species.
4. Write down a set of rules that could be used to predict species based on iris measurements.

**BONUS:** Define a function that accepts a row of data and returns a predicted species. Then, use that function to make predictions for all existing rows of data and check the accuracy of your predictions.

#### Gather some basic information about the data.

In [None]:
# 150 observations, 5 columns (the 4 features & response)
iris.shape

In [None]:
iris.dtypes

In [None]:
# Verify the basic stats look appropriate
iris.describe()

In [None]:
# Test for imbalanced classes
iris.species.value_counts()

In [None]:
# Verify we are not missing any data
iris.isnull().sum()

#### Use sorting, split-apply-combine, and/or visualization to look for differences between species.

In [None]:
iris.head()

In [None]:
# Sort the DataFrame by petal_width.
iris.sort_values(by='petal_width', ascending=True, inplace=True)
iris.head()

In [None]:
# Sort the DataFrame by petal_width and display the NumPy array.
iris.sort_values(by='petal_width', ascending=True).values[0:5]

#### Split-apply-combine: Explore the data while using a `groupby` on `'species'`.

In [None]:
# Mean of sepal_length, grouped by species.
iris.groupby(by='species', axis=0).sepal_length.mean()

In [None]:
# Mean of all numeric columns, grouped by species.
iris.groupby('species').mean()

In [None]:
# describe() of all numeric columns, grouped by species.
iris.groupby('species').describe()

In [None]:
# Box plot of petal_width, grouped by species.
iris.boxplot(column='petal_width', by='species');

In [None]:
# Box plot of all numeric columns, grouped by species.
iris.boxplot(by='species', rot=45);

In [None]:
# Map species to a numeric value so that plots can be colored by species.
iris['species_num'] = iris.species.map({'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2})

# Alternative method:
iris['species_num'] = iris.species.factorize()[0]

In [None]:
iris

In [None]:
# Scatterplot of petal_length vs. petal_width, colored by species
iris.plot(kind='scatter', x='petal_length', y='petal_width', c='species_num', colormap='brg');


In [None]:
# Scatter matrix of all features, colored by species.
pd.plotting.scatter_matrix(iris.drop('species_num', axis=1), c=iris.species_num, figsize=(12, 10));

#### Class Exercise: Using the graphs above, can you write down a set of rules that can accurately predict species based on iris measurements?

In [None]:
# Feel free to do more analysis if needed to make good rules!

In [None]:
iris.head()

#### Bonus: Try to implement these rules to make your own classifier!

Write a function that accepts a row of data and returns a predicted species. Then, use that function to make predictions for all existing rows of data and check the accuracy of your predictions.

In [None]:
def predict_flower(df):
    pass


predict_flower(iris)

In [None]:
iris.head()

In [None]:
# Let's see what percentage your manual classifier gets correct!
# 0.3333 means 1/3 are classified correctly

sum(iris.species == iris.prediction) / 150.

<a id="human-learning-on-the-iris-dataset"></a>
## Human Learning on the Iris Data Set
---

How did we (as humans) predict the species of an iris?

1. We observed that the different species had (somewhat) dissimilar measurements.
2. We focused on features that seemed to correlate with the response.
3. We created a set of rules (using those features) to predict the species of an unknown iris.

We assumed that if an **unknown iris** had measurements similar to **previous irises**, then its species was most likely the same as those previous irises.

In [None]:
# Allow plots to appear in the notebook.
%matplotlib inline
import matplotlib.pyplot as plt

# Increase default figure and font sizes for easier viewing.
plt.rcParams['figure.figsize'] = (10, 8)
plt.rcParams['font.size'] = 14

# Create a custom color map.
from matplotlib.colors import ListedColormap
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

In [None]:
# Map each iris species to a number.
iris['species_num'] = iris.species.map({'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2})

In [None]:
# Box plot of all numeric columns, grouped by species.
iris.drop('species_num', axis=1).boxplot(by='species', rot=45);

In [None]:
# Create a scatterplot of PETAL LENGTH versus PETAL WIDTH and color by SPECIES.
iris.plot(kind='scatter', x='petal_length', y='petal_width', c='species_num', colormap=cmap_bold);

In [None]:
iris['pred_num'] = iris.prediction.map({'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2})



# Create a scatter plot of PETAL LENGTH versus PETAL WIDTH and color by PREDICTION.
iris.plot(kind='scatter', x='petal_length', y='petal_width', c='pred_num', colormap=cmap_bold);

---

<a id="k-nearest-neighbors-knn-classification"></a>
## K-Nearest Neighbors (KNN) Classification
---

K-nearest neighbors classification is (as its name implies) a classification model that uses the "K" most similar observations in order to make a prediction.

KNN is a supervised learning method; therefore, the training data must have known target values.

The process of of prediction using KNN is fairly straightforward:

1. Pick a value for K.
2. Search for the K observations in the data that are "nearest" to the measurements of the unknown iris.
    - Euclidian distance is often used as the distance metric, but other metrics are allowed.
3. Use the most popular response value from the K "nearest neighbors" as the predicted response value for the unknown iris.

The visualizations below show how a given area can change in its prediction as K changes.

- Colored points represent true values and colored areas represent a **prediction space**. (This is called a Voronoi Diagram.)
- Each prediction space is wgere the majority of the "K" nearest points are the color of the space.
- To predict the class of a new point, we guess the class corresponding to the color of the space it lies in.

<a id="knn-classification-map-for-iris-k"></a>
### KNN Classification Map for Iris (K=1)

![1NN classification map](https://www.dropbox.com/scl/fi/naecsufoh5sqnow7lqhgg/iris_01nn_map.png?rlkey=zn4v66l3rnzfhlgzzd48u85sc&raw=1)

### KNN Classification Map for Iris (K=5)

![5NN classification map](https://www.dropbox.com/scl/fi/65fyukmy6l2hj23yaosr2/iris_05nn_map.png?rlkey=hdfwixb6ox7kx9v1vku16jk9d&raw=1)

### KNN Classification Map for Iris (K=15)

![15NN classification map](https://www.dropbox.com/scl/fi/33zv70b54knrepc0avexh/iris_15nn_map.png?rlkey=n0wb383pg2md4hc7gk1n8j26d&raw=1)

<a id="knn-classification-map-for-iris-k"></a>
### KNN Classification Map for Iris (K=50)

![50NN classification map](https://www.dropbox.com/scl/fi/wfaobmeanl0xzt7hq4en2/iris_50nn_map.png?rlkey=1w1dnj6n3i1f5o5r6zv79g1ws&raw=1)

We can see that, as K increases, the classification spaces' borders become more distinct. However, you can also see that the spaces are not perfectly pure when it comes to the known elements within them.

**How are outliers affected by K?** As K increases, outliers are "smoothed out". Look at the above three plots and notice how outliers strongly affect the prediction space when K=1. When K=50, outliers no longer affect region boundaries. This is a classic bias-variance tradeoff -- with increasing K, the bias decreases but the variance increases.

**Question:** What's the "best" value for K in this case?

**Answer:** ...

## Lets build a Knn Model

In [None]:
iris.head()

In [None]:
iris.columns

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics

feature_cols = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
X = iris[feature_cols]
y = iris.species_num

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=99)


knn = KNeighborsClassifier(n_neighbors=11)
knn.fit(X_train, y_train)

y_pred_class = knn.predict(X_test)
print((metrics.accuracy_score(y_test, y_pred_class)))

## Guided Intro to KNN: NBA Position KNN Classifier

For the rest of the lesson, we will be using a dataset containing the 2015 season statistics for ~500 NBA players. This dataset leads to a nice choice of K, as we'll see below. The columns we'll use for features (and the target 'pos') are:


| Column | Meaning |
| ---    | ---     |
| pos | C: Center. F: Front. G: Guard |
| ast | Assists per game |
| stl | Steals per game |
| blk | Blocks per game |
| tov | Turnovers per game |
| pf  | Personal fouls per game |

For information about the other columns, see [this glossary](https://www.basketball-reference.com/about/glossary.html).

<img src="https://www.dropbox.com/scl/fi/igomtyzq4ubk5y9otx1fy/basketball.png?rlkey=3q03bh6jhx7yg36s4mtnzeusf&raw=1"  align="center"/>

In [None]:
# Read the NBA data into a DataFrame.
import pandas as pd

path = 'data/NBA_players_2015.csv'
nba = pd.read_csv(path, index_col=0)

In [None]:
nba.head()

In [None]:
nba.columns

In [None]:
nba.pos.factorize()

In [None]:
# Map positions to numbers
nba['pos_num'] = nba.pos.factorize()[0]

In [None]:
nba.head()

In [None]:
# Create feature matrix (X).
feature_cols = ['ast', 'stl', 'blk', 'tov', 'pf']
X = nba[feature_cols]

In [None]:
X.head()

In [None]:
# Create response vector (y).
y = nba.pos_num
y

<a id="using-the-traintest-split-procedure-k"></a>
### Using the Train/Test Split Procedure (K=1)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics

#### Step 1: Split X and y into training and testing sets(test_size = 0.25) (using `random_state = 99` for reproducibility).

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=99)

In [None]:
y.shape

In [None]:
y_train.shape

In [None]:
y_test.shape

#### Step 2: Train the model on the training set (using K=1).

In [None]:
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)

#### Step 3: Test the model on the testing set and check the accuracy.

In [None]:
y_pred_class = knn.predict(X_test)
accuracy = metrics.accuracy_score(y_test, y_pred_class)
print(accuracy)

**Question:** If we had trained on the entire dataset and tested on the entire dataset, using 1-KNN what accuracy would we likely get? If the resulting accuracy is not this number, what must some data points look like?

**Answer:** ...

#### Repeating for K=50.

In [None]:
knn = None
accuracy_50 = None
print(accuracy_50)

#### Comparing Testing Accuracy With Null Accuracy

Null accuracy is the accuracy that can be achieved by **always predicting the most frequent class**. For example, if most players are Centers, we would always predict Center.

The null accuracy is a benchmark against which you may want to measure every classification model.

#### Examine the class distribution from the training set.

Remember that we are comparing KNN to this simpler model. So, we must find the most frequent class **of the training set**.

In [None]:
most_freq_class = y_train.value_counts().index[0]

print(y_train.value_counts())
most_freq_class

#### Compute null accuracy.

In [None]:
y_test.value_counts()[most_freq_class] / len(y_test)

<a id="tuning-a-knn-model"></a>
## Tuning a KNN Model
---

In [None]:
# Instantiate the model (using the value K=5).
knn = None

# Fit the model with data.
knn.fit(X, y)

# Store the predicted response values.
y_pred_class = None

In [None]:
# Calculate predicted probabilities of class membership.
# Each row sums to one and contains the probabilities of the point being a 0-Center, 1-Front, 2-Guard.
knn.predict_proba(X)

## What is the "best" value of K?

In [None]:
# Calculate TRAINING ERROR and TESTING ERROR for K=1 through 100.

k_range = list(range(1, 101))
training_error = []
testing_error = []

# Find test accuracy for all values of K between 1 and 100 (inclusive).
for k in k_range:

    # Instantiate the model with the current K value.
    knn = None

    # Calculate training error (error = 1 - accuracy).
    y_pred_class = None
    training_accuracy = None
    training_error.append(1 - training_accuracy)

    # Calculate testing error.
    y_pred_class = None
    testing_accuracy = None
    testing_error.append(1 - testing_accuracy)

In [None]:
# Allow plots to appear in the notebook.
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

In [None]:
# Create a DataFrame of K, training error, and testing error.
column_dict = {'K': k_range, 'training error':training_error, 'testing error':testing_error}
df = pd.DataFrame(column_dict).set_index('K').sort_index(ascending=False)
df.head()

In [None]:
# Plot the relationship between K (HIGH TO LOW) and TESTING ERROR.
df.plot(y='testing error');
plt.xlabel('Value of K for KNN');
plt.ylabel('Error (lower is better)');

In [None]:
# Find the minimum testing error and the associated K value.
df.sort_values('testing error').head()

In [None]:
# Alternative method:
min(list(zip(testing_error, k_range)))

<a id="training-error-versus-testing-error"></a>
### Training Error Versus Testing Error

In [None]:
# Plot the relationship between K (HIGH TO LOW) and both TRAINING ERROR and TESTING ERROR.
df.plot();
plt.xlabel('Value of K for KNN');
plt.ylabel('Error (lower is better)');

- **Training error** decreases as model complexity increases (lower value of K).
- **Testing error** is minimized at the optimum model complexity.

Evaluating the training and testing error is important. For example:

- If the training error is much lower than the test error, then our model is likely overfitting.
- If the test error starts increasing as we vary a hyperparameter, we may be overfitting.
- If either error plateaus, our model is likely underfitting (not complex enough).

#### Making Predictions on Out-of-Sample Data

Given the statistics of a (truly) unknown NBA player, how do we predict his position?

In [None]:
import numpy as np

# Instantiate the model with the best-known parameters.
knn = KNeighborsClassifier(n_neighbors=13)

# Re-train the model with X and y (not X_train and y_train). Why?
knn.fit(X, y)

# Make a prediction for an out-of-sample observation.
knn.predict_proba(np.array([2, 1, 0, 1, 2]).reshape(1, -1))

What could we conclude?

- When using KNN on this data set with these features, the **best value for K** is likely to be around 14.
- Given the statistics of an **unknown player**, we estimate that we would be able to correctly predict his position about 74% of the time.

<a id="standardizing-features"></a>
## Standardizing Features
---

There is one major issue that applies to many machine learning models: They are sensitive to feature scale.

> KNN in particular is sensitive to feature scale because it (by default) uses the Euclidean distance metric. To determine closeness, Euclidean distance sums the square difference along each axis. So, if one axis has large differences and another has small differences, the former axis will contribute much more to the distance than the latter axis.

This means that it matters whether our feature are centered around zero and have similar variance to each other.

Unfortunately, most data does not naturally start at a mean of zero and a shared variance. Other models tend to struggle with scale as well, even linear regression, when you get into more advanced methods such as regularization.

Fortuantely, this is an easy fix.

<a id="use-standardscaler-to-standardize-our-data"></a>
### Use `StandardScaler` to Standardize our Data

StandardScaler standardizes our data by subtracting the mean from each feature and dividing by its standard deviation.

#### Separate feature matrix and response for scikit-learn.

In [None]:
# Create feature matrix (X).
feature_cols = ['ast', 'stl', 'blk', 'tov', 'pf']

X = nba[feature_cols]
y = nba.pos_num  # Create response vector (y).

In [None]:
X.head()

#### Create the train/test split.

Notice that we create the train/test split first. This is because we will reveal information about our testing data if we standardize right away.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=99)

#### Instantiate and fit `StandardScaler`.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
# We fit to figure out the distribution
scaler.fit(X_train)

# now we transform everything using that
# if you wanted to do it all in one step ==> X_train = scaler.fit_transform(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
pd.DataFrame(X_train).describe()

#### Fit a KNN model and look at the testing error.
Can you find a number of neighbors that improves our results from before?

In [None]:
# Calculate testing error.
knn = KNeighborsClassifier(n_neighbors=13)
knn.fit(X_train, y_train)

y_pred_class = knn.predict(X_test)
testing_accuracy = metrics.accuracy_score(y_test, y_pred_class)
print('the accuracy is: ',testing_accuracy)
testing_error = 1 - testing_accuracy

print('the error is: ',testing_error)

<a id="comparing-knn-with-other-models"></a>
## Comparing KNN With Other Models
---

**Advantages of KNN:**

- It's simple to understand and explain.
- Model training is fast.
- It can be used for classification and regression (for regression, take the average value of the K nearest points!).
- Being a non-parametric method, it is often successful in classification situations where the decision boundary is very irregular.

**Disadvantages of KNN:**

- It must store all of the training data.
- Its prediction phase can be slow when n is large.
- It is sensitive to irrelevant features.
- It is sensitive to the scale of the data.
- Accuracy is (generally) not competitive with the best supervised learning methods.

# Random Forest
----
Dictatorship or diplomacy?

<img src="https://www.dropbox.com/scl/fi/agihb72fdl8yozz3k3an9/rf.jpg?rlkey=yxry1ud5m2obcm2sjhhfp49fj&raw=1"  align="center"/>

In [None]:
## Lets import RF
from sklearn.ensemble import RandomForestClassifier

In [None]:
X.head()

In [None]:
# Calculate testing error.
rf_model = RandomForestClassifier(n_estimators=20,max_depth=8, random_state=0)


rf_model.fit(X_train, y_train)

y_pred_class = rf_model.predict(X_test)
testing_accuracy = metrics.accuracy_score(y_test, y_pred_class)
print('the accuracy is: ',testing_accuracy)
testing_error = 1 - testing_accuracy

print('the error is: ',testing_error)

# Now you do it
<img src="https://www.dropbox.com/scl/fi/s9kv1dytq4qzr8g19y3r0/hands_on.jpg?rlkey=yz8kq22sfdgc7lsgmm1e0fksr&raw=1" width="100" height="100" align="right"/>

The dataset is one on "churn" in cell phone plans. It has information on the usage of the phones by different account holders and whether or not they churned or not.

Our goal is to predict whether a user will churn or not based on the other features.

<img src="https://www.dropbox.com/scl/fi/r0di6ju7bm2pskg5nqd0n/churn.png?rlkey=xclo5ytlre63kb6o31sjub956&raw=1"  align="center"/>

### Use these parameters for testing

> random_state = 99

> test_size = 0.2

In [None]:
churn = pd.read_csv('./data/churn_missing.csv')
churn.head()

In [None]:
churn.shape

In [None]:
churn.isnull().sum().plot(kind='bar')