<a href="https://colab.research.google.com/github/alimoorreza/CS167-fall24-notes/blob/main/Day07_Normalization_and_w_knn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS167: Day07
## Normalization and Weighted k Nearest Neighbors

#### CS167: Machine Learning, Fall 2024


📆 [Course Schedule](https://analytics.drake.edu/~reza/teaching/cs167_fall24/cs167_schedule.html) | 📜 [Syllabus](https://analytics.drake.edu/~reza/teaching/cs167_fall24/cs167_syllabus_fall24.pdf)

## Can't forget to load our data:

And some of our favorite modulues, `pandas` and `numpy`

In [None]:
#run this cell if you're using Colab:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
#import the data:
#make sure the path on the line below corresponds to the path where you put your dataset.
import pandas as pd
import numpy as np

path2 = '/content/drive/MyDrive/cs167_fall24/datasets/titanic.csv'
titanic = pd.read_csv(path2)
titanic.head()


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


# ✨ New Material

## Normalization Motivation:

In datasets that have numeric data, the columns that have the largest magnitude will have a greater 'say' in the decision of what to predict.

In the penguin dataset, `body_mass_g` will have a much bigger say in the prediction than the other options.

# Normalization:

__Normalizing data:__
- rescale attrbute values so they're about the same
- adjusting values measured on different scales to a common scale

## A Simple Normalization:
One simple method of normalizing data is to replace each value with a proportion relativeto the max value.

For example, the oldest person on the Titanic was 80, so:

| **age** | **replaced by** |
|---------|:------------------|
| 80      | 80/80 = 1        |
| 50      | 50/80 = 0.625    |
| 48      | 48/80 = 0.6      |
| 25      | 25/80 = 0.3125   |
| 4       | 4/80 = 0.05      |

## Before Normalization
<div>
<img src="https://analytics.drake.edu/~reza/teaching/cs167_fall24/notes/images/day03_zscore_improvement.png" width=600/>
</div>

### Age is overemphasized here

## Z-Score: Another Normalization Method

__Idea__: rather than normalize to proportion of max, normalize based on how mnay standard deviations they are away from the mean.

__Standard Deviation__: usually represened as $\sigma$ (sigma), a kind of 'average' distance from the average value.
- a low standard deviation indicates that the values tend to be close to the mean
- a high standard deviation indicates that the values are spread out over a wider range.

## Standard Deviation:
<div>
<img src="https://analytics.drake.edu/~reza/teaching/cs167_fall24/notes/images/day03_std.png" width=400/>
</div>

## Standard Deviation Calculation:

## $\sigma = \sqrt{\frac{1}{N}\sum_{i=1}^{N}(x_i - \mu)^2}$

1. Find the mean, represented as $\mu$ (mu)
2. Then, for each number, subtract the mean and square the result.
3. Then, find the mean of those squared differences.
4. Take the square root of tht and we are done.

Let $\mu$ be the mean, then standard deviation of $x_1, x+2, ..., x_N$ is:

## $\sigma = \sqrt{\frac{(x_1-\mu)^2 + (x_2 - \mu)^2+ ... + (x_N-\mu)^2}{N}}$

# Corrected Sample Standard Deviation

The mean of a sample tends to be a good estimate for the mean of the entire population (on average), but..
- standard deviation of samples tend to be _smaller_ than the standard deviation of the entier population.

__Bessel's correction__ says that you should divide by $N-1$ instead of N when working with a sample (as we usually do in machine learning tasks), and your estimate will be a little less biased.

## $\sigma = \sqrt{\frac{(x_1-\mu)^2 + (x_2 - \mu)^2+ ... + (x_N-\mu)^2}{N-1}}$

# Computing the Z-Score
After computing the corrected sample standard deviation,

to normlaize, replace each value $x_i$ with it's Z-Socre based on the mean ($\mu$) and standard deviation ($\sigma$) of it's column.

## $Z-score: \frac{x_i- \mu}{\sigma}$

## Exampe Z-Score Calculation

For example:
On the Titanic:
- sex mean(0:male, 1:female): 0.35
- sex standard deviation: 0.48
- age mean: 29.7
- age standard deviation: 13

<div>
<img src="https://analytics.drake.edu/~reza/teaching/cs167_fall24/notes/images/day03_zscore.png" width=400/>
</div>


<div>
<img src="https://analytics.drake.edu/~reza/teaching/cs167_fall24/notes/images/day03_zscore_ex.png" width=600/>
</div>

# Normalization Code:
Let's try out some code now:



In [None]:
#make sure your data is loaded and ready to go (one of the top few cells)
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


## New function `replace()`

Called on a dataframe, will repalce values given in `to_replace` with `value`.

Let's use this to make the `sex` column of the dataset numeric.

In [None]:
titanic['sex'] = titanic['sex'].replace(to_replace='male', value=1)
titanic['sex'] = titanic['sex'].replace(to_replace='female', value=0)
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,1,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,0,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,0,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,0,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,1,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [None]:
titanic['who'] = titanic['who'].replace(to_replace='woman', value=1)
titanic['who'] = titanic['who'].replace(to_replace='man', value=0)
titanic.head(7)


## Calculating z-score:
Now that we have the data as 1s and 0s, let's calculate the mean and standard deviation.

In [None]:
s_mean      = titanic.sex.mean()
s_std       = titanic.sex.std()

#replace column with each entry's z-score
titanic.sex = (titanic.sex - s_mean)/s_std
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,0.737281,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,-1.354813,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,-1.354813,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,-1.354813,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,0.737281,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


Next, you'd need to repeat this process for all of the predictor columns -- so they're all of compareable size.

## 💻 Programming Exercise #1:

Normalize each of the predictor columns in the iris dataset.

> Note: you need a way to transform the new reading (the specimen) that you will make the prediction on so that the new one and the training data will all be on the same scale. How can you do that?

Repeat your kNN prediction code with the normalized data.
- Does the value of k change the predictions?

In [None]:
path1   = '/content/drive/MyDrive/cs167_fall24/datasets/irisData.csv'
iris    = pd.read_csv(path1)
iris.head(7)

Unnamed: 0,sepal length,sepal width,petal length,petal width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa


In [None]:
# use z-score to normalize the iris data

# petal length:
petal_length_mean     = iris['petal length'].mean()
petal_length_std      = iris['petal length'].std()
# z score
z_score_pl            = (iris['petal length'] - petal_length_mean)/petal_length_std
iris['petal length']  = z_score_pl

# petal width:
petal_width_mean      = iris['petal width'].mean()
petal_width_std       = iris['petal width'].std()
# z score
z_score_pw            = (iris['petal width'] - petal_width_mean)/petal_width_std
iris['petal width']   = z_score_pw



In [None]:
iris.head(7)

Unnamed: 0,sepal length,sepal width,petal length,petal width,species
0,5.1,3.5,-1.336794,-1.308593,Iris-setosa
1,4.9,3.0,-1.336794,-1.308593,Iris-setosa
2,4.7,3.2,-1.39347,-1.308593,Iris-setosa
3,4.6,3.1,-1.280118,-1.308593,Iris-setosa
4,5.0,3.6,-1.336794,-1.308593,Iris-setosa
5,5.4,3.9,-1.166767,-1.046525,Iris-setosa
6,4.6,3.4,-1.336794,-1.177559,Iris-setosa


In [None]:
def knn(specimen, data, k):
    # write your code in here to make this function work
    # 1. calculate distances
    data_copy = data.copy() #good practice to make a copy of the data
    data_copy['distance_to_new'] = np.sqrt(
        (specimen['petal length'] - data_copy['petal length'])**2
        +(specimen['sepal length'] - data_copy['sepal length'])**2
        +(specimen['petal width'] - data_copy['petal width'])**2
        +(specimen['sepal width'] - data_copy['sepal width'])**2)

    # 2. sort
    sorted_data = data_copy.sort_values(['distance_to_new'])

    # 3. predict
    prediction = sorted_data.iloc[0:k]['species'].mode()[0]

    #return prediction
    return prediction

In [None]:
#what will you have to do here to make it work?

path1 = '/content/drive/MyDrive/cs167_fall24/datasets/irisData.csv'
iris = pd.read_csv(path1)
iris.head(7)

new_iris = {}
new_iris['petal length']  = 5.1
new_iris['sepal length']  = 7.2
new_iris['petal width']   = 1.5
new_iris['sepal width']   = 2.5

pred = knn(new_iris, iris, 15)
print(pred)

## Programming Exercise #2:

Write a function called `z_score()` that will take in a list of the names of the columns that you want to normalize, and the dataframe, and will return a dataframe where those columns have been z-score normalized.

In [None]:
def z_score(columns, data):
    """
    takes in a list of columns to normalize using the z-score method
    Params:
        columns, a list of columns to normalize
        data, the dataframe, preferably a copy
    Return:
        a copy of the dataframe with the specified columns normalized
    """
    normalized_data = data.copy()

    mean_list = []
    std_list  = []

    for col in columns:

        # get the mean and std

        # keep appending the mean, std into the lists initilized above

        # z score

        # replace the column with the z-score


    return normalized_data, mean_list, std_list

In [None]:
path1 = '/content/drive/MyDrive/cs167_fall24/datasets/irisData.csv'
iris = pd.read_csv(path1)
iris.head(7)

Unnamed: 0,sepal length,sepal width,petal length,petal width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa


In [None]:
column_names = ['sepal length', 'sepal width', 'petal width', 'petal length']

iris_norm, mean_list, std_list = z_score(column_names, iris)

iris_norm.head()

Unnamed: 0,sepal length,sepal width,petal length,petal width,species
0,-0.897674,1.028611,-1.336794,-1.308593,Iris-setosa
1,-1.1392,-0.12454,-1.336794,-1.308593,Iris-setosa
2,-1.380727,0.33672,-1.39347,-1.308593,Iris-setosa
3,-1.50149,0.10609,-1.280118,-1.308593,Iris-setosa
4,-1.018437,1.259242,-1.336794,-1.308593,Iris-setosa


## Are all neighbors created equal?

The way we've learned kNN so far, each neighbor gets an equal vote in the decision of what to predict.

Do we see any problems with this? If so, what?

<div>
<img src="https://analytics.drake.edu/~reza/teaching/cs167_fall24/notes/images/day04_wknn_motivation.png" width = 400/>
</div>

Should neighbors that are closer to the new instance get a larger share of the vote?

# Weighted k-NNN Intuition:

In weighted kNN, the nearest k points are given a weight, and the weights are grouped by the target variable. The class with the largest sum of weights will be the class that is predicted.

The intuition is to give more weight to the points that are nearby and less weight to the points that are farther away.
- distance-weighted voting

In w-kNN, we want to predict the target variable with the most weight, where the weight is defined by the inverse distance function.

## $w_{q,i} = \frac{1}{d(x_q, x_i)^2}$

> In English, you can read that as the __weight__ of a traning example is equal to 1 divided by the distance between the new instance and the traning example squared.

## A w-kNN Example: Step 1

Start by calculating the distance between the new example ('X'), and each of the other training examples:

<div>
<img src="https://analytics.drake.edu/~reza/teaching/cs167_fall24/notes/images/day04_wknn_ex.png" width=700/>
</div>

## A w-kNN Example: Step 2

Then, __calculate the weight__ of each training example using the inverse distance squared.

<div>
<img src="https://analytics.drake.edu/~reza/teaching/cs167_fall24/notes/images/day04_wknn_ex1.png" width=700/>
</div>

## A w-kNN Example: Step 3

Find the k closest neighbors--let's assume `k=3` for this example:
<div>
<img src="https://analytics.drake.edu/~reza/teaching/cs167_fall24/notes/images/day04_wknn_ex2.png" width=700/>
</div>

Then, sum the weights for each possible class:
- __orange__: $1$
- __blue__: $1/16 + 1/9 = 0.115$

### What would a __normal 3NN__ predict? Weighted 3NN?

## Let's write some code:

Write a new function `weighted_knn()`

Pass the iris measurements (specimen), data frame, and k as parameters and return the precited class.

In [None]:
import numpy as np

def weighted_knn(specimen, data, k):

  # step 1: calculate the distances from 'specimen' to all other samples in 'data'
  data['distances'] = np.sqrt( (specimen['petal length'] - data['petal length'])**2 +
                               (specimen['sepal length'] - data['sepal length'])**2 +
                               (specimen['petal width']  - data['petal width'])**2 +
                               (specimen['sepal width']  - data['sepal width'])**2 )


  # step 2: calculate the weights for each sample (remember, weights are 1/d^2)
  # data['weights']    = ... (TBD)
  data['weights']      = 1/data['distances']**2
  #print(data)


  # step 3: find the k closest neighbors as follows
  # first: sort the data and take the first k samples as neighbors
  sorted_data        = data.sort_values(['distances'])
  print('Nearest k samples in the training data:')
  neighbors          = sorted_data.iloc[0:k]
  print(neighbors)


  # second: use groupby to sum the weights of each species in the closest k
  result = neighbors.groupby('species')['weights'].sum()
  print(result)


  # third: return the class that has the largest sum of weight.
  # TBD
  predicted_label = result.idxmax()

  return predicted_label


In [None]:
new_iris = {}
new_iris['petal length']  = 5.1
new_iris['sepal length']  = 7.2
new_iris['petal width']   = 1.5
new_iris['sepal width']   = 2.5
k = 5
prediction = weighted_knn(new_iris, iris, k)
print('Prediction (weighted knn + unnormalized data): ', prediction)

Nearest k samples in the training data:
     sepal length  sepal width  petal length  petal width          species  \
76            6.8          2.8           4.8          1.4  Iris-versicolor   
52            6.9          3.1           4.9          1.5  Iris-versicolor   
77            6.7          3.0           5.0          1.7  Iris-versicolor   
50            7.0          3.2           4.7          1.4  Iris-versicolor   
129           7.2          3.0           5.8          1.6   Iris-virginica   

     distances   weights  
76    0.591608  2.857143  
52    0.700000  2.040816  
77    0.741620  1.818182  
50    0.836660  1.428571  
129   0.866025  1.333333  
species
Iris-versicolor    8.144712
Iris-virginica     1.333333
Name: weights, dtype: float64
Prediction (weighted knn + unnormalized data):  Iris-versicolor


## Exercises:

Normalize each of the predictor columns in the iris dataset, or just use `iris_norm` which we created above.

>__Note__: you need a way to transform the new reading (the specimen) that you will make the prediction on so that the new one and the training data will all be on the same scale. How can you do that?

Repeat your k-NN prediction code for the normalized data.
- Does the value of k change the predictions?
    - compare using `k=3`, and `k=5` on each method (normalized and non-normalized), (weighted and unweighted)

In [None]:
def z_score(columns, data):
    """
    takes in a list of columns to normalize using the z-score method
    Params:
        columns, a list of columns to normalize
        data, the dataframe, preferably a copy
    Return:
        a copy of the dataframe with the specified columns normalized
    """
    normalized_data = data.copy()

    mean_list = []
    std_list  = []

    for col in columns:

        # get the mean and std

        # keep appending the mean, std into the lists initilized above

        # z score

        # replace the column with the z-score


    return normalized_data, mean_list, std_list

In [None]:
# create a new sample
new_iris = {}
new_iris['petal length']  = 5.1
new_iris['sepal length']  = 7.2
new_iris['petal width']   = 1.5
new_iris['sepal width']   = 2.5

print("Not normalized:")
print('unweighted kNN, k=3:', knn(new_iris, iris, 3))
print('unweighted kNN, k=5:', knn(new_iris, iris, 5))
print('weighted kNN,   k=3:', weighted_knn(new_iris, iris, 3))
print('weighted kNN,   k=5:', weighted_knn(new_iris, iris, 5))

In [None]:
# get the mean() and std() for each column of iris
column_names                    = ['petal length', 'sepal length', 'petal width', 'sepal width']
iris_norm, mean_list, std_list  = z_score(column_names, iris)
print('column mean: ', mean_list, ' and std: ', std_list)

In [None]:
# create a new specimen and then normalized it with the computed (mean, std) values from the tranining data

# create a new sample
norm_iris = {}
norm_iris['petal length']   = 5.1
norm_iris['sepal length']   = 7.2
norm_iris['petal width']    = 1.5
norm_iris['sepal width']    = 2.5
# normalize the test sample using the normalized mean, std you computed earlier
norm_iris['petal length']   = (norm_iris['petal length'] - mean_list[0])/std_list[0]
norm_iris['sepal length']   = (norm_iris['sepal length'] - mean_list[1])/std_list[1]
norm_iris['petal width']    = (norm_iris['petal width']  - mean_list[2])/std_list[2]
norm_iris['sepal width']    = (norm_iris['sepal width']  - mean_list[3])/std_list[3]


In [None]:
print("Normalized:")
print('unweighted kNN, k=3:', knn(norm_iris, iris_norm, 3))
print('unweighted kNN, k=5:', knn(norm_iris, iris_norm, 5))
print('weighted kNN, k=3:', weighted_knn(norm_iris, iris_norm, 3))
print('weighted kNN, k=5:', weighted_knn(norm_iris, iris_norm, 5))

## Use these tables to keep track of your predictions:
### `k=3`
|                    | **not normalized** | **normalized** |
|--------------------|--------------------|----------------|
| **unweighted kNN** |          |              |
| **weighted kNN**   |          |               |

### `k=5`

|                    | **not normalized** | **normalized** |
|--------------------|--------------------|----------------|
| **unweighted kNN** |                    |                |
| **weighted kNN**   |                    |                |

# 💬 Discussion Question

Should we __always__ normalize our data? Why or why not?

When does it make sense to normalize? When might it make more sense not to?