<a href="https://colab.research.google.com/github/alimoorreza/CS167-sp25-notes/blob/main/Day06_Normalization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS167: Day05
## Data Normalization

#### CS167: Machine Learning, Spring 2025


📜 [Syllabus](https://analytics.drake.edu/~reza/teaching/cs167_sp25/cs167_syllabus_sp25.pdf)

#Announcement
[Notebook #2: kNN and Normalization](https://github.com/alimoorreza/CS167-SP25-Notebook-2) has been released, due Monday 02/24 by 11:59pm.


## Before we get started, let's load in our datasets:

In [None]:
#run this cell if you're using Colab:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
#import the data:
#make sure the path on the line below corresponds to the path where you put your dataset.
import pandas as pd
path = '/content/drive/MyDrive/cs167_sp25/datasets/titanic.csv'
titanic = pd.read_csv(path)
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


# 💬 Discussion Question:

What do we do if the features aren't numbers?
- like Titanic `embark_town`... how can we calculate a distance between `Southampton` and `Queenstown`?

In [None]:
titanic.embark_town.unique()

array(['Southampton', 'Cherbourg', 'Queenstown', nan], dtype=object)

In [None]:
pd.get_dummies(titanic.embark_town)

Unnamed: 0,Cherbourg,Queenstown,Southampton
0,False,False,True
1,True,False,False
2,False,False,True
3,False,False,True
4,False,False,True
...,...,...,...
886,False,False,True
887,False,False,True
888,False,False,True
889,True,False,False


# 💬 Discussion Question:

What if our __target variable__ is continuous rather than categorical? How would we make a prediction using kNN?
- Can we do regression with kNN? If so, how?

Examples of Regression problems:
- predict tomorrow's temperature
- predict the fuel efficiency of a vehicle
- predict how much someone will like a show on Netflix

In [None]:
path = '/content/drive/MyDrive/cs167_sp25/datasets/vehicles.csv'
vehicles = pd.read_csv(path)
vehicles.head()

  vehicles = pd.read_csv(path)


Unnamed: 0,barrels08,barrelsA08,charge120,charge240,city08,city08U,cityA08,cityA08U,cityCD,cityE,...,mfrCode,c240Dscr,charge240b,c240bDscr,createdOn,modifiedOn,startStop,phevCity,phevHwy,phevComb
0,15.695714,0.0,0.0,0.0,19,0.0,0,0.0,0.0,0.0,...,,,0.0,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0
1,29.964545,0.0,0.0,0.0,9,0.0,0,0.0,0.0,0.0,...,,,0.0,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0
2,12.207778,0.0,0.0,0.0,23,0.0,0,0.0,0.0,0.0,...,,,0.0,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0
3,29.964545,0.0,0.0,0.0,10,0.0,0,0.0,0.0,0.0,...,,,0.0,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0
4,17.347895,0.0,0.0,0.0,17,0.0,0,0.0,0.0,0.0,...,,,0.0,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0


## Summary: Missing Data Functions
- `isna()`: returns True for any missing data
- `notna()`: returns True for any data that is __not__ `NaN`
- `any()`: returns true if any of the elements in a Series is True
- `value_counts()`: returns a list of the values in a Series, use `dropna=False` to see `NaN` values
- `dropna()`: drops rows or columns (specify which axis, 1 or 0) that have missing data. Don't forget to either save the result of the call or add `inplace=True` as a parameter.
- `fillna()`: replaces missing data with a given value (generally 0 or the mean)

# Normalization:

__Normalizing data:__
- rescale attrbute values so they're about the same
- adjusting values measured on different scales to a common scale

## A Simple Normalization:
One simple method of normalizing data is to replace each value with a proportion relativeto the max value.

For example, the oldest person on the Titanic was 80, so:

| **age** | **replaced by** |
|---------|:------------------|
| 80      | 80/80 = 1        |
| 50      | 50/80 = 0.625    |
| 48      | 48/80 = 0.6      |
| 25      | 25/80 = 0.3125   |
| 4       | 4/80 = 0.05      |

## Before Normalization
<div>
<img src="https://analytics.drake.edu/~reza/teaching/cs167_sp25/notes/images/day03_zscore_improvement.png" width=600/>
</div>

### Age is overemphasized here

## Z-Score: Another Normalization Method

__Idea__: rather than normalize to proportion of max, normalize based on how mnay standard deviations they are away from the mean.

__Standard Deviation__: usually represened as $\sigma$ (sigma), a kind of 'average' distance from the average value.
- a low standard deviation indicates that the values tend to be close to the mean
- a high standard deviation indicates that the values are spread out over a wider range.

## Standard Deviation:
<div>
<img src="https://analytics.drake.edu/~reza/teaching/cs167_sp25/notes/images/day03_std.png" width=400/>
</div>

## Standard Deviation Calculation:

## $\sigma = \sqrt{\frac{1}{N}\sum_{i=1}^{N}(x_i - \mu)^2}$

1. Find the mean, represented as $\mu$ (mu)
2. Then, for each number, subtract the mean and square the result.
3. Then, find the mean of those squared differences.
4. Take the square root of tht and we are done.

Let $\mu$ be the mean, then standard deviation of $x_1, x+2, ..., x_N$ is:

## $\sigma = \sqrt{\frac{(x_1-\mu)^2 + (x_2 - \mu)^2+ ... + (x_N-\mu)^2}{N}}$

# Corrected Sample Standard Deviation

The mean of a sample tends to be a good estimate for the mean of the entire population (on average), but..
- standard deviation of samples tend to be _smaller_ than the standard deviation of the entier population.

__Bessel's correction__ says that you should divide by $N-1$ instead of N when working with a sample (as we usually do in machine learning tasks), and your estimate will be a little less biased.

## $\sigma = \sqrt{\frac{(x_1-\mu)^2 + (x_2 - \mu)^2+ ... + (x_N-\mu)^2}{N-1}}$

# Computing the Z-Score
After computing the corrected sample standard deviation,

to normlaize, replace each value $x_i$ with it's Z-Socre based on the mean ($\mu$) and standard deviation ($\sigma$) of it's column.

## $Z-score: \frac{x_i- \mu}{\sigma}$

## Exampe Z-Score Calculation

For example:
On the Titanic:
- sex mean(0:male, 1:female): 0.35
- sex standard deviation: 0.48
- age mean: 29.7
- age standard deviatios
<div>
<img src="https://analytics.drake.edu/~reza/teaching/cs167_sp25/notes/images/day03_zscore.png" width=400/>
</div>


<div>
<img src="https://analytics.drake.edu/~reza/teaching/cs167_sp25/notes/images/day03_zscore_ex.png" width=600/>
</div>

# Normalization Code:
Let's try out some code now:



In [None]:
#make sure your data is loaded and ready to go (one of the top few cells)
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


## New function `replace()`

Called on a dataframe, will repalce values given in `to_replace` with `value`.

Let's use this to make the `sex` column of the dataset numeric.

In [None]:
titanic['sex'] = titanic['sex'].replace(to_replace='female', value=1)
titanic['sex'] = titanic['sex'].replace(to_replace='male', value=0)
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,0,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,1,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,1,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,1,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,0,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


## Calculating z-score:
Now that we have the data as 1s and 0s, let's calculate the mean and standard deviation.

In [None]:
s_mean = titanic.sex.mean()
s_std = titanic.sex.std()

#replace column with each entry's z-score
titanic.sex = (titanic.sex - s_mean)/s_std
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,-0.737281,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,1.354813,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,1.354813,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,1.354813,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,-0.737281,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


Next, you'd need to repeat this process for all of the predictor columns -- so they're all of compareable size.

## Exercise:

Normalize each of the predictor columns in the iris dataset.

> Note: you need a way to transform the new reading (the specimen) that you will make the precition on so that the new one and the training data will all be on the same sclae. How can you do that?

Repeat your kNN prediction code with the normalized data.
- Does the value of k change the predictions?