<a href="https://colab.research.google.com/github/alimoorreza/CS167-sp25-notes/blob/main/Day05_Missing_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS167: Day05
## kNN code, Missing Data and Normalization

#### CS167: Machine Learning, Spring 2025


📜 [Syllabus](https://analytics.drake.edu/~reza/teaching/cs167_sp25/cs167_syllabus_sp25.pdf)

## Before we get started, let's load in our datasets:

In [5]:
#run this cell if you're using Colab:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Quick Review:

**k Nearest Neighbors**: predict the most commonly occuring class of the *k nearest neighbors*.

Three main steps:
1. Calculate the distance between the new point (e.g. the Iris we would like to make a prediction on), and the existing training examples.

2. Sort the data by the newly calculated distance so that the nearest training examples are first

3. Take the top `k` neighbors and:
    - if the problem is a *classification*, **take the mode of the target variable** to find the most commonly appearing class and return that as your prediction.
    - if the problem is a *regression*, **take the average of the target variables for the k closest neighbors** and return that as your prediction

# Implement kNN in Python/Pandas:

[Please, take a look at the notebook from the last lecture.](https://github.com/alimoorreza/CS167-sp25-notes/blob/main/Day04_kNN.ipynb)

In [3]:
#import the data:
#make sure the path on the line below corresponds to the path where you put your dataset.
import pandas as pd
path = '/content/drive/MyDrive/cs167_sp25/datasets/titanic.csv'
titanic = pd.read_csv(path)
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


# ✨ New Material

# Missing Data:
Most datasets you will work with will not be in perfect shape--you'll need to "clean" the data before you can run any machine learning algorithms on it.

Missing data is a pretty common thing--so much so that there's a special value for missing data: `NaN`, or not a number.

The steps of cleaning data normally include:
1. Identifying which columns have missing data
2. Determining how much data is missing in each column
3. Deciding what to do with the missing data: drop it, fill it, let it be

Notice, in the `deck` column, there are 3 instances of `NaN` we can see...

But what about the other 800 or so rows? Do we have to go through and find them manually? Gross.

In [None]:
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


## Step 1: Detecting Missing Data

In order to ID missing data, we will use a combination of three pandas functions:
- `isna()`, `notna()`, and `any()`

## Using `isna()` and `notna()` to find missing data:
- `isna()` will return a boolean series where it is True if the element is `NaN'.
- `notna()` will return a bollean seires where it is True if the element is __not__ `NaN`.


In [None]:
%%html
<iframe src="https://pandas.pydata.org/docs/reference/api/pandas.isna.html" width="1000" height="500"></iframe>

Let's call `isna()` on the first 5 row of Titanic, and see what we get as an output:

In [None]:
#titanic.loc[0:4]
titanic.loc[0:7]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
5,0,3,male,,0,0,8.4583,Q,Third,man,True,,Queenstown,no,True
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
7,0,3,male,2.0,3,1,21.075,S,Third,child,False,,Southampton,no,False


In [None]:
#titanic.loc[0:4].isna()
titanic.loc[0:7].isna()
#look at the 'deck' column...

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False
5,False,False,False,True,False,False,False,False,False,False,False,True,False,False,False
6,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
7,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False


In [None]:
titanic.loc[0:7].notna()
#look at the 'deck' column...

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,True,True,True,True,True,True,True,True,True,True,True,False,True,True,True
1,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True
2,True,True,True,True,True,True,True,True,True,True,True,False,True,True,True
3,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True
4,True,True,True,True,True,True,True,True,True,True,True,False,True,True,True
5,True,True,True,False,True,True,True,True,True,True,True,False,True,True,True
6,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True
7,True,True,True,True,True,True,True,True,True,True,True,False,True,True,True


That's pretty nifty, but there's gotta be a better way of summarizing this...

# `any()`

In [None]:
%%html
<iframe src="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.any.html" width="1000" height="350"></iframe>

Let's use `any()` on the call to `isna()` we just did to let us know which columns have missing data:

In [None]:
titanic.isna().any()

Unnamed: 0,0
survived,False
pclass,False
sex,False
age,True
sibsp,False
parch,False
fare,False
embarked,True
class,False
who,False


Several columns are missing data: `age`, `embarked`, `deck`, and `embark_town`.

Wouldn't it be great to know how much data is missing in each of those columns?

## Step 2: How much data is missing?
It's important to determine *how much missing data each column has* before we decide how to handle our missing data:
- If the missing data is a small proportion of the data, we choose to drop those rows completely from the dataset.
- However, if most of the rows are missing data for a specific column, maybe it's a sign that we don't need to use that column.

There are multiple ways of doing this, but one of the quickest/easiest is using `value_counts()`

## `value_counts()`
Great, so now that we know which columns are missing data, let's check to see how much data they are missing using `value_counts()`.

In [None]:
%%html
<iframe src="https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html" width="1000" height="350"></iframe>

In [None]:
titanic.columns

Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
       'alive', 'alone'],
      dtype='object')

In [None]:
titanic.deck.value_counts(dropna=False)
#688 missing values

Unnamed: 0_level_0,count
deck,Unnamed: 1_level_1
,688
C,59
B,47
D,33
E,32
A,15
F,13
G,4


In [None]:
titanic.age.value_counts(dropna=False)
#177 missing values

Unnamed: 0_level_0,count
age,Unnamed: 1_level_1
,177
24.00,30
22.00,27
18.00,26
28.00,25
...,...
36.50,1
55.50,1
0.92,1
23.50,1


In [None]:
titanic.embarked.value_counts(dropna=False)
#2 missing values

Unnamed: 0_level_0,count
embarked,Unnamed: 1_level_1
S,644
C,168
Q,77
,2


In [None]:
titanic.embark_town.value_counts(dropna=False)
#2 missing values

Unnamed: 0_level_0,count
embark_town,Unnamed: 1_level_1
Southampton,644
Cherbourg,168
Queenstown,77
,2


So, here are our results:

| **Column**    | **Num Rows Missing** |
|:---------------|----------------------|
| `deck`         | 688                  |
| `age`          | 177                  |
| `embarked`    | 2                    |
| `embark_town` | 2                    |

Now with this new information, it's up to us to decide what to do with these missing values

## Step 3: Decide how to handle missing data

There are 3 main options here:
- drop the missing data from the dataset (either col or row)
- fill the missing data with a suitable replacement
- let it be and cross our fingers

### Option 1: Drop it using `dropna()`

If there isn't much missing data, and/or you have a very large dataset, dropping the row that includes the missing data is a viable option.

In [None]:
path = '/content/drive/MyDrive/cs167_fall24/datasets/titanic.csv'
titanic = pd.read_csv(path)

In [None]:
print(f'size of the dataframe before dropna(){titanic.shape}')
titanic.dropna()
print(f'size of the dataframe after dropna(){titanic.shape}')

size of the dataframe before dropna()(891, 15)
size of the dataframe after dropna()(891, 15)


huh... that's weird. We know that there's missing data, why didn't the shape change?

Pandas is trying to protect you, and rather than dropping the rows "in place", it is returning a dataframe with the rows dropped--as written, we're just not saving it's return. There are two ways to fix this:
- save what `dropna()` is returning in a variable (see below)
- add the parameter `inplace=True` to the function call, and it will drop the rows in the original dataset (be careful with this one)

In [None]:
print("before dropna(): ", titanic.shape)
no_missing_data = titanic.dropna()
print("after dropna(): ", no_missing_data.shape)

before dropna():  (891, 15)
after dropna():  (182, 15)


In [None]:
print("before dropna(): ", titanic.shape)
titanic.dropna(inplace=True)
print("after dropna(inplace=True): ", titanic.shape)

before dropna():  (891, 15)
after dropna(inplace=True):  (182, 15)


That's better, but wow, most of our dataset is gone now if we drop all of the rows that have missing data. If this happens to you, you'll probably want to re-load your data to have the full dataset to work with.

In [None]:
# if that happens, you'll want to re-run your data loading code:
path = '/content/drive/MyDrive/cs167_fall24/datasets/titanic.csv'
titanic = pd.read_csv(path)

`embarked` and `embark_town` don't have many rows missing... let's use `dropna()` to drop them in place:
- the parameter `subset` allows us to provide a list of columns that we want any missing data to be dropped from.

In [None]:
path = '/content/drive/MyDrive/cs167_fall24/datasets/titanic.csv'
titanic = pd.read_csv(path)

print("before: ", titanic.shape)
titanic.dropna(inplace=True, subset=["embark_town"])
print("after: ", titanic.shape)

before:  (891, 15)
after:  (889, 15)


In [None]:
path = '/content/drive/MyDrive/cs167_fall24/datasets/titanic.csv'
titanic = pd.read_csv(path)

print("before: ", titanic.shape)
titanic.dropna(inplace=True, subset=["age"])
print("after: ", titanic.shape)

before:  (891, 15)
after:  (714, 15)


In [None]:
path = '/content/drive/MyDrive/cs167_fall24/datasets/titanic.csv'
titanic = pd.read_csv(path)

print("before: ", titanic.shape)
titanic.dropna(inplace=True, subset=["deck"])
print("after: ", titanic.shape)

before:  (891, 15)
after:  (203, 15)


In [None]:
path = '/content/drive/MyDrive/cs167_fall24/datasets/titanic.csv'
titanic = pd.read_csv(path)

print("before: ", titanic.shape)
titanic.dropna(inplace=True, subset=['embarked', 'embark_town']) # multiple columns
print("after: ", titanic.shape)

before:  (891, 15)
after:  (889, 15)


### Option 2:  Fill it using `fillna()`

If dropping all of the data will make your dataset too sparse, consider filling the missing values with something else.

What do you think we should use to fill in the missing data in the `age` column?
- we probably don't want to throw off our statistics...

In [None]:
path = '/content/drive/MyDrive/cs167_fall24/datasets/titanic.csv'
titanic = pd.read_csv(path)

titanic.head(7)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
5,0,3,male,,0,0,8.4583,Q,Third,man,True,,Queenstown,no,True
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True


In [None]:
avg_age = titanic.age.mean()
titanic.age.fillna(avg_age, inplace=True)
titanic.head(7)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
5,0,3,male,29.699118,0,0,8.4583,Q,Third,man,True,,Queenstown,no,True
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True


The `fillna()` functiona llows `NaN` values to be filled with a given value like so:

In [None]:
print("before: ", titanic['age'].isna().any())
age_mean = titanic['age'].mean()
titanic['age'].fillna(age_mean, inplace=True)
print("after: ", titanic['age'].isna().any())
titanic.head(7)

before:  True
after:  False


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
5,0,3,male,29.699118,0,0,8.4583,Q,Third,man,True,,Queenstown,no,True
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True


In [None]:
# alternate option
avg_age = titanic['age'].mean()
titanic['age'] = titanic['age'].fillna(avg_age)
titanic.head(7)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
5,0,3,male,29.642093,0,0,8.4583,Q,Third,man,True,,Queenstown,no,True
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True


## Option #3: Let it be ❄️

What's so bad about missing data? Why do we care if some data is missing?

What happens if we try to do math with `NaN`? Try it out for yourself:

In [None]:
import numpy as np
a = np.nan

In [None]:
#try out some addition/subtraction

In [None]:
#try out some multiplication/division

In [None]:
#what about taking something to the power? (**)

In [None]:
# modulo or integer division? (% or //)

In [None]:
# what about '=='? Is np.nan == np.nan?

In [None]:
# what happens if you take the average of this list of numbers?
my_series = pd.Series([2,2,3,np.nan,3])
my_series.mean()

2.5

In [None]:
my_series.median()

2.5

## Summary: Missing Data Functions
- `isna()`: returns True for any missing data
- `notna()`: returns True for any data that is __not__ `NaN`
- `any()`: returns true if any of the elements in a Series is True
- `value_counts()`: returns a list of the values in a Series, use `dropna=False` to see `NaN` values
- `dropna()`: drops rows or columns (specify which axis, 1 or 0) that have missing data. Don't forget to either save the result of the call or add `inplace=True` as a parameter.
- `fillna()`: replaces missing data with a given value (generally 0 or the mean)