[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gdsaxton/GDAN5400/blob/main/Week%208%20Notebooks/GDAN%205400%20-%20Week%208%20Notebooks%20%28IV%29%20-%20Task%204%20-%20Fill%20Missing%20Values.ipynb)

This notebook provides a mini-tutorial on different ways of identifying missing data in the Titanic training dataset.

In [None]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')

# Load Packages and Set Working Directory
Import several necessary Python packages. We will be using the <a href="http://pandas.pydata.org/">Python Data Analysis Library,</a> or <i>PANDAS</i>, extensively for our data manipulations in this and future tutorials.

In [None]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series

<br>
PANDAS allows you to set various options for, among other things, inspecting the data. I like to be able to see all of the columns. Therefore, I typically include this line at the top of all my notebooks.

In [None]:
#http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', 250)
pd.set_option('display.max_info_columns', 500)

### Read in the Titanic Training Data

In [None]:
import numpy as np
import pandas as pd

train_url = 'https://raw.githubusercontent.com/gdsaxton/GDAN5400/refs/heads/main/Titanic/train.csv'
train = pd.read_csv(train_url)
print('# of rows in training dataset:', len(train), '\n')
train[:2]

# Fill in Missing Values for `Age`

Before we run our machine learning models we have to get the data ready to analyze, including identifying and fixing variables with missing values. Task 3 in the fourth coding assignment has the following requirements:

 Fill in Missing Values for `Age`
- The `Age` column contains missing values that must be filled before modeling.  
- Use the **median** age to replace missing values, as it is less affected by outliers.  
- After filling in the missing values, verify that `Age` no longer has any missing entries.  


## Why Fill in Missing Values Instead of Dropping Data?  
Previously, we handled missing values by **dropping** rows or columns that contained them. While this works in some cases, dropping data can lead to problems:  
- **Loss of Information** – Removing rows with missing `Age` values means we lose other important details about those passengers.  
- **Smaller Dataset** – Fewer data points can make our model less accurate.  
- **Bias in Data** – If certain groups (e.g., older passengers) are more likely to have missing values, dropping them could skew our analysis.  

Instead of dropping rows, we can **fill in missing values** using an appropriate method.

---

## Step 1: Check for Missing Values in `Age`  
Before filling in missing values, let’s confirm how many are missing.  

```python
train["Age"].isnull().sum()
```

This will return the number of missing values in the `Age` column.

---

## Step 2: Fill Missing `Age` Values with the Median  
There are different ways to fill in missing values, but the **median** is often the best choice for numerical data because:  
- It is **less affected by outliers** than the mean.  
- It represents a typical value for the dataset.  

### Replace Missing `Age` Values with the Median  
```python
train["Age"] = train["Age"].fillna(train["Age"].median())
```

This replaces all missing values in the `Age` column with the median age.

---

## Alternative Ways to Fill Missing Values  
Different situations may call for different imputation methods:  

### 1️⃣ Fill with the **Mean**  
```python
train["Age"] = train["Age"].fillna(train["Age"].mean())
```
- The **mean (average)** is easy to compute but can be skewed by extreme values.  

### 2️⃣ Fill with the **Most Frequent Value (Mode)**  
```python
train["Age"] = train["Age"].fillna(train["Age"].mode()[0])
```
- This replaces missing values with the most common age.  
- Works well for categorical data but may not be ideal for continuous numbers like age.  

### 3️⃣ Fill with a **Fixed Value** (e.g., 30 years old)  
```python
train["Age"] = train["Age"].fillna(30)
```
- Useful if domain knowledge suggests a reasonable default value.  

---

## Step 3: Verify That `Age` Has No Missing Values  
After filling in missing values, let’s confirm that there are no more missing entries.  

```python
train["Age"].isnull().sum()
```

If the output is `0`, it means all missing values have been successfully replaced.

---

## Conclusion  
Filling in missing values instead of dropping them helps preserve data, improves model performance, and prevents bias. The **median** is often the best choice for numerical variables like age because it is **robust to outliers**.  

Once we have completed this step, our dataset will be cleaner and ready for further analysis!


## Examples 

Below I run several examples. I will also show the descriptive statistics before and after running the code.

We can first see that `Age` is missing 177 of the 891 values.

In [None]:
train.info()

The descriptive statistics command also shows the same thing. Only 714 observations have the data. The mean is 29.699118

In [None]:
train[['Age']].describe().T

Now let's fill in the 177 missing rows with the median values of `Age` and then print out how many rows are still missing data.

In [None]:
train['Age'] = train['Age'].fillna(train["Age"].median())
print("Missing values in Age column:", train["Age"].isnull().sum())

Now let's re-run the descriptive statistics. Now all rows have data and the mean has changed slightly to 29.361582. There is no change in the minimum and maximum values.

In [None]:
train[['Age']].describe().T