[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gdsaxton/GDAN5400/blob/main/Week%208%20Notebooks/GDAN%205400%20-%20Week%208%20Notebooks%20%28IV%29%20-%20Task%204%20-%20Fill%20Missing%20Values.ipynb)

This notebook provides a mini-tutorial on different ways of identifying missing data in the Titanic training dataset.

In [None]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')

# Load Packages and Set Working Directory
Import several necessary Python packages. We will be using the <a href="http://pandas.pydata.org/">Python Data Analysis Library,</a> or <i>PANDAS</i>, extensively for our data manipulations in this and future tutorials.

In [None]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series

<br>
PANDAS allows you to set various options for, among other things, inspecting the data. I like to be able to see all of the columns. Therefore, I typically include this line at the top of all my notebooks.

In [None]:
#http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', 250)
pd.set_option('display.max_info_columns', 500)

### Read in the Titanic Training Data

In [None]:
import numpy as np
import pandas as pd

train_url = 'https://raw.githubusercontent.com/gdsaxton/GDAN5400/refs/heads/main/Titanic/train.csv'
train = pd.read_csv(train_url)
print('# of rows in training dataset:', len(train), '\n')
train[:2]

# Explore the Data with Histograms 

Before we run our machine learning models we have to get the data ready to analyze, including identifying and fixing variables with missing values. Task 3 in the fourth coding assignment has the following requirements:

Explore the Data with Histograms  
- Generate histograms for all **numeric features** in the dataset.  
- Use these histograms to understand the distribution of key variables such as `Age`, `Fare`, and `Pclass`.  
- **Tips:** 
  - Instead of plotting separate histograms for each variable, use the **shortcut method** we covered in class to generate all histograms at once.
  - Make sure to read in the plotting packages (*hint*: there are two relevant import lines we used in our Week 7 and Week 8 notebooks, as well as weeks 5 and 6)


## Why Use Histograms?  
Bef#ore we build our machine learning model, it’s important to **understand the distribution of our data**. Histograms help us:  
- Identify patterns, such as whether a variable is **normally distributed**, **skewed**, or has **outliers**.  
- Detect potential **data issues**, like missing values or extreme values.  
- Compare distributions of key features like `Age`, `Fare`, and `Pclass`.  

---

### Step 1: Import Required Libraries  
To create histograms, we need two key visualization libraries:  
- **Matplotlib** – A foundational plotting library for Python.  
- **Seaborn** – A statistical visualization library that builds on Matplotlib.  

```python
import matplotlib.pyplot as plt  
import seaborn as sns  
```

---

### Step 2: Generate Histograms for All Numeric Features  
Instead of plotting each histogram separately, we can use a **shortcut method** to generate histograms for all numeric columns at once.  

```python
train.select_dtypes(include='number').hist(figsize=(13, 8))  
plt.tight_layout()  
plt.show()  
```

#### 📌 **How It Works**  
- `train.select_dtypes(include='number')` selects only numerical columns.  
- `.hist(figsize=(13, 8))` generates histograms for each numeric column.  
- `plt.tight_layout()` adjusts spacing to prevent overlapping.  
- `plt.show()` displays the plots.  

---

### Step 3: Interpret the Histograms  
After running the code, you’ll see multiple histograms. Here are examples of what to look for:  

### 🔹 `Age`
- If the distribution is **bell-shaped**, the data is normally distributed.  
- If it is **skewed left or right**, we may need to transform the data.  

### 🔹 `Fare`
- A **right-skewed** distribution (many small values, few large values) suggests the presence of **outliers**.  
- We might consider applying a **log transformation** to reduce skewness.  

### 🔹 `Pclass`
- Since `Pclass` represents **passenger class (1st, 2nd, 3rd)**, it should have **three distinct bars** in the histogram.  
- This confirms it is a categorical feature, even though it’s stored as a number.  

---

### Conclusion  
Histograms give us a **quick and powerful way** to understand our data before modeling. They help us identify:  
- **Skewness** (e.g., `Fare`)  
- **Potential transformations** (e.g., log-scaling `Fare`)  
- **Categorical vs. numeric features** (`Pclass` looks numeric but is categorical)  
- **Missing values or outliers**  

Once we understand our dataset’s distributions, we can move on to further data preprocessing!  

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

train.select_dtypes(include='number').hist(figsize=(13, 8))
plt.tight_layout()
plt.show()

### Optional Step: Customize Individual Histograms
If you want to analyze a specific variable in more detail, you can plot its histogram separately using **Seaborn**:  

```python
sns.histplot(train["Age"], bins=30, kde=True)  
plt.title("Distribution of Age")  
plt.xlabel("Age")  
plt.ylabel("Count")  
plt.show()  
```

### 🔹 **What’s Different?**
- `bins=30` adjusts the number of bars.  
- `kde=True` adds a **smooth density curve** to visualize the shape.  
- `plt.title()`, `plt.xlabel()`, and `plt.ylabel()` label the plot.  

---

In [None]:
sns.histplot(train["Age"], bins=30, kde=True)  
plt.title("Distribution of Age")  
plt.xlabel("Age")  
plt.ylabel("Count")  
plt.show() 