[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gdsaxton/GDAN5400/blob/main/Week%208%20Notebooks/GDAN%205400%20-%20Week%208%20Notebooks%20%28III%29%20-%20Task%203%20-%20Identify%20Missing%20Data.ipynb)

This notebook provides a mini-tutorial on different ways of identifying missing data in the Titanic training dataset.

In [None]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')

# Load Packages and Set Working Directory
Import several necessary Python packages. We will be using the <a href="http://pandas.pydata.org/">Python Data Analysis Library,</a> or <i>PANDAS</i>, extensively for our data manipulations in this and future tutorials.

In [None]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series

<br>
PANDAS allows you to set various options for, among other things, inspecting the data. I like to be able to see all of the columns. Therefore, I typically include this line at the top of all my notebooks.

In [None]:
#http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', 250)
pd.set_option('display.max_info_columns', 500)

### Read in the Titanic Training Data

In [None]:
import numpy as np
import pandas as pd

train_url = 'https://raw.githubusercontent.com/gdsaxton/GDAN5400/refs/heads/main/Titanic/train.csv'
train = pd.read_csv(train_url)
print('# of rows in training dataset:', len(train), '\n')
train[:2]

# Identify Variables with Missing Data 

Before we run our machine learning models we have to get the data ready to analyze, including identifying and fixing variables with missing values. Task 3 in the fourth coding assignment has the following requirements:

- **Part (i)**
  - Determine which variables contain missing values in the dataset using any acceptable method
- **Part (ii)**
  - Use the provided input function to manually enter the variables that have missing values, then hit `Enter` to continue. 

*Hints:*
- You can use the `.info()` method, `.isnull().sum()`, `.isna().sum()`, or `.describe()`


## Why Do Missing Values Matter?  
In real-world datasets, it’s common for some values to be missing. This can happen for many reasons, such as incomplete data collection or errors in recording information.

Missing values can affect how well our model performs, so before building a machine learning model, we must first **identify** which variables contain missing data.  


## Methods for Identifying Missing Data  

We can use several methods to check for missing values in our dataset:

### Option 1: Use `.info()` to get an overview  
```python
train.info()  
```
- This method shows the number of **non-null** values in each column.
- If the number of non-null values is less than the total number of rows, that column has missing values.

### Option 2: Use `.isnull().sum()` to count missing values  
```python
train.isnull().sum()  
```
- This method returns a count of missing values for each column.
- Any column with a number greater than **0** has missing values.

### Option 3: Use `.isna().sum()` (alternative method)  
```python
train.isna().sum()  
```
- This works the same way as `.isnull().sum()`, since `isna()` and `isnull()` are interchangeable in PANDAS.

### Option 4: Use `.describe()` for summary statistics  
```python
train.describe(include="all")  
```
- If the **count** for a column is lower than the total number of rows, it means some values are missing.

Below I will implement these different options for identifying the rows with missing data:

In [None]:
train.info()

The following two methods compute the total number of missing values for each column in the dataset. The first uses a native PANDAS method while the second uses NumPy.

Each returns a `Series` where each column name is paired with its count of missing values. Columns with zero missing values are also shown.

In [None]:
train.isnull().sum()

In [None]:
train.isna().sum()

The next two options are variations of the above that filter out columns that do not have missing values. They first calculate the missing value count (`train.isnull().sum()`), then keep only those where the count is greater than zero.

This is a more complicated, but cleaner, way to focus only on columns with missing values.

In [None]:
train.isnull().sum()[train.isnull().sum() > 0]

In [None]:
train.isna().sum()[train.isna().sum() > 0]

## Enter Variables with Missing Data  

Once you have identified the variables with missing values, manually enter them using the input function:  

```python
missing_vars = input("Enter the column names with missing values, separated by commas: ")  
print(f"You entered: {missing_vars}")  
```

This ensures that you've correctly identified the columns with missing data before moving on to handling them.  

From the above output, we can see that `Age`, `Cabin`, and `Embarked` contain missing values. You would then enter:  

```
Age, Cabin, Embarked
```

Now that we’ve identified missing values, we can decide how to handle them in the subsequent steps

In [None]:
missing_values = input("Enter variables missing values: ")
print(f"Variables missing data: {missing_values}")