[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gdsaxton/GDAN5400/blob/main/Week%208%20Notebooks/GDAN%205400%20-%20Week%208%20Notebooks%20%28VI%29%20-%20Task%206%20-%20Automated%20Data%20Report.ipynb)

This notebook provides a mini-tutorial on different ways of identifying missing data in the Titanic training dataset.

In [1]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')

Current date and time :  2025-02-25 15:13:01 

CPU times: user 101 µs, sys: 39 µs, total: 140 µs
Wall time: 117 µs


# Load Packages and Set Working Directory
Import several necessary Python packages. We will be using the <a href="http://pandas.pydata.org/">Python Data Analysis Library,</a> or <i>PANDAS</i>, extensively for our data manipulations in this and future tutorials.

In [2]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series

<br>
PANDAS allows you to set various options for, among other things, inspecting the data. I like to be able to see all of the columns. Therefore, I typically include this line at the top of all my notebooks.

In [3]:
#http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', 250)
pd.set_option('display.max_info_columns', 500)

### Read in the Titanic Training Data

In [4]:
import numpy as np
import pandas as pd

train_url = 'https://raw.githubusercontent.com/gdsaxton/GDAN5400/refs/heads/main/Titanic/train.csv'
train = pd.read_csv(train_url)
print('# of rows in training dataset:', len(train), '\n')
train[:2]

# of rows in training dataset: 891 



Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C


# Generating an Automated Data Report  

Manually exploring each variable can be time-consuming. Instead, we can use an *automated tool* to generate a detailed report that summarizes key pieces of information. To get you set up using this tool, task 6 in the fourth coding assignment has the following requirements:

- Install and use `ydata-profiling` to create a detailed report of the dataset.  
- This report will provide insights into **missing values, distributions, correlations, and more**.  
- **Tip:** Instead of manually exploring each variable, use this **automated tool** to summarize the data in one step.  
- Save the report as an **HTML file** for easy viewing.


## Why Use an Automated Data Report?  
Manually exploring each variable can be time-consuming. Instead, we can use an **automated tool** to generate a detailed report that summarizes:  
- **Missing values** – Identify which columns need data cleaning.  
- **Distributions** – Understand the spread of numeric variables.  
- **Correlations** – Detect relationships between variables.  
- **Duplicates & Outliers** – Spot potential issues in the data.  

For this, we will use the **`ydata-profiling`** package (previously known as `pandas-profiling`).  

---

## Step 1: Install `ydata-profiling`  
If you haven’t installed it yet, run:  

If you haven’t installed it yet, run:  

```python
!pip install ydata-profiling
```

---

## Step 2: Import Required Libraries  
```python
import pandas as pd  
from ydata_profiling import ProfileReport  
```

---

## Step 3: Generate the Data Report  
Now, create an automated data report:  

```python
profile = ProfileReport(train, explorative=True)  
profile.to_file("titanic_data_report.html")  
```

---

## Step 4: Open and View the Report  
The report is saved as **`titanic_data_report.html`** in your working directory.  
- Open it in a web browser to explore **missing values, distributions, correlations, and more** in an interactive format.  

---

## Key Benefits of `ydata-profiling`
- Saves time by automating data exploration.  
- Provides a **visual summary** of missing values, correlations, and distributions.  
- Helps in **identifying outliers** and **understanding variable relationships** before modeling.  

Once we have a detailed overview of our dataset, we can move on to further data preprocessing and model building!
```

In [None]:
# Install ydata-profiling
!pip install ydata_profiling --quiet

In [43]:
#Import the package
from ydata_profiling import ProfileReport

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m390.9/390.9 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m296.5/296.5 kB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m687.8/687.8 kB[0m [31m23.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.8/104.8 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.5/4.5 MB[0m [31m67.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for htmlmin (setup.py) ... [?25l[?25hdone


In [None]:
# Generate the report
profile = ProfileReport(train,title="Titanic")

In [55]:
# Save the report to an HTML file
profile.to_file("titanic.html")

### Quick Start Guide: How to Use the Data Report  

Once the report has been generated, follow these steps to open and interpret it:

### Open the Report  
- Locate the **`titanic.html`** file in your working directory.  
- Double-click to open it in your web browser.  
- If using a cloud-based environment (Google Colab, Jupyter Notebook), **download the file** and open it locally.

### What to Look For  

#### 🔹 **Overview Section**  
- Provides a **summary** of the dataset, including the number of rows, columns, and missing values.

#### 🔹 **Missing Values**  
- Check the **heatmap** to see which columns have missing data.  
- Look at the **percent missing** for each column to decide whether to fill or drop values.

#### 🔹 **Variable Descriptions**  
- Shows the distribution of numeric variables (`Age`, `Fare`, etc.).  
- Categorical variables (`Pclass`, `Sex`, `Embarked`) are displayed with frequency counts.

#### 🔹 **Correlations**  
- The **correlation matrix** helps identify relationships between variables.  
- Positive and negative correlations can indicate potential feature importance.

#### 🔹 **Warnings & Outliers**  
- The report highlights potential **data issues**, such as highly skewed distributions or duplicate rows.  
- Helps you spot extreme values in `Fare` and `Age`.

### Next Steps  
- Use the report to **identify missing values** and decide on filling strategies.  
- Check for **outliers** that may need to be handled.  
- Note which variables might be useful for predictive modeling.

This automated report gives you a fast, interactive way to explore your dataset and make data-driven preprocessing decisions.