<a href="https://colab.research.google.com/github/brendanpshea/database_sql/blob/main/Database_11_Data_Storytelling_with_Zombies.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Telling a Story (About Zombies) with Your Data
### Brendan Shea, PhD

In this chapter, we delve into the heart of data analysis: the art and science of interpreting your data and communicating your findings effectively. Your journey through this course has equipped you with the fundamental skills to gather, organize, and manipulate data. Now, we turn our attention towards the real-world challenge of extracting meaningful insights from our datasets and presenting these insights in a manner that allows others to understand and act upon them.

We will begin by exploring how to clean and structure data to facilitate downstream analysis. This often overlooked step is crucial in ensuring the integrity of your analyses and, consequently, the reliability of your findings.

Next, we turn to the essential task of visualizing data. We'll delve into various techniques for presenting different types of data, including geographical and temporal data. We will also discuss the development of interactive dashboards, a powerful tool for real-time data monitoring and decision-making.

We will then transition into the realm of descriptive and inferential statistics. Descriptive statistics help us summarize and understand the characteristics of our dataset. Inferential statistics, on the other hand, enable us to make predictions and inferences about a population based on a sample.

Finally, we will address the all-important task of report writing. Here we will discuss how to structure a report, how to communicate complex information clearly and succinctly, and how to ensure your report is both engaging and informative.

Throughout this chapter, we will be mindful of potential biases and confounding factors that could impact our analyses. We will also discuss the importance of data privacy and ethics, particularly when handling sensitive information.



**Brendan's Note:** This chapter will demonstrate some more "advanced" topic in data analysis. The goal here is not to memorize each and every function, but rather to get a sense of "how these things work" at a high level. (So, don't worry if

## Case Study: Zombies!

Now, let's bring these concepts to life with a case study. You are a junior analyst at the Centers for Disease Control and Prevention (CDC). Suddenly, an outbreak of a mysterious illness begins to spread rapidly across the country. The symptoms appear zombie-like, and public fear is rising. Your task is to analyze the data coming in from across the nation, make sense of it, and communicate your findings to various stakeholders. This high-stakes scenario will give you a taste of what data analysts face in real-world crises.

As we proceed through this chapter, we will apply the concepts we learn to this unfolding crisis, helping the CDC understand and combat the spread of this terrifying outbreak. Let's get started!

## Extract, Transform, Load (ETL)

Extract, Transform, Load (ETL) is a critical process used in data handling, particularly in data warehousing. It's a three-step procedure:

1.  **Extract:** Data is extracted from various sources, which could be databases, Excel files, web pages, or even text files. The nature of these sources often means the data is in different formats and structures.

2.  **Transform:** This is the process of converting the extracted data into a form that can be analyzed more effectively. Transformations may include cleaning (removing errors or inconsistencies), filtering, splitting or merging fields, converting data types, or creating new calculated fields.

3.  **Load** The final step is to load the transformed data into a final target database or data warehouse, where it can be accessed and analyzed. This data store is often designed differently from operational databases, optimized for analysis rather than transactional processing.

As a junior analyst at the CDC tasked with responding to a Zombie outbreak, you'll be dealing with data from many sources. You might have case reports coming in from hospitals in various formats, laboratory test results coming from different lab systems, demographic data from census databases, and even social media posts or news reports. Each of these data sources will have its own structure and quirks.

The ETL process allows you to consolidate all this diverse data into a consistent format in a single location. This makes it much easier to analyze the data, spot trends and patterns, and generate reports. For example, you might need to generate a daily report of new Zombie cases, or analyze case data to identify risk factors for severe symptoms. ETL is the process that enables these activities, ensuring that the data you're working with is accurate, consistent, and up-to-date.

### Extract

The first step in the ETL process is to extract the data from its source. In this case, our data is stored in a CSV file. We'll use the Python library `pandas` to read this CSV file and load the data into a **pandas** DataFrame.

Let's start by loading our data.

In [2]:
import pandas as pd

# download our data
!wget https://github.com/brendanpshea/database_sql/raw/main/data/zombie_outbreak.csv -q -N
# Load the data from the CSV file
df = pd.read_csv('zombie_outbreak.csv')
# Display the first few rows of the dataframe
df.head(10)

Unnamed: 0,case_id,report_date,location,symptom_severity,case_status,age
0,0,2028-02-01,MN,Mild,Deceased,73
1,1,2028-02-01,MN,Mild,Infected,29
2,2,2028-02-01,MN,Mild,Recovered,74
3,3,2028-02-01,MN,Mild,Infected,62
4,4,2028-02-01,MN,Mild,Infected,65
5,5,2028-02-01,MN,Moderate,Recovered,47
6,6,2028-02-01,MN,Severe,Recovered,-1
7,7,2028-02-01,MN,Moderate,Recovered,57
8,8,2028-02-01,MN,Mild,Recovered,11
9,9,2028-02-01,MN,Severe,Infected,-1


Our data consists of several fields:

1.  `case_id`: A unique identifier for each case.
2.  `report_date`: The date when the case was reported.
3.  `location`: The location where the case was reported.
4.  `symptom_severity`: The severity of the symptoms.
5.  `case_status`: The status of the case (e.g., Infected, Recovered).
6.  `age`: The age of the individual.
    - It looks like -1 is used to code "age not known."

Now, let's see what sorts of values we have in the (non-numeric) colums:

In [10]:
print(df.location.unique(), "\n",
df.symptom_severity.unique(), "\n",
df.case_status.unique(), "\n",
)

['MN' 'WI' 'IA' 'SD' 'ND' 'NE' 'IL' 'MI' 'IN' 'OH'] 
 ['Mild' 'Moderate' 'Severe'] 
 ['Deceased' 'Infected' 'Recovered' 'Unknown'] 



This code block is using the `print` function in Python, which simply displays the output of the code inside its parentheses. It is printing three different outputs here, separated by newline characters (`"\n"`), which create a break or new line in the output.

Let's break this down further:

1.  `df.location.unique()`: This statement is using the `unique` method of the pandas DataFrame (df). This method is used to find the unique elements of a particular column ('location') in the DataFrame. In other words, it is showing all the distinct values that exist in the 'location' column of the data. This is useful when you want to understand the diversity of your dataset - for instance, to see how many different locations are represented in your data. Here, the data is focused on midwest statees.

2.  `df.symptom_severity.unique()`: Similar to the above, this statement is finding all the unique elements in the 'symptom_severity' column of the DataFrame. This could be useful for understanding what different severity levels exist in your dataset.

3.  `df.case_status.unique()`: Again, this statement is finding all unique elements in the 'case_status' column of the DataFrame.

In the context of data cleaning and ETL (Extract, Transform, Load) process, this kind of code is often used in the exploratory phase. Before you can clean or transform your data, you need to understand what's in it. This code helps by showing what unique values exist in these particular columns, and could assist in identifying any inconsistencies, typos or outliers in the data. For example, we see that some values of `case_status` are "unknown."

### Transform

Just as an unlucky bite from a Zombie can change a person into a Zombie, transformation involves cleaning and preparing the data for the database. The exact transformations will depend on the specific data and the needs of the database.

One of the most common issues when working with data is dealing with missing or unknown values. These can be represented in many ways, including special symbols, placeholder text, or specific numbers. In our Zombie outbreak dataset, "unknown" values in the `case_status` column and -1 values in the `age` column represent missing or unknown data.

Why might we want to replace these placeholders with NULLs?

1.  Accuracy: The placeholder values might be mistaken for actual data. For example, -1 is a valid number, so if someone didn't know it was being used as a placeholder, they might include it in numerical calculations, which would give inaccurate results.

2.  Compatibility: Some systems and databases have built-in support for handling NULLs. For example, SQL has several functions that specifically deal with NULL values.

3.  Standardization: NULL is a widely recognized representation for missing or unknown data. Using NULL instead of various different placeholders helps make your data more consistent and understandable.

Let's replace these placeholder values with NULLs in our data. We'll also convert the `report_date` column to a datetime data type, which is more appropriate for date data.

In [12]:
# Replace "unknown" and -1 with NULL
df['case_status'] = df['case_status'].replace('unknown', None)
df['age'] = df['age'].replace(-1, None)

# Convert report_date to datetime
df['report_date'] = pd.to_datetime(df['report_date'])

# Check the data types and the first few rows of the dataframe
df.dtypes, df.head(10)


(case_id                      int64
 report_date         datetime64[ns]
 location                    object
 symptom_severity            object
 case_status                 object
 age                         object
 dtype: object,
    case_id report_date location symptom_severity case_status   age
 0        0  2028-02-01       MN             Mild    Deceased    73
 1        1  2028-02-01       MN             Mild    Infected    29
 2        2  2028-02-01       MN             Mild   Recovered    74
 3        3  2028-02-01       MN             Mild    Infected    62
 4        4  2028-02-01       MN             Mild    Infected    65
 5        5  2028-02-01       MN         Moderate   Recovered    47
 6        6  2028-02-01       MN           Severe   Recovered  None
 7        7  2028-02-01       MN         Moderate   Recovered    57
 8        8  2028-02-01       MN             Mild   Recovered    11
 9        9  2028-02-01       MN           Severe    Infected  None)

We have successfully replaced the placeholders with NULLs and converted the `report_date` column to datetime format. Our data is now clean and ready to be loaded into a database.

It's important to note that the transformations needed will depend on the specific data you're working with. For example, you might need to handle different types of missing value indicators, convert other data types, or perform more complex cleaning tasks.

## Load
The final step in the ETL process is loading the transformed data into the destination system. In our case, we're loading the data into a SQLite database, which is a lightweight disk-based database that doesn't require a separate server process.

We'll use the `sqlite3` library in Python to load our data into a SQLite database. Let's create a new SQLite database and load our data into a table in this database. We'll name the table "zombie_outbreak".

Note that if the specified database does not exist, `sqlite3` will automatically create it. If the table already exists, it will be replaced with our new data. Let's proceed with loading the data.

In [14]:
import sqlite3

# Create a connection to the SQLite database
# If the database does not exist, it will be created
conn = sqlite3.connect('zombie_outbreak.db')

# Write the data to a SQLite table
df.to_sql('zombie_outbreak', conn, if_exists='replace', index=False)

# Close the connection to the database
conn.close()

The data has now been loaded into a SQLite database in a table named "zombie_outbreak".

To verify that the data has been loaded correctly, we can fetch and display the first few rows from the "zombie_outbreak" table in the SQLite database. Let's do that now.

In [15]:
# First, we create a database and connect to it
!pip install SQLAlchemy==1.3.24 -q # Needed o avoid problems with more recent version in Colab

%load_ext sql
%sql sqlite:///zombie_outbreak.db

%sql SELECT * FROM zombie_outbreak LIMIT 10;

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.4/6.4 MB[0m [31m15.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for SQLAlchemy (setup.py) ... [?25l[?25hdone
 * sqlite:///zombie_outbreak.db
Done.


case_id,report_date,location,symptom_severity,case_status,age
0,2028-02-01 00:00:00,MN,Mild,Deceased,73
1,2028-02-01 00:00:00,MN,Mild,Infected,29
2,2028-02-01 00:00:00,MN,Mild,Recovered,74
3,2028-02-01 00:00:00,MN,Mild,Infected,62
4,2028-02-01 00:00:00,MN,Mild,Infected,65


As we can see, the data has been successfully loaded into the SQLite database. The queried rows from the SQLite database match the first few rows of our original data.

To summarize, we've successfully conducted an ETL process:

1.  Extracted data from a CSV file using `pandas`.
2.  Transformed the data by replacing placeholder values with NULLs, converting data types, and checking for any missing values.
3.  Loaded the data into a SQLite database using `sqlite3`.

This data is now ready for analysis. As a CDC analyst, you can now query this database to find patterns, generate reports, or input this data into a machine learning model to make predictions.

Remember, the ETL process is a crucial step in ensuring that your data is accurate, consistent, and ready for analysis. By following this guide, you should be able to apply these ETL concepts to your own data tasks.

## Introduction to Pandas

Pandas is a powerful data manipulation library in Python. It provides flexible and efficient data structures that make data manipulation and analysis easy.

Pandas is built on top of two core Python libraries - Matplotlib for data visualization and NumPy for mathematical operations. Pandas allows you to do complex data manipulation with simple, one-line commands.

One of the key data structures in Pandas is the DataFrame, which is a two-dimensional labeled data structure with columns that can be of different types (like integers, strings, and datetimes).

Let's take a deeper look at some of the Pandas functionalities we've used in our ETL process:

### `pd.read_csv()`

This is a function in pandas that reads CSV files and converts them into DataFrame. It has various options allowing you to, for example, specify the delimiter, handle missing values, skip rows, etc.

In our ETL process, we used this function to load our Zombie outbreak data from a CSV file:


```
df = pd.read_csv('/mnt/data/zombie_outbreak.csv')
```

### `df.head()`

This is a function that returns the first n rows of the DataFrame. This is useful to get a glimpse of the data after loading it. By default, it returns the first 5 rows.

We used this function several times to display the first few rows of our data:

```
df.head()
```

### `df.replace()`

This function replaces a set of values with another set of values in the DataFrame. We used this function to replace the "unknown" values and -1 values with NULLs in our data:

```
df['case_status'] = df['case_status'].replace('unknown', None)
df['age'] = df['age'].replace(-1, None)
```

### `pd.to_datetime()`

This function converts a series of string representations of dates and times to a series of datetime objects. We used this function to convert the `report_date` column to datetime:

```
df['report_date'] = pd.to_datetime(df['report_date'])
```

### `df.dtypes`

This is an attribute (not a function) that returns the data types of each column in the DataFrame. This is useful to check if the data types are what we expect. For example, we want `report_date` to be a datetime, `location` to be a string, `case_status` to be a categorical variable, and `age` to be an integer.

```
df.dtypes
```

### `df.to_sql()`

This function writes records stored in a DataFrame to a SQL database. We used this function to load our data into a SQLite database:

```
df.to_sql('zombie_outbreak', conn, if_exists='replace', index=False)
```

Here, `conn` is a connection object to our SQLite database, `'zombie_outbreak'` is the name of the table we want to create in the database, `if_exists='replace'` means that if the table already exists, it will be replaced with our new data, and `index=False` means that the index of the DataFrame will not be included as a column in the table.

## Appendix: A Script to Generate a Random Zombie Outbreak
You don't need to run this! I just included it here to show how I generated the data set we're working with. Feel free to play with it to see what happens! (Right now, this is *not* a realistic pandemic data set).

In [None]:
NUM_CASES = 5000

import pandas as pd
import numpy as np
from random import choices, randint
from datetime import datetime, timedelta
import math

# List of states to simulate spreading of the outbreak
states = ['MN', 'WI', 'IA', 'SD', 'ND', 'NE', 'IL', 'MI', 'IN', 'OH']

# Define the severity levels and case status
severity_levels = ['Mild', 'Moderate', 'Severe']
case_status = ['Infected', 'Recovered', 'Deceased', 'Unknown']

# Function to generate case_status based on age and severity
def generate_case_status(age, severity):
    if severity == 'Mild':
        return choices(case_status, weights=[70, 29, 1, 0])[0] if age != -1 else choices(case_status, weights=[0, 0, 0, 100])[0]
    elif severity == 'Moderate':
        return choices(case_status, weights=[40, 59, 1, 0])[0] if age != -1 else choices(case_status, weights=[0, 0, 0, 100])[0]
    else: # 'Severe'
        return choices(case_status, weights=[20, 29, 51, 0])[0] if age > 50 else choices(case_status, weights=[40, 59, 1, 0])[0]

# Create dataframe
data = {'case_id': [], 'report_date': [], 'location': [], 'symptom_severity': [], 'case_status': [], 'age': []}

start_date = datetime(2028, 2, 1)
end_date = datetime(2028, 8, 1)
num_days = (end_date - start_date).days

# Exponential growth parameters
a = 1  # initial number of cases
r = np.log(2) / 30  # rate, set to double every month

# Generate t values (one for each case), evenly spaced between 0 and the total number of days
t_values = np.linspace(0, num_days, num=NUM_CASES)

# Calculate the number of cases for each day, round to nearest integer
case_counts = np.rint(a * np.exp(r * t_values)).astype(int)

# Generate dates proportionate to the exponential growth of cases
dates = [start_date + timedelta(days=int(t)) for t in t_values]

for i in range(NUM_CASES):
    # Get the date
    date = dates[i]

    # generate location based on the date (more recent dates have more states)
    location = np.random.choice(states[:max(1, (date-start_date).days//(num_days//len(states)))])

    # generate age (more older people as per severity)
    age = int(np.random.normal(loc=50, scale=20))
    if age < 0: age = 0
    if np.random.rand() < 0.1:  # 10% chance of age being unknown
        age = -1

    # generate symptom severity (more severe for older people)
    symptom_severity = choices(severity_levels, weights=[70, 20, 10] if age != -1 else [10, 20, 70])[0]

    # generate case_status based on symptom_severity and age
    case_status_generated = generate_case_status(age, symptom_severity)

    # append generated data to the dictionary
    data['case_id'].append(i)
    data['report_date'].append(date.strftime("%Y-%m-%d"))
    data['location'].append(location)
    data['symptom_severity'].append(symptom_severity)
    data['case_status'].append(case_status_generated)
    data['age'].append(age)

# Create DataFrame
df = pd.DataFrame(data)

# Save DataFrame to CSV
df.to_csv('zombie_outbreak.csv', index=False)
