<a href="https://colab.research.google.com/github/brendanpshea/database_sql/blob/main/Database_11_Data_Storytelling_with_Zombies.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Telling a Story with Your Data
### Brendan Shea, PhD

In this chapter, we delve into the heart of data analysis: the art and science of interpreting your data and communicating your findings effectively. Your journey through this course has equipped you with the fundamental skills to gather, organize, and manipulate data. Now, we turn our attention towards the real-world challenge of extracting meaningful insights from our datasets and presenting these insights in a manner that allows others to understand and act upon them.

We will begin by exploring how to clean and structure data to facilitate downstream analysis. This often overlooked step is crucial in ensuring the integrity of your analyses and, consequently, the reliability of your findings.

Next, we turn to the essential task of visualizing data. We'll delve into various techniques for presenting different types of data, including geographical and temporal data. We will also discuss the development of interactive dashboards, a powerful tool for real-time data monitoring and decision-making.

We will then transition into the realm of descriptive and inferential statistics. Descriptive statistics help us summarize and understand the characteristics of our dataset. Inferential statistics, on the other hand, enable us to make predictions and inferences about a population based on a sample.

Finally, we will address the all-important task of report writing. Here we will discuss how to structure a report, how to communicate complex information clearly and succinctly, and how to ensure your report is both engaging and informative.

Throughout this chapter, we will be mindful of potential biases and confounding factors that could impact our analyses. We will also discuss the importance of data privacy and ethics, particularly when handling sensitive information.


## Case Study: Zombies!

Now, let's bring these concepts to life with a case study. You are a junior analyst at the Centers for Disease Control and Prevention (CDC). Suddenly, an outbreak of a mysterious illness begins to spread rapidly across the country. The symptoms appear zombie-like, and public fear is rising. Your task is to analyze the data coming in from across the nation, make sense of it, and communicate your findings to various stakeholders. This high-stakes scenario will give you a taste of what data analysts face in real-world crises.

As we proceed through this chapter, we will apply the concepts we learn to this unfolding crisis, helping the CDC understand and combat the spread of this terrifying outbreak. Let's get started!

In [None]:
# First, we create a database and connect to it
!pip install SQLAlchemy==1.3.24 -q # Needed o avoid problems with more recent version in Colab

%load_ext sql
%sql sqlite:///zombies.db

## Appendix: A Script to Generate a Random Zombie Outbreak
You don't need to run this! I just included it here to show how I generated the data set we're working with. Feel free to play with it to see what happens! (Right now, this is *not* a realistic pandemic data set).

In [1]:
NUM_CASES = 5000

import pandas as pd
import numpy as np
from random import choices, randint
from datetime import datetime, timedelta
import math

# List of states to simulate spreading of the outbreak
states = ['MN', 'WI', 'IA', 'SD', 'ND', 'NE', 'IL', 'MI', 'IN', 'OH']

# Define the severity levels and case status
severity_levels = ['Mild', 'Moderate', 'Severe']
case_status = ['Infected', 'Recovered', 'Deceased', 'Unknown']

# Function to generate case_status based on age and severity
def generate_case_status(age, severity):
    if severity == 'Mild':
        return choices(case_status, weights=[70, 29, 1, 0])[0] if age != -1 else choices(case_status, weights=[0, 0, 0, 100])[0]
    elif severity == 'Moderate':
        return choices(case_status, weights=[40, 59, 1, 0])[0] if age != -1 else choices(case_status, weights=[0, 0, 0, 100])[0]
    else: # 'Severe'
        return choices(case_status, weights=[20, 29, 51, 0])[0] if age > 50 else choices(case_status, weights=[40, 59, 1, 0])[0]

# Create dataframe
data = {'case_id': [], 'report_date': [], 'location': [], 'symptom_severity': [], 'case_status': [], 'age': []}

start_date = datetime(2028, 2, 1)
end_date = datetime(2028, 8, 1)
num_days = (end_date - start_date).days

# Exponential growth parameters
a = 1  # initial number of cases
r = np.log(2) / 30  # rate, set to double every month

# Generate t values (one for each case), evenly spaced between 0 and the total number of days
t_values = np.linspace(0, num_days, num=NUM_CASES)

# Calculate the number of cases for each day, round to nearest integer
case_counts = np.rint(a * np.exp(r * t_values)).astype(int)

# Generate dates proportionate to the exponential growth of cases
dates = [start_date + timedelta(days=int(t)) for t in t_values]

for i in range(NUM_CASES):
    # Get the date
    date = dates[i]

    # generate location based on the date (more recent dates have more states)
    location = np.random.choice(states[:max(1, (date-start_date).days//(num_days//len(states)))])

    # generate age (more older people as per severity)
    age = int(np.random.normal(loc=50, scale=20))
    if age < 0: age = 0
    if np.random.rand() < 0.1:  # 10% chance of age being unknown
        age = -1

    # generate symptom severity (more severe for older people)
    symptom_severity = choices(severity_levels, weights=[70, 20, 10] if age != -1 else [10, 20, 70])[0]

    # generate case_status based on symptom_severity and age
    case_status_generated = generate_case_status(age, symptom_severity)

    # append generated data to the dictionary
    data['case_id'].append(i)
    data['report_date'].append(date.strftime("%Y-%m-%d"))
    data['location'].append(location)
    data['symptom_severity'].append(symptom_severity)
    data['case_status'].append(case_status_generated)
    data['age'].append(age)

# Create DataFrame
df = pd.DataFrame(data)

# Save DataFrame to CSV
df.to_csv('zombie_outbreak.csv', index=False)
