# Impact of Covid 19 - CDC Data Analysis

A comprehensive Analysis of the Pandemic Using CDC Health Datasets.

## Scope of the project:

Analyze COVID-19 CASES and DEATHS across different demographics in the United States using CDC datasets. This includes analysis by age groups, race/ethnicity, and sex to understand the pandemic's impact on different populations.

## Executive Summary

1. How does COVID-19 impact different age groups in terms of cases and deaths?
2. What are the disparities in COVID-19 cases and deaths across different racial/ethnic groups?
3. How do COVID-19 cases and deaths differ by sex?
4. What patterns can we identify in the demographic distribution of COVID-19 impact?

## Importing Required Libraries

In [2]:
import pandas as pd
from IPython.display import display
import os

# set maximum number of rows and columns to display
pd.set_option('display.max_rows', 5000)
pd.set_option('display.max_columns', 5000)

# set console width to a larger value
pd.set_option('display.width', 1000)

print("✅ Libraries imported successfully!")

✅ Libraries imported successfully!


## Load CDC COVID-19 Datasets

In [3]:
# Define base path for datasets
base_path = 'Datasets/Impact of Covid-19 in different countries/CDC'

# Import CDC datasets with relative paths
try:
    cdc_cases_by_age = pd.read_csv(os.path.join(base_path, 'Cases by Age Group/cases_by_age_group.csv'))
    cdc_cases_by_race = pd.read_csv(os.path.join(base_path, 'Cases by Race:Ethnicity/cases_by_race_ethnicity__all_age_groups.csv'))
    cdc_cases_by_sex = pd.read_csv(os.path.join(base_path, 'Cases by Sex/cases_by_sex__all_age_groups.csv'))
    cdc_deaths_by_age = pd.read_csv(os.path.join(base_path, 'Deaths by Age Group/deaths_by_age_group.csv'))
    cdc_deaths_by_race = pd.read_csv(os.path.join(base_path, 'Deaths by Race:Ethnicity/deaths_by_race_ethnicity__all_age_groups.csv'))
    cdc_deaths_by_sex = pd.read_csv(os.path.join(base_path, 'Deaths by Sex/deaths_by_sex__all_age_groups.csv'))
    
    print("✅ All CDC datasets loaded successfully!")
except Exception as e:
    print(f"❌ Error loading datasets: {e}")
    print("Please ensure the datasets are in the correct directory structure.")

✅ All CDC datasets loaded successfully!


## Organize Datasets for Analysis

In [4]:
# CDC Datasets and Titles
cdc_data = [cdc_cases_by_age, cdc_cases_by_race, cdc_cases_by_sex, cdc_deaths_by_age, cdc_deaths_by_race, cdc_deaths_by_sex]
cdc_titles = ['Cases by Age Group', 'Cases by Race/Ethnicity', 'Cases by Sex', 'Deaths by Age Group', 'Deaths by Race/Ethnicity', 'Deaths by Sex']

print(f"Number of CDC datasets loaded: {len(cdc_data)}")
print("\nDatasets available for analysis:")
for i, title in enumerate(cdc_titles):
    print(f"{i+1}. {title}")

Number of CDC datasets loaded: 6

Datasets available for analysis:
1. Cases by Age Group
2. Cases by Race/Ethnicity
3. Cases by Sex
4. Deaths by Age Group
5. Deaths by Race/Ethnicity
6. Deaths by Sex


## Dataset Overview - Shape and Structure

In [5]:
# Display shape of each dataset
print("Dataset Shapes (rows, columns):")
print("=" * 40)
for i, dataset in enumerate(cdc_data):
    print(f"{cdc_titles[i]}: {dataset.shape}")

Dataset Shapes (rows, columns):
Cases by Age Group: (11, 4)
Cases by Race/Ethnicity: (7, 4)
Cases by Sex: (3, 4)
Deaths by Age Group: (11, 4)
Deaths by Race/Ethnicity: (7, 4)
Deaths by Sex: (3, 4)


## Detailed Dataset Information

In [6]:
# Display detailed information for each dataset
for i, dataset in enumerate(cdc_data):
    print(f"\n{'='*60}")
    print(f"{cdc_titles[i]}")
    print(f"{'='*60}")
    
    print("\nDataset Info:")
    print(f"Shape: {dataset.shape}")
    print(f"Columns: {list(dataset.columns)}")
    
    print("\nData Types:")
    display(dataset.dtypes)
    
    print("\nFirst few rows:")
    display(dataset.head())
    
    print("\nSummary Statistics:")
    display(dataset.describe(include='all'))


Cases by Age Group

Dataset Info:
Shape: (11, 4)
Columns: ['Age Group', 'Percent of cases', 'Count of cases', 'Percent of US population']

Data Types:


Age Group                    object
Percent of cases            float64
Count of cases                int64
Percent of US population    float64
dtype: object


First few rows:


Unnamed: 0,Age Group,Percent of cases,Count of cases,Percent of US population
0,0-4 Years,3.7,3583384,6.0
1,5-11 Years,6.4,6272066,8.7
2,12-15 Years,4.4,4345110,5.1
3,16-17 Years,2.6,2533285,2.5
4,18-29 Years,20.2,19740037,16.4



Summary Statistics:


Unnamed: 0,Age Group,Percent of cases,Count of cases,Percent of US population
count,11,11.0,11.0,11.0
unique,11,,,
top,0-4 Years,,,
freq,1,,,
mean,,9.090909,8883138.0,9.109091
std,,6.87742,6723165.0,5.679164
min,,2.0,1937125.0,2.0
25%,,3.85,3747124.0,5.0
50%,,6.4,6272066.0,8.7
75%,,15.3,14960070.0,12.9



Cases by Race/Ethnicity

Dataset Info:
Shape: (7, 4)
Columns: ['Race/Ethnicity', 'Percent of cases', 'Count of cases', 'Percent of US population']

Data Types:


Race/Ethnicity               object
Percent of cases            float64
Count of cases                int64
Percent of US population    float64
dtype: object


First few rows:


Unnamed: 0,Race/Ethnicity,Percent of cases,Count of cases,Percent of US population
0,Hispanic/Latino,24.3,15979927,18.45
1,American Indian / Alaska Native Non-Hispanic,1.0,691508,0.74
2,Asian Non-Hispanic,4.4,2874309,5.76
3,Black Non-Hispanic,12.4,8183888,12.54
4,Native Hawaiian / Other Pacific Islander Non-H...,0.3,180853,0.182



Summary Statistics:


Unnamed: 0,Race/Ethnicity,Percent of cases,Count of cases,Percent of US population
count,7,7.0,7.0,7.0
unique,7,,,
top,Hispanic/Latino,,,
freq,1,,,
mean,,14.285714,9405397.0,14.286
std,,19.343425,12729860.0,21.295243
min,,0.3,180853.0,0.182
25%,,2.4,1602767.0,1.48
50%,,4.4,2874309.0,5.76
75%,,18.35,12081910.0,15.495



Cases by Sex

Dataset Info:
Shape: (3, 4)
Columns: ['Sex', 'Percent of cases', 'Count of cases', 'Percent of US population']

Data Types:


Sex                          object
Percent of cases             object
Count of cases                int64
Percent of US population    float64
dtype: object


First few rows:


Unnamed: 0,Sex,Percent of cases,Count of cases,Percent of US population
0,Female,53.8,52322303,50.75
1,Male,46.1,44849180,49.25
2,Other,<0.1,4102,



Summary Statistics:


Unnamed: 0,Sex,Percent of cases,Count of cases,Percent of US population
count,3,3.0,3.0,2.0
unique,3,3.0,,
top,Female,53.8,,
freq,1,1.0,,
mean,,,32391860.0,50.0
std,,,28296420.0,1.06066
min,,,4102.0,49.25
25%,,,22426640.0,49.625
50%,,,44849180.0,50.0
75%,,,48585740.0,50.375



Deaths by Age Group

Dataset Info:
Shape: (11, 4)
Columns: ['Age Group', 'Percentage of deaths', 'Count of deaths', 'Percent of US population']

Data Types:


Age Group                    object
Percentage of deaths         object
Count of deaths               int64
Percent of US population    float64
dtype: object


First few rows:


Unnamed: 0,Age Group,Percentage of deaths,Count of deaths,Percent of US population
0,0-4 Years,0.1,761,6.0
1,5-11 Years,0.1,547,8.7
2,12-15 Years,0.1,519,5.1
3,16-17 Years,<0.1,365,2.5
4,18-29 Years,0.7,6946,16.4



Summary Statistics:


Unnamed: 0,Age Group,Percentage of deaths,Count of deaths,Percent of US population
count,11,11.0,11.0,11.0
unique,11,9.0,,
top,0-4 Years,0.1,,
freq,1,3.0,,
mean,,,87838.909091,9.109091
std,,,112156.325318,5.679164
min,,,365.0,2.0
25%,,,654.0,5.0
50%,,,16963.0,8.7
75%,,,191887.0,12.9



Deaths by Race/Ethnicity

Dataset Info:
Shape: (7, 4)
Columns: ['Race/Ethnicity', 'Percent of deaths', 'Count of deaths', 'Percent of US population']

Data Types:


Race/Ethnicity               object
Percent of deaths           float64
Count of deaths               int64
Percent of US population    float64
dtype: object


First few rows:


Unnamed: 0,Race/Ethnicity,Percent of deaths,Count of deaths,Percent of US population
0,Hispanic/Latino,16.8,139082,18.45
1,American Indian / Alaska Native Non-Hispanic,1.1,8794,0.74
2,Asian Non-Hispanic,3.3,27013,5.76
3,Black Non-Hispanic,12.7,104915,12.54
4,Native Hawaiian / Other Pacific Islander Non-H...,0.2,1930,0.182



Summary Statistics:


Unnamed: 0,Race/Ethnicity,Percent of deaths,Count of deaths,Percent of US population
count,7,7.0,7.0,7.0
unique,7,,,
top,Hispanic/Latino,,,
freq,1,,,
mean,,14.3,118225.857143,14.286
std,,22.773742,188474.991656,21.295243
min,,0.2,1930.0,0.182
25%,,1.6,12971.0,1.48
50%,,3.3,27013.0,5.76
75%,,14.75,121998.5,15.495



Deaths by Sex

Dataset Info:
Shape: (3, 4)
Columns: ['Sex', 'Percentage of deaths', 'Count of deaths', 'Percent of US population']

Data Types:


Sex                          object
Percentage of deaths         object
Count of deaths              object
Percent of US population    float64
dtype: object


First few rows:


Unnamed: 0,Sex,Percentage of deaths,Count of deaths,Percent of US population
0,Female,45.3,436120 - 436129,50.75
1,Male,54.7,527490 - 527499,49.25
2,Other,<0.1,0 - 9,



Summary Statistics:


Unnamed: 0,Sex,Percentage of deaths,Count of deaths,Percent of US population
count,3,3.0,3,2.0
unique,3,3.0,3,
top,Female,45.3,436120 - 436129,
freq,1,1.0,1,
mean,,,,50.0
std,,,,1.06066
min,,,,49.25
25%,,,,49.625
50%,,,,50.0
75%,,,,50.375


## Data Quality Check - Missing Values

In [7]:
# Check for missing values in each dataset
print("Missing Values Analysis:")
print("=" * 60)

for i, dataset in enumerate(cdc_data):
    print(f"\n{cdc_titles[i]}:")
    missing_values = dataset.isnull().sum()
    if missing_values.sum() == 0:
        print("✅ No missing values found")
    else:
        print("⚠️  Missing values detected:")
        print(missing_values[missing_values > 0])

Missing Values Analysis:

Cases by Age Group:
✅ No missing values found

Cases by Race/Ethnicity:
✅ No missing values found

Cases by Sex:
⚠️  Missing values detected:
Percent of US population    1
dtype: int64

Deaths by Age Group:
✅ No missing values found

Deaths by Race/Ethnicity:
✅ No missing values found

Deaths by Sex:
⚠️  Missing values detected:
Percent of US population    1
dtype: int64


## Key Statistics Summary

In [8]:
# Display key statistics for each dataset
print("CDC COVID-19 Data Summary")
print("=" * 60)

for i, dataset in enumerate(cdc_data):
    print(f"\n{cdc_titles[i]}:")
    print(f"  • Number of records: {len(dataset)}")
    print(f"  • Number of columns: {len(dataset.columns)}")
    
    # Look for count columns
    count_cols = [col for col in dataset.columns if 'count' in col.lower() or 'cases' in col.lower() or 'deaths' in col.lower()]
    if count_cols:
        for col in count_cols:
            if dataset[col].dtype in ['int64', 'float64']:
                total = dataset[col].sum()
                if pd.notna(total):
                    print(f"  • Total {col}: {total:,}")

CDC COVID-19 Data Summary

Cases by Age Group:
  • Number of records: 11
  • Number of columns: 4
  • Total Percent of cases: 100.0
  • Total Count of cases: 97,714,517

Cases by Race/Ethnicity:
  • Number of records: 7
  • Number of columns: 4
  • Total Percent of cases: 99.99999999999999
  • Total Count of cases: 65,837,776

Cases by Sex:
  • Number of records: 3
  • Number of columns: 4
  • Total Count of cases: 97,175,585

Deaths by Age Group:
  • Number of records: 11
  • Number of columns: 4
  • Total Count of deaths: 966,228

Deaths by Race/Ethnicity:
  • Number of records: 7
  • Number of columns: 4
  • Total Percent of deaths: 100.1
  • Total Count of deaths: 827,581

Deaths by Sex:
  • Number of records: 3
  • Number of columns: 4


## Demographic Analysis - Cases by Age Group

In [9]:
print("COVID-19 Cases by Age Group Analysis")
print("=" * 50)

# Display the full cases by age dataset
display(cdc_cases_by_age)

# Find age groups with highest case counts
if 'Count of cases' in cdc_cases_by_age.columns:
    top_age_groups = cdc_cases_by_age.nlargest(3, 'Count of cases')
    print("\nTop 3 Age Groups by Case Count:")
    for _, row in top_age_groups.iterrows():
        print(f"  • {row['Age Group']}: {row['Count of cases']:,} cases")

COVID-19 Cases by Age Group Analysis


Unnamed: 0,Age Group,Percent of cases,Count of cases,Percent of US population
0,0-4 Years,3.7,3583384,6.0
1,5-11 Years,6.4,6272066,8.7
2,12-15 Years,4.4,4345110,5.1
3,16-17 Years,2.6,2533285,2.5
4,18-29 Years,20.2,19740037,16.4
5,30-39 Years,16.5,16154939,13.5
6,40-49 Years,14.1,13765198,12.3
7,50-64 Years,18.6,18164908,19.2
8,65-74 Years,7.5,7307601,9.6
9,75-84 Years,4.0,3910864,4.9



Top 3 Age Groups by Case Count:
  • 18-29 Years: 19,740,037 cases
  • 50-64 Years: 18,164,908 cases
  • 30-39 Years: 16,154,939 cases


## Demographic Analysis - Deaths by Age Group

In [10]:
print("COVID-19 Deaths by Age Group Analysis")
print("=" * 50)

# Display the full deaths by age dataset
display(cdc_deaths_by_age)

# Find age groups with highest death counts
if 'Count of deaths' in cdc_deaths_by_age.columns:
    top_death_age_groups = cdc_deaths_by_age.nlargest(3, 'Count of deaths')
    print("\nTop 3 Age Groups by Death Count:")
    for _, row in top_death_age_groups.iterrows():
        print(f"  • {row['Age Group']}: {row['Count of deaths']:,} deaths")

COVID-19 Deaths by Age Group Analysis


Unnamed: 0,Age Group,Percentage of deaths,Count of deaths,Percent of US population
0,0-4 Years,0.1,761,6.0
1,5-11 Years,0.1,547,8.7
2,12-15 Years,0.1,519,5.1
3,16-17 Years,<0.1,365,2.5
4,18-29 Years,0.7,6946,16.4
5,30-39 Years,1.8,16963,13.5
6,40-49 Years,4,38500,12.3
7,50-64 Years,17.5,169040,19.2
8,65-74 Years,22.2,214734,9.6
9,75-84 Years,26.2,252917,4.9



Top 3 Age Groups by Death Count:
  • 85+ Years: 264,936 deaths
  • 75-84 Years: 252,917 deaths
  • 65-74 Years: 214,734 deaths


## Demographic Analysis - Cases and Deaths by Race/Ethnicity

In [11]:
print("COVID-19 Cases by Race/Ethnicity")
print("=" * 50)
display(cdc_cases_by_race)

print("\n" + "=" * 50)
print("COVID-19 Deaths by Race/Ethnicity")
print("=" * 50)
display(cdc_deaths_by_race)

COVID-19 Cases by Race/Ethnicity


Unnamed: 0,Race/Ethnicity,Percent of cases,Count of cases,Percent of US population
0,Hispanic/Latino,24.3,15979927,18.45
1,American Indian / Alaska Native Non-Hispanic,1.0,691508,0.74
2,Asian Non-Hispanic,4.4,2874309,5.76
3,Black Non-Hispanic,12.4,8183888,12.54
4,Native Hawaiian / Other Pacific Islander Non-H...,0.3,180853,0.182
5,White Non-Hispanic,53.8,35413265,60.11
6,Multiple/Other Non-Hispanic,3.8,2514026,2.22



COVID-19 Deaths by Race/Ethnicity


Unnamed: 0,Race/Ethnicity,Percent of deaths,Count of deaths,Percent of US population
0,Hispanic/Latino,16.8,139082,18.45
1,American Indian / Alaska Native Non-Hispanic,1.1,8794,0.74
2,Asian Non-Hispanic,3.3,27013,5.76
3,Black Non-Hispanic,12.7,104915,12.54
4,Native Hawaiian / Other Pacific Islander Non-H...,0.2,1930,0.182
5,White Non-Hispanic,63.9,528699,60.11
6,Multiple/Other Non-Hispanic,2.1,17148,2.22


## Demographic Analysis - Cases and Deaths by Sex

In [12]:
print("COVID-19 Cases by Sex")
print("=" * 30)
display(cdc_cases_by_sex)

print("\n" + "=" * 30)
print("COVID-19 Deaths by Sex")
print("=" * 30)
display(cdc_deaths_by_sex)

COVID-19 Cases by Sex


Unnamed: 0,Sex,Percent of cases,Count of cases,Percent of US population
0,Female,53.8,52322303,50.75
1,Male,46.1,44849180,49.25
2,Other,<0.1,4102,



COVID-19 Deaths by Sex


Unnamed: 0,Sex,Percentage of deaths,Count of deaths,Percent of US population
0,Female,45.3,436120 - 436129,50.75
1,Male,54.7,527490 - 527499,49.25
2,Other,<0.1,0 - 9,


Based on my examination of all 6 CDC datasets, here are the key data quality issues that need to be addressed before proceeding with visualizations:

⚠️ CRITICAL ISSUES FOUND:
1. Inconsistent Data Types & Special Characters
Deaths by Sex dataset: Contains range values like "436120 - 436129" and "0 - 9" instead of single numbers
Cases by Sex dataset: Contains "<0.1" as a string instead of numeric value
Deaths by Age dataset: Contains "<0.1" for percentage values

2. Missing/Inconsistent Values
"N/A" values in US population percentages for "Other" sex category
String representations of small percentages like "<0.1" need conversion

3. Column Name Inconsistencies
Cases datasets use "Percent of cases"
Deaths datasets use "Percentage of deaths" (different naming)

I'll proceed with the required transformations, starting with:



In [13]:
"""
CDC COVID-19 Data Cleaning and Transformation
This cell cleans all the data quality issues identified in the CDC datasets.
"""

import numpy as np

print("🧹 Starting CDC Data Cleaning Process...")
print("=" * 50)

# Function to convert range values to numeric (for Deaths by Sex dataset)
def convert_range_to_numeric(value):
    """Convert range values like '436120 - 436129' to midpoint"""
    if isinstance(value, str) and ' - ' in value:
        try:
            start, end = value.split(' - ')
            return (int(start) + int(end)) / 2
        except ValueError:
            return np.nan
    return value

# Function to handle special characters like "<0.1"
def convert_less_than(value):
    """Convert '<0.1' to 0.05 (midpoint estimate)"""
    if isinstance(value, str) and value.startswith('<'):
        try:
            return float(value[1:]) / 2
        except ValueError:
            return np.nan
    return value

# Function to clean percentage columns
def clean_percentage_column(series):
    """Clean percentage columns that may contain '<' symbols"""
    return series.apply(convert_less_than)

# Function to clean count columns
def clean_count_column(series):
    """Clean count columns that may contain ranges or special characters"""
    return series.apply(convert_range_to_numeric)

print("🔧 Cleaning Cases by Sex dataset...")
# Clean Cases by Sex
# Handle '<0.1' in Percent of cases
cdc_cases_by_sex['Percent of cases'] = clean_percentage_column(cdc_cases_by_sex['Percent of cases'])
# Replace 'N/A' with NaN
cdc_cases_by_sex = cdc_cases_by_sex.replace('N/A', np.nan)
print("✅ Cases by Sex cleaned")

print("🔧 Cleaning Deaths by Sex dataset...")
# Clean Deaths by Sex (most problematic dataset)
# Handle range values in Count of deaths
cdc_deaths_by_sex['Count of deaths'] = clean_count_column(cdc_deaths_by_sex['Count of deaths'])
# Handle '<0.1' in Percentage of deaths
cdc_deaths_by_sex['Percentage of deaths'] = clean_percentage_column(cdc_deaths_by_sex['Percentage of deaths'])
# Replace 'N/A' with NaN
cdc_deaths_by_sex = cdc_deaths_by_sex.replace('N/A', np.nan)
# Standardize column name
cdc_deaths_by_sex = cdc_deaths_by_sex.rename(columns={'Percentage of deaths': 'Percent of deaths'})
print("✅ Deaths by Sex cleaned")

print("🔧 Cleaning Deaths by Age dataset...")
# Clean Deaths by Age
# Handle '<0.1' in Percentage of deaths
cdc_deaths_by_age['Percentage of deaths'] = clean_percentage_column(cdc_deaths_by_age['Percentage of deaths'])
# Standardize column name
cdc_deaths_by_age = cdc_deaths_by_age.rename(columns={'Percentage of deaths': 'Percent of deaths'})
print("✅ Deaths by Age cleaned")

print("🔧 Cleaning Deaths by Race/Ethnicity dataset...")
# Clean Deaths by Race/Ethnicity
# Standardize column name
cdc_deaths_by_race = cdc_deaths_by_race.rename(columns={'Percentage of deaths': 'Percent of deaths'})
print("✅ Deaths by Race/Ethnicity cleaned")

print("\n✅ Data cleaning completed!")
print("📈 The datasets are now ready for visualization and analysis.")
print("\n💡 Cleaning notes:")
print("  - '<0.1' values converted to 0.05 (midpoint estimate)")
print("  - Range values converted to midpoint (e.g., '436120 - 436129' → 436124.5)")
print("  - 'N/A' values converted to NaN for proper handling")
print("  - Column names standardized across datasets")

🧹 Starting CDC Data Cleaning Process...
🔧 Cleaning Cases by Sex dataset...
✅ Cases by Sex cleaned
🔧 Cleaning Deaths by Sex dataset...
✅ Deaths by Sex cleaned
🔧 Cleaning Deaths by Age dataset...
✅ Deaths by Age cleaned
🔧 Cleaning Deaths by Race/Ethnicity dataset...
✅ Deaths by Race/Ethnicity cleaned

✅ Data cleaning completed!
📈 The datasets are now ready for visualization and analysis.

💡 Cleaning notes:
  - '<0.1' values converted to 0.05 (midpoint estimate)
  - Range values converted to midpoint (e.g., '436120 - 436129' → 436124.5)
  - 'N/A' values converted to NaN for proper handling
  - Column names standardized across datasets


## ETL process Completed! 📊

I have successfully loaded, analyzed, and transformed all CDC COVID-19 datasets. 

### Next Steps:
- Add data visualizations using matplotlib, seaborn, or plotly
- Perform statistical analysis and comparisons
- Create demographic trend analysis
- Build interactive dashboards
