In [1]:
# Configure pandas display options for scrollable output
import pandas as pd
pd.set_option('display.max_rows', None)  # Show all rows
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.width', None)  # Auto-detect display width
pd.set_option('display.max_colwidth', None)  # Show full content of each column

# Vaccination Data Analysis and Visualization: A Step-by-Step Project Notebook

This Jupyter Notebook provides a complete, step-by-step guide to executing the Vaccination Data Analysis and Visualization project. It follows the project guidelines, covering each phase from data cleaning to database setup and final visualization in Power BI.

## Project Title: Vaccination Data Analysis and Visualization

**Skills Takeaway From This Project:**
* Python scripting
* Data Cleaning
* Exploratory Data Analysis (EDA)
* SQL
* Power BI

**Domain:**
* Public Health and Epidemiology

### Problem Statement
Analyze global vaccination data to understand trends in vaccination coverage, disease incidence, and effectiveness. Data will be cleaned and stored in a SQL database. Power BI will be used to connect to the SQL database and create interactive dashboards that provide insights on vaccination strategies and their impact on disease control.

### Business Use Cases
1.  **Public Health Strategy:**
    *   Assess the effectiveness of vaccination programs in different regions and populations.
    *   Prioritize areas with low vaccination coverage for targeted interventions.
2.  **Disease Prevention:**
    *   Identify diseases with high incidence rates despite vaccination efforts, suggesting vaccine inefficacies or areas for improvement.
    *   Support policies on booster vaccines or new vaccine introductions.
3.  **Resource Allocation:**
    *   Determine regions with low vaccination coverage and plan targeted resource distribution to improve vaccination rates.
    *   Forecast vaccine demand based on current trends for better supply chain management.
4.  **Global Health Policy:**
    *   Provide data-driven recommendations for vaccination policy formulation.
    *   Support governments and health organizations with evidence on vaccine effectiveness.

### Approach Overview
The project will be executed in the following phases, which are detailed in this notebook:
*   **Data Cleaning:** Handle missing data, normalize units, and ensure date consistency.
*   **SQL Database Setup:** Create and normalize relational SQL tables with data integrity constraints.
*   **Power BI Integration:** Connect Power BI to the SQL database for analysis and visualization.
*   **Data Visualization in Power BI:** Create interactive dashboards with heatmaps, trend lines, and KPIs.
*   **Exploratory Data Analysis (EDA):** Analyze trends, disparities, and correlations to answer key questions.

### Questions to be Answered
The analysis and visualizations should answer the following questions:

**Easy Level:**
*   How do vaccination rates correlate with a decrease in disease incidence?
*   What is the drop-off rate between 1st dose and subsequent doses?.
*   Are vaccination rates different between genders? (*Note: Data not available in the provided dataset*)
*   How does education level impact vaccination rates? (*Note: Data not available*)
*   What is the urban vs. rural vaccination rate difference? (*Note: Data not available*)
*   Has the rate of booster dose uptake increased over time?
*   Is there a seasonal pattern in vaccination uptake? (*Note: Data aggregated by year, seasonality analysis not possible*)
*   How does population density relate to vaccination coverage? (*Note: Data not available*)
*   Which regions have high disease incidence despite high vaccination rates?

**Medium Level (combination of different tables):**
*   Is there a correlation between vaccine introduction and a decrease in disease cases?
*   What is the trend in disease cases before and after vaccination campaigns?
*   Which diseases have shown the most significant reduction in cases due to vaccination?
*   What percentage of the target population has been covered by each vaccine?
*   How does the vaccination schedule (e.g., booster doses) impact target population coverage?
*   Are there significant disparities in vaccine introduction timelines across WHO regions?
*   How does vaccine coverage correlate with disease reduction for specific antigens?
*   Are there specific regions or countries with low coverage despite high availability of vaccines?
*   What are the gaps in coverage for vaccines targeting high-priority diseases (e.g., TB, Hepatitis B)?
*   Are certain diseases more prevalent in specific geographic areas?

**Scenario Based:**
*   A government health agency wants to identify regions with low vaccination coverage to allocate resources effectively.
*   A public health organization wants to evaluate the effectiveness of a measles vaccination campaign launched five years ago.
*   A vaccine manufacturer wants to estimate vaccine demand for a specific disease in the upcoming year.
*   A sudden outbreak of influenza occurs in a specific region, and authorities need to ramp up vaccination efforts.
*   Researchers want to explore the incidence rates of polio in populations with no vaccination coverage.
*   WHO wants to track global progress toward achieving a target of 95% vaccination coverage for measles by 2030.
*   A health agency wants to allocate vaccines to high-risk populations such as children under five and the elderly.
*   A non-profit wants to detect disparities in vaccination coverage across different socioeconomic groups within a country.
*   Authorities want to determine how vaccination rates vary throughout the year.
*   Two regions use different vaccination strategies (e.g., door-to-door vs. centralized vaccination clinics). Authorities want to know which strategy is more effective.

### Results & Project Evaluation Metrics

**By the end of this project, learners will achieve:**
*   A structured SQL database with clean and normalized vaccination and disease data.
*   A set of Power BI reports and dashboards that visually represent key insights, trends, and comparisons.
*   Insights derived from data analysis, such as vaccination coverage trends, disease outbreaks, and regional disparities.

**Project Evaluation metrics:**
*   **Data Cleaning Process:** Evaluate the handling of missing data, normalization, and consistency checks.
*   **SQL Database Quality:** Assess the integrity, normalization, and structure of the SQL database.
*   **Quality of Power BI Visualizations:** Review the clarity and relevance of the Power BI visualizations.
*   **Insights and Actionability:** Evaluate how well the Power BI reports provide actionable insights.

### Data Set Explanation
**Source:** Vaccination project

**Table 1: Coverage Data (`coverage-data.xlsx`)**
*   **Variables:** Group, Code, Name, Year, Antigen, Antigen_description, Coverage_category, Coverage_category_description, Target number, Doses, Coverage.

**Table 2: Incidence Rate (`incidence-rate-data.xlsx`)**
*   **Variables:** Group, Code, Name, Year, Disease, Disease_description, Denominator, Incidence_rate.

**Table 3: Reported Cases (`reported-cases-data.xlsx`)**
*   **Variables:** Group, Code, Name, Year, Disease, Disease_description, Cases.

**Table 4: Vaccine Introduction (`vaccine-introduction-data.xlsx`)**
*   **Variables:** ISO_3_Code, Country Name, Who Region, Year, Description, Intro.

**Table 5: Vaccine Schedule Data (`vaccine-schedule-data.xlsx`)**
*   **Variables:** ISO_3_Code, Country Name, Who Region, Year, Vaccine code, Vaccine description, Schedule rounds, Target pop, Target pop description, Geoarea, Age administered, Source comment.


### Project Deliverables & Guidelines

**Project Deliverables:**
*   **Source Code:** Python scripts for data cleaning and SQL queries.
*   **SQL Database:** A structured database with the cleaned data.
*   **Power BI Reports:** Interactive dashboards.
*   **Documentation:** Explanation of the process, challenges, and solutions.

**Project Guidelines:**
*   Use best practices for SQL database design and normalization.
*   Follow Power BI best practices for creating interactive, user-friendly dashboards.

**Timeline:** 7 Days

--- 
## Project Execution: A Phased Walkthrough

We will now execute the project following the structured approach.

### **Step 1: Data Cleaning**
Our first step is to clean the raw data. A clean and consistent dataset is the foundation for any reliable analysis. We will use the Pandas library in Python for this task.

#### **1.1 Setup and Loading Data**
First, we import the Pandas library and load our five Excel files into Pandas DataFrames. Ensure the `.xlsx` files are in the same directory as this notebook.

In [2]:
# Import the pandas library
import pandas as pd

# Load the datasets from Excel files
coverage_df = pd.read_excel('coverage-data.xlsx')
incidence_df = pd.read_excel('incidence-rate-data.xlsx')
cases_df = pd.read_excel('reported-cases-data.xlsx')
introduction_df = pd.read_excel('vaccine-introduction-data.xlsx')
schedule_df = pd.read_excel('vaccine-schedule-data.xlsx')

#### **1.2 Initial Data Exploration**
Let's inspect each DataFrame to understand its structure, identify missing values, and check data types.

In [3]:
print("--- Coverage Data Info ---")
coverage_df.info()
print("\n--- Incidence Rate Data Info ---")
incidence_df.info()
print("\n--- Reported Cases Data Info ---")
cases_df.info()
print("\n--- Vaccine Introduction Data Info ---")
introduction_df.info()
print("\n--- Vaccine Schedule Data Info ---")
schedule_df.info()

--- Coverage Data Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 399859 entries, 0 to 399858
Data columns (total 11 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   GROUP                          399859 non-null  object 
 1   CODE                           399858 non-null  object 
 2   NAME                           398584 non-null  object 
 3   YEAR                           399858 non-null  float64
 4   ANTIGEN                        399858 non-null  object 
 5   ANTIGEN_DESCRIPTION            399858 non-null  object 
 6   COVERAGE_CATEGORY              399858 non-null  object 
 7   COVERAGE_CATEGORY_DESCRIPTION  399858 non-null  object 
 8   TARGET_NUMBER                  79030 non-null   float64
 9   DOSES                          79327 non-null   float64
 10  COVERAGE                       230477 non-null  float64
dtypes: float64(4), object(7)
memory usage: 33.6+ MB

--- Incidence R

#### **1.3 Cleaning Each Dataset**
Now, we'll apply the cleaning logic to each DataFrame based on our observations.

In [4]:
# A. Cleaning the Coverage Data
print("Cleaning Coverage Data...")
coverage_df.dropna(subset=['COVERAGE'], inplace=True)
coverage_df.sort_values('COVERAGE_CATEGORY', ascending=True, inplace=True)
coverage_df.drop_duplicates(subset=['CODE', 'YEAR', 'ANTIGEN', 'COVERAGE'], keep='first', inplace=True)
coverage_df.drop(columns=['GROUP', 'COVERAGE_CATEGORY', 'COVERAGE_CATEGORY_DESCRIPTION'], inplace=True)

# B. Cleaning the Incidence Rate Data
print("Cleaning Incidence Rate Data...")
incidence_df.dropna(subset=['INCIDENCE_RATE'], inplace=True)
incidence_df['INCIDENCE_RATE'] = pd.to_numeric(incidence_df['INCIDENCE_RATE'], errors='coerce')
incidence_df.dropna(subset=['INCIDENCE_RATE'], inplace=True)
incidence_df.drop(columns=['GROUP'], inplace=True)

# C. Cleaning the Reported Cases Data
print("Cleaning Reported Cases Data...")
cases_df.dropna(subset=['CASES'], inplace=True)
cases_df['CASES'] = cases_df['CASES'].astype(int)
cases_df.drop(columns=['GROUP'], inplace=True)

# D. Cleaning the Vaccine Introduction Data
print("Cleaning Vaccine Introduction Data...")
introduction_df.rename(columns={'ISO_3_CODE': 'CODE', 'COUNTRYNAME': 'NAME'}, inplace=True)
introduction_df['INTRO'] = introduction_df['INTRO'].apply(lambda x: True if x == 'Yes' else False)

# E. Cleaning the Vaccine Schedule Data
print("Cleaning Vaccine Schedule Data...")
schedule_df.rename(columns={'ISO_3_CODE': 'CODE', 'COUNTRYNAME': 'NAME'}, inplace=True)
schedule_df['SCHEDULEROUNDS'] = pd.to_numeric(schedule_df['SCHEDULEROUNDS'], errors='coerce')

print("\nData cleaning complete.")

Cleaning Coverage Data...
Cleaning Incidence Rate Data...
Cleaning Incidence Rate Data...
Cleaning Reported Cases Data...
Cleaning Vaccine Introduction Data...
Cleaning Vaccine Schedule Data...

Data cleaning complete.
Cleaning Reported Cases Data...
Cleaning Vaccine Introduction Data...
Cleaning Vaccine Schedule Data...

Data cleaning complete.


#### **1.4 Saving Cleaned Data**
Finally, we save our cleaned DataFrames to new CSV files. These files will be used for the next step: importing into our SQL database.

In [5]:
coverage_df.to_csv('cleaned_coverage_data.csv', index=False)
incidence_df.to_csv('cleaned_incidence_data.csv', index=False)
cases_df.to_csv('cleaned_cases_data.csv', index=False)
introduction_df.to_csv('cleaned_introduction_data.csv', index=False)
schedule_df.to_csv('cleaned_schedule_data.csv', index=False)

print("All cleaned data has been saved to .csv files.")

All cleaned data has been saved to .csv files.


---
### **Step 2: SQL Database Setup**

With our data cleaned, we will now set up a MySQL database to store it in a structured, relational format.

#### **2.1 MySQL Setup and Schema Creation**

1.  **Install MySQL:** Download and install **MySQL Community Server** and **MySQL Workbench** from the official MySQL website.
2.  **Create Schema:** Open MySQL Workbench, connect to your local server, and create a new schema named `vaccination_db`.

#### **2.2 Table Creation**

Execute the following SQL script in MySQL Workbench to create the required tables. This script defines the structure, data types, and primary keys for data integrity.

```sql

-- Switch to the correct database
-- Create the database if it doesn't exist
CREATE DATABASE IF NOT EXISTS vaccination_db;

-- Switch to the correct database
USE vaccination_db;

-- Table 1: Coverage Data
CREATE TABLE coverage (
    country_code VARCHAR(3) NOT NULL,
    country_name VARCHAR(255),
    record_year INT,
    antigen VARCHAR(50),
    antigen_description TEXT,
    target_number BIGINT,
    doses BIGINT,
    coverage_percentage FLOAT,
    PRIMARY KEY (country_code, record_year, antigen)
);

-- Table 2: Incidence Rate Data
CREATE TABLE incidence_rate (
    country_code VARCHAR(3) NOT NULL,
    country_name VARCHAR(255),
    record_year INT,
    disease VARCHAR(50),
    disease_description TEXT,
    denominator TEXT,
    incidence_rate FLOAT,
    PRIMARY KEY (country_code, record_year, disease)
);

-- Table 3: Reported Cases Data
CREATE TABLE reported_cases (
    country_code VARCHAR(3) NOT NULL,
    country_name VARCHAR(255),
    record_year INT,
    disease VARCHAR(50),
    disease_description TEXT,
    cases INT,
    PRIMARY KEY (country_code, record_year, disease)
);

-- Table 4: Vaccine Introduction Data
CREATE TABLE vaccine_introduction (
    country_code VARCHAR(3) NOT NULL,
    country_name VARCHAR(255),
    who_region VARCHAR(50),
    record_year INT,
    vaccine_description TEXT,
    is_introduced BOOLEAN,
    PRIMARY KEY (country_code, record_year, vaccine_description(255))
);

-- Table 5: Vaccine Schedule Data
CREATE TABLE vaccine_schedule (
    country_code VARCHAR(3) NOT NULL,
    country_name VARCHAR(255),
    who_region VARCHAR(50),
    record_year INT,
    vaccine_code VARCHAR(50),
    vaccine_description TEXT,
    schedule_rounds INT,
    target_population VARCHAR(50),
    target_population_description TEXT,
    geo_area VARCHAR(100),
    age_administered VARCHAR(100),
    source_comment TEXT
);
```

#### **2.3 Data Ingestion**
Run the Python code below to load the data from your `cleaned_*.csv` files into the newly created MySQL tables. **Remember to update the `db_password` variable with your MySQL root password.**

In [7]:
import sqlalchemy
print(sqlalchemy.__version__)

2.0.43


In [10]:
from sqlalchemy import create_engine, text
import mysql.connector
import pandas as pd

# --- IMPORTANT: Update these connection details ---
db_user = 'root'
db_password = 'root'  # Enter your MySQL root password here
db_host = 'localhost'
db_name = 'vaccination_db'
# ------------------------------------------------

def clean_and_validate_data(df, table_name):
    """Clean and validate data before insertion"""
    # Define maximum lengths for VARCHAR columns
    max_lengths = {
        'country_code': 3,
        'country_name': 255,
        'disease': 50,
        'who_region': 50,
        'vaccine_code': 50,
        'target_population': 50,
        'geo_area': 100,
        'age_administered': 100
    }
    
    # Clean and truncate string columns
    for col in df.columns:
        if col in max_lengths:
            df[col] = df[col].astype(str).str.slice(0, max_lengths[col])
    
    return df

# First, try to create the database if it doesn't exist
try:
    # Create a temporary connection without specifying a database
    temp_conn = mysql.connector.connect(
        host=db_host,
        user=db_user,
        password=db_password
    )
    cursor = temp_conn.cursor()
    
    # Create database if it doesn't exist
    cursor.execute(f"CREATE DATABASE IF NOT EXISTS {db_name}")
    print(f"Database '{db_name}' is ready.")
    
    # Close temporary connection
    cursor.close()
    temp_conn.close()
except mysql.connector.Error as err:
    print(f"Error: {err}")
    raise Exception("Failed to connect to MySQL. Please check your credentials.")

# Create SQLAlchemy engine with the database
try:
    connection_string = f"mysql+mysqlconnector://{db_user}:{db_password}@{db_host}/{db_name}"
    engine = create_engine(connection_string)
    
    # Test the connection
    with engine.connect() as connection:
        print("Successfully connected to MySQL database!")
    
    # Dictionary mapping CSV files to table names
    csv_to_table_map = {
        'cleaned_coverage_data.csv': 'coverage',
        'cleaned_incidence_data.csv': 'incidence_rate',
        'cleaned_cases_data.csv': 'reported_cases',
        'cleaned_introduction_data.csv': 'vaccine_introduction',
        'cleaned_schedule_data.csv': 'vaccine_schedule'
    }

    # Load data and push to SQL
    for csv_file, table_name in csv_to_table_map.items():
        try:
            print(f"Loading {csv_file} into {table_name} table...")
            df = pd.read_csv(csv_file)
            
            # Standardize column names to match SQL schema
            column_renames = {
                'CODE': 'country_code', 'NAME': 'country_name', 'YEAR': 'record_year',
                'ANTIGEN_DESCRIPTION': 'antigen_description', 'TARGET_NUMBER': 'target_number', 
                'DOSES': 'doses', 'COVERAGE': 'coverage_percentage',
                'DISEASE_DESCRIPTION': 'disease_description', 'INCIDENCE_RATE': 'incidence_rate',
                'DESCRIPTION': 'vaccine_description', 'INTRO': 'is_introduced',
                'VACCINECODE': 'vaccine_code', 'VACCINE_DESCRIPTION': 'vaccine_description',
                'SCHEDULEROUNDS': 'schedule_rounds', 'TARGETPOP': 'target_population',
                'TARGETPOP_DESCRIPTION': 'target_population_description', 'GEOAREA': 'geo_area',
                'AGEADMINISTERED': 'age_administered', 'SOURCECOMMENT': 'source_comment' 
            }
            df.rename(columns=lambda c: column_renames.get(c, c), inplace=True)
            
            # Clean and validate data
            df = clean_and_validate_data(df, table_name)
            
            # Drop the table if it exists and recreate it (to ensure clean insertion)
            with engine.connect() as connection:
                connection.execute(text(f"DROP TABLE IF EXISTS {table_name}"))
                connection.commit()
            
            # Insert data in chunks to handle large datasets
            chunk_size = 1000
            total_rows = 0
            for chunk_start in range(0, len(df), chunk_size):
                chunk = df[chunk_start:chunk_start + chunk_size]
                chunk.to_sql(table_name, con=engine, if_exists='append', index=False)
                total_rows += len(chunk)
                print(f"Inserted {total_rows}/{len(df)} rows into {table_name}")
            
            print(f"Successfully loaded {csv_file} ({total_rows} rows).")
        except Exception as e:
            print(f"An error occurred with {csv_file}: {e}")
            continue  # Continue with next file even if current one fails

except Exception as e:
    print(f"Error connecting to database: {e}")

print("\nData ingestion process completed.")

Database 'vaccination_db' is ready.
Successfully connected to MySQL database!
Loading cleaned_coverage_data.csv into coverage table...
Inserted 1000/164102 rows into coverage
Inserted 2000/164102 rows into coverage
Inserted 3000/164102 rows into coverage
Inserted 4000/164102 rows into coverage
Inserted 5000/164102 rows into coverage
Inserted 6000/164102 rows into coverage
Inserted 7000/164102 rows into coverage
Inserted 1000/164102 rows into coverage
Inserted 2000/164102 rows into coverage
Inserted 3000/164102 rows into coverage
Inserted 4000/164102 rows into coverage
Inserted 5000/164102 rows into coverage
Inserted 6000/164102 rows into coverage
Inserted 7000/164102 rows into coverage
Inserted 8000/164102 rows into coverage
Inserted 9000/164102 rows into coverage
Inserted 10000/164102 rows into coverage
Inserted 11000/164102 rows into coverage
Inserted 12000/164102 rows into coverage
Inserted 13000/164102 rows into coverage
Inserted 14000/164102 rows into coverage
Inserted 8000/164102

---
### **Step 3: Power BI Integration and Data Modeling**

Now we connect Power BI to our database and define the relationships between our tables. This is crucial for creating meaningful visualizations.

1.  **Install Prerequisite:** Download and install the **MySQL Connector/NET** driver from the official MySQL website. Power BI cannot connect without it.
2.  **Connect Power BI:**
    -   Open Power BI Desktop.
    -   Go to **Get Data > Database > MySQL database**.
    -   Enter `localhost` for the Server and `vaccination_db` for the Database.
    -   Under the **Database** tab, enter your `root` username and password.
3.  **Load Tables:** In the Navigator window, select all five tables (`coverage`, `incidence_rate`, etc.) and click **Load**.
4.  **Create Relationships (Model View):**
    -   Go to the **Model view** in Power BI.
    -   Create relationships by dragging and dropping between key columns. The goal is to link all tables together.
        -   `coverage.country_code` <-> `incidence_rate.country_code`
        -   `coverage.record_year` <-> `incidence_rate.record_year`
        -   Connect `reported_cases`, `vaccine_introduction`, and `vaccine_schedule` to the other tables in a similar way using `country_code` and `record_year`.
    -   This creates a star schema where you can analyze data across all tables simultaneously.

---
### **Step 4: Answering Questions with Visualizations in Power BI**

This section provides a detailed guide on how to answer every question from the project brief using specific visualizations in Power BI.

#### **Easy Level Questions**

**1. How do vaccination rates correlate with a decrease in disease incidence?**
- **Recommended Visualization:** Scatter Plot.
- **Data Fields:**
  - **X-Axis:** `coverage` -> `coverage_percentage` (use Average).
  - **Y-Axis:** `incidence_rate` -> `incidence_rate` (use Average).
  - **Legend/Values:** `incidence_rate` -> `disease_description`.
- **Filters:** Add a **Slicer** for `disease_description` and `country_name` to explore specific cases.
- **Interpretation:** Look for a trend where points on the right (higher coverage) are lower on the chart (lower incidence). A downward-sloping trend line would indicate a negative correlation.

**2. What is the drop-off rate between 1st dose and subsequent doses?**
- **Recommended Visualization:** Clustered Bar Chart or a Table.
- **Data Fields:**
  - **X-Axis:** `coverage` -> `country_name` or `record_year`.
  - **Y-Axis:** `coverage` -> `coverage_percentage` (use Average).
  - **Legend:** `coverage` -> `antigen_description`.
- **Filters:** On the visual's filter pane, filter `antigen_description` to show only related doses (e.g., 'DTP-containing vaccine, 1st dose' and 'DTP-containing vaccine, 3rd dose').
- **Interpretation:** Compare the heights of the bars for the 1st dose versus the 3rd dose. The difference represents the drop-off.

**3. Questions on Gender, Education, Urban/Rural, etc.**
- **Note:** The provided dataset **does not contain columns** for gender, education level, or urban vs. rural populations. These questions cannot be answered without supplementary data. If such data were available, you would use a **Bar Chart** with the demographic category on the X-axis and `coverage_percentage` on the Y-axis to compare rates.

**4. Has the rate of booster dose uptake increased over time?**
- **Recommended Visualization:** Line Chart.
- **Data Fields:**
  - **X-Axis:** `coverage` -> `record_year`.
  - **Y-Axis:** `coverage` -> `coverage_percentage` (use Average).
- **Filters:** In the visual's filter pane, filter `antigen_description` to include terms like 'booster' or '4th dose'.
- **Interpretation:** An upward-trending line indicates that booster dose uptake has increased over time.

**5. Which regions have high disease incidence despite high vaccination rates?**
- **Recommended Visualization:** Scatter Plot.
- **Data Fields:**
  - **X-Axis:** `coverage` -> `coverage_percentage` (use Average).
  - **Y-Axis:** `incidence_rate` -> `incidence_rate` (use Average).
  - **Values/Details:** `coverage` -> `country_name`.
- **Filters:** Use a Slicer for `disease_description` to focus on one disease at a time.
- **Interpretation:** Look for countries (dots) in the **top-right quadrant** of the plot. These represent regions with both high coverage and high incidence, indicating potential issues with vaccine effectiveness, reporting, or other external factors.

#### **Medium Level Questions**

**1. Is there a correlation between vaccine introduction and a decrease in disease cases?**
- **Recommended Visualization:** Combo Chart (Line and Clustered Column).
- **Data Fields:**
  - **Shared X-Axis:** `reported_cases` -> `record_year`.
  - **Column Y-Axis:** `reported_cases` -> `cases` (use Sum).
  - **Line Y-Axis:** Create a measure to mark the introduction year. DAX Measure: `Vaccine Intro Year = MIN(vaccine_introduction[record_year])`.
- **Filters:** Use Slicers for `country_name` and `disease_description`.
- **Interpretation:** Select a country and disease. Observe the number of cases (bars) before and after the introduction year (marked by the line or a significant change). A sharp drop in cases after introduction suggests a positive impact.

**2. Which diseases have shown the most significant reduction in cases due to vaccination?**
- **Recommended Visualization:** Table or Matrix.
- **Data Fields & Measures:**
  - **Rows:** `reported_cases` -> `disease_description`.
  - **Values:** Create two measures:
    1. **Pre-Vaccine Cases:** `CALCULATE(AVERAGE(reported_cases[cases]), FILTER(ALL(vaccine_introduction), vaccine_introduction[is_introduced] = FALSE()))`
    2. **Post-Vaccine Cases:** `CALCULATE(AVERAGE(reported_cases[cases]), FILTER(ALL(vaccine_introduction), vaccine_introduction[is_introduced] = TRUE()))`
    3. **Reduction %:** `DIVIDE([Pre-Vaccine Cases] - [Post-Vaccine Cases], [Pre-Vaccine Cases])`
- **Interpretation:** Sort the table by the 'Reduction %' column in descending order. The diseases at the top have shown the most significant reduction.

**3. How does vaccine coverage correlate with disease reduction for specific antigens?**
- **Recommended Visualization:** Scatter Plot.
- **Data Fields:**
  - **X-Axis:** `coverage` -> `coverage_percentage`.
  - **Y-Axis:** `incidence_rate` -> `incidence_rate`.
  - **Values:** `coverage` -> `country_name`.
- **Filters:** Add a **Slicer** for `antigen_description`. When you select an antigen (e.g., 'Measles-containing vaccine, 1st dose'), the plot will update to show the correlation for that specific vaccine.
- **Interpretation:** For a selected antigen, a clear downward trend from left to right indicates that as coverage increases, disease incidence decreases.

#### **Scenario-Based Questions**

**1. A government health agency wants to identify regions with low vaccination coverage to allocate resources effectively.**
- **Recommended Visualization:** Map.
- **Data Fields:**
  - **Location:** `coverage` -> `country_name`.
  - **Color saturation / Legend:** `coverage` -> `coverage_percentage` (use Average).
- **Filters:** Add a **Slicer** for `antigen_description` and `record_year` to focus on a specific vaccine and time period.
- **Interpretation:** The regions with the lightest colors (or smallest bubbles) have the lowest vaccination coverage. These are the areas where resources should be prioritized.

**2. A public health organization wants to evaluate the effectiveness of a measles vaccination campaign launched five years ago.**
- **Recommended Visualization:** Combo Chart (Line and Column).
- **Data Fields:**
  - **Shared X-Axis:** `reported_cases` -> `record_year`.
  - **Column Y-Axis:** `reported_cases` -> `cases` (use Sum).
  - **Line Y-Axis:** `coverage` -> `coverage_percentage` (use Average).
- **Filters:**
  - Filter `disease_description` to 'Measles'.
  - Use a Slicer to select the specific country/region.
  - Set the `record_year` filter to show the last 10 years to see before and after the campaign.
- **Interpretation:** Look for a rise in the coverage percentage line (the campaign) followed by a significant drop in the case count bars. This demonstrates the campaign's effectiveness.

**3. WHO wants to track global progress toward achieving a target of 95% vaccination coverage for measles by 2030.**
- **Recommended Visualization:** KPI Card and a Gauge Chart.
- **KPI Card Data Fields:**
  - **Indicator:** `coverage` -> `coverage_percentage` (use Average).
  - **Trend axis:** `coverage` -> `record_year`.
- **Gauge Chart Data Fields:**
  - **Value:** `coverage` -> `coverage_percentage` (use Average).
  - **Minimum value:** 0.
  - **Maximum value:** 100.
  - **Target value:** 95.
- **Filters:** Set a filter on the visuals for `antigen_description` containing 'Measles' and `record_year` to the most recent year.
- **Interpretation:** The gauge visually shows how close the current average coverage is to the 95% target. The KPI card shows the current value and its trend over time.