# Graded Lab: Multi-Source Data Integration Challenge

## Overview 
Congratulations on reaching your first graded lab in this course! In this challenge, you'll apply the Python data manipulation skills you've learned throughout this module to solve a real-world business problem.
Meet MediTrack Health Solutions, a cutting-edge healthcare management company that will serve as your case study for this assessment. As a new data scientist on the MediTrack analytics team, you've been tasked with an important data integration project that will help improve their healthcare management platform.
## Your Challenge 
MediTrack currently maintains separate systems for:
- Electronic health records
- Billing transactions
- Patient visit histories

Your task is to build a data pipeline that will combine these sources, enabling seamless communication between healthcare providers, insurers, and patients. This integration will help MediTrack:
- Track patient visit patterns and associated costs
- Analyze insurance coverage effectiveness
- Identify billing efficiency opportunities
- Support data-driven decision-making for both clinical and financial teams

## Learning Outcomes 
By the end of this lab, you will demonstrate your ability to:
- Import and process multiple data formats (CSV and Excel)
- Apply data cleaning techniques to handle missing values and inconsistencies
- Merge datasets using appropriate join operations
- Create calculated fields for analysis
- Extract specific metrics for automated assessment

## Dataset Information
You'll work with two datasets from MediTrack's systems:
- <b>billing_graded_lab.xlsx:</b> Billing System Records
    - Payment status
    - Insurance coverage
    - Total charges
    - Patient payments
- <b>medical_visits_graded_lab.xlsx:</b> Visit Tracking System
    - Visit dates
    - Diagnoses
    - Treatments
    - Doctor fees

## Graded Challenges
### Graded Challenge 1: Data Import and Initial Inspection 

<b>Step 1:</b> Import Required Libraries

In [None]:
import pandas as pd
import numpy as np

<b>Step 2:</b> Load Datasets
- Load <b>billing_graded_lab.xlsx</b> into a DataFrame, <b>billing_df</b>
- Load <b>medical_visits_graded_lab.xlsx</b> into a DataFrame, <b>medical_visits_df</b>
- Display the first few rows of each dataset

In [None]:
# Load the billing_graded_lab.xlsx file into biling_df dataframe and  medical_visits_graded_lab.xlsx file into medical_visits_df
billing_df = pd.read_excel("billing_graded_lab.xlsx")

# Display the first few rows of each dataset
billing_df.head()

In [None]:
# Do not edit this cell, just run it. This cell contains test cases.


<b>Step 3:</b> Inspect Data Quality
- Check for missing values
- Review data types
- Examine date formats

In [None]:
# Inspect Data quality by getting a quick overview of data 
billing_df.info()

# Take a look at dataframe information like Data types
billing_df.describe()

<b>Tip:</b> Use info() and describe() methods to get a quick overview of your data

### Graded Challenge 2: Data Cleaning and Preprocessing 

<b>Step 1:</b> Handle Missing Values
- Fill missing payment_status with 'Pending' in <b>billing_df</b>
- Handle any missing dates in "bill_date" and "visit_date" columns appropriately

In [None]:
billing_df.payment_status.value_counts()

In [None]:
# Fill the missing values in the payment_status column of the billing_df dataFrame with 'Pending'
    
billing_df["payment_status"].fillna(value="Pending", inplace=True)

# Add a default date of '1970-01-01' to replace missing dates in both dataframes 
    
billing_df["bill_date"].fillna(value="1970-01-01", inplace=True)
    

In [None]:
billing_df["bill_date"]

In [None]:
billing_df["bill_date"].info()

In [None]:
# Do not edit this cell, just run it. This cell contains test cases.


<b>Step 2:</b> Standardize Date Formats
- Convert bill_date to datetime
- Convert visit_date to datetime
- Ensure consistent date formatting

In [None]:
# Convert bill_date and visit_date columns to datetime

billing_df["bill_date"] = pd.to_datetime(billing_df["bill_date"])
   

In [None]:
# Do not edit this cell, just run it. This cell contains test cases.

<b>Step 3:</b> Validate Data Types
- Ensure numeric columns (total_charge, insurance_coverage, patient_paid) in billing_df are properly formatted 

In [None]:
# Ensure numeric columns in billings_df contain numeric values 

numeric_columns = ['total_charge','insurance_coverage','patient_paid']
# YOUR CODE HERE
 

In [None]:
# Do not edit this cell, just run it. This cell contains test cases.

## Graded Challenge 3: Data Integration

<b>Step 1:</b> Prepare Join Keys
- Identify common fields between datasets

In [None]:
# Take a look at both datasets to identify the common fields 
display(billing_df.head())
display(medical_visits_df.head())

<b>Step 2:</b> Merge Datasets
- Join billing and visits data on appropriate keys
- Validate the merged dataset

In [None]:
# Merge the datasets using patient_id as the key into a dataframe called merged_df

# YOUR CODE HERE
  
display(merged_df)

In [None]:
# Do not edit this cell, just run it. This cell contains test cases.


<b>Step 3:</b> Validate no data loss 
- Check for any data loss during merging

In [None]:
# YOUR CODE HERE

In [None]:
# Do not edit this cell, just run it. This cell contains test cases.

## Graded Challenge 4: Analysis Pipeline

<b>Step 1:</b> Create the following calculated fields : 
- Calculate total_revenue 
- Determine average insurance coverage 
- Compute visit frequency metrics per patient

The output of the function should be a dictionary with the exact keys: 
- total_revenue (2 decimal places)
- avg_insurance_coverage (4 decimal places)
- patient_visit_frequency (2 decimal places)

In [None]:
def process_meditrack_data(merged_df):
    # YOUR CODE HERE
    
# Printitng the calculated fields 

results = process_meditrack_data(merged_df)
print(results)
    

In [None]:
# Do not edit this cell, just run it. This cell contains test cases.

<b>Step 2:</b>  Extract Key Metrics
- Generate summary statistics
- Calculate specific business metrics

In [None]:
# Generate summary statistics
summary_stats = merged_df.describe()
display(summary_stats)

## Verify Your Results 
Before submission, validate your analysis against these key checkpoints:
    
Your final output must be a dictionary with <b>Key:Value</b> pairs as follows: 
- total_revenue: $2,500,005 – $3,193,305
- avg_insurance_coverage: 0 to 1
- patient_visit_frequency: less than 10

Data Quality Checks:
- Datetime format for all dates
- No missing values in critical columns
- Logical row count in merged dataset
- Valid numeric values in calculated fields

Output Format Requirements:
- Dictionary output with exact keys:
- total_revenue (2 decimal places)
- avg_insurance_coverage (4 decimal places)
- patient_visit_frequency (2 decimal places)

### Troubleshooting
If you encounter issues:
1. Data Import
    - Check file paths and names
    - Verify column presence
    - Confirm data types
2. Calculations
    - Review mathematical operations
    - Check null value handling
    - Verify aggregation methods
3. Formatting
    - Confirm proper round() function usage
    - Match dictionary key names exactly
    - Verify decimal place specifications

You can retry the lab as many times as needed until you get the correct results. Remember to run all cells in order and verify all outputs before submission!