# Week 11 Assignment


Please do the programming exercise and verify that your code works using the tests, then think about your final project and fill out the questions in the second part.

---
---

### 47.1: Filtering and summarizing data

For this work, you'll find a data file in `/data/complications_all.csv`.

Read in the data file and create a variable called `mo_hospitals` that contains a data frame from the `complications_all.csv` file, filtered down to only contain those hospitals from the state of Missouri (MO).

Then aggregate that data by hospital into a variable named `mo_summary`.  There are some key fields that we want to summarize:
* We want to know the earliest date that each hospital was participating in any program
* We want to know the latest date that each hospital stopped participating in any program
* We want to know the total number of patients in the denominators of these programs

Some things to note:
* You will need to convert the `Start Date` and `End Date` to actual datetime fields
* You will need to clean up and convert the `Denominator` field to just be numeric - the rule that you should use it to simply remove any records where the `Denominator` is `'Not Available'`


The final result of this step should be a new data frame called `mo_summary` that contains one row for each hospital and contains the min start date, max end date, and total denominator.  Use the names `start_date`, `end_date`, and `number` for those columns in `mo_summary`.


You do not need to create your code in the form of a function, just make sure your variable names match what I've described above so the tests work.

In [13]:
import pandas as pd
# This is just to show you the name to use for the variable you need to create for this step to pass.
all_hospitals = pd.read_csv('https://hds5210-data.s3.amazonaws.com/complications_all.csv')


In [14]:
# Do you work here and in as many cells as you need to create a variable called `mo_summary` that matches the requirements
import pandas as pd

# Step 1: Load the dataset
hospitals_data = pd.read_csv('complications_all.csv')

# Step 2: Filter out hospitals located in Missouri and create a copy of the filtered data
missouri_hospitals = hospitals_data[hospitals_data['State'] == 'MO'].copy()

# Step 3: Convert the 'Start Date' and 'End Date' columns to datetime format
missouri_hospitals['Start Date'] = pd.to_datetime(missouri_hospitals['Start Date'])
missouri_hospitals['End Date'] = pd.to_datetime(missouri_hospitals['End Date'])

# Step 4: Exclude rows where the 'Denominator' is marked as 'Not Available'
# and change 'Denominator' to numeric type using .loc to avoid the warning
valid_hospitals = missouri_hospitals[missouri_hospitals['Denominator'] != 'Not Available'].copy()
valid_hospitals.loc[:, 'Denominator'] = pd.to_numeric(valid_hospitals['Denominator'])

# Step 5: Group the data by 'Facility Name' and calculate the minimum start date, maximum end date, and the sum of denominators for each hospital
hospital_summary = valid_hospitals.groupby('Facility Name').agg(
    start_date=('Start Date', 'min'),
    end_date=('End Date', 'max'),
    number=('Denominator', 'sum')
)

# Step 6: Display the first few rows of the summary data for review
print(hospital_summary.head())



                                   start_date   end_date  number
Facility Name                                                   
BARNES JEWISH HOSPITAL             2015-04-01 2018-06-30  131313
BARNES-JEWISH ST PETERS HOSPITAL   2015-04-01 2018-06-30   15668
BARNES-JEWISH WEST COUNTY HOSPITAL 2015-04-01 2018-06-30    9622
BATES COUNTY MEMORIAL HOSPITAL     2015-07-01 2018-06-30    3117
BELTON REGIONAL MEDICAL CENTER     2015-04-01 2018-06-30    9270


In [15]:
assert(mo_summary['number'].sum() == 1766908)
assert(mo_summary['start_date'].min() == pd.Timestamp(2015,4,1))
assert(mo_summary['end_date'].max() == pd.Timestamp(2018,6,30))
assert(mo_summary.shape == (108,3))
assert(mo_summary.loc['BARNES JEWISH HOSPITAL'].number == 131313)
assert(mo_summary.loc['BOONE HOSPITAL CENTER'].number == 63099)

---

### 47.2 Planning your final project

You should be thinking about the things we've been learning and how you can apply them to your final project.  Use the rubric to help guid your thinking and then answer the questions below.  This is meant as a guide to help you think through what you will do.

#### A) Data Access

Your project should include data from at least three distinct types of sources.  For example: AWS S3, Relational Databases, Internet, Web Services, local files.  List what data sources you're planning to use.

For my final project, I plan to use the following three distinct data sources 1. Relational databases, 2. internet, 3. local files.


#### B. Data Formats

Your project should include data that comes in different file formats.  For example: HL7, EDI, HTML, CSV, Excel, JSON, XML.  List what data formats you're planning to use.


For this project, I plan to use data in the following different file formats CSV, Excel, JSON.


#### C. Objective

What purpose would your project serve in a real work setting?  Take a couple of paragraphs to write down why this is an interesting product.


In a real-world work setting, this project could serve as a critical tool for monitoring and evaluating the performance of hospitals participating in diabetes management programs. Diabetes is a chronic condition that requires ongoing care, and hospitals are often involved in programs aimed at improving patient outcomes, such as reducing complications, hospital readmissions, and managing blood sugar levels. By tracking key metrics like participation start and end dates, as well as patient denominators, healthcare organizations can assess the effectiveness of these programs over time and make data-driven decisions to improve care delivery.

For instance, understanding when hospitals first joined diabetes-related initiatives and how long they participated can provide insights into program adherence and sustainability. Furthermore, aggregating data on the number of patients involved (denominators) helps quantify the scope of each hospital’s efforts in managing diabetes. This information can be used for performance benchmarking, identifying best practices, and allocating resources more effectively. Overall, the project would be valuable for healthcare administrators and policymakers aiming to enhance the quality of care for diabetic patients while ensuring that healthcare facilities are consistently meeting program goals.











---



## Submit your work via GitHub as normal
