Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
NAME = "Emily Holcomb"
COLLABORATORS = ""

---

# Week 11 Exercises

_McKinney 7.2, 11.1_

The general activity that we are doing in this week's exercise is to explore and try to understand a particular data set.  In this case, it is 

1. Read in the data file and filter down to only looking at MO hospitals
2. Aggregate by Hospital: sum the Denominator, use min Start Date, use max End Date  (watch for rows with no Denominator value!)
3. Calculate the average per day across that entire span
4. Histogram
5. Rank and find the hospital with the most

## STEP 1 - Read and Filter

<img src="images/step1.png" alt="Read and Filter Output" style="width: 500px; float: right; margin-left: 20px; border: 1px solid">

In the first step, read in the data file from this directory named `complications.csv`.  It is a CSV file and Pandas should read it in just fine.  Explore the file so that you understand the columns and values.  At the end of this step, create a variable called `mo_hospitals` that contains a data frame from the `complications.csv` file, filtered down to only contain those hospitals from the state of Missouri (MO).

A screenshot is included for reference.



In [3]:
import pandas as pd

# This is just to show you the name to use for the variable you need to create for this step to pass.
mo_hospitals = pd.DataFrame()

# Put your code below and make sure that you reassign `mo_hospitals` 
# to have the contents described in the instructions.

# YOUR CODE HERE
hospitals = pd.read_csv('complications.csv')

mo_hospitals = hospitals[hospitals['State'] == 'MO']

FileNotFoundError: [Errno 2] File b'complications.csv' does not exist: b'complications.csv'

In [2]:
# These assertions will help make sure that you're on the right track.
assert(mo_hospitals['State'].unique() == ['MO'])
assert(mo_hospitals.shape == (2171,19))

KeyError: 'State'

## STEP 2 - Transform and Aggregate

<img src="images/step2.png" alt="Transforma and Aggregate Output" style="width: 500px; float: right; margin-left: 20px; border: 1px solid">

In the next step, we need to aggregate the results by hospital.  There are some key fields that we want to summarize, though:
* We want to know the earliest date that each hospital was participating in any program
* We want to know the latest date that each hospital stopped participating in any program
* We want to know the total number of patients in the denominators of these programs

Some things to note:
* You will need to convert the `Measure Start Date` and `Measure End Date` to actual datetime fields
* You will need to clean up and convert the `Denominator` field to just be numeric - the rule that you should use it to simply remove any records where the `Denominator` is `'Not Available'`

The final result of this step should be a new data frame called `mo_summary` that contains one row for each hospital and contains the min start date, max end date, and total denominator.  Use the names `start_date`, `end_date`, and `number` for those columns in `mo_summary`.

A screenshot is included for reference.

In [None]:
# This is just to show you the name to use for the variable you need to create for this step to pass.
mo_summary = pd.DataFrame()

# Put your code below and make sure that you reassign `mo_summary` 
# to have the contents described in the instructions.

# YOUR CODE HERE
start = mo_hospitals['Measure Start Date'].astype(str)
end = mo_hospitals['Measure End Date'].astype(str)
mo_hospitals['start_date'] = pd.to_datetime(start.str[0:10], format='%m/%d/%Y')
mo_hospitals['end_date'] = pd.to_datetime(end.str[0:10], format='%m/%d/%Y')

In [None]:
mo_hospitals = mo_hospitals[~mo_hospitals.Denominator.str.contains("Not Available")]
mo_hospitals['number'] = pd.to_numeric(mo_hospitals['Denominator'])

In [None]:
mo_summary['Hospital Name'] = list(mo_hospitals.groupby(['Hospital Name']).sum().index)

In [None]:
mo_summary['number'] = list(mo_hospitals.groupby(['Hospital Name']).sum()['number'])

In [None]:
mo_summary['start_date'] = list(mo_hospitals.groupby('Hospital Name').aggregate({'start_date':'min','end_date':'max'})['start_date'])

In [None]:
mo_summary['end_date'] = list(mo_hospitals.groupby('Hospital Name').aggregate({'start_date':'min','end_date':'max'})['end_date'])

In [None]:
mo_summary = mo_summary.groupby('Hospital Name').max()

In [None]:
assert(mo_summary['number'].sum() == 1596311)
assert(mo_summary['start_date'].min() == pd.Timestamp(2014,4,1))
assert(mo_summary['end_date'].max() == pd.Timestamp(2017,6,30))
assert(mo_summary.shape == (105,3))
assert(mo_summary.loc['BARNES JEWISH HOSPITAL'].number == 119125)
assert(mo_summary.loc['BOONE HOSPITAL CENTER'].number == 61395)

## STEP 3: Average Per Day

<img src="images/step3.png" alt="Average per Day" style="width: 500px; margin-left: 20px; float: right; border: 1px solid">


The next step, now that we have a start date, end date, and total patients for each day is to calculate how many patients on average per day this represents.  You will need to calculate the number of days between the start date and end date, and then the average as total patients divided by number of days.

Your final output should still be the variable `mo_summary`, and it will need to have two additional columns: `days` and `per_day`.

A screenshot is included for reference.


In [None]:
# Put your code below and make sure that you add new columns to `mo_summary` 
# to have the contents described in the instructions.

# YOUR CODE HERE
mo_summary['days'] = mo_summary['end_date'] - mo_summary['start_date']

mo_summary['days'] = mo_summary['days'].dt.days

In [None]:
mo_summary['per_day'] = mo_summary['number'] / mo_summary['days']

In [None]:
assert(mo_summary['days'].sum() == 120981)
assert(mo_summary['per_day'].mean() == 12.85715761949863)
assert(mo_summary['per_day'].min() == 0.028310502283105023)
assert(mo_summary['days'].min() == 1095)
assert(mo_summary['days'].max() == 1186)
assert(mo_summary.loc['BARNES JEWISH HOSPITAL'].per_day == 100.44266441821247)
assert(mo_summary.loc['BARNES JEWISH HOSPITAL'].days == 1186)

## STEP 4: Histogram

<img src="images/step4.png" alt="Histogram" style="width: 300px; float: right; margin-left: 20px; border: 1px solid">

For this step, I want you to plot a histogram to give yourself a better understanding of this new metric we've calculated: how many patients per day have a potential complication.  Looking at the shape and distribution of the data might give us some indication of the hospital volume and quality.  (Don't take this too seriously, though.  We're cutting some important corners.)

A screenshot is included for reference.

In [None]:
%matplotlib inline
# Use the built-in Pandas histogram plotting capability to plot a histogram of the `per_day` values.

# YOUR CODE HERE
mo_summary['per_day'].hist()


## STEP 5: Rank Order

Now that we see what the distribution looks like, you can tell that most of the `per_day` values are below 20.  There are, however, some outliers up around 100!  Who are those?  Let's rank the data set by the `per_day` value in descending order, examine them, and produce a list of the "top 3" based on this criteria.  That is, who are the top three hospitals based on having the highest `per_day` values.

At the end of this step, put those hospital names in a list called `top_hospitals` and that will be used for testing.  They should be in the order they appear `[#1, #2, #3]`.  That is, if the top hospitals were Mercy (99.3), BJC (97.2), and MoBap (90.1), then you would have `top_hospitals = ['Mercy','BJC','MoBap']`

In [None]:
# Put your code below and make sure that your final result ends up in the variable `top_hospitals`
# as described in the instructions above.

top_hospitals = []

# YOUR CODE HERE
top_hosp = mo_summary['per_day'].rank()

top_hospitals = list(top_hosp.sort_values(ascending=False)[0:3].index)

In [None]:
assert(type(top_hospitals) == list)