### Brief Overview 

- The goal of this assignment is to practice some important data wrangling functionality commonly required in real-world projects.

- Here we will use two datasets:
  - IRS Statistics of Income (SOI) dataset
  - The Medicaid Data per State  


- The final product here is a table with medication cost per Medicaid enrollee per state. This dataset will allow us to answer such questions as:
  - Which medications account for the bulk of a state's spending   
  - Which drugs are prescribed much more in one state compared to the other states.
etc.


In [1]:
import pandas as pd
import numpy as np


* Load the IRS Statistics of Income (SOI) dataset (tax_data.csv) into a `DataFrame` called `tax_data`. The file is `tax_data.csv` is located in the `data` directory of the assignment folder.

* This dataset was preprocessed but the original one was obtained at the following URL:

        https://www.irs.gov/pub/irs-soi/15zpallagi.csv


In [2]:
## WRITE YOUR CODE HERE 
tax_data = pd.read_csv('data/tax_data.csv')

* Use a `tax_data` method or property to display the first eight (8) rows of the `DataFrame`

In [3]:
## WRITE YOUR CODE HERE 
tax_data.head(8)

Unnamed: 0,STATEFIPS,STATE,zipcode,agi_stub,N1,mars1,MARS2,MARS4,PREP,N2,...,N10300,A10300,N85530,A85530,N85300,A85300,N11901,A11901,N11902,A11902
0,1.0,AL,0.0,1.0,836320.0,481570.0,109790.0,233260.0,455560.0,1356760.0,...,373410.0,328469.0,0.0,0.0,0.0,0.0,61920.0,48150.0,732670.0,1933120.0
1,1.0,AL,0.0,2.0,494830.0,206630.0,146250.0,129390.0,275920.0,1010990.0,...,395880.0,965011.0,0.0,0.0,0.0,0.0,73720.0,107304.0,415410.0,1187403.0
2,1.0,AL,0.0,3.0,261250.0,80720.0,139280.0,36130.0,155100.0,583910.0,...,251490.0,1333418.0,0.0,0.0,0.0,0.0,64200.0,139598.0,193030.0,536699.0
3,1.0,AL,0.0,4.0,166690.0,28510.0,124650.0,10630.0,99950.0,423990.0,...,165320.0,1414283.0,0.0,0.0,0.0,0.0,45460.0,128823.0,116440.0,377177.0
4,1.0,AL,0.0,5.0,212660.0,19520.0,184320.0,4830.0,126860.0,589490.0,...,212000.0,3820152.0,420.0,168.0,60.0,31.0,83330.0,421004.0,121570.0,483682.0
5,1.0,AL,0.0,6.0,55360.0,2950.0,49260.0,350.0,41410.0,160530.0,...,55300.0,6027793.0,22090.0,39519.0,27550.0,95112.0,28590.0,791573.0,15960.0,250289.0
6,1.0,AL,35004.0,1.0,1490.0,970.0,230.0,280.0,700.0,2160.0,...,690.0,610.0,0.0,0.0,0.0,0.0,120.0,94.0,1290.0,2792.0
7,1.0,AL,35004.0,2.0,1350.0,630.0,360.0,300.0,610.0,2540.0,...,1140.0,3019.0,0.0,0.0,0.0,0.0,210.0,301.0,1130.0,2935.0


*  Modify `tax_data` to uppercase all the header name. 
  * Your answer can only use `DataFrame` or `Series` methods or properties
  * Do not hardcode the operation by uppercasing the columns yourself
  *  You can `tax_data.columns`, which returns a `Series` of the column names.
  
* The resulting column name should look as follows:


```
STATEFIPS    STATE    ZIPCODE    AGI_STUB    N1    MARS1    MARS2    MARS4    PREP    N2    ...    
```

This operation is useful for standardizing column names and avoid guessing whether the column header was in upper case, lower case or a mix of both.

In [4]:
tax_data.columns = tax_data.columns.str.upper()

* What is the total number of entries (also called observations) in `tax_data`?

  * Your answer can only use `DataFrame` or `Series` methods or properties


In [5]:
## WRITE YOUR CODE HERE 
tax_data.shape

(166698, 91)

- If `STATEFIPS` is header title of the first column of the `tax_data` `DataFrame`, what is the title of the 32nd column
  - Your answer can only use `DataFrame` or `Series` methods or properties and should use a single python expression



In [6]:
## WRITE YOUR CODE HERE 
tax_data.columns[31]

'N00900'

* If `STATEFIPS` is the the first column, what is the index of the column name `N10300`?

  * Your answer can only use `DataFrame` or `Series` methods or properties


In [7]:
## WRITE YOUR CODE HERE 
tax_data.columns.get_loc('N10300')

81

In [8]:
## WRITE YOUR CODE HERE 
states_zips = tax_data.groupby('STATE')['ZIPCODE'].nunique()
states_zips = states_zips.reset_index()
states_zips = states_zips.sort_values(by='ZIPCODE', ascending=False)

- Identify the position of HI in the list of zip code counts per state (questions directly above)
  - Your answer can only use `DataFrame` or `Series` methods or properties

In [9]:
## WRITE YOUR CODE HERE
states_zips.loc[states_zips['STATE'] == 'HI']

Unnamed: 0,STATE,ZIPCODE
11,HI,60


### Identifying and Removing Ambiguous Zip Codes

- Count the number of entries where ZIPCODE is 0, assign your results to a variable named  `nb_invalid_zip`

In [10]:
## WRITE YOUR CODE HERE 
nb_invalid_zip = tax_data['ZIPCODE'].value_counts()[0.0].tolist()
#type(nb_invalid_zip)
print(nb_invalid_zip)

306


* Run the line below to make sure that `nb_invalid_zip` is an integer (`int`)
  * Note that `assert` will only print an error if `type(nb_invalid_zip)` is not of type `int`
  
* If `nb_invalid_zip` then change your answer above so that the value returned is effectively a number.

In [11]:
assert(type(nb_invalid_zip) == int)

- Remove from `tax_data` all the lines where the zip code is `0` and save resulting `DataFrame` to a variable named `tax_data_valid_zip`
  - Your answer can only use `DataFrame` or `Series` methods or properties


In [12]:
tax_data_valid_zip = tax_data[tax_data['ZIPCODE'] != 0.0]
tax_data_valid_zip.shape

(166392, 91)

* Run the line below to confirm that the operation worked as expected

  * The assertion below is testing that the number of lines with "zip code equal to  0" + number of lines in `tax_data_valid_zip` is equal to the number of lines in the original `DataFrame` `tax_data`
  
  * The assertion below will fail (and print an error message) if the results do not match. If that is the case, please review your code above.

In [13]:
assert((tax_data_valid_zip.shape[0] + nb_invalid_zip) == tax_data.shape[0])

### Identifying and Removing Lines with Missing Values

* How many lines contain at least one missing value `NaN` in the `tax_data_valid_zip` `DataFrame`?
  * Your answer can only use `DataFrame` methods and properties
* Assing the count of `NaN` into a variable called nb_missing_values

In [14]:
## WRITE YOUR CODE HERE
nb_missing_values = tax_data_valid_zip.isnull().any(axis=1).sum()
#tax_data_valid_zip.info()
nb_missing_values

139

* Create a new `DataFrame` containing all the lines from `tax_data_valid_zip`, except lines containing missing values

* Call the new `DataFrame` `tax_data_valid_zip_cleaned

In [15]:
## WRITE YOUR CODE HERE 
tax_data_valid_zip_cleaned = tax_data_valid_zip.dropna(axis=0)
tax_data_valid_zip_cleaned.shape

(166253, 91)

* Run the line below to confirm that the operation worked as expected. The assertion below is testing that:  
`nb_missing_values` + number of lines in `tax_data_valid_zip_cleaned` is equal to the number of lines in `tax_data_valid_zip`
  
* Note that assert will only print an error if the results do not match

In [16]:
assert((tax_data_valid_zip_cleaned.shape[0] + nb_missing_values) == tax_data_valid_zip.shape[0])

### Computing the Percentile Income per Zip Code

* The function `compute_percentile_zipcode` below computes the percentile income per zip code

* By default percentile=0.5,  i.e., the function computes the median

* Read the code and make sure you understand what it does before moving on to the next question


In [17]:
#def compute_percentile_zip(df_zip, percentile=0.5):
    #index_median = sum(( df_zip["N1"]/ sum(df_zip["N1"])).cumsum() <= percentile)
    #val_below_or_at_median = (df_zip["A00100"] /df_zip["N1"]).iloc[index_median]
    #return val_below_or_at_median

In [18]:
tax_data_valid_zip_cleaned = tax_data_valid_zip_cleaned.reset_index(drop=True)

In [19]:
def compute_percentile_zip(df_zip, percentile=0.65):
    index_median = sum(( df_zip["N1"]/ sum(df_zip["N1"])).cumsum() <= percentile)
    val_below_or_at_median = (df_zip["A00100"] /df_zip["N1"]).iloc[index_median]
    return val_below_or_at_median

In [20]:
## WRITE YOUR CODE HERE
grouper = tax_data_valid_zip_cleaned.groupby('ZIPCODE').apply(compute_percentile_zip)

In [21]:
zip_rev_all = grouper.sort_values(ascending=False)
zip_rev_all.head()

ZIPCODE
33109.0    3954.114286
33480.0    3413.301538
94301.0    3109.443711
94027.0    3091.537013
10577.0    2414.855556
dtype: float64

- What are the three zip codes with the most significant 65th percentile value for income?

In [22]:
## WRITE YOUR CODE HERE
zip_rev_all.head(3)

ZIPCODE
33109.0    3954.114286
33480.0    3413.301538
94301.0    3109.443711
dtype: float64

# 2 Working with the Medicaid Data

### Loading and exploring the data 

* Load the Medicaid data stored in the file `medicaid_data.csv` into a `DataFrame` called `medicaid_data`. The file is located in the `data` directory of the assignment folder. 
* Note that this is quite large and may take some time to load on a computer with modest RAM resources (4GB or less)


In [23]:
## WRITE YOUR CODE HERE
medicaid_data = pd.read_csv('data/medicaid_data.csv')
medicaid_data.head()

Unnamed: 0,Utilization Type,State,NDC,Product Name,Units Reimbursed,Number of Prescriptions,Total Amount Reimbursed,Medicaid Amount Reimbursed,Non Medicaid Amount Reimbursed,Location
0,MCOU,PA,55150023930,Dexamethas,33.0,19.0,234.98,234.98,0.0,"(40.5773, -77.264)"
1,FFSU,NY,23917710,ALPHAGAN P,570.0,57.0,16006.34,16006.34,0.0,"(42.1497, -74.9384)"
2,MCOU,OR,13925050501,Dapsone 10,456.0,15.0,1052.42,1052.42,0.0,"(44.5672, -122.1269)"
3,FFSU,MN,51862006401,DIAZEPAM,780.0,16.0,89.6,77.6,12.0,"(45.7326, -93.9196)"
4,FFSU,MN,781237101,DEXTROAMPH,451.0,12.0,1411.24,198.93,1212.31,"(45.7326, -93.9196)"



- Modify `medicaid_data` to uppercase all the column names 

  - If your solution uses an assignment, the righthand side of the assignment (rvalue) can only use `DataFrame` or `Series` methods or properties


In [24]:
## WRITE YOUR CODE HERE
medicaid_data.columns = medicaid_data.columns.str.upper()
medicaid_data.head()

Unnamed: 0,UTILIZATION TYPE,STATE,NDC,PRODUCT NAME,UNITS REIMBURSED,NUMBER OF PRESCRIPTIONS,TOTAL AMOUNT REIMBURSED,MEDICAID AMOUNT REIMBURSED,NON MEDICAID AMOUNT REIMBURSED,LOCATION
0,MCOU,PA,55150023930,Dexamethas,33.0,19.0,234.98,234.98,0.0,"(40.5773, -77.264)"
1,FFSU,NY,23917710,ALPHAGAN P,570.0,57.0,16006.34,16006.34,0.0,"(42.1497, -74.9384)"
2,MCOU,OR,13925050501,Dapsone 10,456.0,15.0,1052.42,1052.42,0.0,"(44.5672, -122.1269)"
3,FFSU,MN,51862006401,DIAZEPAM,780.0,16.0,89.6,77.6,12.0,"(45.7326, -93.9196)"
4,FFSU,MN,781237101,DEXTROAMPH,451.0,12.0,1411.24,198.93,1212.31,"(45.7326, -93.9196)"



- Familiarize your self with the data
  - the `NDC` column stands for National Drug Code, a universal product identifier for human drugs in the United States
  
  - The remaining column names are self-explanatory
  
- Explore the number of lines and columns in the data

- Check that your column headers are in uppercase

In [25]:
## WRITE YOUR CODE HERE
print(medicaid_data.columns)
print(medicaid_data.shape)
print(medicaid_data.size)

Index(['UTILIZATION TYPE', 'STATE', 'NDC', 'PRODUCT NAME', 'UNITS REIMBURSED',
       'NUMBER OF PRESCRIPTIONS', 'TOTAL AMOUNT REIMBURSED',
       'MEDICAID AMOUNT REIMBURSED', 'NON MEDICAID AMOUNT REIMBURSED',
       'LOCATION'],
      dtype='object')
(1695546, 10)
16955460


* If you explore the  `Location` column for all the entries for which "STATE" value is equal to "HI" you'll notice that all the values are identical

* Are there any states that have more than one value for `Location`. 
  * Hint: think about using a sorted aggregation as part of a split-apply-combine operation to answer this question
  - Your answer can only use `DataFrame` or `Series` methods or properties


In [26]:
hawaii = medicaid_data.groupby('STATE').get_group('HI')
hawaii

Unnamed: 0,UTILIZATION TYPE,STATE,NDC,PRODUCT NAME,UNITS REIMBURSED,NUMBER OF PRESCRIPTIONS,TOTAL AMOUNT REIMBURSED,MEDICAID AMOUNT REIMBURSED,NON MEDICAID AMOUNT REIMBURSED,LOCATION
154,MCOU,HI,16714034804,DASETTA 1-,448.0,16.0,267.07,267.07,0.00,"(21.1098, -157.5311)"
294,MCOU,HI,49348002972,SM TRIPLE,821.6,27.0,143.34,143.34,0.00,"(21.1098, -157.5311)"
318,MCOU,HI,378043301,BUPROPION,5461.0,97.0,2206.56,2206.56,0.00,"(21.1098, -157.5311)"
397,MCOU,HI,49348004537,SM ALLERGY,5016.0,35.0,105.57,105.57,0.00,"(21.1098, -157.5311)"
557,MCOU,HI,54007928,BALSALAZID,3630.0,14.0,3204.75,3204.75,0.00,"(21.1098, -157.5311)"
...,...,...,...,...,...,...,...,...,...,...
1694941,MCOU,HI,45802006436,TRIAMCINOL,11520.0,138.0,1331.15,1331.15,0.00,"(21.1098, -157.5311)"
1695124,MCOU,HI,299590825,EPIDUO 0.1,1890.0,37.0,6509.69,6509.69,0.00,"(21.1098, -157.5311)"
1695291,MCOU,HI,69097084507,CYCLOBENZA,4811.0,129.0,714.83,707.49,7.34,"(21.1098, -157.5311)"
1695327,MCOU,HI,52544024928,MICROGESTI,9576.0,131.0,5166.67,5166.67,0.00,"(21.1098, -157.5311)"


In [27]:
def loc_comp(df):
    df['match'] = df.LOCATION.eq(df.LOCATION.shift())
    return df

In [28]:
groups = medicaid_data.groupby('STATE').apply(loc_comp)

In [29]:
new_df = groups[groups['match'] == False]
counts = new_df.value_counts(new_df['STATE'])
counts

STATE
XX    153253
AL         1
ND         1
NE         1
NH         1
NJ         1
NM         1
NV         1
NY         1
OH         1
OK         1
OR         1
PA         1
RI         1
SC         1
SD         1
TN         1
TX         1
UT         1
VA         1
VT         1
WA         1
WI         1
WV         1
WY         1
NC         1
AK         1
MS         1
IA         1
AR         1
AZ         1
CA         1
CO         1
CT         1
DC         1
DE         1
FL         1
GA         1
HI         1
ID         1
MO         1
IL         1
IN         1
KS         1
KY         1
LA         1
MA         1
MD         1
ME         1
MI         1
MN         1
MT         1
dtype: int64



* To compare medication prescriptions across states in a fair and balanced way, we need the number of Medicaid beneficiaries in each state. The following example illustrates the importance of normalizing the values `UNITS REIMBURSED` for each medication in each state by the number of Medicaid enrollees in each state.
  
* The `medicaid_data` DataFrame shows that for the drug with NDC `61958180101` (the drug name is HARVONI and it's used to treat Hepatitis C) there were 11,886  units sold in KY, versus 40,142 in CA -- that's almost 4 times more units sold in CA compared to KY. However, there are 1,284,193 Medicaid enrollees in KY, versus 13,096,861 in California. If we normalize the number of units sold in KY, versus CA, we find that the normalized there were close to 3 times more HARVONI prescription in KY  than in CA. This is ___perhaps___ justified by the fact the KY has one of the highest rates of reported cases of Hepatitis C in the US (2.7% in KY versus 0.2% in CA).
  
https://www.cdc.gov/hepatitis/statistics/2015surveillance/pdfs/2015hepsurveillancerpt.pdf

* The number of enrollees per state was obtained here:

    https://www.medicaid.gov/medicaid/managed-care/enrollment/index.html
    
    
* A parsed/processed version (medicaid_enrollment.tsv) can be in data director of the assignment folder. Use `pandas` to load the medicaid_enrollmen file into DataFrame named `medicaid_enrollment`

In [30]:
## WRITE YOUR CODE HERE
medicaid_enrollment = pd.read_csv('data/medicaid_enrollment.tsv', sep='\t')

* Modify `medicaid_enrollment` to uppercase all the column names 

  * Your answer can only use `DataFrame` or `Series` methods or properties
  * Do not hardcode the operation by uppercasing the columns yourself


In [31]:
## WRITE YOUR CODE HERE
medicaid_enrollment.columns = medicaid_enrollment.columns.str.upper()
medicaid_enrollment.columns

Index(['STATE', 'TOTAL MEDICAID ENROLLEES'], dtype='object')

* Note that some states/territories have missing values. Remove the missing values and save the resulting `DataFrame` as a new variable named `medicaid_enrollment_cleaned`

* Pay attention to how 'n/a' is given here!
* After cleaning, do you still have the Guam entry? If so, reconsider what missing values means in this context

In [32]:
## WRITE YOUR CODE HERE
medicaid_enrollment_cleaned = medicaid_enrollment.dropna()
medicaid_enrollment_cleaned = medicaid_enrollment_cleaned[medicaid_enrollment_cleaned['TOTAL MEDICAID ENROLLEES'].str.contains('n/a') == False]

### Converting `TOTAL MEDICAID ENROLLEE` Data Type

* Given that data on `TOTAL MEDICAID ENROLLEE` column contains commas on file (ex. 3,269,999 instead of 3269999), `pandas` has erroneously set the data type for that column as a string. We need to convert the column from string to `int` since we will be using it in an arithmetic expression during normalization

 

* Inspect the `dtype` property of "TOTAL MEDICAID ENROLLEES" column, and  make sure that the data type is `int`


In [33]:
## WRITE YOUR CODE HERE
medicaid_enrollment_cleaned['TOTAL MEDICAID ENROLLEES'] = medicaid_enrollment_cleaned['TOTAL MEDICAID ENROLLEES'].str.replace(',', '').astype(float)

### Associating `medicaid_data` and `medicaid_enrollment_cleaned`

- We can use the shared State information across both tables to associate both tables (SQL JOIN).
- However,  `medicaid_data` contains two-letter state abbreviations, while `medicaid_enrollment_cleaned` contains the complete state name
  - We need to convert (or append) two-letter state abbreviations to `medicaid_enrollment_cleaned`

- Pandas can read HTML and parse the code for tables. We will use that functionality to read in the state abbreviations from a Wikipedia page.
  - A brief description of what the code does is included in the comments

In [34]:
import requests

url = 'https://www.50states.com/abbreviations.htm'
header = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
    "X-Requested-With": "XMLHttpRequest"
}

r = requests.get(url, headers=header)

tables = pd.read_html(r.text)


# We access the desired table by giving it's index.
# Since the URL contain only one table, then we can access that table using index 0
Codes_abbreviations = tables[0]
Codes_abbreviations.head(5)

Unnamed: 0,US STATE,POSTAL ABBREVIATION,STANDARD ABBREVIATION
0,Alabama,AL,Ala.
1,Alaska,AK,Alaska
2,Arizona,AZ,Ariz.
3,Arkansas,AR,Ark.
4,California,CA,Calif.


* Change the the `DataFrame`'s headers from ['US State:', 'Abbreviation:'] to ['US STATE', 'ABBREVIATION']

  * You can hard code this operation


In [35]:
## WRITE YOUR CODE HERE 
Codes_abbreviations = Codes_abbreviations.rename(columns={'POSTAL ABBREVIATION': 'ABBREVIATION'})

* Combine the tables `medicaid_enrollment_cleaned` and `Codes_abbreviations` such that the resulting `DataFrame` contains all the columns in `medicaid_enrollment_cleaned` and only `ABBREVIATION` from `Codes_abbreviations` 
* Save the results to variable named `medicaid_enrollment_cleaned_with_zip`

- `medicaid_enrollment_cleaned_with_zip` should look like the following ( '...' represents remaining data that is not shown):


```
   STATE    Total Medicaid Enrollees    ABBREVIATION
0    Alabama    1,050,989    AL
1    Alaska    164,783    AK
2    Arizona    1,740,520    AZ
3    Arkansas    762,166    AR
4    California    13,096,861    CA
...
```

* We did not cover joins in class -- you find a plethora of examples on how to do this online. See for instance:

`https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html`

* If you cannot get it to work, contact the `TA` for the solution. You will not be penalized if you don't answer this question. 

In [36]:
medicaid_enrollment_cleaned_with_zip = medicaid_enrollment_cleaned.merge(Codes_abbreviations, how='left', left_on='STATE', right_on='US STATE')

* We have no further use for the column STATE in  `medicaid_enrollment_cleaned_with_zip`
  * Remove the column make sure your data in `medicaid_enrollment_cleaned_with_zip` looks like the following  ( `...` represents remaining data that is not shown):

```
    Total Medicaid Enrollees    ABBREVIATION
0   1,050,989                  AL
1   164,783                    AK
2   1,740,520                  AZ
3   762,166                    AR
4   13,096,861                 CA
....
```

In [37]:
medicaid_enrollment_cleaned_with_zip = medicaid_enrollment_cleaned_with_zip.drop(['US STATE',
                                                                                  'STANDARD ABBREVIATION',
                                                                                  'STATE'], axis=1)

- Use `DataFrame medicaid_enrollment_cleaned_with_zip` to assign the appropriate number of Medicaid enrollees to each entry in the `medicaid_data`

   I.E., instead of the 10 original columns, `medicaid_data` will now have an 11th column representing the `TOTAL MEDICAID ENROLLEES` according to the STATE value in the entry.
   
- Save the resulting DataFrame into a new variable called `medicaid_data_w_enrollments`
- The resulting DataFrame should look like the following (`...` represents remaining data that is not shown):

```
UTILIZATION TYPE    STATE    NDC    PRODUCT NAME    UNITS REIMBURSED    NUMBER OF PRESCRIPTIONS    TOTAL AMOUNT REIMBURSED    MEDICAID AMOUNT REIMBURSED    NON MEDICAID AMOUNT REIMBURSED    LOCATION
0    MCOU    PA    55150023930    Dexamethas    33.0    19.0    234.98    234.98    0.0    (40.5773, -77.264)
1    FFSU    NY    23917710    ALPHAGAN P    570.0    57.0    16006.34    16006.34    0.0    (42.1497, -74.9384)
2    MCOU    OR    13925050501    Dapsone 10    456.0    15.0    1052.42    1052.42    0.0    (44.5672, -122.1269)
...
```

* The order of the columns in the `DataFrame` is not important. This answer uses the same approach as the one used to `merge` the tables above. 

In [38]:
## WRITE YOUR CODE HERE
#medicaid_enrollment_cleaned_with_zip = medicaid_enrollment_cleaned.merge(Codes_abbreviations, how='left', #left_on='STATE', right_on='US STATE')
medicaid_data_w_enrollments = medicaid_data.merge(medicaid_enrollment_cleaned_with_zip,
                                                  how='left', left_on='STATE', right_on='ABBREVIATION')

In [39]:
medicaid_data_w_enrollments = medicaid_data_w_enrollments.drop(['ABBREVIATION'], axis=1)

- Remove any lines where "STATE" or "PRODUCT NAME" are missing from  `medicaid_data_w_enrollments`

In [40]:
## WRITE YOUR CODE HERE
medicaid_data_w_enrollments.dropna(subset=['STATE', 'PRODUCT NAME'], inplace=True)
medicaid_data_w_enrollments = medicaid_data_w_enrollments[~medicaid_data_w_enrollments.STATE.str.contains('XX')]

- Use ["STATE", "PRODUCT NAME"] as hierarchical index for `medicaid_data_w_enrollments`. Recall that a hierarchical index is simply an index with multiple levels of indexing (multiple columns)
  * Hint: the function to set an index on a `DataFrame` can take a single column name or a list of column names. The list here is  ["STATE", "NDC"]
- Call the new data `medicaid_data_w_enrollments_hierarch`
- Inspect your data to make sure the new index has now two levels (STATE and NDC)

In [41]:
## WRITE YOUR CODE HERE 
medicaid_data_w_enrollments_hierarch = medicaid_data_w_enrollments.set_index(['STATE', 'NDC'])

In [42]:
medicaid_data_w_enrollments_hierarch.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,UTILIZATION TYPE,PRODUCT NAME,UNITS REIMBURSED,NUMBER OF PRESCRIPTIONS,TOTAL AMOUNT REIMBURSED,MEDICAID AMOUNT REIMBURSED,NON MEDICAID AMOUNT REIMBURSED,LOCATION,TOTAL MEDICAID ENROLLEES
STATE,NDC,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
PA,55150023930,MCOU,Dexamethas,33.0,19.0,234.98,234.98,0.0,"(40.5773, -77.264)",2569232.0
NY,23917710,FFSU,ALPHAGAN P,570.0,57.0,16006.34,16006.34,0.0,"(42.1497, -74.9384)",6281038.0
OR,13925050501,MCOU,Dapsone 10,456.0,15.0,1052.42,1052.42,0.0,"(44.5672, -122.1269)",1123913.0
MN,51862006401,FFSU,DIAZEPAM,780.0,16.0,89.6,77.6,12.0,"(45.7326, -93.9196)",1052521.0
MN,781237101,FFSU,DEXTROAMPH,451.0,12.0,1411.24,198.93,1212.31,"(45.7326, -93.9196)",1052521.0



* Write a single Pandas expression to print all the lines with containing NDC 61958180101 in "PA"

 * Use a single indexing call (bracket notation) using `loc`

 * Hint 1: Since your index is hierarchical, `loc` is expecting two values, the first for STATE and the second for NDC


In [43]:
medicaid_data_w_enrollments_hierarch.loc['PA', 61958180101]

  medicaid_data_w_enrollments_hierarch.loc['PA', 61958180101]


Unnamed: 0_level_0,Unnamed: 1_level_0,UTILIZATION TYPE,PRODUCT NAME,UNITS REIMBURSED,NUMBER OF PRESCRIPTIONS,TOTAL AMOUNT REIMBURSED,MEDICAID AMOUNT REIMBURSED,NON MEDICAID AMOUNT REIMBURSED,LOCATION,TOTAL MEDICAID ENROLLEES
STATE,NDC,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
PA,61958180101,FFSU,Harvoni (,924.0,33.0,1017644.54,985269.84,32374.7,"(40.5773, -77.264)",2569232.0
PA,61958180101,MCOU,Harvoni (,2604.0,93.0,2952891.16,2860100.54,92790.62,"(40.5773, -77.264)",2569232.0
PA,61958180101,FFSU,Harvoni (,1008.0,36.0,1107725.22,1107725.22,0.0,"(40.5773, -77.264)",2569232.0
PA,61958180101,MCOU,Harvoni (,1932.0,69.0,2178602.3,2118612.3,59990.0,"(40.5773, -77.264)",2569232.0
PA,61958180101,FFSU,Harvoni (,924.0,33.0,1015383.6,1015383.6,0.0,"(40.5773, -77.264)",2569232.0
PA,61958180101,MCOU,Harvoni (,3220.0,115.0,3599192.35,3482188.68,117003.67,"(40.5773, -77.264)",2569232.0


In [44]:
ratio_reimbursed = (medicaid_data_w_enrollments_hierarch.groupby(['STATE', 'NDC'])['UNITS REIMBURSED'].sum() /                             medicaid_data_w_enrollments_hierarch.groupby(['STATE', 'NDC'])['TOTAL MEDICAID ENROLLEES'].unique())

In [45]:
ratio_reimbursed = ratio_reimbursed.astype(float)
medicaid_reimbursement_per_enrollee = np.log2(ratio_reimbursed)
medicaid_reimbursement_per_enrollee.head()

STATE  NDC    
AK     2143380    -9.609109
       2143480   -10.008280
       2322730    -6.109830
       2322830    -4.444321
       2322930    -3.855995
dtype: float64

- To facilitate working with the final data, we are going to unstack `medicaid_reimbursement_per_enrollee` into a variable called  `medicaid_norm_ndc`

- Using `medicaid_reimbursement_per_enrollee`, generate a `DataFrame` where: 
  - index should be the two-letter state symbol 
  - the column names should be the NDC codes 

- The `DataFrame`  should be formatted as in the image below
  - Hint, simply unstack the data


<img src="media/unstacked.png" alt="drawing" style="width:900px;"/>


In [46]:
## WRITE YOUR CODE HERE 
medicaid_norm_ndc = medicaid_reimbursement_per_enrollee.unstack(level=-1)

In [47]:
medicaid_norm_ndc.head()

NDC,2143380,2143480,2144511,2144527,2197590,2322730,2322830,2322930,2323030,2323130,...,76439035930,76439035990,76439036290,76439036390,76439036490,76439036590,99207013070,99207024005,99207026012,99207046330
STATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AK,-9.609109,-10.00828,,,,-6.10983,-4.444321,-3.855995,,,...,,,,,,,,,,
AL,-9.940595,-9.805485,,,,-5.33626,-4.243688,-3.645575,,,...,,,,,-8.518493,,,,,
AR,-12.985157,-12.657103,,,,-4.720964,-2.886635,-2.617579,,,...,,,,,,,,,,
AZ,-11.53387,-10.821194,,,,-5.312806,-3.998085,-3.40151,,,...,-11.824196,,,,,,,,,
CA,-10.926863,-10.698372,-16.127018,,-9.464364,-7.429936,-6.286911,-5.864352,,,...,,,,,,,-11.952415,,,-15.150865


#### Exploring the data (very briefly) 
- What is the drug with the highest log-normalized ratio in Hawaii?


In [48]:
medicaid_norm_ndc.loc['HI'].sort_values(ascending=False).head()

NDC
43386009019    10.511181
43386006019     3.448037
10572010001     2.599661
116200116       2.284969
62175044601     2.195764
Name: HI, dtype: float64

* Investigate the NDC of the product with the highest log-normalized ratio in Hawaii 
  * What is it used for?

* Compare the value of `Units Reimbursed` that product between HI and other states, (take for instance MA, FL, OR and WA)
* Can you think for reasons why this product has the highest log-normalized ratio in Hawaii?

In [49]:
medicaid_data_w_enrollments_hierarch.loc['HI', 43386009019]

  medicaid_data_w_enrollments_hierarch.loc['HI', 43386009019]


Unnamed: 0_level_0,Unnamed: 1_level_0,UTILIZATION TYPE,PRODUCT NAME,UNITS REIMBURSED,NUMBER OF PRESCRIPTIONS,TOTAL AMOUNT REIMBURSED,MEDICAID AMOUNT REIMBURSED,NON MEDICAID AMOUNT REIMBURSED,LOCATION,TOTAL MEDICAID ENROLLEES
STATE,NDC,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
HI,43386009019,MCOU,GAVILYTE-G,472000.0,118.0,1754.68,1754.68,0.0,"(21.1098, -157.5311)",340513.0
HI,43386009019,MCOU,GAVILYTE-G,376012.0,105.0,1462.6,1462.6,0.0,"(21.1098, -157.5311)",340513.0
HI,43386009019,MCOU,GAVILYTE-G,496104006.0,63.0,847.56,839.61,7.95,"(21.1098, -157.5311)",340513.0


In [50]:
medicaid_data_w_enrollments_hierarch.loc['WA', 43386009019]

  medicaid_data_w_enrollments_hierarch.loc['WA', 43386009019]


Unnamed: 0_level_0,Unnamed: 1_level_0,UTILIZATION TYPE,PRODUCT NAME,UNITS REIMBURSED,NUMBER OF PRESCRIPTIONS,TOTAL AMOUNT REIMBURSED,MEDICAID AMOUNT REIMBURSED,NON MEDICAID AMOUNT REIMBURSED,LOCATION,TOTAL MEDICAID ENROLLEES
STATE,NDC,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
WA,43386009019,FFSU,GaviLyte -,58.0,56.0,1083.76,1083.76,0.0,"(47.3917, -121.5708)",1771679.0
WA,43386009019,FFSU,GaviLyte -,65.0,65.0,1045.0,1031.66,13.34,"(47.3917, -121.5708)",1771679.0
WA,43386009019,MCOU,GaviLyte -,2702.118,2677.0,31709.24,31691.66,17.58,"(47.3917, -121.5708)",1771679.0
WA,43386009019,MCOU,GaviLyte -,2642.0,2578.0,28734.28,28732.78,1.5,"(47.3917, -121.5708)",1771679.0
WA,43386009019,MCOU,GaviLyte -,2753.149,2719.0,34745.72,34743.22,2.5,"(47.3917, -121.5708)",1771679.0
WA,43386009019,FFSU,GaviLyte -,87.0,85.0,1389.96,1387.77,2.19,"(47.3917, -121.5708)",1771679.0


In [51]:
medicaid_data_w_enrollments_hierarch.loc['MA', 43386009019]

  medicaid_data_w_enrollments_hierarch.loc['MA', 43386009019]


Unnamed: 0_level_0,Unnamed: 1_level_0,UTILIZATION TYPE,PRODUCT NAME,UNITS REIMBURSED,NUMBER OF PRESCRIPTIONS,TOTAL AMOUNT REIMBURSED,MEDICAID AMOUNT REIMBURSED,NON MEDICAID AMOUNT REIMBURSED,LOCATION,TOTAL MEDICAID ENROLLEES
STATE,NDC,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
MA,43386009019,FFSU,GAVILYTE-G,461.0,294.0,7135.59,5720.4,1415.19,"(42.2373, -71.5314)",1829618.0
MA,43386009019,MCOU,GAVILYTE-G,486.0,471.0,4587.04,4587.04,0.0,"(42.2373, -71.5314)",1829618.0
MA,43386009019,FFSU,GAVILYTE-G,1086.0,479.0,9185.81,9185.81,0.0,"(42.2373, -71.5314)",1829618.0
MA,43386009019,FFSU,GAVILYTE-G,676.0,499.0,11456.6,9617.89,1838.71,"(42.2373, -71.5314)",1829618.0
MA,43386009019,MCOU,GAVILYTE-G,753.0,727.0,7264.42,7264.42,0.0,"(42.2373, -71.5314)",1829618.0
MA,43386009019,MCOU,GAVILYTE-G,1971.0,925.0,12118.86,9820.33,2298.53,"(42.2373, -71.5314)",1829618.0


In [52]:
medicaid_data_w_enrollments_hierarch.loc['FL', 43386009019]

  medicaid_data_w_enrollments_hierarch.loc['FL', 43386009019]


Unnamed: 0_level_0,Unnamed: 1_level_0,UTILIZATION TYPE,PRODUCT NAME,UNITS REIMBURSED,NUMBER OF PRESCRIPTIONS,TOTAL AMOUNT REIMBURSED,MEDICAID AMOUNT REIMBURSED,NON MEDICAID AMOUNT REIMBURSED,LOCATION,TOTAL MEDICAID ENROLLEES
STATE,NDC,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
FL,43386009019,MCOU,GAVILYTE-G,596000.0,149.0,1910.49,1906.79,3.7,"(27.8333, -81.717)",3808334.0
FL,43386009019,MCOU,GAVILYTE-G,676000.0,169.0,2137.42,2137.42,0.0,"(27.8333, -81.717)",3808334.0
FL,43386009019,MCOU,GAVILYTE-G,576000.0,144.0,1920.29,1920.29,0.0,"(27.8333, -81.717)",3808334.0


In [53]:
medicaid_data_w_enrollments_hierarch.loc['AZ', 43386009019]

  medicaid_data_w_enrollments_hierarch.loc['AZ', 43386009019]


Unnamed: 0_level_0,Unnamed: 1_level_0,UTILIZATION TYPE,PRODUCT NAME,UNITS REIMBURSED,NUMBER OF PRESCRIPTIONS,TOTAL AMOUNT REIMBURSED,MEDICAID AMOUNT REIMBURSED,NON MEDICAID AMOUNT REIMBURSED,LOCATION,TOTAL MEDICAID ENROLLEES
STATE,NDC,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
AZ,43386009019,MCOU,GAVILYTE-G,7828400.0,1909.0,28182.24,28131.21,51.03,"(33.7712, -111.3877)",1740520.0
AZ,43386009019,MCOU,GAVILYTE-G,6060000.0,1499.0,20287.85,20267.83,20.02,"(33.7712, -111.3877)",1740520.0
AZ,43386009019,MCOU,GAVILYTE-G,7040000.0,1731.0,26035.13,25939.95,95.18,"(33.7712, -111.3877)",1740520.0


* Find and list all unique `NDC`s for which the difference between the largest and second large log-normalized ratio by state is at least 10.

* For instance:
  * The highest log-normalized ratio for `00591289749` (`AZACITIDIN`) is OK where it has a log-normalized ratio  of `-1.025642`.
  * The second highest log-normalized ratio for `00591289749` is in `GA` where is has a log-normalized ration of  `-12.623428`
  


In [54]:
medicaid_norm_ndc

NDC,2143380,2143480,2144511,2144527,2197590,2322730,2322830,2322930,2323030,2323130,...,76439035930,76439035990,76439036290,76439036390,76439036490,76439036590,99207013070,99207024005,99207026012,99207046330
STATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AK,-9.609109,-10.00828,,,,-6.10983,-4.444321,-3.855995,,,...,,,,,,,,,,
AL,-9.940595,-9.805485,,,,-5.33626,-4.243688,-3.645575,,,...,,,,,-8.518493,,,,,
AR,-12.985157,-12.657103,,,,-4.720964,-2.886635,-2.617579,,,...,,,,,,,,,,
AZ,-11.53387,-10.821194,,,,-5.312806,-3.998085,-3.40151,,,...,-11.824196,,,,,,,,,
CA,-10.926863,-10.698372,-16.127018,,-9.464364,-7.429936,-6.286911,-5.864352,,,...,,,,,,,-11.952415,,,-15.150865
CO,-10.862982,-9.925954,,,-7.608472,-5.075762,-3.870805,-3.362432,,,...,,,,,,,,,,-11.041431
CT,-8.405103,-7.948475,-12.411014,,-6.677343,-4.920683,-3.735907,-3.137814,,,...,,,,,,-7.683292,,,,-10.459198
DC,,,,,,,,,,,...,,,,,,,,,,
DE,,-10.274536,,,,-6.343799,-4.54307,-3.888018,,,...,,,,,,,,,,
FL,-13.534299,-12.569558,-18.053374,,-10.239592,-5.474293,-3.984721,-3.643438,,,...,,,,,,,,,,


In [55]:
## WRITE YOUR CODE HERE 
medicaid_norm_ndc.nlargest(2, medicaid_norm_ndc.columns)

NDC,2143380,2143480,2144511,2144527,2197590,2322730,2322830,2322930,2323030,2323130,...,76439035930,76439035990,76439036290,76439036390,76439036490,76439036590,99207013070,99207024005,99207026012,99207046330
STATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
PA,-7.378007,-7.6728,-14.663549,,-8.51565,-4.623149,-3.091035,-2.718405,,,...,,,,,-9.698115,-9.289037,,,,
MA,-7.921997,-7.803463,-13.472194,,-7.741402,-5.414724,-4.592154,-4.205626,,,...,,,,,,,,,,


In [56]:
medicaid_norm_ndc.apply(lambda x: x.sort_values(ascending=False).values)

NDC,2143380,2143480,2144511,2144527,2197590,2322730,2322830,2322930,2323030,2323130,...,76439035930,76439035990,76439036290,76439036390,76439036490,76439036590,99207013070,99207024005,99207026012,99207046330
STATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AK,-7.378007,-7.6728,-12.411014,-16.63842,-5.55162,-3.204003,-1.802734,-1.327376,-9.372872,-8.261931,...,-8.55878,-10.151948,-8.441462,-7.767459,-5.335093,-6.053763,-11.952415,-14.031825,-5.502345,-10.459198
AL,-7.921997,-7.688026,-13.472194,,-6.491564,-3.42415,-2.224795,-1.793863,,,...,-9.749994,-10.993223,-10.37409,-12.274537,-7.626261,-6.725677,,,-6.134353,-11.041431
AR,-8.305096,-7.803463,-14.055095,,-6.634022,-3.647826,-2.334932,-1.990146,,,...,-11.824196,,,,-8.24698,-7.216145,,,-6.247495,-15.150865
AZ,-8.325745,-7.883399,-14.069577,,-6.677343,-3.894089,-2.487967,-2.09429,,,...,,,,,-8.518493,-7.683292,,,-9.093277,
CA,-8.405103,-7.948475,-14.338234,,-6.942597,-4.210681,-2.549454,-2.370303,,,...,,,,,-9.698115,-7.914855,,,-9.560169,
CO,-8.540125,-7.987181,-14.481856,,-7.487729,-4.272083,-2.587634,-2.40611,,,...,,,,,-10.854467,-9.060588,,,-10.235734,
CT,-8.661399,-8.097982,-14.663549,,-7.584248,-4.400475,-2.627334,-2.499801,,,...,,,,,-11.150101,-9.289037,,,-12.005426,
DC,-8.747994,-8.598997,-14.825948,,-7.608472,-4.433167,-2.725213,-2.51749,,,...,,,,,-11.151096,-9.464321,,,,
DE,-8.788942,-8.629338,-15.234263,,-7.741402,-4.462918,-2.886635,-2.546478,,,...,,,,,-12.341651,-10.150101,,,,
FL,-9.150146,-8.747732,-16.127018,,-8.51565,-4.564379,-2.979393,-2.562984,,,...,,,,,-12.588218,-10.24149,,,,


- The Drug `AZACITIDINE` has a very high normalized UNITS REIMBURSED in OK compared to other states.
   - Normalized log value is -1.025642 (or a ratio 0.49119167009735065)
   - Second highest state has a log value of -12.623428 (0.000158478197834722)
- Oklahoma is not a high-incidence state for cancer
- Could the following explain what is happening in Oklahoma?

https://www.centerwatch.com/clinical-trials/listings/92093/acute-myeloid-leukemia-aml-study-asp2215-gilteritinib-by/?&geo_lat=35.4675602&geo_lng=-97.5164276&radius=10