<a href="https://colab.research.google.com/github/christophermalone/HLA311/blob/main/Module2_Part3C_Advanced.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Module 2 | Part 3C: GROUP BY / PIVOT - Advanced Level 

This purpose of this iPython Notebook is to communicate the process by which a data scientist would obtain basic summaries of a data table that involve a GROUP BY or PIVOT action.

<table width='100%' ><tr><td bgcolor='green'></td></tr></table>

## Example #1 - Uninsured Rates 

For this example, we will reconsider the Small Area Health Insurance Estimates (SAHIE) data from the United States Census Bureau.  This data provides counts for the number of people with and without health insurance.  This subset include county level information for people under the age of 19 (agecat = 4) and those from the poorest income category (iprcat = 3). 

Source: https://www.census.gov/data/datasets/time-series/demo/sahie/estimates-acs.html 


Google Folder: <a href="https://drive.google.com/drive/folders/1Z827qwQyI0AaeVy4n-XnxC3iFKDm8sxi?usp=sharing" target="_blank">Link to Data</a>

The following fields are included in this data table:
*   State_Name: state name
*   State_Abbreviation: abbreviation for state
*   Census_Region: Identifies the Census Region for the State (Levels: Northeast, South, Midwest, or West)
*   MedicareExpansion: Had state adopted medicare expansion on year data was collected
*   agecat: category of age (agecat = 0, all ages)
*   racecat: category of race (0:All races, 1:White, 2:Black, 3:Hispanic)
*   sexcat: category of sex (sexcat = 0, all sexes)
*   iprcat: category of income (iprcat = 0, all income groups)
*   NumberinGroup: Number of people in this demographic group
*   Uninsured: Number of people uninsured in this demographic group
*   Insured: Number of people insured in this demographic group
*   PercentUninsured: Percent of people uninsured in this demographic group


<table width='100%' ><tr><td bgcolor='green'></td></tr></table>


<strong>Goal</strong>: Compute the percent of uninsured people under the age of 19 who are in the lowest income categorey.


### Load the data in Python

In [31]:
#Load the pandas package
import pandas as pd

In [32]:
#Use read_table to read in the tab delimited file into Python
SAHIE_Race = pd.read_csv('http://www.statsclass.org/online/hla311/datasets/SAHIE_StateData_Race.csv') 

In [33]:
#How many records and field
SAHIE_Race.shape

(204, 12)

In [34]:
#Look at first few rows of the data
SAHIE_Race.head(n=5)

Unnamed: 0,State_Name,State_Abbreviation,Census_Region,Medicare_Expansion,agecat,racecat,sexcat,iprcat,NumberInGroup,Uninsured,Insured,PercentUninsured
0,Alabama,AL,South,No,0,0,0,0,3955117,470052,3485065,11.9
1,Alabama,AL,South,No,0,1,0,0,2512149,255754,2256395,10.2
2,Alabama,AL,South,No,0,2,0,0,1081918,143788,938130,13.3
3,Alabama,AL,South,No,0,3,0,0,202025,52469,149556,26.0
4,Alaska,AK,West,Yes,0,0,0,0,636220,88519,547701,13.9


Next, install the dfply package that can be used to invoke various data verbs in Python

In [35]:
pip install dfply



In [36]:
#Load the dfply package
from dfply import *

### Obtain % Uninsured by Race

In [37]:
# Using Pyton to obtain % Uninsured by race
# racecat Levels {0: All Races, 1: White, 2: Black, 3:Hispanic}
# racecat = 0 is excluded from the summary table
OutcomeTable = (
           SAHIE_Race
            >> filter_by(X.racecat != 0)
            >> group_by(X.racecat)
            >> summarize(Percent_Uninsured = X.Uninsured.sum() / X.NumberInGroup.sum())
          )

#Pretty print the desired table
print(OutcomeTable.to_string(index=False))

 racecat  Percent_Uninsured
       1           0.075210
       2           0.113761
       3           0.191171


### Obtain % Uninsured by Race across Medicare Expansion

In [40]:
# Using Pyton to obtain % Uninsured by race
# racecat Levels {0: All Races, 1: White, 2: Black, 3:Hispanic}
# racecat = 0 is excluded from the summary table
OutcomeTable = (
           SAHIE_Race
            >> filter_by(X.racecat != 0)
            >> group_by(X.Medicare_Expansion, X.racecat)
            >> summarize(Percent_Uninsured = X.Uninsured.sum() / X.NumberInGroup.sum())
            >> spread(X.Medicare_Expansion, X.Percent_Uninsured)
          )

#Pretty print the desired table
print(OutcomeTable.to_string(index=False))

 racecat        No       Yes
       1  0.104923  0.059937
       2  0.147426  0.082677
       3  0.265559  0.146594




---



---



---

