<a href="https://colab.research.google.com/github/christophermalone/HLA311/blob/main/Module2_Part3A_Advanced.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Module 2 | Part 3A: Summary (Top 10 List) - Advanced Level 

This purpose of this iPython Notebook is to communicate the process by which a data scientist would obtain a Top 10 List using Python.

<table width='100%' ><tr><td bgcolor='green'></td></tr></table>

## Example #1 - Uninsured Rates for various Demographic Categories

For this example, we will consider the Small Area Health Insurance Estimates (SAHIE) data from the United States Census Bureau.  This data provides counts for the number of people with and without health insurance across a variety of categories (age/race/sex/income).  This data is provided at the county level.

Source: https://www.census.gov/data/datasets/time-series/demo/sahie/estimates-acs.html 


Google Folder: <a href="https://drive.google.com/drive/folders/1ba54w0Q3TCCYN37h3aFQrSwVTwXwWoPh?usp=sharing" target="_blank">Link to Data</a>

The following fields are included in this data table:
*   State_Name: state name
*   State_Abbreviation: abbreviation for state
*   County_Name: name of county
*   CountyName_State: Combination of County_Name field and State_Abbreviation
*   MedicareExpansion_Adopted_StateLevel: Had state adopted medicare expansion on year data was collected
*   agecat: category of age (see data dictionary)
*   racecat: category of race (see data dictionary)
*   sexcat: category of sex (see data dictionary)
*   iprcat: category of income (see data dictionary)
*   NumberinGroup: Number of people in this demographic group
*   Uninsured: Number of people uninsured in this demographic group
*   Insured: Number of people insured in this demographic group
*   PercentUninsured: Percent of people uninsured in this demographic group


<table width='100%' ><tr><td bgcolor='green'></td></tr></table>


<strong>Goal</strong>: Obtain a "Top 10 List" for the <i>worst</i> counties in Minnesota regarding the percentage of children without health insurance. 


The following data processing steps are necessary to obtain this Top 10 List.
*   <strong>Step #0</strong>: Load the data into Python
*   <strong>Step #1</strong>: Get only counties from MN
*   <strong>Step #2</strong>: Exclude any state level information
*   <strong>Step #3</strong>: Get only records that meet the following criteria</li>
<ul>
  <li>agecat = 4 (4 indicates children -- i.e. under the age of 19)</li>
  <li>racecat = 0 (0 indicates All races)</li>
  <li>sexcat = 0 (0 indicates All sexes)</li>
  <li>iprcat = 0 (0 indicates All incomes)</li>
</ul>
*   <strong>Step #4</strong>: Apply <strong>ARRANGE</strong> data action to sort
*   <strong>Step #5</strong>: Retain only the top 10 records and the desired fields


### Step 0: Load the data in Python

In [20]:
#Load the pandas package
import pandas as pd

In [21]:
#Use read_table to read in the tab delimited file into Python
SAHIE = pd.read_csv('http://www.statsclass.org/online/hla311/datasets/SAHIE.csv') 

In [22]:
#How many records and field
SAHIE.shape

(119538, 13)

In [23]:
#Look at first few rows of the data
SAHIE.head(n=5)

Unnamed: 0,State_Name,State_Abbreviation,County_Name,CountyName_State,MedicareExpansion_Adopted_StateLevel,agecat,racecat,sexcat,iprcat,NumberInGroup,Uninsured,Insured,PercentUninsured
0,Alabama,AL,,,No,0,0,0,0,3955117.0,470052.0,3485065.0,11.9
1,Alabama,AL,,,No,0,0,0,1,1460808.0,286457.0,1174351.0,19.6
2,Alabama,AL,,,No,0,0,0,2,1805111.0,334174.0,1470937.0,18.5
3,Alabama,AL,,,No,0,0,0,3,989540.0,203801.0,785739.0,20.6
4,Alabama,AL,,,No,0,0,0,4,2679733.0,415673.0,2264060.0,15.5


Next, install the dfply package that can be used to invoke various data verbs in Python

In [24]:
pip install dfply



In [25]:
#Load the dfply package
from dfply import *

### Step #1: Get only information for Minnesota

In [26]:
##Apply a FILTER action on State_Abbreviation = MN
OutcomeTable = (
            SAHIE
            >> filter_by(X.State_Abbreviation == 'MN')
                  )

#Print first 5 rows of OutcomeTable
OutcomeTable.head(n=5)

Unnamed: 0,State_Name,State_Abbreviation,County_Name,CountyName_State,MedicareExpansion_Adopted_StateLevel,agecat,racecat,sexcat,iprcat,NumberInGroup,Uninsured,Insured,PercentUninsured
50202,Minnesota,MN,,,Yes,0,0,0,0,4636753.0,238150.0,4398603.0,5.1
50203,Minnesota,MN,,,Yes,0,0,0,1,1074990.0,104423.0,970567.0,9.7
50204,Minnesota,MN,,,Yes,0,0,0,2,1404951.0,132302.0,1272649.0,9.4
50205,Minnesota,MN,,,Yes,0,0,0,3,677447.0,64691.0,612756.0,9.5
50206,Minnesota,MN,,,Yes,0,0,0,4,2392688.0,188720.0,2203968.0,7.9


### Step #2: Exclude any state level information


In [27]:
##Apply a FILTER action on 
#                          State_Abbreviation = MN
#                          AND County_Name is not null
OutcomeTable = (
            SAHIE
            >> filter_by(X.State_Abbreviation == 'MN')
            >> filter_by(X.County_Name.notnull())
          )

#Print first 5 rows of OutcomeTable
OutcomeTable.head(n=5)

Unnamed: 0,State_Name,State_Abbreviation,County_Name,CountyName_State,MedicareExpansion_Adopted_StateLevel,agecat,racecat,sexcat,iprcat,NumberInGroup,Uninsured,Insured,PercentUninsured
50328,Minnesota,MN,Aitkin County,"Aitkin County, MN",Yes,0,0,0,0,10445.0,752.0,9693.0,7.2
50329,Minnesota,MN,Aitkin County,"Aitkin County, MN",Yes,0,0,0,1,3623.0,340.0,3283.0,9.4
50330,Minnesota,MN,Aitkin County,"Aitkin County, MN",Yes,0,0,0,2,4634.0,434.0,4200.0,9.4
50331,Minnesota,MN,Aitkin County,"Aitkin County, MN",Yes,0,0,0,3,2310.0,217.0,2093.0,9.4
50332,Minnesota,MN,Aitkin County,"Aitkin County, MN",Yes,0,0,0,4,7158.0,613.0,6545.0,8.6


### Step #3: Apply additional FILTER actions

In [28]:
##Apply a FILTER action on 
#                          State_Abbreviation = MN
#                          AND County_Name is not null
#                          AND agecat = 4 (only peopled 19 and under)
#                          AND racecat = 0 (all races)
#                          AND sexcat = 0 (all sexes)
#                          AND iprcat = 0 (all income categories)

OutcomeTable = (
            SAHIE
            >> filter_by(X.State_Abbreviation == 'MN')
            >> filter_by(X.County_Name.notnull())
            >> filter_by(X.agecat == 4, X.racecat == 0, X.sexcat == 0, X.iprcat == 0)
          )

#Print first 5 rows of OutcomeTable
OutcomeTable.head(n=5)

Unnamed: 0,State_Name,State_Abbreviation,County_Name,CountyName_State,MedicareExpansion_Adopted_StateLevel,agecat,racecat,sexcat,iprcat,NumberInGroup,Uninsured,Insured,PercentUninsured
50352,Minnesota,MN,Aitkin County,"Aitkin County, MN",Yes,4,0,0,0,2684.0,125.0,2559.0,4.7
50388,Minnesota,MN,Anoka County,"Anoka County, MN",Yes,4,0,0,0,87542.0,2431.0,85111.0,2.8
50424,Minnesota,MN,Becker County,"Becker County, MN",Yes,4,0,0,0,8576.0,403.0,8173.0,4.7
50460,Minnesota,MN,Beltrami County,"Beltrami County, MN",Yes,4,0,0,0,11877.0,624.0,11253.0,5.3
50496,Minnesota,MN,Benton County,"Benton County, MN",Yes,4,0,0,0,10467.0,321.0,10146.0,3.1


### Step #4: Apply ARRANGE action to sort by Percent Uninsured

In [29]:
##Apply an ARRANGE action to SORT by Percent Uninsured
OutcomeTable = (
            SAHIE
            >> filter_by(X.State_Abbreviation == 'MN')
            >> filter_by(X.County_Name.notnull())
            >> filter_by(X.agecat == 4, X.racecat == 0, X.sexcat == 0, X.iprcat == 0)
            >> arrange(X.PercentUninsured, ascending=False)
          )

#Print first 5 rows of OutcomeTable
OutcomeTable.head(n=5)

Unnamed: 0,State_Name,State_Abbreviation,County_Name,CountyName_State,MedicareExpansion_Adopted_StateLevel,agecat,racecat,sexcat,iprcat,NumberInGroup,Uninsured,Insured,PercentUninsured
53088,Minnesota,MN,Todd County,"Todd County, MN",Yes,4,0,0,0,5898.0,426.0,5472.0,7.2
50712,Minnesota,MN,Cass County,"Cass County, MN",Yes,4,0,0,0,6320.0,450.0,5870.0,7.1
52224,Minnesota,MN,Nobles County,"Nobles County, MN",Yes,4,0,0,0,6055.0,413.0,5642.0,6.8
51144,Minnesota,MN,Fillmore County,"Fillmore County, MN",Yes,4,0,0,0,5310.0,336.0,4974.0,6.3
52440,Minnesota,MN,Pipestone County,"Pipestone County, MN",Yes,4,0,0,0,2365.0,141.0,2224.0,6.0


### Step #5: Retain only the top 10 records and the desired fields for the Top 10 List

In [30]:
# Using Pyton to obtain Top 10 List (using pandas and dfply packages)
OutcomeTable = (
            SAHIE
            >> filter_by(X.State_Abbreviation == 'MN')
            >> filter_by(X.County_Name.notnull())
            >> filter_by(X.agecat == 4, X.racecat == 0, X.sexcat == 0, X.iprcat == 0)
            >> arrange(X.PercentUninsured, ascending=False)
            >> top_n(10)
            >> select(X.CountyName_State, X.PercentUninsured)
          )

#Pretty print the desired table
print(OutcomeTable.to_string(index=False))

      CountyName_State  PercentUninsured
       Todd County, MN               7.2
       Cass County, MN               7.1
     Nobles County, MN               6.8
   Fillmore County, MN               6.3
  Pipestone County, MN               6.0
 Clearwater County, MN               5.8
   Mahnomen County, MN               5.8
   Traverse County, MN               5.7
   Watonwan County, MN               5.4
   Beltrami County, MN               5.3
       Cook County, MN               5.3


**Note**:  The code block for Step #5 includes the necessary code for all previous steps; thus, only the code block for Step #5 needs to be run to obtain the desired Top 10 List.



---



---



---

