<a href="https://colab.research.google.com/github/christophermalone/HLA311/blob/main/Module2_Part3A_Advanced.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Module 2 | Part 3A: Summary (Top 10 List) - Advanced Level 

This purpose of this iPython Notebook is to communicate the process by which a data scientist would obtain a Top 10 List using Python.

<table width='100%' ><tr><td bgcolor='green'></td></tr></table>

## Example #1 - Uninsured Rates for various Demographic Categories

For this example, we will consider the Small Area Health Insurance Estimates (SAHIE) data from the United States Census Bureau.  This data provides counts for the number of people with and without health insurance across a variety of categories (age/race/sex/income).  This data is provided at the county level.

Source: https://www.census.gov/data/datasets/time-series/demo/sahie/estimates-acs.html 


Google Folder: <a href="https://drive.google.com/drive/folders/1ba54w0Q3TCCYN37h3aFQrSwVTwXwWoPh?usp=sharing" target="_blank">Link to Data</a>

The following fields are included in this data table:
*   State_Name: state name
*   State_Abbreviation: abbreviation for state
*   County_Name: name of county
*   CountyName_State: Combination of County_Name field and State_Abbreviation
*   MedicareExpansion_Adopted_StateLevel: Had state adopted medicare expansion on year data was collected
*   agecat: category of age (see data dictionary)
*   racecat: category of race (see data dictionary)
*   sexcat: category of sex (see data dictionary)
*   iprcat: category of income (see data dictionary)
*   NumberinGroup: Number of people in this demographic group
*   Uninsured: Number of people uninsured in this demographic group
*   Insured: Number of people insured in this demographic group
*   PercentUninsured: Percent of people uninsured in this demographic group


<table width='100%' ><tr><td bgcolor='green'></td></tr></table>


<strong>Goal</strong>: Obtain a "Top 10 List" for the <i>worst</i> counties in Minnesota regarding the percentage of children without health insurance. 


The following data processing steps are necessary to obtain this Top 10 List.
*   <strong>Step #0</strong>: Load the data into Python
*   <strong>Step #1</strong>: Get only counties from MN
*   <strong>Step #2</strong>: Exclude any state level information
*   <strong>Step #3</strong>: Get only records that meet the following criteria</li>
<ul>
  <li>agecat = 4 (4 indicates children -- i.e. under the age of 19)</li>
  <li>racecat = 0 (0 indicates All races)</li>
  <li>sexcat = 0 (0 indicates All sexes)</li>
  <li>iprcat = 0 (0 indicates All incomes)</li>
</ul>
*   <strong>Step #4</strong>: Apply <strong>ARRANGE</strong> data action to sort
*   <strong>Step #5</strong>: Retain only the top 10 records and the desired fields


### Step 0: Load the data in Python

In [1]:
#Load the pandas package
import pandas as pd

In [31]:
#Use read_table to read in the tab delimited file into Python
SAHIE = pd.read_csv('http://www.statsclass.org/online/hla311/datasets/SAHIE.csv') 

In [32]:
#How many records and field
SAHIE.shape

(119538, 13)

In [33]:
#Look at first few rows of the data
SAHIE.head(n=5)

Unnamed: 0,State_Name,State_Abbreviation,County_Name,CountyName_State,MedicareExpansion_Adopted_StateLevel,agecat,racecat,sexcat,iprcat,NumberInGroup,Uninsured,Insured,PercentUninsured
0,Alabama,AL,,,No,0,0,0,0,3955117.0,470052.0,3485065.0,11.9
1,Alabama,AL,,,No,0,0,0,1,1460808.0,286457.0,1174351.0,19.6
2,Alabama,AL,,,No,0,0,0,2,1805111.0,334174.0,1470937.0,18.5
3,Alabama,AL,,,No,0,0,0,3,989540.0,203801.0,785739.0,20.6
4,Alabama,AL,,,No,0,0,0,4,2679733.0,415673.0,2264060.0,15.5


Next, install the dfply package that can be used to invoke various data verbs in Python

In [5]:
pip install dfply

Collecting dfply
[?25l  Downloading https://files.pythonhosted.org/packages/53/91/18ab48c64661252dadff685f8ddbc6f456302923918f488714ee2345d49b/dfply-0.3.3-py3-none-any.whl (612kB)
[K     |▌                               | 10kB 20.5MB/s eta 0:00:01[K     |█                               | 20kB 30.1MB/s eta 0:00:01[K     |█▋                              | 30kB 28.3MB/s eta 0:00:01[K     |██▏                             | 40kB 21.3MB/s eta 0:00:01[K     |██▊                             | 51kB 14.6MB/s eta 0:00:01[K     |███▏                            | 61kB 12.1MB/s eta 0:00:01[K     |███▊                            | 71kB 13.6MB/s eta 0:00:01[K     |████▎                           | 81kB 14.5MB/s eta 0:00:01[K     |████▉                           | 92kB 12.6MB/s eta 0:00:01[K     |█████▍                          | 102kB 13.7MB/s eta 0:00:01[K     |█████▉                          | 112kB 13.7MB/s eta 0:00:01[K     |██████▍                         | 122kB 13.7MB/s

In [6]:
#Load the dfply package
from dfply import *

### Step #1: Get only information for Minnesota

In [None]:
##Apply a FILTER action on State_Abbreviation = MN
OutcomeTable = (
            SAHIE
            >> filter_by(X.State_Abbreviation == 'MN')
                  )

#Print first 5 rows of OutcomeTable
OutcomeTable.head(n=5)

### Step #2: Exclude any state level information


In [None]:
##Apply a FILTER action on 
#                          State_Abbreviation = MN
#                          AND County_Name is not null
OutcomeTable = (
            SAHIE
            >> filter_by(X.State_Abbreviation == 'MN')
            >> filter_by(X.County_Name.notnull())
          )

#Print first 5 rows of OutcomeTable
OutcomeTable.head(n=5)

### Step #3: Apply additional FILTER actions

In [None]:
##Apply a FILTER action on 
#                          State_Abbreviation = MN
#                          AND County_Name is not null
#                          AND agecat = 4 (only peopled 19 and under)
#                          AND racecat = 0 (all races)
#                          AND sexcat = 0 (all sexes)
#                          AND iprcat = 0 (all income categories)

OutcomeTable = (
            SAHIE
            >> filter_by(X.State_Abbreviation == 'MN')
            >> filter_by(X.County_Name.notnull())
            >> filter_by(X.agecat == 4, X.racecat == 0, X.sexcat == 0, X.iprcat == 0)
          )

#Print first 5 rows of OutcomeTable
OutcomeTable.head(n=5)

### Step #4: Apply ARRANGE action to sort by Percent Uninsured

In [None]:
##Apply an ARRANGE action to SORT by Percent Uninsured
OutcomeTable = (
            SAHIE
            >> filter_by(X.State_Abbreviation == 'MN')
            >> filter_by(X.County_Name.notnull())
            >> filter_by(X.agecat == 4, X.racecat == 0, X.sexcat == 0, X.iprcat == 0)
            >> arrange(X.PercentUninsured, ascending=False)
          )

#Print first 5 rows of OutcomeTable
OutcomeTable.head(n=5)

### Step #5: Retain only the top 10 records and the desired fields for the TOp 10 List

In [42]:
# Using Pyton to obtain Top 10 List (using pandas and dfply packages)
OutcomeTable = (
            SAHIE
            >> filter_by(X.State_Abbreviation == 'MN')
            >> filter_by(X.County_Name.notnull())
            >> filter_by(X.agecat == 4, X.racecat == 0, X.sexcat == 0, X.iprcat == 0)
            >> arrange(X.PercentUninsured, ascending=False)
            >> top_n(10)
            >> select(X.CountyName_State, X.PercentUninsured)
          )

#Pretty print the desired table
print(OutcomeTable.to_string(index=False))

      CountyName_State  PercentUninsured
       Todd County, MN               7.2
       Cass County, MN               7.1
     Nobles County, MN               6.8
   Fillmore County, MN               6.3
 Clearwater County, MN               5.8
   Mahnomen County, MN               5.8
   Traverse County, MN               5.7
   Watonwan County, MN               5.4
   Beltrami County, MN               5.3
       Cook County, MN               5.3


**Note**:  The code block for Step #5 includes the necessary code for all previous steps; thus, only the code block for Step #5 needs to be run to obtain the desired Top 10 List.



---



---



---

