<a href="https://colab.research.google.com/github/christophermalone/HLA311/blob/main/Module2_Part2_Advanced.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Module 2 | Part 2: FILTER() Action - Advanced Level 

This purpose of this iPython Notebook is to communicate the process by which a data scientist would perform a FILTER() action using Python.

<table width='100%' ><tr><td bgcolor='green'></td></tr></table>

## Example #1 - Chronic Disease Data

For this example, we will consider the disparities that exist across race for common chronic diseases in the United States.  The ChronicDiseasebyRace_Counties.csv file can be found in the Data Folder.

Source: https://data.cms.gov/mapping-medicare-disparities

Data Folder: https://drive.google.com/drive/folders/1619cjbolTO-UDmrhir9KIa7fXe0Yainw?usp=sharing 


The data processing steps for this example will include:
*   Automatically load the data file into Python
*   Preform a FILTER action to obtain only the counties in MN or WI. 
*   Perform a SELECT action to obtain the following fields: FIPS, County, State, Condition, Urban/Rural, Total, NumberBeneficiaries
*   Write out the desired data table so further analyses can be completed

<table width='100%' ><tr><td bgcolor='green'></td></tr></table>

### Step 1: Load the data in Python

In [32]:
#Load the pandas package
import pandas as pd

In [33]:
#Use read_table to read in the tab delimited file into Python
ChronicDiseases = pd.read_csv('http://www.statsclass.org/online/hla311/datasets/ChronicDiseasebyRace_Counties.csv') 

In [34]:
#How many records and fields
ChronicDiseases.shape

(16115, 19)

In [None]:
#View the first 5 records
ChronicDiseases.head(n=5)

Next, install the dfply package that can be used to invoke various data verbs in Python

In [36]:
pip install dfply



In [37]:
#Load the dfply package
from dfply import *

### Step #2: FILTER Action

As a first step, let us obtain information for only Minnesota, i.e MN,  counties.

In [None]:
##Apply a filter on State for MN counties
ChronicDiseases_MN = (
                    ChronicDiseases
                    >> filter_by(X.State == 'MN')
                  )

#Pretty print the desired table
print(ChronicDiseases_MN.to_string(index=False))

Next, let us include the counties from Wisconsin, i.e. WI.

In [None]:
##Apply a filter on State for MN OR WI counties
ChronicDiseases_MNandWI = (
                    ChronicDiseases
                    >> filter_by( (X.State == 'MN') | (X.State == 'WI') )
                  )

#Pretty print the desired table
print(ChronicDiseases_MNandWI.to_string(index=False))

### Step #3: Apply FILTER action and SELECT action

In [None]:
##Apply filter on State of MN or WI counties, and apply select to obtain desired columns
ChronicDiseases_MNandWI = (
                    ChronicDiseases
                    >> filter_by( (X.State == 'MN') | (X.State == 'WI') )
                    >> select(X.FIPS,X.County, X.State, X.Condition, X.UrbanRural, X.Total, X.NumberBeneficiaries )
                  )

#Pretty print the desired table
print(ChronicDiseases_MNandWI.to_string(index=False))

### Step #4: Write out the desired data table (csv file)

In [44]:
#Use read_table to read in the tab delimited file into Python
ChronicDiseases_MNandWI.to_csv('/content/ChronicDiseases_MNandWI.csv', sep=',', encoding='utf-8', index=False)

**Note**: The desired data file can be downloaded from the content folder -- expand the folder near the upper left side of screen to view its contents.



---



---



---



<table width='100%' ><tr><td bgcolor='green'></td></tr></table>

## Example 2 - Home Health Care

For this example, we will consider information regarding the Home Health Care agencies across the United Staes.  The data used for this example can be downloaded from the folder below.

Source: https://data.cms.gov/provider-data/dataset/6jpm-sxkc

Data Folder: https://drive.google.com/drive/folders/1619cjbolTO-UDmrhir9KIa7fXe0Yainw?usp=sharing 


The data processing steps for this example will include:
*   Automatically load the data file into Python
*   Preform a FILTER action to obtain only the home health care agencies in MN; there is not a designated State field, State information can be found by searching the MailingAddress field 
*   Write out the desired data table so further analyses can be completed

<table width='100%' ><tr><td bgcolor='green'></td></tr></table>

### Step 1: Load the data into Python

The various Python packages are already loaded; thus, we can proceed with the loading this file into Python.

In [46]:
#Use read_table to read in the tab delimited file into Python
HomeHealthCare = pd.read_csv('http://www.statsclass.org/online/hla311/datasets/HomeHealthCare_AllAgencies_US.csv') 

In [47]:
#How many records and fields
HomeHealthCare.shape

(11176, 18)

In [48]:
#View the first 5 records
HomeHealthCare.head(n=5)

Unnamed: 0,CMSID,ProviderName,MailingAddress,Phone,Type of Ownership,NursingCareServices,PhysicalTherapyServices,OccupationalTherapyServices,SpeechPathologyServices,MedicalSocialServices,HomeHealthAideServices,DateCertified,PatientCareStarRating,HowOften_PatientsCareTimely,HowOften_Checkedfordepression,HowOften_Admittedtohospital,HowMuch_Medicarespendsperepisode_comparedtoallagenciesnationally,NumberCases_Medicarespendsperepisode_comparedtoallagenciesnationally
0,17000,ALABAMA DEPARTMENT OF PUBLIC HEALTH HOME CARE,"201 MONROE STREET THE RSA TOWER SUITE 1200,MO...",3342065341,GOVERNMENT - STATE/COUNTY,Yes,Yes,Yes,Yes,Yes,Yes,7/1/1966,4.0,93.4,95.4,14.4,0.89,2130
1,17009,ENCOMPASS HEALTH HOME HEALTH,"2970 LORNA ROAD,BIRMINGHAM,AL,35216",2058242680,PROPRIETARY,Yes,Yes,Yes,Yes,Yes,Yes,1/18/1973,3.5,97.1,99.5,16.0,0.99,19072
2,17013,KINDRED AT HOME,"1239 RUCKER BLVD,ENTERPRISE,AL,36330",3343470234,PROPRIETARY,Yes,Yes,Yes,No,No,Yes,7/24/1975,4.0,99.8,99.6,15.4,1.08,1734
3,17014,AMEDISYS HOME HEALTH,"68278 MAIN STREET,BLOUNTSVILLE,AL,35031",8664864919,PROPRIETARY,Yes,Yes,Yes,Yes,Yes,Yes,9/4/1975,4.5,99.6,84.1,11.0,0.98,882
4,17016,SOUTHEAST ALABAMA HOMECARE,"804 GLOVER AVENUE,ENTERPRISE,AL,36330",3343474800,PROPRIETARY,Yes,Yes,Yes,Yes,Yes,Yes,6/9/1976,4.5,99.4,100.0,15.7,0.99,1187


### Step 2: Apply FILTER Action

Here, the Mailing Address field will need to be searched to identify whether or not the record should be retained, i.e. is from MN.  The <strong>str.contains()</strong> function will be used to accomplish this task.

In [None]:
##Apply a filter on State for MN counties
HomeHealthCare_MN = (
                   HomeHealthCare
                    >> filter_by(X.MailingAddress.str.contains(',MN,'))
                  )

#Pretty print the desired table
print(HomeHealthCare_MN.to_string(index=False))

### Step 3: Write the desired data table (CSV file)

In [51]:
#Use read_table to read in the tab delimited file into Python
HomeHealthCare_MN.to_csv('/content/HomeHealthCare_MN', sep=',', encoding='utf-8', index=False)



---



---



---

