<a href="https://colab.research.google.com/github/christophermalone/HLA311/blob/main/Module1_Part3_Advanced.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Module 1 | Part 3: Importing Data - Advanced Level 

This purpose of this iPython Notebook is to communicate the process by which a data scientist would obtain data.  

<table width='100%' ><tr><td bgcolor='green'></td></tr></table>

## Example - Common Chronic Diseases by Race for Counties in US
This example will consider data from the Mapping Medicare Disparities website from data.CMS.gov website. 

Source:  https://data.cms.gov/mapping-medicare-disparities

The data processing steps for this example will include:
*   Automatically download a zip file containing the data from the internet
*   Automatically unzip the file so that the individual data files can be retrieved
*   Automatically remove all header information from the file
*   Replace all * characters (which indicate insufficient data) with NaN
*   Remove all counties except those in MN an WI as these are the only states that are needed for the desired data set
*   Finally, read the data into Python for future analyses

<table width='100%' ><tr><td bgcolor='green'></td></tr></table>

### Step 1: Download the data

Create the script to complete the download the data

In [181]:
#Create a file that contains a set of commands to download file
%%bash
{  
 echo 'wget -O  /content/ChronicDiseases_byRace.zip http://www.statsclass.org/online/hla311/datasets/ChronicDiseases_byRace.zip'
 } > Download_ChronicDiseaseData.sh

Run the script to complete the download

In [182]:
%%bash
bash Download_ChronicDiseaseData.sh

--2021-05-21 22:11:46--  http://www.statsclass.org/online/hla311/datasets/ChronicDiseases_byRace.zip
Resolving www.statsclass.org (www.statsclass.org)... 192.254.227.17
Connecting to www.statsclass.org (www.statsclass.org)|192.254.227.17|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1810023 (1.7M) [application/zip]
Saving to: ‘/content/ChronicDiseases_byRace.zip’

     0K .......... .......... .......... .......... ..........  2%  507K 3s
    50K .......... .......... .......... .......... ..........  5% 1011K 2s
   100K .......... .......... .......... .......... ..........  8% 54.0M 2s
   150K .......... .......... .......... .......... .......... 11% 1.02M 2s
   200K .......... .......... .......... .......... .......... 14% 32.8M 1s
   250K .......... .......... .......... .......... .......... 16% 74.6M 1s
   300K .......... .......... .......... .......... .......... 19% 60.3M 1s
   350K .......... .......... .......... .......... .......... 22% 90.6M 1

## Step 2: Unzip the file

Create the script to unzip the file

In [183]:
%%bash
{   
 echo 'unzip -o /content/ChronicDiseases_byRace.zip -d "/content/"'
 } > Unzip_ChronicDiseaseData.sh

Run the script to unzip the file

In [184]:
%%bash
bash Unzip_ChronicDiseaseData.sh

Archive:  /content/ChronicDiseases_byRace.zip
  inflating: /content/Population Report 2018 June 2020 - County.txt  
  inflating: /content/Population Report 2018 June 2020 - State.txt  
  inflating: /content/Population Report 2018 June 2020.xlsx  


In [185]:
#View the contents of the file
!head -15 '/content/Population Report 2018 June 2020 - County.txt'

2018 U.S. County-level Medicare Prevalence Rates (Percent of Population) of Chronic Conditions among Racial and Ethnic Groups 																		
																		
Note:																		
[1] Color coding for quick identification of the race/ethnicity group with the highest prevalence rate in a county. Each race/ethnicity group has its own unique color. 																		
"[2] Use filters to select specific county, state, condition, and urban/rural location."																		
[3] Source: 2018 Medicare Beneficiary Summary File.																		
"[4] For measure definition and methodology, please see the Mapping Medicare Disparities Technical Documentation at:"																		
https://www.cms.gov/About-CMS/Agency-Information/OMH/Downloads/Mapping-Technical-Documentation.pdf																		
																		
*Insufficient Data																		
FIPS	County	State	Condition	Urban/Rural	Total	Total No. of Beneficiaries	White	No. of Beneficiaries Who are White	Black	No. of Be

### Step 3: Remove all the header information from the file

In [186]:
%%bash
sed -i -e 1,10d '/content/Population Report 2018 June 2020 - County.txt'

In [187]:
#View the contents of the file
!head '/content/Population Report 2018 June 2020 - County.txt'

FIPS	County	State	Condition	Urban/Rural	Total	Total No. of Beneficiaries	White	No. of Beneficiaries Who are White	Black	No. of Beneficiaries Who are Black	Asian/Pacific Islander	No. of Beneficiaries Who are Asian/Pacific Islander	Hispanic	No. of Beneficiaries Who are Hispanic	American Indian/Alaska Native	No. of Beneficiaries Who are American Indian/Alaska Native	Other	No. of Beneficiaries Who are in Other Ethic Groups
01001	AUTAUGA	AL	Chronic Kidney Disease	Urban	26	"[1,000-4,999]"	25	"[1,000-4,999]"	32	[500-999]	25	[11-499]	24	[11-499]	*	*	0	[11-499]
01003	BALDWIN	AL	Chronic Kidney Disease	Rural	23	"[10,000+]"	22	"[10,000+]"	30	"[1,000-4,999]"	19	[11-499]	19	[11-499]	21	[11-499]	18	[11-499]
01005	BARBOUR	AL	Chronic Kidney Disease	Rural	30	"[1,000-4,999]"	26	"[1,000-4,999]"	37	[500-999]	*	*	*	*	*	*	*	*
01007	BIBB	AL	Chronic Kidney Disease	Urban	32	"[1,000-4,999]"	31	"[1,000-4,999]"	35	[11-499]	*	*	*	*	*	*	*	*
01009	BLOUNT	AL	Chronic Kidney Disease	Urban	25	"[1,000-4,999]"	25	"[1,

### Step 4: Replace all * Characters with NaN

In [188]:
%%bash
sed -i 's/*/NaN/g' '/content/Population Report 2018 June 2020 - County.txt'

In [189]:
#View the contents in the content folder
!head '/content/Population Report 2018 June 2020 - County.txt'

FIPS	County	State	Condition	Urban/Rural	Total	Total No. of Beneficiaries	White	No. of Beneficiaries Who are White	Black	No. of Beneficiaries Who are Black	Asian/Pacific Islander	No. of Beneficiaries Who are Asian/Pacific Islander	Hispanic	No. of Beneficiaries Who are Hispanic	American Indian/Alaska Native	No. of Beneficiaries Who are American Indian/Alaska Native	Other	No. of Beneficiaries Who are in Other Ethic Groups
01001	AUTAUGA	AL	Chronic Kidney Disease	Urban	26	"[1,000-4,999]"	25	"[1,000-4,999]"	32	[500-999]	25	[11-499]	24	[11-499]	NaN	NaN	0	[11-499]
01003	BALDWIN	AL	Chronic Kidney Disease	Rural	23	"[10,000+]"	22	"[10,000+]"	30	"[1,000-4,999]"	19	[11-499]	19	[11-499]	21	[11-499]	18	[11-499]
01005	BARBOUR	AL	Chronic Kidney Disease	Rural	30	"[1,000-4,999]"	26	"[1,000-4,999]"	37	[500-999]	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
01007	BIBB	AL	Chronic Kidney Disease	Urban	32	"[1,000-4,999]"	31	"[1,000-4,999]"	35	[11-499]	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
01009	BLOUNT	AL	Chronic Kidney Dise

## Step 5: Getting only Counties in MN or WI

In [190]:
%%bash
 grep -E "^FIPS|^27|^55" '/content/Population Report 2018 June 2020 - County.txt' > '/content/Population Report 2018 June 2020 - County MN and WI.txt'

In [191]:
#View the contents of the file
!head '/content/Population Report 2018 June 2020 - County MN and WI.txt'

FIPS	County	State	Condition	Urban/Rural	Total	Total No. of Beneficiaries	White	No. of Beneficiaries Who are White	Black	No. of Beneficiaries Who are Black	Asian/Pacific Islander	No. of Beneficiaries Who are Asian/Pacific Islander	Hispanic	No. of Beneficiaries Who are Hispanic	American Indian/Alaska Native	No. of Beneficiaries Who are American Indian/Alaska Native	Other	No. of Beneficiaries Who are in Other Ethic Groups
27001	AITKIN	MN	Chronic Kidney Disease	Rural	18	"[1,000-4,999]"	18	"[1,000-4,999]"	NaN	NaN	NaN	NaN	NaN	NaN	11	[11-499]	NaN	NaN
27003	ANOKA	MN	Chronic Kidney Disease	Urban	24	"[10,000+]"	23	"[10,000+]"	29	[500-999]	22	[11-499]	25	[11-499]	35	[11-499]	28	[11-499]
27005	BECKER	MN	Chronic Kidney Disease	Rural	20	"[1,000-4,999]"	19	"[1,000-4,999]"	0	[11-499]	NaN	NaN	NaN	NaN	27	[11-499]	23	[11-499]
27007	BELTRAMI	MN	Chronic Kidney Disease	Rural	25	"[1,000-4,999]"	24	"[1,000-4,999]"	23	[11-499]	NaN	NaN	32	[11-499]	33	[500-999]	28	[11-499]
27009	BENTON	MN	Chronic Kidney Dis



---



---



---



---



### Import Data into Python

In [192]:
#Load the pandas package
import pandas as pd

In [193]:
#Use read_table to read in the tab delimited file into Python
ChronicDisease = pd.read_table('/content/Population Report 2018 June 2020 - County MN and WI.txt', sep='\t') 

In [194]:
#Looking at first 5 records
ChronicDisease.head(n=5)

Unnamed: 0,FIPS,County,State,Condition,Urban/Rural,Total,Total No. of Beneficiaries,White,No. of Beneficiaries Who are White,Black,No. of Beneficiaries Who are Black,Asian/Pacific Islander,No. of Beneficiaries Who are Asian/Pacific Islander,Hispanic,No. of Beneficiaries Who are Hispanic,American Indian/Alaska Native,No. of Beneficiaries Who are American Indian/Alaska Native,Other,No. of Beneficiaries Who are in Other Ethic Groups
0,27001,AITKIN,MN,Chronic Kidney Disease,Rural,18,"[1,000-4,999]",18,"[1,000-4,999]",,,,,,,11.0,[11-499],,
1,27003,ANOKA,MN,Chronic Kidney Disease,Urban,24,"[10,000+]",23,"[10,000+]",29.0,[500-999],22.0,[11-499],25.0,[11-499],35.0,[11-499],28.0,[11-499]
2,27005,BECKER,MN,Chronic Kidney Disease,Rural,20,"[1,000-4,999]",19,"[1,000-4,999]",0.0,[11-499],,,,,27.0,[11-499],23.0,[11-499]
3,27007,BELTRAMI,MN,Chronic Kidney Disease,Rural,25,"[1,000-4,999]",24,"[1,000-4,999]",23.0,[11-499],,,32.0,[11-499],33.0,[500-999],28.0,[11-499]
4,27009,BENTON,MN,Chronic Kidney Disease,Urban,23,"[1,000-4,999]",23,"[1,000-4,999]",35.0,[11-499],27.0,[11-499],36.0,[11-499],,,,


Next, install the dfply package using pip

In [195]:
pip install dfply



In [196]:
#Load the dfply package
from dfply import *

### Getting Simple Summaries for each Condition by State

In [197]:
##Using dfply to get groups, summarize, and split columns by State
Outcomes = (
        ChronicDisease 
        >> group_by(X.State, X.Condition)
        >> summarize(Avg = X.Total.mean())
        >> spread(X.State, X.Avg)
        )

#Use pretty print for outcomes
print(Outcomes.round(1).to_string(index=False))

                             Condition    MN    WI
                Chronic Kidney Disease  21.6  22.8
 Chronic Obstructive Pulmonary Disease   9.1   9.7
              Congestive Heart Failure  13.0  13.3
                              Diabetes  22.0  23.3
                          Hypertension  44.6  48.9
