# parse_data.ipynb

This notebook parses the data files used for the FP-2 assignment. 

<br>
<br>

First let's read the attached data file:

In [1]:
import pandas as pd
import numpy as np

df0 = pd.read_csv('healthassessment.csv')

df0.describe()

Unnamed: 0,Year,Sort_1000_or_more,Cause of death rank,Latest_data
count,32079.0,62714.0,38.0,34.0
mean,2017.86474,5.248047,3.0,1.0
std,3.447658,1.241516,1.57686,0.0
min,1997.0,0.0,1.0,1.0
25%,2017.0,4.0,2.0,1.0
50%,2019.0,5.0,3.0,1.0
75%,2020.0,6.0,4.0,1.0
max,2025.0,8.0,7.0,1.0


<br>
<br>

The dependent and independent variables variables (DVs and IVs) that we are interested in are:

**DVs**:
- The number of people or cases in the population group (categorized by age group and year of data recording) with the condition ("Number_with_outcome" column in the CSV file)

**IVs**:
- Race-ethnicity ("Race_ethnicity" column in the CSV file)
- Age group ("Age_group" column in the CSV file)
- The number of people in the overall group ("Denominator" column in the CSV file)
- Year ("Year" column in the CSV file)
- Zip code ("Zip_code" column in the CSV file)
- The primary neighborhoods associated with each zip code ("Primary_Neighborhood" column in the CSV file)
- Details noted about each cause of death group ("Death_tooltip" column in the CSV file)


<br>
<br>

Let's extract the relevant columns. Because many IVs of interest are text values, df.head() has been added along with the original template's df.describe() as an alternative representation of the dataset.

In [2]:
df = df0[  
['Race_ethnicity', 'Age_group', 'Denominator', 'Year', 'Zip_code', 'Primary_Neighborhood', 'Death_tooltip', 'Number_with_outcome']  ]

pd.set_option('mode.chained_assignment', None)
df['Number_with_outcome'] = df['Number_with_outcome'].replace({',':''}, regex = True).astype(float)
df['Denominator'] = df['Denominator'].replace({',':''}, regex = True).astype(float)

df.describe()

Unnamed: 0,Denominator,Year,Number_with_outcome
count,62526.0,32079.0,62595.0
mean,42738.21,2017.86474,461.820491
std,130897.2,3.447658,2624.301571
min,23.0,1997.0,14.0
25%,1116.0,2017.0,36.0
50%,7184.0,2019.0,74.0
75%,26367.0,2020.0,221.0
max,2804452.0,2025.0,176888.0


In [3]:
df.head(5)

Unnamed: 0,Race_ethnicity,Age_group,Denominator,Year,Zip_code,Primary_Neighborhood,Death_tooltip,Number_with_outcome
0,ALL,0to12months,8307.0,2019.0,ALL,All,,25.0
1,ALL,0to12months,7890.0,2020.0,ALL,All,,43.0
2,ALL,0to12months,7413.0,2021.0,ALL,All,,46.0
3,ALL,0to12months,7049.0,2022.0,ALL,All,,24.0
4,ALL,0to12months,6673.0,,ALL,All,,21.0


<br>
<br>

Next let's use the `rename` function to give the columns simpler variable names:

In [4]:
df = df.rename( columns={'Race_ethnicity':'race_ethnicity', 'Age_group':'agegroup', 'Denominator':'denominator',
                        'Year':'year', 'Zip_code':'zipcode', 'Primary_Neighborhood':'primaryneighborhood',
                        'Death_tooltip':'deathtooltip', 'Number_with_outcome':'number_with_outcome'} )

df.describe()

Unnamed: 0,denominator,year,number_with_outcome
count,62526.0,32079.0,62595.0
mean,42738.21,2017.86474,461.820491
std,130897.2,3.447658,2624.301571
min,23.0,1997.0,14.0
25%,1116.0,2017.0,36.0
50%,7184.0,2019.0,74.0
75%,26367.0,2020.0,221.0
max,2804452.0,2025.0,176888.0


In [5]:
df.head(5)

Unnamed: 0,race_ethnicity,agegroup,denominator,year,zipcode,primaryneighborhood,deathtooltip,number_with_outcome
0,ALL,0to12months,8307.0,2019.0,ALL,All,,25.0
1,ALL,0to12months,7890.0,2020.0,ALL,All,,43.0
2,ALL,0to12months,7413.0,2021.0,ALL,All,,46.0
3,ALL,0to12months,7049.0,2022.0,ALL,All,,24.0
4,ALL,0to12months,6673.0,,ALL,All,,21.0
