# parse_data.ipynb

This notebook parses the data files used for the FP-2 assignment. 

<br>
<br>

First let's read the attached data file:

In [1]:
import pandas as pd
import numpy as np

df0 = pd.read_csv('healthassessment.csv')

df0.describe()

Unnamed: 0,Year,Sort_1000_or_more,Cause of death rank,Latest_data
count,32079.0,62714.0,38.0,34.0
mean,2017.86474,5.248047,3.0,1.0
std,3.447658,1.241516,1.57686,0.0
min,1997.0,0.0,1.0,1.0
25%,2017.0,4.0,2.0,1.0
50%,2019.0,5.0,3.0,1.0
75%,2020.0,6.0,4.0,1.0
max,2025.0,8.0,7.0,1.0


<br>
<br>

The dependent and independent variables variables (DVs and IVs) that we are interested in are:

**DVs**:
- The number of people or cases in the population group (categorized by age group and year of data recording) with the condition ("Number_with_outcome" column in the CSV file)

**IVs**:
- Race-ethnicity ("Race_ethnicity" column in the CSV file)
- Age group ("Age_group" column in the CSV file)
- The number of people in the overall group ("Denominator" column in the CSV file)
- Year ("Year" column in the CSV file)
- Zip code ("Zip_code" column in the CSV file)
- The primary neighborhoods associated with each zip code ("Primary_Neighborhood" column in the CSV file)
- Details noted about each cause of death group ("Death_tooltip" column in the CSV file)


<br>
<br>

Let's extract the relevant columns. Because all IVs of interest (except for year) are text values, df.head() has been added along with the original template's df.describe() as a better representation of the dataset.

In [5]:
df = df0[  
['Race_ethnicity', 'Age_group', 'Denominator', 'Year', 'Zip_code', 'Primary_Neighborhood', 'Death_tooltip', 'Number_with_outcome']  ]

df.describe()

Unnamed: 0,Year
count,32079.0
mean,2017.86474
std,3.447658
min,1997.0
25%,2017.0
50%,2019.0
75%,2020.0
max,2025.0


In [None]:
df.head(5)

<br>
<br>

Next let's use the `rename` function to give the columns simpler variable names:

In [3]:
df = df.rename( columns={'Race_ethnicity':'race_ethnicity', 'Age_group':'agegroup', 'Denominator':'denominator',
                        'Year':'year', 'Zip_code':'zipcode', 'Primary_Neighborhood':'primaryneighborhood',
                        'Death_tooltip':'deathtooltip', 'Number_with_outcome':'number_with_outcome'} )

df.describe()

Unnamed: 0,race_ethnicity,agegroup,denominator,year,zipcode,primaryneighborhood,deathtooltip,number_with_outcome
0,ALL,0to12months,8307,2019.0,ALL,All,,25
1,ALL,0to12months,7890,2020.0,ALL,All,,43
2,ALL,0to12months,7413,2021.0,ALL,All,,46
3,ALL,0to12months,7049,2022.0,ALL,All,,24
4,ALL,0to12months,6673,,ALL,All,,21


In [4]:
df.head(5)

Unnamed: 0,year
count,32079.0
mean,2017.86474
std,3.447658
min,1997.0
25%,2017.0
50%,2019.0
75%,2020.0
max,2025.0
