# Steps in a Data Collection Project

1. Identify the data source, requirements
*     Import libraries
*     Define constants, control variables and import credentials

2. Establish connection, collect sample data

3. Identify data format

4. Save data for later analysis

# 1. Identify the data source, requirements

Webpage, PDF, tabular data
https://www.cdc.gov/nchs/data/databriefs/db395-tables-508.pdf#page=4

Account/credentials required: n/a (public)

# 2. Establish connection, collect sample data

    Import libraries

In [1]:
# library helper
# run: importnb-install from Conda before using
from importnb import Notebook
with Notebook(): 
    import Utility

# custom helper class (from jupyter notebook)
helper = Utility.Helper()

import pandas as pd

import re

Class 'Helper' v1.3 has been loaded


In [2]:
# reload changes in Jupyter notebooks
from importlib import reload
with Notebook(): __name__ == '__main__' and reload(Utility)

    Define constants, control variables and import credentials

In [3]:
DATA_PATH = '../../data/'

# 3. Identify data format

    Dataframe representing the following information.
    
    Data table for Figure 4. Number of deaths, percentage of total deaths, and age-adjusted death rates for the 10 leading causes of death in 2019: United States, 2018 and 2019
        
    Legend

        Rank – Based on number of deaths; 
        Rate – Deaths per 100,000 U.S. standard population
        Cause of death (based on International Classification of Diseases, 10th Revision [ICD–10])
            *code not included in ICD-10

In [15]:
cause_columns = ['Rank',
    'Cause of death',
    'Classification code',
    '2018 Number',
    '2018 Percent',
    '2018 Rate',
    '2019 Number',
    '2019 Percent',
    '2019 Rate'
]
    
cause_data = [
    ['-', 'All causes', '', 2839205, 100.0, 723.6, 2854838, 100.0, 715.2],
    [1, 'Diseases of heart', 'I00–I09,I11,I13,I20–I51', 655381, 23.1, 163.6, 659041, 23.1, 161.5],
    [2, 'Malignant neoplasms', 'C00–C97', 599274, 21.1, 149.1, 599601, 21.0, 146.2],
    [3, 'Accidents (unintentional injuries)', 'V01–X59,Y85–Y86', 167127, 5.9, 48.0 , 173040, 6.1, 49.3],
    [4, 'Chronic lower respiratory diseases', 'J40–J47', 159486, 5.6, 39.7, 156979, 5.5, 38.2],
    [5, 'Cerebrovascular diseases', 'I60–I69', 147810, 5.2, 37.1, 150005, 5.3, 37.0],  
    [6, 'Alzheimer disease', 'G30', 122019, 4.3, 30.5, 121499, 4.3, 29.8],
    [7, 'Diabetes mellitus', 'E10–E14', 84946, 3.0, 21.4, 87647, 3.1, 21.6],
    [8, 'Nephritis, nephrotic syndrome and nephrosis', 'N00–N07,N17–N19,N25–N27', 51386, 1.8, 12.9, 51565, 1.8, 12.7],
    [9, 'Influenza and pneumonia', 'J09–J18', 59120, 2.1, 14.9, 49783, 1.7, 12.3],
    [10, 'Intentional self-harm (suicide)', '*U03,X60–X84,Y87.0', 48344, 1.7, 14.2, 47511, 1.7, 13.9],
    ['-', 'All other causes (residual)', 'C00',  744312, 26.2, '-', 758167, 26.6, '-']  
]
    
cause_df = pd.DataFrame(columns=cause_columns, data=cause_data)

In [16]:
cause_df

Unnamed: 0,Rank,Cause of death,Classification code,2018 Number,2018 Percent,2018 Rate,2019 Number,2019 Percent,2019 Rate
0,-,All causes,,2839205,100.0,723.6,2854838,100.0,715.2
1,1,Diseases of heart,"I00–I09,I11,I13,I20–I51",655381,23.1,163.6,659041,23.1,161.5
2,2,Malignant neoplasms,C00–C97,599274,21.1,149.1,599601,21.0,146.2
3,3,Accidents (unintentional injuries),"V01–X59,Y85–Y86",167127,5.9,48.0,173040,6.1,49.3
4,4,Chronic lower respiratory diseases,J40–J47,159486,5.6,39.7,156979,5.5,38.2
5,5,Cerebrovascular diseases,I60–I69,147810,5.2,37.1,150005,5.3,37.0
6,6,Alzheimer disease,G30,122019,4.3,30.5,121499,4.3,29.8
7,7,Diabetes mellitus,E10–E14,84946,3.0,21.4,87647,3.1,21.6
8,8,"Nephritis, nephrotic syndrome and nephrosis","N00–N07,N17–N19,N25–N27",51386,1.8,12.9,51565,1.8,12.7
9,9,Influenza and pneumonia,J09–J18,59120,2.1,14.9,49783,1.7,12.3


# 4. Save data for later analysis

In [18]:
cause_df.to_csv("../../data/us_deaths_by_cause.csv")