# CDC Wonder COD Analysis

Read from `cod113-gender-5yr.txt` and make a couple charts to 
examine cause of the death by and relative cause of death by age.
Groups the cause of death into high level categories.

Writes out a dataset that has cancer deaths by age and gender,
along with the relative percentage of cancer as a cause of death.

## Setup

In [1]:
import pandas as pd
from io import StringIO
import altair as alt

alt.data_transformers.disable_max_rows()

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

## Read data

In [2]:
!head cod113-gender-5yr.txt

"Notes"	"ICD-10 113 Cause List"	"ICD-10 113 Cause List Code"	"Five-Year Age Groups"	"Five-Year Age Groups Code"	"Gender"	"Gender Code"	Deaths	Population	Crude Rate	Crude Rate Lower 95% Confidence Interval	Crude Rate Upper 95% Confidence Interval	Crude Rate Standard Error
	"#Salmonella infections (A01-A02)"	"GR113-001"	"< 1 year"	"1"	"Female"	"F"	25	42477103	0.1	0.0	0.1	0.0
	"#Salmonella infections (A01-A02)"	"GR113-001"	"< 1 year"	"1"	"Male"	"M"	22	44436653	0.0	0.0	0.1	0.0
	"#Salmonella infections (A01-A02)"	"GR113-001"	"40-44 years"	"40-44"	"Female"	"F"	10	236277503	Unreliable	0.0	0.0	0.0
	"#Salmonella infections (A01-A02)"	"GR113-001"	"45-49 years"	"45-49"	"Male"	"M"	14	232932105	Unreliable	0.0	0.0	0.0
	"#Salmonella infections (A01-A02)"	"GR113-001"	"50-54 years"	"50-54"	"Male"	"M"	31	223575070	0.0	0.0	0.0	0.0
	"#Salmonella infections (A01-A02)"	"GR113-001"	"55-59 years"	"55-59"	"Female"	"F"	24	213742200	0.0	0.0	0.0	0.0
	"#Salmonella infections (A01-A02)"	"GR113-001"	"55-59 years"	"5

In [3]:
!tail -n 10 cod113-gender-5yr.txt

"figures for 2000 are April 1 Census counts. Population figures for 1999 are from the 1990-1999 intercensal series of July 1"
"estimates. Population figures for the infant age groups are the number of live births. <br/><b>Note:</b> Rates and population"
"figures for years 2001 - 2009 differ slightly from previously published reports, due to use of the population estimates which"
"were available at the time of release."
"13. The population figures used in the calculation of death rates for the age group 'under 1 year' are the estimates of the"
"resident population that is under one year of age. More information: http://wonder.cdc.gov/wonder/help/ucd.html#Age Group."
"14. Beginning with the 2018 data, changes have been implemented that affect the counts for ICD-10 cause of death codes O00-O99"
"compared to previous practice. In addition, data for the cause of death codes O00-O99 for 2003 through 2017 reflect differences"
"in information available to individual states and probable errors.

In [4]:
df = pd.read_csv(
    'cod113-gender-5yr.txt',
    # delim_whitespace=True,
    sep="\t",
    quotechar='"',
    # skipinitialspace=True,
    na_values=['Not Applicable', 'Unreliable'],
    thousands=','
)
df.head()

Unnamed: 0,Notes,ICD-10 113 Cause List,ICD-10 113 Cause List Code,Five-Year Age Groups,Five-Year Age Groups Code,Gender,Gender Code,Deaths,Population,Crude Rate,Crude Rate Lower 95% Confidence Interval,Crude Rate Upper 95% Confidence Interval,Crude Rate Standard Error
0,,#Salmonella infections (A01-A02),GR113-001,< 1 year,1,Female,F,25.0,42477103.0,0.1,0.0,0.1,0.0
1,,#Salmonella infections (A01-A02),GR113-001,< 1 year,1,Male,M,22.0,44436653.0,0.0,0.0,0.1,0.0
2,,#Salmonella infections (A01-A02),GR113-001,40-44 years,40-44,Female,F,10.0,236277503.0,,0.0,0.0,0.0
3,,#Salmonella infections (A01-A02),GR113-001,45-49 years,45-49,Male,M,14.0,232932105.0,,0.0,0.0,0.0
4,,#Salmonella infections (A01-A02),GR113-001,50-54 years,50-54,Male,M,31.0,223575070.0,0.0,0.0,0.0,0.0


In [5]:
df.tail()

Unnamed: 0,Notes,ICD-10 113 Cause List,ICD-10 113 Cause List Code,Five-Year Age Groups,Five-Year Age Groups Code,Gender,Gender Code,Deaths,Population,Crude Rate,Crude Rate Lower 95% Confidence Interval,Crude Rate Upper 95% Confidence Interval,Crude Rate Standard Error
5133,resident population that is under one year of ...,,,,,,,,,,,,
5134,"14. Beginning with the 2018 data, changes have...",,,,,,,,,,,,
5135,"compared to previous practice. In addition, da...",,,,,,,,,,,,
5136,in information available to individual states ...,,,,,,,,,,,,
5137,information can be found at: https://www.cdc.g...,,,,,,,,,,,,


In [6]:
df.shape

(5138, 13)

### Filter out the bottom of the file and unknown age

In [7]:
df = df.loc[df.Notes.isnull()]
df.shape

(5053, 13)

In [8]:
df.tail()

Unnamed: 0,Notes,ICD-10 113 Cause List,ICD-10 113 Cause List Code,Five-Year Age Groups,Five-Year Age Groups Code,Gender,Gender Code,Deaths,Population,Crude Rate,Crude Rate Lower 95% Confidence Interval,Crude Rate Upper 95% Confidence Interval,Crude Rate Standard Error
5048,,#COVID-19 (U07.1),GR113-137,90-94 years,90-94,Male,M,16168.0,,,,,
5049,,#COVID-19 (U07.1),GR113-137,95-99 years,95-99,Female,F,11683.0,,,,,
5050,,#COVID-19 (U07.1),GR113-137,95-99 years,95-99,Male,M,5471.0,,,,,
5051,,#COVID-19 (U07.1),GR113-137,100+ years,100+,Female,F,2713.0,,,,,
5052,,#COVID-19 (U07.1),GR113-137,100+ years,100+,Male,M,731.0,,,,,


Final dataset:

In [9]:
df = df.loc[df['Five-Year Age Groups Code'] != 'NS']
df.shape

(4958, 13)

In [10]:
df.to_parquet('cod113-gender-5yr.parquet')

### We actually have 134 causes in here

Some of these are combinations of the others.
There are at least three different levels of heirarchy!
As an example:

- Level 1: `#Malignant neoplasms (C00-C97)` - The top "rankable" category
- Level 2: `Malignant neoplasms of lymphoid, hematopoietic and related tissue (C81-C96)` - Major subcategory
- Level 3: Specific types like:
    - `Hodgkin disease (C81)`
    - `Non-Hodgkin lymphoma (C82-C85)`
    - `Leukemia (C91-C95)`
    
I learned this the hard way,
after spending some time making a mapping that categorized these all as cancer
(and combined the other types incorrectly),
and double checking the death rates against high level data.

You also have to be careful adding up the `Population` numbers if you're adding up the `Deaths`,
becaues this file is grouped by 5 year age and gender.
So you can't `sum` or even take the `max` when grouping by these.
Claude got this wrong, and again double checking rates helped get to the right answer (had to do this myself!).

Let's see that there are 134:

In [11]:
df['ICD-10 113 Cause List Code'].unique()

array(['GR113-001', 'GR113-002', 'GR113-003', 'GR113-004', 'GR113-005',
       'GR113-006', 'GR113-007', 'GR113-009', 'GR113-010', 'GR113-011',
       'GR113-013', 'GR113-015', 'GR113-016', 'GR113-017', 'GR113-018',
       'GR113-019', 'GR113-020', 'GR113-021', 'GR113-022', 'GR113-023',
       'GR113-024', 'GR113-025', 'GR113-026', 'GR113-027', 'GR113-028',
       'GR113-029', 'GR113-030', 'GR113-031', 'GR113-032', 'GR113-033',
       'GR113-034', 'GR113-035', 'GR113-036', 'GR113-037', 'GR113-038',
       'GR113-039', 'GR113-040', 'GR113-041', 'GR113-042', 'GR113-043',
       'GR113-044', 'GR113-045', 'GR113-046', 'GR113-047', 'GR113-048',
       'GR113-049', 'GR113-050', 'GR113-051', 'GR113-052', 'GR113-053',
       'GR113-054', 'GR113-055', 'GR113-056', 'GR113-057', 'GR113-058',
       'GR113-059', 'GR113-060', 'GR113-061', 'GR113-062', 'GR113-063',
       'GR113-064', 'GR113-065', 'GR113-066', 'GR113-067', 'GR113-068',
       'GR113-069', 'GR113-070', 'GR113-071', 'GR113-072', 'GR11

In [12]:
len(df['ICD-10 113 Cause List Code'].unique())

134

In [13]:
df.loc[df['ICD-10 113 Cause List Code'] == 'GR113-019']['ICD-10 113 Cause List'].unique()

array(['#Malignant neoplasms (C00-C97)'], dtype=object)

In [14]:
df['ICD-10 113 Cause List'].sort_values().unique()

array(['#Accidents (unintentional injuries) (V01-X59,Y85-Y86)',
       '#Acute bronchitis and bronchiolitis (J20-J21)',
       '#Alzheimer disease (G30)', '#Anemias (D50-D64)',
       '#Aortic aneurysm and dissection (I71)',
       '#Arthropod-borne viral encephalitis (A83-A84,A85.2)',
       '#Assault (homicide) (*U01-*U02,X85-Y09,Y87.1)',
       '#Atherosclerosis (I70)', '#COVID-19 (U07.1)',
       '#Cerebrovascular diseases (I60-I69)',
       '#Certain conditions originating in the perinatal period (P00-P96)',
       '#Cholelithiasis and other disorders of gallbladder (K80-K82)',
       '#Chronic liver disease and cirrhosis (K70,K73-K74)',
       '#Chronic lower respiratory diseases (J40-J47)',
       '#Complications of medical and surgical care (Y40-Y84,Y88)',
       '#Congenital malformations, deformations and chromosomal abnormalities (Q00-Q99)',
       '#Diabetes mellitus (E10-E14)', '#Diseases of appendix (K35-K38)',
       '#Diseases of heart (I00-I09,I11,I13,I20-I51)',
      

This pulls the top level top 15,
note the total population for these groupings is pulled separately:

In [15]:
# Create mapping dictionary for the 15 leading causes
# Using the ICD-10 codes from the top 15 list
cause_mapping = {
    'GR113-054': '#Diseases of heart (I00-I09,I11,I13,I20-I51)˘',
    'GR113-019': '#Malignant neoplasms (C00-C97)',
    'GR113-070': '#Cerebrovascular diseases (I60-I69)',
    'GR113-082': '#Chronic lower respiratory diseases (J40-J47)',
    'GR113-112': '#Accidents (unintentional injuries) (V01-X59,Y85-Y86)',
    'GR113-052': '#Alzheimer disease (G30)',
    'GR113-046': '#Diabetes mellitus (E10-E14)',
    'GR113-076': '#Influenza and pneumonia (J09-J18)',
    'GR113-097': '#Nephritis, nephrotic syndrome and nephrosis (N00-N07,N17-N19,N25-N27)',
    'GR113-124': '#Intentional self-harm (suicide) (*U03,X60-X84,Y87.0)',
    'GR113-010': '#Septicemia (A40-A41)',
    'GR113-093': '#Chronic liver disease and cirrhosis (K70,K73-K74)',
    'GR113-069': '#Essential hypertension and hypertensive renal disease (I10,I12,I15)',
    'GR113-051': '#Parkinson disease (G20-G21)',
    'GR113-127': '#Assault (homicide) (*U01-*U02,X85-Y09,Y87.1)'
}

# Get total population (should only count each demographic group once)
total_population = df.groupby(['Five-Year Age Groups', 'Gender'])['Population'].mean().sum()

# Group by cause code and sum the deaths
grouped = df.groupby('ICD-10 113 Cause List Code').agg({
    'Deaths': 'sum'
}).reset_index()

# Add total population to each row
grouped['Population'] = total_population

print(f"\nTotal population calculated from data: {total_population:,}")

# Map to the 15 leading causes
grouped['Leading_Cause'] = grouped['ICD-10 113 Cause List Code'].map(cause_mapping)
grouped.head()


Total population calculated from data: 6,626,842,756.0


Unnamed: 0,ICD-10 113 Cause List Code,Deaths,Population,Leading_Cause
0,GR113-001,775.0,6626843000.0,
1,GR113-002,13.0,6626843000.0,
2,GR113-003,153621.0,6626843000.0,
3,GR113-004,13403.0,6626843000.0,
4,GR113-005,9994.0,6626843000.0,


## Pull the top 15 from the 113 causes

In [16]:
# Filter for only the 15 leading causes and calculate the rate
result = grouped[grouped['Leading_Cause'].notna()].copy()
result['Crude_Rate'] = (result['Deaths'] / result['Population']) * 100000

# Sort by deaths to match the original order
result = result.sort_values('Deaths', ascending=False)

print("\nResults for 15 leading causes:")
print(result[['Leading_Cause', 'Deaths', 'Population', 'Crude_Rate']])

# Calculate sum of deaths in original 113 causes for comparison
total_deaths_113 = df['Deaths'].sum()
total_deaths_15 = result['Deaths'].sum()

print(f"\nTotal deaths in 113 causes: {total_deaths_113:,}")
print(f"Total deaths in 15 leading causes: {total_deaths_15:,}")
print(f"Percentage captured by 15 leading causes: {(total_deaths_15/total_deaths_113)*100:.1f}%")

# For cancer specifically, let's check all the malignant neoplasm categories
cancer_deaths = df[df['ICD-10 113 Cause List'].str.contains('Malignant neoplasm', na=False)]['Deaths'].sum()
cancer_top15 = result[result['Leading_Cause'].str.contains('Malignant neoplasm', na=False)]['Deaths'].sum()

print(f"\nTotal cancer deaths from detailed categories: {cancer_deaths:,}")
print(f"Cancer deaths in top 15 category: {cancer_top15:,}")
print(f"Difference: {cancer_deaths - cancer_top15:,}")


Results for 15 leading causes:
                                         Leading_Cause      Deaths  \
50       #Diseases of heart (I00-I09,I11,I13,I20-I51)˘  14233208.0   
15                      #Malignant neoplasms (C00-C97)  12644666.0   
66                 #Cerebrovascular diseases (I60-I69)   3184522.0   
78       #Chronic lower respiratory diseases (J40-J47)   3063969.0   
108  #Accidents (unintentional injuries) (V01-X59,Y...   2887953.0   
48                            #Alzheimer disease (G30)   1872534.0   
42                        #Diabetes mellitus (E10-E14)   1674688.0   
72                  #Influenza and pneumonia (J09-J18)   1257042.0   
93   #Nephritis, nephrotic syndrome and nephrosis (...   1014083.0   
120  #Intentional self-harm (suicide) (*U03,X60-X84...    838696.0   
8                                #Septicemia (A40-A41)    795184.0   
89   #Chronic liver disease and cirrhosis (K70,K73-...    742897.0   
65   #Essential hypertension and hypertensive renal...    

## See that our processing lines up to the top 15 file from CDC

The above matches these numbers.

In [17]:
!cat cod-top15.txt

"Notes"	"15 Leading Causes of Death"	"15 Leading Causes of Death Code"	Deaths	Population	Crude Rate	Crude Rate Lower 95% Confidence Interval	Crude Rate Upper 95% Confidence Interval	Crude Rate Standard Error
	"#Diseases of heart (I00-I09,I11,I13,I20-I51)"	"GR113-054"	14234024	6746356647	211.0	210.9	211.1	0.1
	"#Malignant neoplasms (C00-C97)"	"GR113-019"	12644869	6746356647	187.4	187.3	187.5	0.1
	"#Cerebrovascular diseases (I60-I69)"	"GR113-070"	3184602	6746356647	47.2	47.2	47.3	0.0
	"#Chronic lower respiratory diseases (J40-J47)"	"GR113-082"	3064049	6746356647	45.4	45.4	45.5	0.0
	"#Accidents (unintentional injuries) (V01-X59,Y85-Y86)"	"GR113-112"	2888942	6746356647	42.8	42.8	42.9	0.0
	"#Alzheimer disease (G30)"	"GR113-052"	1872576	6746356647	27.8	27.7	27.8	0.0
	"#Diabetes mellitus (E10-E14)"	"GR113-046"	1674724	6746356647	24.8	24.8	24.9	0.0
	"#Influenza and pneumonia (J09-J18)"	"GR113-076"	1257088	6746356647	18.6	18.6	18.7	0.0
	"#Nephritis, nephrotic syndrome and nephrosis (N00-N07,N17

### Look at some of the subcategory counts

In [18]:
# Check if sum of detailed categories equals the top level category
top_level_cancer = df[df['ICD-10 113 Cause List'] == '#Malignant neoplasms (C00-C97)']['Deaths'].sum()

detailed_cancers = df[
    (df['ICD-10 113 Cause List'].str.contains('Malignant neoplasm', na=False)) & 
    (df['ICD-10 113 Cause List'] != '#Malignant neoplasms (C00-C97)')
]['Deaths'].sum()

print(f"Top level cancer deaths: {top_level_cancer:,}")
print(f"Sum of detailed cancer deaths: {detailed_cancers:,}")
print(f"Ratio: {detailed_cancers/top_level_cancer:.2f}")

# List all cancer categories with their totals
cancer_rows = df[df['ICD-10 113 Cause List'].str.contains('Malignant neoplasm', na=False)]
print("\nAll cancer categories with death counts:")
for cause, group in cancer_rows.groupby('ICD-10 113 Cause List'):
    deaths = group['Deaths'].sum()
    print(f"{cause}: {deaths:,}")

Top level cancer deaths: 12,644,666.0
Sum of detailed cancer deaths: 10,993,008.0
Ratio: 0.87

All cancer categories with death counts:
#Malignant neoplasms (C00-C97): 12,644,666.0
Malignant neoplasm of bladder (C67): 319,250.0
Malignant neoplasm of breast (C50): 918,861.0
Malignant neoplasm of cervix uteri (C53): 89,620.0
Malignant neoplasm of esophagus (C15): 309,902.0
Malignant neoplasm of larynx (C32): 82,574.0
Malignant neoplasm of ovary (C56): 313,963.0
Malignant neoplasm of pancreas (C25): 810,577.0
Malignant neoplasm of prostate (C61): 650,767.0
Malignant neoplasm of stomach (C16): 254,086.0
Malignant neoplasms of colon, rectum and anus (C18-C21): 1,184,418.0
Malignant neoplasms of corpus uteri and uterus, part unspecified (C54-C55): 189,518.0
Malignant neoplasms of kidney and renal pelvis (C64-C65): 288,147.0
Malignant neoplasms of lip, oral cavity and pharynx (C00-C14): 191,340.0
Malignant neoplasms of liver and intrahepatic bile ducts (C22): 446,951.0
Malignant neoplasms of 

## Cause of death by age (all rankable causes)

In [19]:
df_rankable = df[df['ICD-10 113 Cause List'].str.startswith('#')]

In [20]:
# Function to extract first number from age range for sorting
def extract_first_age(age_str):
    return int(age_str.split('-')[0]) if '-' in str(age_str) else int(age_str.rstrip('+'))

# Sort age groups and create ordered list
age_order = sorted(df_rankable['Five-Year Age Groups Code'].unique(), key=extract_first_age)

In [21]:
plot_df = df_rankable.groupby(
    ['Five-Year Age Groups Code', 'Gender', 'ICD-10 113 Cause List']
)['Deaths'].sum().reset_index()

alt.Chart(plot_df).mark_bar(tooltip=True).encode(
    x=alt.X('Five-Year Age Groups Code', sort=age_order),
    y='Deaths',
    color=alt.Color('ICD-10 113 Cause List'),
    row='Gender'
).properties(
    width=800
)

### Relative grouped COD by age 

In [22]:
alt.Chart(plot_df).mark_bar(tooltip=True).encode(
    x=alt.X('Five-Year Age Groups Code', sort=age_order),
    y=alt.Y('Deaths').stack("normalize"),
    color=alt.Color('ICD-10 113 Cause List'),
    row='Gender'
).properties(
    width=800
)

The total population should be the same for each age & gender group:

In [23]:
single_age_gender = df_rankable[(df_rankable['Five-Year Age Groups Code'] == '35-39') & (df_rankable['Gender Code'] == 'M')]
single_age_gender.shape

(41, 13)

In [24]:
len(single_age_gender['Population'].unique()) == 1

True

Check all groups:

In [25]:
for age in df_rankable['Five-Year Age Groups Code'].unique():
    for gender in df_rankable['Gender Code'].unique():
        len(df_rankable[(df_rankable['Five-Year Age Groups Code'] == age) & (df_rankable['Gender Code'] == gender)]['Population'].unique()) == 1

In [26]:
plot_df.head(5)

Unnamed: 0,Five-Year Age Groups Code,Gender,ICD-10 113 Cause List,Deaths
0,1,Female,"#Accidents (unintentional injuries) (V01-X59,Y...",10620.0
1,1,Female,#Acute bronchitis and bronchiolitis (J20-J21),372.0
2,1,Female,#Anemias (D50-D64),142.0
3,1,Female,"#Assault (homicide) (*U01-*U02,X85-Y09,Y87.1)",2835.0
4,1,Female,#Atherosclerosis (I70),34.0


In [27]:
all_deaths_by_age = df_rankable.groupby(['Five-Year Age Groups Code', 'Gender Code'])['Deaths'].sum().reset_index()
all_deaths_by_age = all_deaths_by_age.rename(columns={'Deaths': 'All-Cause Deaths'})
all_deaths_by_age

Unnamed: 0,Five-Year Age Groups Code,Gender Code,All-Cause Deaths
0,1,F,202636.0
1,1,M,251752.0
2,1-4,F,31839.0
3,1-4,M,42221.0
4,10-14,F,23653.0
5,10-14,M,37170.0
6,100+,F,319942.0
7,100+,M,71344.0
8,15-19,F,64421.0
9,15-19,M,171342.0


## Pull cancer rates and as a percent of all causes death

In [28]:
cancer_rates = df.loc[
    df['ICD-10 113 Cause List'] == '#Malignant neoplasms (C00-C97)'
].merge(
    all_deaths_by_age, 
    how='left', 
    on=['Five-Year Age Groups Code', 'Gender Code']
).assign(
    **{'Percent of Causes': lambda x: x['Deaths']/x['All-Cause Deaths']}
)
cancer_rates.loc[cancer_rates['Five-Year Age Groups Code'].isin(['50-54', '55-59'])]

Unnamed: 0,Notes,ICD-10 113 Cause List,ICD-10 113 Cause List Code,Five-Year Age Groups,Five-Year Age Groups Code,Gender,Gender Code,Deaths,Population,Crude Rate,Crude Rate Lower 95% Confidence Interval,Crude Rate Upper 95% Confidence Interval,Crude Rate Standard Error,All-Cause Deaths,Percent of Causes
22,,#Malignant neoplasms (C00-C97),GR113-019,50-54 years,50-54,Female,F,315704.0,232356654.0,135.9,135.4,136.3,0.2,759773.0,0.415524
23,,#Malignant neoplasms (C00-C97),GR113-019,50-54 years,50-54,Male,M,331906.0,223575070.0,148.5,147.9,149.0,0.3,1238811.0,0.267923
24,,#Malignant neoplasms (C00-C97),GR113-019,55-59 years,55-59,Female,F,457214.0,213742200.0,213.9,213.3,214.5,0.3,1041338.0,0.439064
25,,#Malignant neoplasms (C00-C97),GR113-019,55-59 years,55-59,Male,M,538119.0,201411886.0,267.2,266.5,267.9,0.4,1670426.0,0.322145


In [29]:
alt.Chart(cancer_rates).mark_bar(tooltip=True).encode(
    x=alt.X('Five-Year Age Groups Code', sort=age_order),
    y='Percent of Causes',
    color=alt.Color('Deaths'),
    row='Gender'
).properties(
    width=800
)

In [30]:
cancer_rates.to_parquet('cod113-gender-5yr-cancer.parquet')

In [32]:
cvd_rates = df.loc[
    df['ICD-10 113 Cause List'] == '#Diseases of heart (I00-I09,I11,I13,I20-I51)'
].merge(
    all_deaths_by_age, 
    how='left', 
    on=['Five-Year Age Groups Code', 'Gender Code']
).assign(
    **{'Percent of Causes': lambda x: x['Deaths']/x['All-Cause Deaths']}
)
cvd_rates.loc[cvd_rates['Five-Year Age Groups Code'].isin(['50-54', '55-59'])]

Unnamed: 0,Notes,ICD-10 113 Cause List,ICD-10 113 Cause List Code,Five-Year Age Groups,Five-Year Age Groups Code,Gender,Gender Code,Deaths,Population,Crude Rate,Crude Rate Lower 95% Confidence Interval,Crude Rate Upper 95% Confidence Interval,Crude Rate Standard Error,All-Cause Deaths,Percent of Causes
22,,"#Diseases of heart (I00-I09,I11,I13,I20-I51)",GR113-054,50-54 years,50-54,Female,F,138435.0,232356654.0,59.6,59.3,59.9,0.2,759773.0,0.182206
23,,"#Diseases of heart (I00-I09,I11,I13,I20-I51)",GR113-054,50-54 years,50-54,Male,M,345730.0,223575070.0,154.6,154.1,155.2,0.3,1238811.0,0.279082
24,,"#Diseases of heart (I00-I09,I11,I13,I20-I51)",GR113-054,55-59 years,55-59,Female,F,202006.0,213742200.0,94.5,94.1,94.9,0.2,1041338.0,0.193987
25,,"#Diseases of heart (I00-I09,I11,I13,I20-I51)",GR113-054,55-59 years,55-59,Male,M,480449.0,201411886.0,238.5,237.9,239.2,0.3,1670426.0,0.287621


In [33]:
cvd_rates.to_parquet('cod113-gender-5yr-cvd.parquet')

In [34]:
alt.Chart(cvd_rates).mark_bar(tooltip=True).encode(
    x=alt.X('Five-Year Age Groups Code', sort=age_order),
    y='Percent of Causes',
    color=alt.Color('Deaths'),
    row='Gender'
).properties(
    width=800
)

In [35]:
selfharm_rates = df.loc[
    df['ICD-10 113 Cause List'] == '#Intentional self-harm (suicide) (*U03,X60-X84,Y87.0)'
].merge(
    all_deaths_by_age, 
    how='left', 
    on=['Five-Year Age Groups Code', 'Gender Code']
).assign(
    **{'Percent of Causes': lambda x: x['Deaths']/x['All-Cause Deaths']}
)
selfharm_rates.loc[selfharm_rates['Five-Year Age Groups Code'].isin(['50-54', '55-59'])]


Unnamed: 0,Notes,ICD-10 113 Cause List,ICD-10 113 Cause List Code,Five-Year Age Groups,Five-Year Age Groups Code,Gender,Gender Code,Deaths,Population,Crude Rate,Crude Rate Lower 95% Confidence Interval,Crude Rate Upper 95% Confidence Interval,Crude Rate Standard Error,All-Cause Deaths,Percent of Causes
18,,"#Intentional self-harm (suicide) (*U03,X60-X84...",GR113-124,50-54 years,50-54,Female,F,21334.0,232356654.0,9.2,9.1,9.3,0.1,759773.0,0.028079
19,,"#Intentional self-harm (suicide) (*U03,X60-X84...",GR113-124,50-54 years,50-54,Male,M,62538.0,223575070.0,28.0,27.8,28.2,0.1,1238811.0,0.050482
20,,"#Intentional self-harm (suicide) (*U03,X60-X84...",GR113-124,55-59 years,55-59,Female,F,17943.0,213742200.0,8.4,8.3,8.5,0.1,1041338.0,0.017231
21,,"#Intentional self-harm (suicide) (*U03,X60-X84...",GR113-124,55-59 years,55-59,Male,M,55734.0,201411886.0,27.7,27.4,27.9,0.1,1670426.0,0.033365


In [36]:
selfharm_rates.to_parquet('cod113-gender-5yr-selfharm.parquet')

In [37]:
alt.Chart(selfharm_rates).mark_bar(tooltip=True).encode(
    x=alt.X('Five-Year Age Groups Code', sort=age_order),
    y='Percent of Causes',
    color=alt.Color('Deaths'),
    row='Gender'
).properties(
    width=800
)