# Optimizing Mammography Allocation via Age and Race Segmentation #
### Contributors: Austin Huang, Kelvin Phung, Ethan Yu, Wenye (Tim) Zhou ###

Female breast cancer is the most common type of cancer for women in the U.S.<sup>1</sup> As such, it is imperative that we optimize the allocation of mammography facilities/resources among the U.S. states.

In this project, we optimize via age and race segmentation. That is, we use the age and racial compositions of each state in order to determine relative access to mammography facilities. For age, the expected number of deaths of wmen from breast cancer for all age ranges were summed together to produce the expected number of deaths of women per facility per state. (The number of facilities in each state was extracted from our original dataset, which included information on certified mammography facilities in U.S. states and territories.) For race, the expected number of deaths of women from breast cancer for each race per facility were determined for each state. In both cases, a larger ratio of expected number of deaths to number of facilities in a state implies a greater need for federal funding in that state, especially for uncertified facilities; if more deaths are experienced per facility, in other words, we need higher-quality facilities, as well as greater access to facilities in general, in order to provide adequate healthcare.

Additional datasets were utilized throughout our investigation. Citations are listed at the end of this notebook.

In [63]:
# imports
import pandas as pd
import numpy as np
import math
import warnings
warnings.filterwarnings('ignore')
import json
import plotly.express as px
import plotly.io as pio
pio.renderers.default = 'browser'

In [9]:
# Datasets used in our investigation
df_beg = pd.read_csv("beginner.csv")
df_sc = pd.read_csv("sc-est2019-alldata6.csv") #Extended version of the second dataset provided to us; includes race compositions
df_ref = pd.read_csv("ref.csv") #Data linking statename to their abbreviations; from World Population Review
df_cancer = pd.read_csv("NewCancerAge.csv") #Data from American Cancer Society on probability of developing and dying of breast cancer (by age)
df_deathrate = pd.read_csv("race_death_rate.csv") #Data from American Cancer Society on breast cancer-specific mortality rate by race

### 1. Number of Facilities Per U.S. State (and District of Columbia) ###

To begin, we extracted the number of mammography facilities per U.S. state, including the District of Columbia. Although the provided dataset included mammography facilities in U.S. territories and jurisdictions, we decided to narrow our analysis to the U.S. states and the District of Columbia-- especially since our additional datasets did not carry information about these territories and jurisdictions. 

To identify the relevant state in each recorded observation, we looked for 2 letter long codes in the dataset and used an external dataset to drop all 2 letter codes that were not US state codes. Then, we tallied the number of facilities per state.  

In [10]:
df_beg=df_beg[["Facility Name","Address 1","Address 2","Address 3", "City","State","Zip Code","Phone","Fax"]]

#Calculate number of facilities per state
def hist(input_list):
    '''
    Tallies the number of instances of items in a list
    '''
    tally={}
    for item in input_list:
        if item not in tally:
            tally[item]=0
        tally[item]+=1
    return tally

state_list = []

for index, row in df_beg.iterrows():
    for elem in row:
        if(not pd.isnull(elem)):
            if(len(str(elem.strip()==2))):
                    state_list.append(elem.strip())
new_state_list=[]
for elem in state_list:
    if elem in df_ref["code"].values:
        new_state_list.append(elem)

num_of_facilities = hist(new_state_list)

In [11]:
# Number of mammography facilities in each state, organized in descending order
num_of_facilities_df = pd.DataFrame.from_dict(num_of_facilities, orient='index', columns=['Number of Facilities'])
num_of_facilities_df.sort_values(by='Number of Facilities', ascending=False)

Unnamed: 0,Number of Facilities
CA,761
FL,627
TX,576
NY,550
PA,355
IL,335
OH,333
GA,277
MI,271
NC,271


### 2. Mammography Allocation via Age Segmentation

For our investigation of age, we segmented the 2019 Census data into 4 disjoint age groups. For each age group, we calculated the expected deaths of women in each state by multiplying the population of each age group by the death rate of each age group; these products were summed together to create an overall expected number of deaths per state.

The average number of expected deaths per state was also calculated-- and this was further used to calculate each state's deviation from the mean.

In [12]:
# Group 2019 census data according to state ('name') and age 
df_sc=df_sc[["NAME","SEX","AGE","ORIGIN","RACE","POPESTIMATE2019"]]
df_sc_new = df_sc.groupby(["NAME",pd.cut(df_sc["AGE"],bins=[0,50,60,70,np.inf],labels=["<49","50-59","60-69",">70"])])["POPESTIMATE2019"].sum()
df_sc_new.to_frame()

Unnamed: 0_level_0,Unnamed: 1_level_0,POPESTIMATE2019
NAME,AGE,Unnamed: 2_level_1
Alabama,<49,12367868
Alabama,50-59,2594968
Alabama,60-69,2327012
Alabama,>70,2095288
Alaska,<49,2009140
...,...,...
Wisconsin,>70,2478008
Wyoming,<49,1476420
Wyoming,50-59,285832
Wyoming,60-69,292564


In [13]:
df_cancer=df_cancer.rename(columns={"Female / 0-49 years": "<49", "Female / 50-59 years": "50-59","Female / 60-69 years":"60-69","Female / 70+ years":">70"})
df_new_cancer = df_cancer.to_dict()

# Expected number of deaths of women per state = breast cancer-specific mortality rate for women * population of women in that state.
expected_death={}
for index,val in df_sc_new.iteritems():
    if index[0] not in expected_death:
        expected_death[index[0]]=0
    expected_death[index[0]]+=val/2*float(df_new_cancer[index[1]][0])
pd.DataFrame.from_dict(expected_death, orient='index', columns=['Expected Deaths'])

Unnamed: 0,Expected Deaths
Alabama,1120.995856
Alaska,122.296946
Arizona,1735.888268
Arkansas,695.761294
California,7996.197284
Colorado,1122.871676
Connecticut,848.315734
Delaware,243.022828
District of Columbia,124.156374
Florida,5919.343984


In [14]:
#Calculate number of deaths per facility
expected_deaths_per_facility={}
for key1, val1 in expected_death.items():
    for key2, val2 in num_of_facilities.items():      
        for idx1 in range(0,len(df_ref["state"])):
            if (key1 == df_ref.loc[idx1,"state"] and key2 == df_ref.loc[idx1,"code"]):
                expected_deaths_per_facility[key1]=val1/val2
pd.DataFrame.from_dict(expected_deaths_per_facility, orient='index', columns=['ED / Facility'])

Unnamed: 0,ED / Facility
Alabama,7.894337
Alaska,3.705968
Arizona,10.151393
Arkansas,8.382666
California,10.507487
Colorado,8.136751
Connecticut,7.376659
Delaware,8.100761
District of Columbia,7.759773
Florida,9.44074


In [15]:
# Subtract average number of deaths per facility from all states to get a better comparison
comparison={}
sum=0
count=0
for key, val in expected_deaths_per_facility.items():
    sum+=val
    count+=1
avg=sum/count
for key, val in expected_deaths_per_facility.items():
    comparison[key]=val-avg

pd.DataFrame.from_dict(comparison, orient='index', columns=['ED / F: Deviation from Mean'])

Unnamed: 0,ED / F: Deviation from Mean
Alabama,0.082503
Alaska,-4.105866
Arizona,2.339559
Arkansas,0.570832
California,2.695652
Colorado,0.324917
Connecticut,-0.435176
Delaware,0.288927
District of Columbia,-0.052061
Florida,1.628906


### 3. Mammography Allocation via Race Segmentation

Using additional datasets from the Census and the American Cancer Society that contained information about racial compositions and breast cancer-specific mortality rates per race, respectively, we calculated the expected number of deaths of women per race in each state. Then, just as we did in our analysis of age, we calculated the expected number of deaths per number of facilities in each state (for every race). 

We paid special attention to marginalized races, namely Black women and Hispanic women. These demographic groups are more likely to report barriers to mammography (and adequate healthcare in general), and they are more prone to late detection and lower localized disease rates.<sup>2</sup> As such, access to mammography facilities--and high-quality ones, at that--is of special importance and urgency to our investigation. Looking at the expected number of deaths for Black and Hispanic women in each state per facility, we know which states need more equitable access to mammography. 

In [16]:
# Group 2019 census data according to state ('name'), sex (women specifically), origin, and race
women_in_census = df_sc.loc[(df_sc['SEX'] == 2), :] # Sex 2 represents women
census_women_2019_df = women_in_census.groupby(["NAME","SEX","ORIGIN", "RACE"]).sum()
census_women_2019_df = census_women_2019_df[['POPESTIMATE2019']]
census_women_2019_df

# Origin 0 represents total number of non-Hispanic and Hispanic; Origin 1 represents non-Hispanic; Origin 2 represents Hispanic
# Race 1 represents White, Race 2 represents Black, Race 3 represents American Indian / Alaskan Native,
# Race 4 represents Asian American, Race 5 represents Pacific Islander and Native Hawaiian, Race 6 represents Two or more races (excluded from our analysis)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,POPESTIMATE2019
NAME,SEX,ORIGIN,RACE,Unnamed: 4_level_1
Alabama,2,0,1,1726955
Alabama,2,0,2,703474
Alabama,2,0,3,17244
Alabama,2,0,4,39304
Alabama,2,0,5,2451
...,...,...,...,...
Wyoming,2,2,2,465
Wyoming,2,2,3,1650
Wyoming,2,2,4,207
Wyoming,2,2,5,84


In [17]:
# Pivot table with state, sex, and origin as indices and race as columns
pvt = census_women_2019_df.pivot_table(index=['NAME', 'SEX', 'ORIGIN'], columns=['RACE'], values='POPESTIMATE2019', aggfunc='sum')
pvt

Unnamed: 0_level_0,Unnamed: 1_level_0,RACE,1,2,3,4,5,6
NAME,SEX,ORIGIN,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Alabama,2,0,1726955,703474,17244,39304,2451,44240
Alabama,2,1,1637819,695866,14078,38512,1323,40393
Alabama,2,2,89136,7608,3166,792,1128,3847
Alaska,2,0,224537,11845,56305,25272,5187,26981
Alaska,2,1,206911,10468,53442,24719,4934,24547
...,...,...,...,...,...,...,...,...
Wisconsin,2,1,2368936,189219,26262,88428,1094,50706
Wisconsin,2,2,173740,9020,7904,1527,522,7867
Wyoming,2,0,262839,3059,7812,3606,275,6438
Wyoming,2,1,238585,2594,6162,3399,191,5221


In [18]:
# Calculate number of Hispanic women in each state
white_and_black = pvt[[1,2]].reset_index()
hispanic_w_and_b = white_and_black[(white_and_black['ORIGIN'] == 2)]
hispanic_w_and_b['Hispanic'] = hispanic_w_and_b[1] + hispanic_w_and_b[2]
hispanic_w_and_b



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



RACE,NAME,SEX,ORIGIN,1,2,Hispanic
2,Alabama,2,2,89136,7608,96744
5,Alaska,2,2,17626,1377,19003
8,Arizona,2,2,1031478,25647,1057125
11,Arkansas,2,2,101097,3756,104853
14,California,2,2,6957387,168660,7126047
17,Colorado,2,2,544333,15376,559709
20,Connecticut,2,2,246601,34367,280968
23,Delaware,2,2,35878,5739,41617
26,District of Columbia,2,2,29483,5770,35253
29,Florida,2,2,2616435,149951,2766386


In [64]:
# Calculate (total) number of Asian American and Pacific Islander/Native Hawaiian women in each state
aapi = pvt[[4, 5]].reset_index()
his_and_nonhis_aapi = aapi[(aapi['ORIGIN'] == 0)]
his_and_nonhis_aapi['AAPI'] = his_and_nonhis_aapi[4] + his_and_nonhis_aapi[5]
his_and_nonhis_aapi

RACE,NAME,SEX,ORIGIN,4,5,AAPI
0,Alabama,2,0,39304,2451,41755
3,Alaska,2,0,25272,5187,30459
6,Arizona,2,0,142511,9430,151941
9,Arkansas,2,0,25866,5519,31385
12,California,2,0,3208061,100349,3308410
15,Colorado,2,0,108856,5532,114388
18,Connecticut,2,0,91037,1822,92859
21,Delaware,2,0,20222,517,20739
24,District of Columbia,2,0,18172,462,18634
27,Florida,2,0,341566,12157,353723


In [65]:
# Calculate the number of non-Hispanic White, Black, and American Indian/Alaskan Native women in each state
w_b_ai = pvt[[1, 2, 3]].reset_index()
nonhis_w_b_ai = w_b_ai[(w_b_ai['ORIGIN'] == 1)]
nonhis_w_b_ai.rename(columns={1: 'Non-Hispanic White', 2: 'Non-Hispanic Black', 3: 'Non-Hispanic American Indian/Alaskan Native'}, inplace=True)
nonhis_w_b_ai

RACE,NAME,SEX,ORIGIN,Non-Hispanic White,Non-Hispanic Black,Non-Hispanic American Indian/Alaskan Native
1,Alabama,2,1,1637819,695866,14078
4,Alaska,2,1,206911,10468,53442
7,Arizona,2,1,1991632,157367,147653
10,Arkansas,2,1,1106548,244397,11826
13,California,2,1,7206004,1115162,82116
16,Colorado,2,1,1939164,108049,18478
19,Connecticut,2,1,1203981,193283,3816
22,Delaware,2,1,310127,113461,1556
25,District of Columbia,2,1,132938,171363,717
28,Florida,2,1,5831619,1733516,27058


In [21]:
# Merge previous dataframes to display the number of women of each race for each state
hispanic_women = hispanic_w_and_b[['NAME', 'Hispanic']]
hispanic_women.set_index('NAME')
aapi_women = his_and_nonhis_aapi[['NAME', 'AAPI']]
aapi_women.set_index('NAME')
nonhis = nonhis_w_b_ai[['NAME', 'Non-Hispanic White', 'Non-Hispanic Black', 'Non-Hispanic American Indian/Alaskan Native']]
nonhis.set_index('NAME')

hisp_and_aapi = hispanic_women.merge(aapi_women, how='inner', on='NAME')
all_races = hisp_and_aapi.merge(nonhis, how='inner', on='NAME')
all_races.set_index('NAME', inplace=True)
all_races

RACE,Hispanic,AAPI,Non-Hispanic White,Non-Hispanic Black,Non-Hispanic American Indian/Alaskan Native
NAME,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Alabama,96744,41755,1637819,695866,14078
Alaska,19003,30459,206911,10468,53442
Arizona,1057125,151941,1991632,157367,147653
Arkansas,104853,31385,1106548,244397,11826
California,7126047,3308410,7206004,1115162,82116
Colorado,559709,114388,1939164,108049,18478
Connecticut,280968,92859,1203981,193283,3816
Delaware,41617,20739,310127,113461,1556
District of Columbia,35253,18634,132938,171363,717
Florida,2766386,353723,5831619,1733516,27058


In [22]:
# Breast cancer-specific mortality rate per race (per 100,000 people)
deathrate_female = df_deathrate[df_deathrate['Cancer Type'] == 'Breast'].filter(regex='Female')
deathrate_female

Unnamed: 0,All races & ethnicities combined / Female,American Indian and Alaska Native / Female,Asian and Pacific Islander / Female,Hispanic / Female,Non-Hispanic black / Female,Non-Hispanic white / Female
2,19.6,20.5,11.7,13.7,27.6,19.7


In [23]:
# Expected number of deaths of women from breast cancer in each state, organized by race
hispanic_ED = all_races['Hispanic'] * deathrate_female['Hispanic / Female'].item() / 100000
aapi_ED = all_races['AAPI'] * deathrate_female['Asian and Pacific Islander / Female'].item() / 100000
non_hispanic_white_ED = all_races['Non-Hispanic White'] * deathrate_female['Non-Hispanic white / Female'].item() / 100000
non_hispanic_black_ED = all_races['Non-Hispanic Black'] * deathrate_female['Non-Hispanic black / Female'].item() / 100000
aian_ED = all_races['Non-Hispanic American Indian/Alaskan Native'] * deathrate_female['American Indian and Alaska Native / Female'].item() / 100000

expected_deaths_columns = {'Hispanic ED' : hispanic_ED, 'AAPI ED' : aapi_ED, 'Non-Hispanic White ED' : non_hispanic_white_ED,
                           'Non-Hispanic Black ED' : non_hispanic_black_ED, 'Non-Hispanic American Indian/Alaskan Native ED' : aian_ED}
expected_deaths_df = pd.DataFrame(expected_deaths_columns)
expected_deaths_df

Unnamed: 0_level_0,Hispanic ED,AAPI ED,Non-Hispanic White ED,Non-Hispanic Black ED,Non-Hispanic American Indian/Alaskan Native ED
NAME,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Alabama,13.253928,4.885335,322.650343,192.059016,2.88599
Alaska,2.603411,3.563703,40.761467,2.889168,10.95561
Arizona,144.826125,17.777097,392.351504,43.433292,30.268865
Arkansas,14.364861,3.672045,217.989956,67.453572,2.42433
California,976.268439,387.08397,1419.582788,307.784712,16.83378
Colorado,76.680133,13.383396,382.015308,29.821524,3.78799
Connecticut,38.492616,10.864503,237.184257,53.346108,0.78228
Delaware,5.701529,2.426463,61.095019,31.315236,0.31898
District of Columbia,4.829661,2.180178,26.188786,47.296188,0.146985
Florida,378.994882,41.385591,1148.828943,478.450416,5.54689


In [24]:
expected_deaths_df['code'] = df_ref['code'].values
expected_deaths_df.set_index('code', inplace=True) # Changing indices to two-letter acronyms to facilitate merging

# Calculate expected number of deaths per facility for each race, organized by state
deaths_per_facilities = expected_deaths_df.merge(num_of_facilities_df, how='inner', left_on='code', right_index=True)
deaths_per_facilities['Hispanic ED/F'] = deaths_per_facilities['Hispanic ED'] / deaths_per_facilities['Number of Facilities']
deaths_per_facilities['AAPI ED/F'] = deaths_per_facilities['AAPI ED'] / deaths_per_facilities['Number of Facilities']
deaths_per_facilities['NHW ED/F'] = deaths_per_facilities['Non-Hispanic White ED'] / deaths_per_facilities['Number of Facilities']
deaths_per_facilities['NHB ED/F'] = deaths_per_facilities['Non-Hispanic Black ED'] / deaths_per_facilities['Number of Facilities']
deaths_per_facilities['AI/AN ED/F'] = deaths_per_facilities['Non-Hispanic American Indian/Alaskan Native ED'] / deaths_per_facilities['Number of Facilities']
deaths_per_facilities

Unnamed: 0_level_0,Hispanic ED,AAPI ED,Non-Hispanic White ED,Non-Hispanic Black ED,Non-Hispanic American Indian/Alaskan Native ED,Number of Facilities,Hispanic ED/F,AAPI ED/F,NHW ED/F,NHB ED/F,AI/AN ED/F
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
AL,13.253928,4.885335,322.650343,192.059016,2.88599,142,0.093338,0.034404,2.272186,1.352528,0.020324
AK,2.603411,3.563703,40.761467,2.889168,10.95561,33,0.078891,0.107991,1.235196,0.087551,0.331988
AZ,144.826125,17.777097,392.351504,43.433292,30.268865,171,0.846936,0.10396,2.294453,0.253996,0.177011
AR,14.364861,3.672045,217.989956,67.453572,2.42433,83,0.173071,0.044242,2.626385,0.812694,0.029209
CA,976.268439,387.08397,1419.582788,307.784712,16.83378,761,1.282876,0.508652,1.865418,0.404448,0.022121
CO,76.680133,13.383396,382.015308,29.821524,3.78799,138,0.555653,0.096981,2.768227,0.216098,0.027449
CT,38.492616,10.864503,237.184257,53.346108,0.78228,115,0.334718,0.094474,2.062472,0.463879,0.006802
DE,5.701529,2.426463,61.095019,31.315236,0.31898,30,0.190051,0.080882,2.036501,1.043841,0.010633
DC,4.829661,2.180178,26.188786,47.296188,0.146985,16,0.301854,0.136261,1.636799,2.956012,0.009187
FL,378.994882,41.385591,1148.828943,478.450416,5.54689,627,0.604458,0.066006,1.832263,0.763079,0.008847


In [25]:
deaths_per_facilities['State'] = df_ref['state'].values
deaths_per_facilities.set_index('State', inplace=True) #Re-changing indices to full state names for readibility

# Truncated version of previous table; includes the expected number of deaths per facility for each race, organized by state
dpf_trunc = deaths_per_facilities[['Hispanic ED/F', 'AAPI ED/F', 'NHW ED/F', 'NHB ED/F', 'AI/AN ED/F']]
dpf_trunc.reset_index(inplace=True)
dpf_trunc

Unnamed: 0,State,Hispanic ED/F,AAPI ED/F,NHW ED/F,NHB ED/F,AI/AN ED/F
0,Alabama,0.093338,0.034404,2.272186,1.352528,0.020324
1,Alaska,0.078891,0.107991,1.235196,0.087551,0.331988
2,Arizona,0.846936,0.10396,2.294453,0.253996,0.177011
3,Arkansas,0.173071,0.044242,2.626385,0.812694,0.029209
4,California,1.282876,0.508652,1.865418,0.404448,0.022121
5,Colorado,0.555653,0.096981,2.768227,0.216098,0.027449
6,Connecticut,0.334718,0.094474,2.062472,0.463879,0.006802
7,Delaware,0.190051,0.080882,2.036501,1.043841,0.010633
8,District of Columbia,0.301854,0.136261,1.636799,2.956012,0.009187
9,Florida,0.604458,0.066006,1.832263,0.763079,0.008847


### 4. Data Visualization ###

We created geographical maps to visualize our findings and more easily derive actionable conclusions. In all of these visualizations, states in lighter colors (i.e., more yellow colors) possess a higher number of expected deaths per facility, while states in darker colors (i.e., more blue colors) possess a lower number of expected deaths per facility. By extension, states in lighter colors are in more need of federal funding (to democratize access to higher-quality mammography healthcare and lessen the burden of expected deaths on each facility). 

In [26]:
us_states = json.load(open("gz_2010_us_040_00_500k.json",'r'))

In [27]:
state1 = []
deaths = []
for key in expected_deaths_per_facility:
    state1.append(key)
    deaths.append(expected_deaths_per_facility[key])

death_dataframe = {
    'State' : state1,
    'deaths per clinic' : deaths
}
df_death = pd.DataFrame(death_dataframe)

In [42]:
state_id_map = {}
for feature in us_states['features']:
    feature['id'] = feature['properties']['STATE']
    state_id_map[feature['properties']['NAME']]= feature['id']

df_death['id'] = df_death['State'].apply(lambda x: state_id_map[x])

In [47]:
# Expected number of deaths per facility by state via age segmentation
fig = px.choropleth(df_death, locations ='id',geojson = us_states,color='deaths per clinic',scope='usa',
                    title = 'Expected number of deaths of Black women per certified mammography facility by state via age segmentation')
fig.show()

In [66]:
state_id_map_2 = {}
for feature in us_states['features']:
    feature['id'] = feature['properties']['STATE']
    state_id_map_2[feature['properties']['NAME']]= feature['id']

dpf_trunc['id'] = dpf_trunc['State'].apply(lambda x: state_id_map_2[x])
dpf_trunc

Unnamed: 0,State,Hispanic ED/F,AAPI ED/F,NHW ED/F,NHB ED/F,AI/AN ED/F,id
0,Alabama,0.093338,0.034404,2.272186,1.352528,0.020324,1
1,Alaska,0.078891,0.107991,1.235196,0.087551,0.331988,2
2,Arizona,0.846936,0.10396,2.294453,0.253996,0.177011,4
3,Arkansas,0.173071,0.044242,2.626385,0.812694,0.029209,5
4,California,1.282876,0.508652,1.865418,0.404448,0.022121,6
5,Colorado,0.555653,0.096981,2.768227,0.216098,0.027449,8
6,Connecticut,0.334718,0.094474,2.062472,0.463879,0.006802,9
7,Delaware,0.190051,0.080882,2.036501,1.043841,0.010633,10
8,District of Columbia,0.301854,0.136261,1.636799,2.956012,0.009187,11
9,Florida,0.604458,0.066006,1.832263,0.763079,0.008847,12


In [56]:
# Define function for visualizing expected number of deaths for each race per facility, organized by state
# Assume use of dpf_trunc dataframe
def edf_map(race, race_title):
    '''
    Return visualization of expected number of deaths for a given race per facility, organized by state
    '''
    fig = px.choropleth(dpf_trunc, locations ='id',geojson = us_states,color=race,scope='usa',
                    title = 'Expected number of deaths of ' + race_title + ' women from breast cancer per certified mammography facility, by state')
    return fig

In [62]:
# Expected number of deaths of Hispanic women from breast cancer per certified mammography facility, by state
fig1 = edf_map('Hispanic ED/F', 'Hispanic')
fig1.show()

In [58]:
# Expected number of deaths of AAPI and Native Hawaiian women from breast cancer per certified mammography facility, by state
fig2 = edf_map('AAPI ED/F', 'AAPI and Native Hawaiian')
fig2.show()

In [59]:
# Expected number of deaths of non-Hispanic White women from breast cancer per certified mammography facility, by state
fig3 = edf_map('NHW ED/F', 'non-Hispanic White')
fig3.show()

In [60]:
# Expected number of deaths of non-Hispanic Black women from breast cancer per certified mammography facility, by state
fig4 = edf_map('NHB ED/F', 'non-Hispanic Black')
fig4.show()

In [61]:
# Expected number of deaths of American Indian and Alaskan Native women from breast cancer per certified mammography facility, by state
fig5 = edf_map('AI/AN ED/F', 'American Indian and Alaskan Native')
fig5.show()

### 5. Conclusions ###

According to our analysis of age, **South Carolina, New Mexico, California, Washington, and Vermont** are most in need of federal funding from the FDA. These states had the highest expected number of deaths of women from breast cancer per facility.

Our analysis of race yielded varying results according to the particular race under investigation. However, we have concluded that marginalized populations of women (namely Black and Hispanic women) are most in need of aid in **Maryland, New Mexico, Texas, and California**.

### References ###

1. U.S. Cancer Statistics Working Group. U.S. Cancer Statistics Data Visualizations Tool, based on 2021 submission data (1999-2019): U.S. Department of Health and Human Services, Centers for Disease Control and Prevention and National Cancer Institute; https://www.cdc.gov/cancer/dataviz, released in November 2022.
2. Hirko, K.A., Rocque, G., Reasor, E. et al. The impact of race and ethnicity in breast cancer—disparities and implications for precision oncology. BMC Med 20, 72 (2022). https://doi.org/10.1186/s12916-022-02260-0

### External Datasets ###

1. American Cancer Society: Cancer Facts &amp; Statistics. American Cancer Society | Cancer Facts &amp; Statistics. (n.d.). Retrieved January 29, 2023, from https://cancerstatisticscenter.cancer.org/cancer-site/Breast/5j5EUt8W 
2. American Cancer Society: Cancer Facts &amp; Statistics. American Cancer Society | Cancer Facts &amp; Statistics. (n.d.). Retrieved January 29, 2023, from https://cancerstatisticscenter.cancer.org/cancer-site/Breast/tENxzKXJ 
3. Bureau, U. S. C. (2021, October 8). 2019 population estimates by age, sex, race and Hispanic origin. Census.gov. Retrieved January 29, 2023, from https://www.census.gov/newsroom/press-kits/2020/population-estimates-detailed.html 
4. List of state abbreviations (download CSV, JSON). (n.d.). Retrieved January 29, 2023, from https://worldpopulationreview.com/states/state-abbreviations 