In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots


#to ignore warnings
import warnings
warnings.filterwarnings('ignore')

In [2]:
#data = pd.read_csv("C:/Users/ahamm/Documents/MA4/Data viz/IHME-GBD_2019_DATA-10654582-1.csv")
#data = pd.read_csv("data/IHME-GBD_2019_DATA-10654582-1.csv")
data = pd.read_csv("data/IHME-GBD_2019_DATA-daf40202-1.csv")

In [3]:
data.head(10)

Unnamed: 0,measure_id,measure_name,location_id,location_name,sex_id,sex_name,age_id,age_name,cause_id,cause_name,metric_id,metric_name,year,val,upper,lower
0,1,Deaths,94,Switzerland,1,Male,8,15-19 years,414,Stomach cancer,1,Number,1992,0.109847,0.147407,0.077678
1,1,Deaths,94,Switzerland,2,Female,8,15-19 years,414,Stomach cancer,1,Number,1992,0.103194,0.134278,0.076533
2,1,Deaths,94,Switzerland,1,Male,8,15-19 years,414,Stomach cancer,2,Percent,1992,0.000692,0.000927,0.000491
3,1,Deaths,94,Switzerland,2,Female,8,15-19 years,414,Stomach cancer,2,Percent,1992,0.001555,0.002026,0.001155
4,1,Deaths,94,Switzerland,1,Male,8,15-19 years,414,Stomach cancer,3,Rate,1992,0.053663,0.072012,0.037947
5,1,Deaths,94,Switzerland,2,Female,8,15-19 years,414,Stomach cancer,3,Rate,1992,0.053364,0.069438,0.039577
6,1,Deaths,94,Switzerland,1,Male,9,20-24 years,414,Stomach cancer,1,Number,1992,0.581058,0.770165,0.436195
7,1,Deaths,94,Switzerland,2,Female,9,20-24 years,414,Stomach cancer,1,Number,1992,0.409267,0.526591,0.306827
8,1,Deaths,94,Switzerland,1,Male,9,20-24 years,414,Stomach cancer,2,Percent,1992,0.001453,0.001925,0.001087
9,1,Deaths,94,Switzerland,2,Female,9,20-24 years,414,Stomach cancer,2,Percent,1992,0.003664,0.004716,0.002743


### I - General informations upon our dataset

For our study, we will use the GBD databank to obtain datasets on various cancer types and associated mortality, filtering by metrics, location, age, and sex from 1990 to 2019. This database was selected for its user-friendly interface that allows pre-download filtering, minimizing the need for preprocessing and avoiding dataset compatibility issues.

#### 1) Basic informations

In [4]:
# Display basic information about the dataset
print("Shape of the dataset:", data.shape)
print("\nColumns in the dataset:", data.columns)
print("\nData types of columns:\n", data.dtypes)
print("\nNumber of unique values per feature:\n",data.nunique())
print("\nUnique values in each feature:\n",data.apply(lambda c : c.unique()))

Shape of the dataset: (348570, 16)

Columns in the dataset: Index(['measure_id', 'measure_name', 'location_id', 'location_name', 'sex_id',
       'sex_name', 'age_id', 'age_name', 'cause_id', 'cause_name', 'metric_id',
       'metric_name', 'year', 'val', 'upper', 'lower'],
      dtype='object')

Data types of columns:
 measure_id         int64
measure_name      object
location_id        int64
location_name     object
sex_id             int64
sex_name          object
age_id             int64
age_name          object
cause_id           int64
cause_name        object
metric_id          int64
metric_name       object
year               int64
val              float64
upper            float64
lower            float64
dtype: object

Number of unique values per feature:
 measure_id            6
measure_name          6
location_id           2
location_name         2
sex_id                2
sex_name              2
age_id               17
age_name             17
cause_id             10
cause_nam

Features :

Overall we have 16 columns corresponding to 10 features in this dataset, some features are coming in 2 columns, one with the ID and one with the name of the variable. 

* `measure_name`: It is the measure that `val`quantifies. Its values are : `Deaths`, `DALYs` (the number of years lost due to ill-health, disability, or early death), `YLDs`(the number of years lived with disability), `YLLs`(the number of years of life lost comapred to the expected life years), `Prevalence` (the proportion of cases in the population at a given time), `incidence` (the rate of occurrence of new cases).
Notice that DALYs = YLDs + YLLs.
* `location_name`: It is the place where the measure has been recorded. Here we have access to measures from Switzerland and the European Union.
* `sex_name`: Male or Female.
* `age_name`: Here this feature has a unique value `all ages`,then each row convey numbers including people of all ages.
* `cause_name`: It is the disease for which we make measurement. Here we focus on ten cancer categories : Pancreatic, Ovarian, Kidney, Brain/Central nervous system, Thyroid, Stomach, Liver, Tracheal/Bronchus/Lung, Breast, Colon/rectum.
* `metric name`: It the type of metric (the unit) of the measurement. Then `val` can be stated in absolute `Number`, in `Percent` of the population or in `Rate`(measure of the frequency with which the event occurs)
* `year`: The year on which `val`is computed. It ranges from 1990 to 2019.
* `val`: The value computed for a given cancer, it has a certain metric and describe a particular measure.
* `upper`: the upper confidence interval of `val`.
* `lower`: The lower confidence interval of `val`.

#### 2) What are the different types of cancer we'll be studying ?

For our study, we decided to focus on 10 different types of cancer

In [5]:
# Types of cancer - ids
cause_ids = data['cause_id'].unique()
print(f"Ids of the types of cancer in our dataset : {cause_ids}")

cause_names = data['cause_name'].unique()
print(f"Names of the types of cancer in our dataset : {cause_names}")

Ids of the types of cancer in our dataset : [414 417 426 429 441 465 471 477 480 456]
Names of the types of cancer in our dataset : ['Stomach cancer' 'Liver cancer' 'Tracheal, bronchus, and lung cancer'
 'Breast cancer' 'Colon and rectum cancer' 'Ovarian cancer'
 'Kidney cancer' 'Brain and central nervous system cancer'
 'Thyroid cancer' 'Pancreatic cancer']


#### 3) Assessment : How do we assess the impact of these cancers?

In [6]:
measure_names = data.measure_name.unique()
print(f"List of metrics for understanding cancer impact within our dataset: {measure_names}")

List of metrics for understanding cancer impact within our dataset: ['Deaths' 'DALYs (Disability-Adjusted Life Years)'
 'YLDs (Years Lived with Disability)' 'YLLs (Years of Life Lost)'
 'Prevalence' 'Incidence']


#### 4) What are the studied regions (geographical regions) ?

In [7]:
location_names = data.location_name.unique()
print(f"List of geographical regions we'll focus on in our study : {location_names}")

List of geographical regions we'll focus on in our study : ['Switzerland' 'European Union']


#### 5) What are the ages impacted by cancer ?

In [8]:
age_names = data.age_name.unique()
print(f"List of people's ages that are impacted by cancer : {age_names}")

age_ids = data.age_id.unique()
print(f"List of IDs people's ages that are impacted by cancer : {age_ids}")

List of people's ages that are impacted by cancer : ['15-19 years' '20-24 years' '25-29 years' '30-34 years' '35-39 years'
 '40-44 years' '45-49 years' '50-54 years' '55-59 years' '60-64 years'
 '65-69 years' '70-74 years' '50-69 years' '25-49 years' '75+ years'
 '<5 years' '5-14 years']
List of IDs people's ages that are impacted by cancer : [  8   9  10  11  12  13  14  15  16  17  18  19  25 206 234   1  23]


## II - EDA

#### 1) Check for Duplication

In [9]:
#assert that there are no duplicated rows
assert(data.drop_duplicates().shape == data.shape)

In [10]:
data.nunique()

measure_id            6
measure_name          6
location_id           2
location_name         2
sex_id                2
sex_name              2
age_id               17
age_name             17
cause_id             10
cause_name           10
metric_id             3
metric_name           3
year                 30
val              321553
upper            321609
lower            321605
dtype: int64

#### 2) Missing Values Calculation

In [11]:
data.isnull().sum()

measure_id       0
measure_name     0
location_id      0
location_name    0
sex_id           0
sex_name         0
age_id           0
age_name         0
cause_id         0
cause_name       0
metric_id        0
metric_name      0
year             0
val              0
upper            0
lower            0
dtype: int64

As we can see, in our dataset, there's no missing values.

#### 3) Organizing our data

1) Sorting data : Upon reviewing the data, we recognized the potential benefits of organizing the information chronologically by year. This approach would enable us to systematically analyze the progression of cancer cases over time.

In [12]:
data = data.sort_values(by='year')
data

Unnamed: 0,measure_id,measure_name,location_id,location_name,sex_id,sex_name,age_id,age_name,cause_id,cause_name,metric_id,metric_name,year,val,upper,lower
277264,5,Prevalence,4743,European Union,2,Female,11,30-34 years,426,"Tracheal, bronchus, and lung cancer",2,Percent,1990,0.000020,0.000022,0.000018
314555,6,Incidence,94,Switzerland,2,Female,25,50-69 years,471,Kidney cancer,3,Rate,1990,5.280548,6.157606,4.490293
314554,6,Incidence,94,Switzerland,1,Male,25,50-69 years,471,Kidney cancer,3,Rate,1990,17.371771,19.240313,15.491265
314553,6,Incidence,94,Switzerland,2,Female,25,50-69 years,471,Kidney cancer,2,Percent,1990,0.000012,0.000014,0.000010
314552,6,Incidence,94,Switzerland,1,Male,25,50-69 years,471,Kidney cancer,2,Percent,1990,0.000040,0.000047,0.000034
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
313825,6,Incidence,94,Switzerland,2,Female,16,55-59 years,471,Kidney cancer,1,Number,2019,23.121214,36.859334,13.595088
313824,6,Incidence,94,Switzerland,1,Male,16,55-59 years,471,Kidney cancer,1,Number,2019,56.763051,87.684218,33.706138
313787,6,Incidence,94,Switzerland,2,Female,234,75+ years,426,"Tracheal, bronchus, and lung cancer",3,Rate,2019,132.116089,168.583689,98.126053
314005,6,Incidence,94,Switzerland,2,Female,17,60-64 years,471,Kidney cancer,1,Number,2019,32.223622,50.805569,19.762338


2. Subdivide datasets and save

In [13]:
for name in data.cause_name.unique():
    data[data.cause_name==name].to_csv(f'data/by_{name.replace(" ","_")}.csv')

for name in data.metric_name.unique():
    data[data.metric_name==name].to_csv(f'data/by_{name}.csv')

In [14]:
data_rate = pd.read_csv('data/by_Rate.csv')
data_number = pd.read_csv('data/by_Number.csv')
data_percent = pd.read_csv('data/by_Percent.csv')

#### 4) Our study

Our study aims to uncover patterns between various cancer types and their impact on populations in Switzerland and the European Union.

We will undertake analyses utilizing a range of indicators provided by the GBD databank to gain insights into how cancers affect individuals' lives. As stated previously these indicators include:

- **`Deaths`**: The number of deaths within a population over a specific period.
- **`DALYs (Disability-adjusted life years)`**: A measure combining years of life lost due to premature death and years lived with a disability, reflecting total years of healthy life lost.
- **`YLDs (Years Lived with Disability)`**: The total time spent living with any health loss, adjusted by severity.
- **`YLLs (Years of Life Lost)`**: The years lost due to premature death.
- **`Prevalence`**: The proportion of individuals in a population who have a disease or its sequelae at a specific time.
- **`Incidence`**: The count of new disease cases within a certain time frame in a specified population.

In [16]:
data.measure_name.unique()

array(['Prevalence', 'Incidence', 'YLLs (Years of Life Lost)',
       'YLDs (Years Lived with Disability)',
       'DALYs (Disability-Adjusted Life Years)', 'Deaths'], dtype=object)

Thanks to these metrics, our analysis will strive to address key questions:

1) How have cancer cases changed over time within the studied populations?
2) What are the characteristics of the individuals affected by these cancers, including their origin (Switzerland or EU), sex, and age at death?

##### A - Evolution of Cancer Cases Over Time

**ANALYSIS 1 : average percentage of cancer deaths relative to deaths from all causes in both the European Union and Switzerland from 1990 to 2019**

To track the progression of cancer cases, we initially considered it essential to understand the percentage of deaths due to cancer within the two populations. Consequently, we decided to analyze the average percentage of cancer deaths relative to deaths from all causes in both the European Union and Switzerland from 1990 to 2019.

In [17]:
colors = px.colors.qualitative.Bold

In [22]:
fig = go.Figure()
i=0
for location, group_l in data_percent[data_percent.measure_name=='Deaths'].groupby(['location_name','sex_name','year'])[['val','lower','upper']].mean().reset_index().groupby('location_name'):
    for sex, group_s in group_l.groupby('sex_name'):
        fig.add_trace(go.Scatter(x=group_s['year'], y=group_s['val'], name=f'{sex}/{location}' ,line_color=colors[i],opacity=1))
        fig.add_trace(go.Scatter(x=group_s['year'], y=group_s['lower'], fill=None, mode='lines',hoverinfo='skip',line_color=colors[i],line_width=0.1,showlegend=False))
        fig.add_trace(go.Scatter(x=group_s['year'], y=group_s['upper'], fill='tonexty', mode='lines',hoverinfo='skip',line_color=colors[i],line_width=0.1,showlegend=False))
        i+=1
fig.update_xaxes(title_text="Year")
fig.update_yaxes(title_text="Percent out of all death causes")
fig.update_layout(hovermode='x unified', title='Mean proportion of deaths caused by cancer diseases (in general) from 1990 to 2019')
fig.show()

From the above graph, we can observe the following:

- For both regions (EU and Switzerland), the mean percentage of deaths due to cancer diseases has been consistently higher for females than for males throughout the years. This could suggest that women are at a higher risk of dying from cancer or that certain cancers affecting women are more lethal. On the other hand it also suggests that men are more likely to die because of other causes.

- There's an apparent increasing trend in the mean percentage of cancer deaths over time for all groups. This could indicate not only an aging population but also a better reporting and diagnosis over time.

- The mean percentage of deaths due to cancer in Switzerland for both genders is consistently higher than in the European Union. However note that the confidence intervals for Switzerland data covers the ones from EU. Suggesting that the diffence is maybe only due to our sample and is not significant in truth.

**ANALYSIS 2 : Percentage of deaths caused by Kidney Cancer in both the European Union and Switzerland from 1990 to 2019**

Out of curiosity, we thought it might be insightful to examine the proportion of deaths attributed to kidney cancer. In future work we will investigate each type.

In [23]:
fig = go.Figure()
i=0
for location, group_l in data_percent[(data_percent.measure_name=='Deaths') & (data_percent.cause_name=='Kidney cancer')].groupby(['location_name','sex_name','year'])[['val','lower','upper']].mean().reset_index().groupby('location_name'):
    for sex, group_s in group_l.groupby('sex_name'):
        fig.add_trace(go.Scatter(x=group_s['year'], y=group_s['val'], name=f'{sex}/{location}' ,line_color=colors[i],opacity=1))
        fig.add_trace(go.Scatter(x=group_s['year'], y=group_s['lower'], fill=None, mode='lines',hoverinfo='skip',line_color=colors[i],line_width=0.1,showlegend=False))
        fig.add_trace(go.Scatter(x=group_s['year'], y=group_s['upper'], fill='tonexty', mode='lines',hoverinfo='skip',line_color=colors[i],line_width=0.1,showlegend=False))
        i+=1
fig.update_xaxes(title_text="Year")
fig.update_yaxes(title_text="Percent out of all death causes")
fig.update_layout(hovermode='x unified', title='Proportion of deaths caused by Kidney cancer')
fig.show()

The graph displays a rising trend in kidney cancer mortality rates for both males and females in the European Union and Switzerland from 1990 to 2019, with males showing higher rates than females in both regions. 

**ANALYSIS 3 : YLLS (Years of Life Lost to premature death) in both the European Union and Switzerland from 1990 to 2019**

It is also insightful to examine the YLLS (Years of Life Lost to premature death) attributed to cancer.

In [28]:
fig = go.Figure()
i=0
for location, group_l in data_rate[data_rate.measure_name=='YLLs (Years of Life Lost)'].groupby(['location_name','sex_name','year'])[['val','lower','upper']].mean().reset_index().groupby('location_name'):
    for sex, group_s in group_l.groupby('sex_name'):
        fig.add_trace(go.Scatter(x=group_s['year'], y=group_s['val'], name=f'{sex}/{location}' ,line_color=colors[i],opacity=1))
        fig.add_trace(go.Scatter(x=group_s['year'], y=group_s['lower'], fill=None, mode='lines',hoverinfo='skip',line_color=colors[i],line_width=0.1,showlegend=False))
        fig.add_trace(go.Scatter(x=group_s['year'], y=group_s['upper'], fill='tonexty', mode='lines',hoverinfo='skip',line_color=colors[i],line_width=0.1,showlegend=False))
        i+=1
fig.update_xaxes(title_text="Year")
fig.update_yaxes(title_text="YLLs per 100k res.")
fig.update_layout(hovermode='x unified', title='YLLs per 100k residents caused by cancer diseases')
fig.show()

We examine the rate of YLLs due to cancer diseases per 100k residents. Those 100k residents are 'average' residents, just like if we sample 100k residents at random in the population. Thus they are not necessarily with cancer. Then we can have the number of YLLs for one 'average' person, but not for one person with cancer.

**ANALYSIS 4 : Prevalence (proportion of individuals in a population who have a disease or its sequelae at a specific time) in both the European Union and Switzerland from 1990 to 2019**

In [32]:
fig = go.Figure()
i=0
for location, group_l in data_rate[data_rate.measure_name=='Prevalence'].groupby(['location_name','sex_name','year'])[['val','lower','upper']].mean().reset_index().groupby('location_name'):
    for sex, group_s in group_l.groupby('sex_name'):
        fig.add_trace(go.Scatter(x=group_s['year'], y=group_s['val'], name=f'{sex}/{location}' ,line_color=colors[i],opacity=1))
        fig.add_trace(go.Scatter(x=group_s['year'], y=group_s['lower'], fill=None, mode='lines',hoverinfo='skip',line_color=colors[i],line_width=0.1,showlegend=False))
        fig.add_trace(go.Scatter(x=group_s['year'], y=group_s['upper'], fill='tonexty', mode='lines',hoverinfo='skip',line_color=colors[i],line_width=0.1,showlegend=False))
        i+=1
fig.update_xaxes(title_text="Year")
fig.update_yaxes(title_text="Rate for 100k res.")
fig.update_layout(hovermode='x unified', title='Prevalence rate of cancer per 100k residents')
fig.show()

Above we can observe that the prevalence of cancer is much higher for women, it reflects the fact that our data includes ovarian and breast cancer, that are specific to women. Below we do twice the same plot, once without including these types of cancer and once with only these types to see how it will impact the graphs.

**ANALYSIS 5 : Total cases of cancer (ovarian and breast excepted) VS (ovarian and breast only) per 100k population**

We observe that we recover a hierarchy similar to the one observed in the average percent of death. From 2000, the prevalence rate of males having cancer in Switzerland is stable compared to the one of males in european union that kept increasing. We do not observe such a difference for females. When considering only breast and ovarian cancers, we observe as expected that the the rate for males falls to zero, while the rate for swiss females has been higher than europeans until 2014, they catch up at this point.

In [33]:
fig = go.Figure()
i=0
for location, group_l in data_rate[(data_rate.measure_name=='Prevalence') & (data_rate.cause_name!='Ovarian cancer') & (data_rate.cause_name!='Breast cancer')].groupby(['location_name','sex_name','year'])[['val','lower','upper']].mean().reset_index().groupby('location_name'):
    for sex, group_s in group_l.groupby('sex_name'):
        fig.add_trace(go.Scatter(x=group_s['year'], y=group_s['val'], name=f'{sex}/{location}' ,line_color=colors[i],opacity=1))
        fig.add_trace(go.Scatter(x=group_s['year'], y=group_s['lower'], fill=None, mode='lines',hoverinfo='skip',line_color=colors[i],line_width=0.1,showlegend=False))
        fig.add_trace(go.Scatter(x=group_s['year'], y=group_s['upper'], fill='tonexty', mode='lines',hoverinfo='skip',line_color=colors[i],line_width=0.1,showlegend=False))
        i+=1
fig.update_xaxes(title_text="Year")
fig.update_yaxes(title_text="Rate for 100k")
fig.update_layout(hovermode='x unified', title='Prevalence rate of cancer (ovarian and breast excepted) per 100k population')
fig.show()

In [34]:
fig = go.Figure()
i=0
for location, group_l in data_rate[(data_rate.measure_name=='Prevalence') & ((data_rate.cause_name=='Breast cancer') | (data_rate.cause_name=='Ovarian cancer'))].groupby(['location_name','sex_name','year'])[['val','lower','upper']].mean().reset_index().groupby('location_name'):
    for sex, group_s in group_l.groupby('sex_name'):
        fig.add_trace(go.Scatter(x=group_s['year'], y=group_s['val'], name=f'{sex}/{location}' ,line_color=colors[i],opacity=1))
        fig.add_trace(go.Scatter(x=group_s['year'], y=group_s['lower'], fill=None, mode='lines',hoverinfo='skip',line_color=colors[i],line_width=0.1,showlegend=False))
        fig.add_trace(go.Scatter(x=group_s['year'], y=group_s['upper'], fill='tonexty', mode='lines',hoverinfo='skip',line_color=colors[i],line_width=0.1,showlegend=False))
        i+=1
fig.update_xaxes(title_text="Year")
fig.update_yaxes(title_text="Rate for 100k res.")
fig.update_layout(hovermode='x unified', title='Prevalence rate of ovarian and breast cancer per 100k population')
fig.show()

#### B - Age-Specific Cancer Risk Overview

In our investigation into the correlation between cancer development and age, we aim to understand how cancer cases are distributed across different life stages. Here are our hypotheses : 

- **<5 years**: `Toddlers` - While cancer is rare in this age group, when it can occur. Early detection and treatment are crucial for improving outcomes.
- **5-14 years**: `Kids` - Childhood cancers, though uncommon, include leukemias, brain cancers.
- **15-19 years**: `Teenagers` - This age group sees an increase in cases of thyroid cancer. 
- **20-29 years**: `Young Adults` - The risk of breast cancer, ovarian, cervical cancer, stomach and thyroid cancer becomes more pronounced, especially in individuals with certain genetic predispositions or lifestyle factors.
- **30-44 years**: `Adults` - Cancer risks increase and diversify, including colon, kidney, breast, pancreatic, and lung cancers. 
- **45-59 years**: `Adults` - This group sees a higher incidence of cancer, making screenings for colorectal, breast, and prostate cancers increasingly important for early diagnosis and treatment.
- **60-64 years**: `Adults` - Cancer risk continues to rise, with a significant focus on screening for cancers that have higher survival rates when detected early, such as colorectal, lung, and bladder cancer.
- **65-69 years** and **70-74 years**: `Adults` - As individuals age, the prevalence of cancers such as prostate, lung, and colorectal cancer increases. 
- **75+ years**: `Seniors` - The highest risk of cancer is observed in this age group. 


**ANALYSIS 1 : Number of Cancer cases from 1990 to 2019 - aggregation**
The goal of this analysis is to understand and visualize the distribution of cancer cases across different age groups over a specified period, from 1990 to 2019. By aggregating cancer incidence data for each age group, the analysis aims to identify trends and patterns in cancer impact, such as which age groups are most affected by cancer.

In [36]:
# Age groups
age_groups = ['<5 years', '5-14 years', '15-19 years', '20-24 years', '25-29 years',
              '30-34 years', '35-39 years', '40-44 years', '45-49 years', '50-54 years',
              '55-59 years', '60-64 years', '65-69 years', '70-74 years', '75+ years']
              #'25-49 years', '50-69 years']

# Total number of people impacted by cancer in each age group
#cancer_impact = [data[data['age_name'] == age]['val'].sum() for age in age_groups]
cancer_impact = [data[(data['age_name'] == age) & (data['measure_name'] == 'Deaths') & (data['metric_id'] == 1)]['val'].sum() for age in age_groups]

fig = go.Figure()

fig.add_trace(go.Bar(
    x=cancer_impact,
    y=age_groups,
    orientation='h',
    marker=dict(
        color='skyblue',
        line=dict(color='black', width=1)
    )
))

fig.update_layout(
    title='Cumulated Number of Cancer Deaths by Age Group from 1990 to 2019',
    xaxis=dict(title='Number of deaths'),
    yaxis=dict(title='Age Group'),
    template='plotly_white',  
    showlegend=False
)

fig.show()


The graph shows the cumulative count of cancer deaths across different age groups over a thirty-year period. The horizontal bars indicate that the highest number of cancer deaths occurs in the oldest age group, "75+ years," with a noticeable decline in younger age groups. Each subsequent younger age group generally shows fewer deaths, illustrating a clear correlation between increasing age and cancer mortality. The least number of deaths are in the youngest age groups, "<5 years" and "5-14 years." This data underscores the significance of age as a factor in cancer mortality.

**ANALYSIS 2 : Number of Cancer cases from 1990 to 2019 - year by year**

By grouping specific age ranges and tracking the total cancer cases annually for each group, it offers insights into how cancer incidence fluctuates across different demographics over time.

In [38]:

# Define age groups with combined age ranges and corresponding groupings
age_group_mapping = {
    '<5 years': ['<5 years'],
    '5-14 years': ['5-14 years'],
    '15-24 years': ['15-19 years', '20-24 years'],
    '25-49 years': ['25-29 years', '30-34 years', '35-39 years', '40-44 years', '45-49 years'],
    '50-59 years': ['50-54 years', '55-59 years'],
    '60-69 years': ['60-64 years', '65-69 years'],
    '70+ years': ['70-74 years', '75+ years']
}

# Calculate the total number of death cases per age group and year
cases_by_year_age = data[(data['measure_name'] == 'Deaths') & (data['metric_id'] == 1)].groupby(['year', 'age_name'])['val'].sum().reset_index()

aggregated_data = []

for year in cases_by_year_age['year'].unique():
    for new_age_group, age_ranges in age_group_mapping.items():
        # Sum cases for all ages in the new age group for each year
        total_cases = cases_by_year_age[
            (cases_by_year_age['year'] == year) & 
            (cases_by_year_age['age_name'].isin(age_ranges))
        ]['val'].sum()

        aggregated_data.append({
            'year': year,
            'age_group': new_age_group,
            'val': total_cases
        })

aggregated_df = pd.DataFrame(aggregated_data)

pivot_df = aggregated_df.pivot(index='year', columns='age_group', values='val')

traces = []
for age_group in pivot_df.columns:
    traces.append(go.Scatter(
        x=pivot_df.index,
        y=pivot_df[age_group],
        mode='lines+markers',
        name=age_group
    ))

layout = go.Layout(
    title='Number of Cancer Deaths by Year and by Age Group',
    xaxis=dict(title='Year'),
    yaxis=dict(title='Number of Deaths'),
    template='plotly_white'
)

fig = go.Figure(data=traces, layout=layout)

fig.show()

The graph presents the total number of deaths from cancer diseases from 1990 to 2019, broken down by various age groups. The age group "70+ years" consistently has the highest number of deaths, followed by "60-69 years" and "50-59 years," suggesting a higher cancer incidence with increasing age. The trend for these groups is relatively stable. The younger age groups ("<5 years" through "25-49 years") show significantly fewer deaths, with the lowest numbers in the "<5 years" category. The data underscores the importance of age as a risk factor in both cancer incidence.

__c) Demographic Analysis of Kidney Cancer Mortality Over Three Decades__

The purpose is to construct a series of population pyramids that detail the distribution of kidney cancer fatalities segmented by age and sex for consecutive five-year intervals from 1990 to 2019. 

In [None]:
# Define the year groups
year_groups = {
    '1990-1994': list(range(1990, 1995)),
    '1995-1999': list(range(1995, 2000)),
    '2000-2004': list(range(2000, 2005)),
    '2005-2009': list(range(2005, 2010)),
    '2010-2014': list(range(2010, 2015)),
    '2015-2019': list(range(2015, 2020))
}

# Function to assign each year to a group
def map_year_to_group(year):
    for group, years in year_groups.items():
        if year in years:
            return group
    return None

data['year_group'] = data['year'].apply(map_year_to_group)

# Filter data for deaths and kidney cancer
data_kidney_deaths = data[(data['measure_name'] == 'Deaths') & (data['cause_name'] == 'Kidney cancer')]

colors = {
    'Female': 'red',
    'Male': 'blue'
}

sorted_age_names = ['<5 years', '5-14 years', '15-19 years', '20-24 years', '25-29 years',
                    '30-34 years', '35-39 years', '40-44 years', '45-49 years', '50-54 years',
                    '55-59 years', '60-64 years', '65-69 years', '70-74 years', '75+ years']

# Set 'age_name' to a categorical type with the specified order before grouping
data_kidney_deaths['age_name'] = pd.Categorical(
    data_kidney_deaths['age_name'], 
    categories=sorted_age_names, 
    ordered=True
)

# Iterate through each year group to create a separate population pyramid
for year_group in sorted(year_groups.keys()):
    # Filter the data for the specific year group
    group_data = data_kidney_deaths[data_kidney_deaths['year_group'] == year_group]

    fig = go.Figure()

    for sex in ['Female', 'Male']:
        sex_data = group_data[group_data['sex_name'] == sex]

        # Group by 'age_name' and sum 'val'—this will now respect the order of 'sorted_age_names'
        age_data = sex_data.groupby('age_name')['val'].sum()

        # If the sex is male, we want the bar to go left, so we multiply by -1
        multiplier = -1 if sex == 'Male' else 1

        fig.add_trace(go.Bar(
            x=multiplier * age_data.values,  # age_data
            y=age_data.index,
            name=sex,
            orientation='h',
            marker=dict(color=colors[sex]),
            hoverinfo='x'
        ))

    fig.update_layout(
        barmode='overlay',
        title=f'Population Pyramid for Kidney Cancer Deaths: {year_group}',
        xaxis=dict(title='Number of Deaths', showgrid=False),
        yaxis=dict(title='Age Group'),
        legend=dict(x=0.1, y=1.1, orientation='h')
    )

    fig.show()


The series of population pyramids represent kidney cancer deaths within various age groups for different five-year periods, from 1990-1994 through to 2015-2019. The graphs illustrate:

- For each period, deaths are more prevalent in older age groups, with the "75+ years" category consistently showing the highest mortality.
- Male deaths outnumber female deaths across nearly all age categories, which is a consistent trend over the 30-year span. This follows the study we made of cancer over time.
- Over time, there is a noticeable shift towards a higher number of deaths in both genders, suggesting either an increase in kidney cancer mortality or improvements in the reporting and diagnosis of the disease.