# Data Analysis on Contributing Factors to Cardiovascular Diseases

## Introduction

In this notebook, we will explore data sourced from the Our World in Data API, focusing on factors associated with cardiovascular diseases. These diseases are one of the leading causes of death worldwide, and understanding the socioeconomic, environmental, and health-related factors involved is essential for formulating prevention and treatment strategies.

We will use several datasets that include:

- **Causes of Death**: information on various causes of mortality across different countries and years.
- **Air Pollution**: levels of different pollutants, such as sulfur dioxide (SO₂) and nitrogen oxide (NOₓ), which significantly impact respiratory and cardiovascular health.
- **GDP per capita**: an economic indicator that may correlate with access to healthcare and lifestyle factors.
- **Obesity and Diabetes Prevalence**: important risk factors for cardiovascular diseases.
- **Population**: demographic data for each country, essential for normalizing the data and for per capita analysis.

Throughout this notebook, we will perform data cleaning and transformation operations, merging these datasets into a single DataFrame. This consolidated data table will enable further analysis and visualization of patterns, allowing for statistical studies on how these factors influence cardiovascular diseases.


## Data Loading & Librarys

In [3]:
import sys
import os
from dotenv import load_dotenv
import pandas as pd


## Set Workdir

In [4]:
load_dotenv()


''' Esta es la config para cuando no estas usando docker 
work_dir = os.getenv('WORK_DIR')
sys.path.append(work_dir)
'''

# Cambia temporalmente WORK_DIR dentro del notebook
os.environ['WORK_DIR'] = '/home/jovyan/work'
sys.path.append(os.getenv('WORK_DIR'))
sys.path.append(f"{os.getenv('WORK_DIR')}/src")
sys.path.append(f"{os.getenv('WORK_DIR')}/transform")
from transform.charts import get_data

## Importing the Original Causes of Death Dataset and Preparing for Merging

In this section, we will import the original "Causes of Death" dataset and prepare it for merging with other relevant datasets. The merging process will allow us to combine related information across datasets, facilitating a comprehensive analysis.

In [5]:
cause_of_deaths = pd.read_csv('../data/cause_of_deaths.csv', sep=',')

In [6]:
cause_of_deaths

Unnamed: 0,Country,Code,Year,Meningitis,Alzheimer's Disease and Other Dementias,Parkinson's Disease,Nutritional Deficiencies,Malaria,Drowning,Interpersonal Violence,...,Diabetes Mellitus,Chronic Kidney Disease,Poisonings,Protein-Energy Malnutrition,Road Injuries,Chronic Respiratory Diseases,Cirrhosis and Other Chronic Liver Diseases,Digestive Diseases,"Fire, Heat, and Hot Substances",Acute Hepatitis
0,Afghanistan,AFG,1990,2159,1116,371,2087,93,1370,1538,...,2108,3709,338,2054,4154,5945,2673,5005,323,2985
1,Afghanistan,AFG,1991,2218,1136,374,2153,189,1391,2001,...,2120,3724,351,2119,4472,6050,2728,5120,332,3092
2,Afghanistan,AFG,1992,2475,1162,378,2441,239,1514,2299,...,2153,3776,386,2404,5106,6223,2830,5335,360,3325
3,Afghanistan,AFG,1993,2812,1187,384,2837,108,1687,2589,...,2195,3862,425,2797,5681,6445,2943,5568,396,3601
4,Afghanistan,AFG,1994,3027,1211,391,3081,211,1809,2849,...,2231,3932,451,3038,6001,6664,3027,5739,420,3816
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6115,Zimbabwe,ZWE,2015,1439,754,215,3019,2518,770,1302,...,3176,2108,381,2990,2373,2751,1956,4202,632,146
6116,Zimbabwe,ZWE,2016,1457,767,219,3056,2050,801,1342,...,3259,2160,393,3027,2436,2788,1962,4264,648,146
6117,Zimbabwe,ZWE,2017,1460,781,223,2990,2116,818,1363,...,3313,2196,398,2962,2473,2818,2007,4342,654,144
6118,Zimbabwe,ZWE,2018,1450,795,227,2918,2088,825,1396,...,3381,2240,400,2890,2509,2849,2030,4377,657,139


## Importing the Air Pollution Dataset

In this step, we import the **Air Pollution** dataset, a key component in our analysis, which will help us assess how environmental factors correlate with causes of death. By including this dataset, we can analyze the impact of long-term exposure to pollutants on health outcomes and specifically on cardiovascular diseases.

In [7]:
air_pollution = get_data('https://ourworldindata.org/grapher/long-run-air-pollution')
air_pollution

Unnamed: 0,entities,years,nox,so2,co,bc,nh3,nmvoc
0,Aruba,1750,1.051486e-01,2.102971e-01,8.096439e+01,1.352049,4.171478e+00,1.114749e+01
1,Aruba,1760,1.118053e-01,2.236105e-01,8.609006e+01,1.437645,4.477596e+00,1.185324e+01
2,Aruba,1770,1.186446e-01,2.372893e-01,9.135637e+01,1.525588,4.802180e+00,1.257837e+01
3,Aruba,1780,1.256144e-01,2.512289e-01,9.672311e+01,1.615209,5.145519e+00,1.331733e+01
4,Aruba,1790,1.326458e-01,2.652916e-01,1.021373e+02,1.705622,5.507698e+00,1.406284e+01
...,...,...,...,...,...,...,...,...
48220,South America,2018,6.192406e+06,3.602524e+06,1.479585e+07,311315.340000,5.642008e+06,6.844417e+06
48221,South America,2019,6.091662e+06,3.420204e+06,1.445984e+07,301738.750000,5.684173e+06,6.787708e+06
48222,South America,2020,5.574989e+06,3.246352e+06,1.347026e+07,291928.280000,5.738365e+06,6.676336e+06
48223,South America,2021,5.968312e+06,3.489811e+06,1.389319e+07,292590.250000,5.778604e+06,6.696890e+06


### Renaming Columns for Consistency

After importing and merging the **Air Pollution** dataset, we need to ensure that column names are consistent across all datasets. This step is essential for smoother data processing and to avoid any mismatches during analysis.

In [8]:
air_pollution.rename(columns={'entities': 'Country', 'years': 'Year', 'nox': 'nitrogen_oxide(NOx)' , 'so2': 'sulphur_dioxide(SO2)', 'co': 'carbon_monoxide(CO)', 'bc': 'black_carbon(BC)', 'nh3': 'ammonia(NH3)', 'nmvoc': 'non_methane_volatile_organic_compounds'}, inplace=True)

In [9]:
air_pollution

Unnamed: 0,Country,Year,nitrogen_oxide(NOx),sulphur_dioxide(SO2),carbon_monoxide(CO),black_carbon(BC),ammonia(NH3),non_methane_volatile_organic_compounds
0,Aruba,1750,1.051486e-01,2.102971e-01,8.096439e+01,1.352049,4.171478e+00,1.114749e+01
1,Aruba,1760,1.118053e-01,2.236105e-01,8.609006e+01,1.437645,4.477596e+00,1.185324e+01
2,Aruba,1770,1.186446e-01,2.372893e-01,9.135637e+01,1.525588,4.802180e+00,1.257837e+01
3,Aruba,1780,1.256144e-01,2.512289e-01,9.672311e+01,1.615209,5.145519e+00,1.331733e+01
4,Aruba,1790,1.326458e-01,2.652916e-01,1.021373e+02,1.705622,5.507698e+00,1.406284e+01
...,...,...,...,...,...,...,...,...
48220,South America,2018,6.192406e+06,3.602524e+06,1.479585e+07,311315.340000,5.642008e+06,6.844417e+06
48221,South America,2019,6.091662e+06,3.420204e+06,1.445984e+07,301738.750000,5.684173e+06,6.787708e+06
48222,South America,2020,5.574989e+06,3.246352e+06,1.347026e+07,291928.280000,5.738365e+06,6.676336e+06
48223,South America,2021,5.968312e+06,3.489811e+06,1.389319e+07,292590.250000,5.778604e+06,6.696890e+06


## Importing the GDP per Capita Dataset

In this step, we import the **GDP per Capita** dataset, which provides valuable insights into the economic status of each country. Integrating this data allows us to examine the potential influence of economic factors on health outcomes, including the prevalence of cardiovascular diseases.

In [10]:
gdp_per_capita = get_data('https://ourworldindata.org/grapher/gdp-per-capita-penn-world-table')

In [11]:
gdp_per_capita.rename(columns={'entities': 'Country', 'years': 'Year','gdp_per_capita_penn_world_table': 'gdp_per_capita'}, inplace=True)

In [12]:
gdp_per_capita

Unnamed: 0,Country,Year,gdp_per_capita
0,Albania,1971,3159.8088
1,Albania,1972,3214.6665
2,Albania,1973,3267.8481
3,Albania,1974,3330.0708
4,Albania,1975,3385.2730
...,...,...,...
10103,Zimbabwe,2015,2880.9058
10104,Zimbabwe,2016,2919.6170
10105,Zimbabwe,2017,3112.8750
10106,Zimbabwe,2018,3007.2370


## Importing the Obesity Dataset

We now import the **Obesity** dataset, which captures obesity prevalence by country. Including this dataset enables us to analyze the impact of obesity as a risk factor for cardiovascular and other health-related issues.


In [13]:
obesity = get_data('https://ourworldindata.org/grapher/obesity-prevalence-adults-who-gho')

In [14]:
obesity.rename(columns={'entities': 'Country', 'years': 'Year', 'obesity_prevalence_adults_who_gho': 'obesity_prevalence_percentage'}, inplace=True)

In [15]:
obesity

Unnamed: 0,Country,Year,obesity_prevalence_percentage
0,Afghanistan,1975,0.5
1,Africa (WHO),1975,2.0
2,Albania,1975,6.5
3,Algeria,1975,6.9
4,Americas (WHO),1975,9.5
...,...,...,...
8269,Vietnam,2016,2.1
8270,Western Pacific (WHO),2016,6.4
8271,Yemen,2016,17.1
8272,Zambia,2016,8.1


## Importing the Diabetes Dataset

The **Diabetes** dataset, detailing diabetes prevalence across different countries, is included to help us explore the role of diabetes as a risk factor in cardiovascular diseases. This dataset enriches our understanding of health conditions associated with increased mortality rates.


In [16]:
diabetes = get_data('https://ourworldindata.org/grapher/diabetes-prevalence-who-gho')

In [17]:
diabetes.rename(columns={'entities': 'Country', 'years': 'Year', 'diabetes_prevalence_who_gho': 'diabetes_prevalence_percentage'}, inplace=True)

In [18]:
diabetes

Unnamed: 0,Country,Year,diabetes_prevalence_percentage
0,Afghanistan,1980,4.9
1,Africa (WHO),1980,3.1
2,Albania,1980,4.9
3,Algeria,1980,5.2
4,Americas (WHO),1980,5.0
...,...,...,...
6890,Vietnam,2014,5.3
6891,Western Pacific (WHO),2014,8.4
6892,Yemen,2014,11.3
6893,Zambia,2014,6.6


## Importing the Population Dataset

Finally, we import the **Population** dataset, which provides total population data by country and year. This dataset is essential for normalizing health data, allowing for more accurate per capita analyses in our study.


In [19]:
population = get_data('https://ourworldindata.org/grapher/population')

In [20]:
population.rename(columns={'entities': 'Country', 'years': 'Year'}, inplace=True)

In [21]:
population

Unnamed: 0,Country,Year,population
0,Afghanistan,-10000,14737
1,Afghanistan,-9000,20405
2,Afghanistan,-8000,28253
3,Afghanistan,-7000,39120
4,Afghanistan,-6000,54166
...,...,...,...
59172,Zimbabwe,2019,15271377
59173,Zimbabwe,2020,15526888
59174,Zimbabwe,2021,15797220
59175,Zimbabwe,2022,16069061


## Merging Datasets for Comprehensive Analysis

To build a unified dataset, we perform a series of merges, incorporating each of the individual datasets we imported earlier. This process allows us to maintain all key indicators in a single dataframe, facilitating a comprehensive analysis of the factors impacting health outcomes and causes of death.

In [22]:
merged_air = pd.merge(cause_of_deaths, air_pollution, left_on=['Country', 'Year'],
                        right_on=['Country', 'Year'], how='left')

In [23]:
merged_gdp = pd.merge(merged_air, gdp_per_capita, left_on=['Country', 'Year'],
                        right_on=['Country', 'Year'], how='left')

In [24]:
merged_obesity = pd.merge(merged_gdp, obesity, left_on=['Country', 'Year'],
                        right_on=['Country', 'Year'], how='left')

In [25]:
merged_diabetes = pd.merge(merged_obesity, diabetes, left_on=['Country', 'Year'],
                        right_on=['Country', 'Year'], how='left')

In [26]:
merged_df = pd.merge(merged_diabetes, population, left_on=['Country', 'Year'],
                        right_on=['Country', 'Year'], how='left')

In [27]:
merged_df

Unnamed: 0,Country,Code,Year,Meningitis,Alzheimer's Disease and Other Dementias,Parkinson's Disease,Nutritional Deficiencies,Malaria,Drowning,Interpersonal Violence,...,nitrogen_oxide(NOx),sulphur_dioxide(SO2),carbon_monoxide(CO),black_carbon(BC),ammonia(NH3),non_methane_volatile_organic_compounds,gdp_per_capita,obesity_prevalence_percentage,diabetes_prevalence_percentage,population
0,Afghanistan,AFG,1990,2159,1116,371,2087,93,1370,1538,...,425144.75,12876.9610,1013430.94,8362.603,73274.350,404866.40,,1.3,6.6,12045664.0
1,Afghanistan,AFG,1991,2218,1136,374,2153,189,1391,2001,...,413349.72,12671.9840,983752.06,8494.117,77547.380,381666.22,,1.4,6.7,12238879.0
2,Afghanistan,AFG,1992,2475,1162,378,2441,239,1514,2299,...,272757.10,7732.8310,654986.94,8487.974,83017.660,242334.94,,1.5,6.9,13278982.0
3,Afghanistan,AFG,1993,2812,1187,384,2837,108,1687,2589,...,276675.40,7967.0625,662752.90,8756.007,89469.490,240105.75,,1.5,7.1,14943174.0
4,Afghanistan,AFG,1994,3027,1211,391,3081,211,1809,2849,...,252820.98,7698.9930,657333.00,9055.427,95695.220,234383.25,,1.6,7.2,16250799.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6115,Zimbabwe,ZWE,2015,1439,754,215,3019,2518,770,1302,...,80178.04,67009.9700,1460933.60,29917.240,116320.484,274259.53,2880.9058,15.2,,14399009.0
6116,Zimbabwe,ZWE,2016,1457,767,219,3056,2050,801,1342,...,74604.40,60951.7930,1483953.10,30161.822,121476.000,276791.56,2919.6170,15.5,,14600297.0
6117,Zimbabwe,ZWE,2017,1460,781,223,2990,2116,818,1363,...,74787.19,53447.1880,1516500.10,30984.312,121789.484,283945.00,3112.8750,,,14812484.0
6118,Zimbabwe,ZWE,2018,1450,795,227,2918,2088,825,1396,...,82210.49,56748.1840,1557296.50,32050.902,124543.016,291050.40,3007.2370,,,15034457.0


### Final Merged Dataset

After all merges, our dataset now contains 44 columns, including health, environmental, economic, and demographic indicators across multiple countries and years. This complete dataset forms the basis of our analysis, enabling a multi-dimensional exploration of the determinants of health and disease.

In [28]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6120 entries, 0 to 6119
Data columns (total 44 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   Country                                     6120 non-null   object 
 1   Code                                        6120 non-null   object 
 2   Year                                        6120 non-null   int64  
 3   Meningitis                                  6120 non-null   int64  
 4   Alzheimer's Disease and Other Dementias     6120 non-null   int64  
 5   Parkinson's Disease                         6120 non-null   int64  
 6   Nutritional Deficiencies                    6120 non-null   int64  
 7   Malaria                                     6120 non-null   int64  
 8   Drowning                                    6120 non-null   int64  
 9   Interpersonal Violence                      6120 non-null   int64  
 10  Maternal Dis

### Calculating Total Deaths

We create a new column named **TotalDeaths** in our merged dataframe. This column is computed by summing the values from the third to the thirty-fourth columns (which represent specific causes of death) for each row:


In [29]:
merged_df['TotalDeaths'] = merged_df.iloc[:, 3:34].sum(axis=1)

first_columns = merged_df.columns[:3]
last_columns = merged_df.columns[34:]

merged_df = merged_df[first_columns.tolist() +  ['Cardiovascular']  + last_columns.tolist()  ]

In [30]:
merged_df.rename(columns={'Cardiovascular': 'CardiovascularDeaths'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  merged_df.rename(columns={'Cardiovascular': 'CardiovascularDeaths'}, inplace=True)


In [31]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6120 entries, 0 to 6119
Data columns (total 15 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   Country                                 6120 non-null   object 
 1   Code                                    6120 non-null   object 
 2   Year                                    6120 non-null   int64  
 3   CardiovascularDeaths                    6120 non-null   int64  
 4   nitrogen_oxide(NOx)                     5880 non-null   float64
 5   sulphur_dioxide(SO2)                    5880 non-null   float64
 6   carbon_monoxide(CO)                     5880 non-null   float64
 7   black_carbon(BC)                        5880 non-null   float64
 8   ammonia(NH3)                            5880 non-null   float64
 9   non_methane_volatile_organic_compounds  5880 non-null   float64
 10  gdp_per_capita                          5103 non-null   floa

In [32]:
merged_df = merged_df.sort_values(by=['Country', 'Year'], ascending=[True, True])

## Saving the Merged Dataset

After completing our analysis and rearranging the dataset, we save the final merged dataframe as a CSV file for future reference and use. The following line of code accomplishes this

In [33]:
merged_df.to_csv('../data/owid.csv', index=False)

With these insights and enhancements in place, we can now proceed to the next exploratory data analysis using the data extracted from the API. This analysis will allow us to delve deeper into the relationships between various health factors and their impacts on cardiovascular diseases, utilizing the comprehensive datasets we have merged. The next steps are outlined in [005_EDA_API.ipynb](./EDA/005_EDA_API.ipynb), where we will explore the trends and patterns in the data more thoroughly.
