# Table of Contents

**I. [Introduction](#introduction)** <br>
&nbsp;1.1 Background Information and Context <br>
&nbsp;1.2 Data Sources <br>
&nbsp;1.3 Research Questions <br>
&nbsp;1.4 Summary of Findings <br>

**II. [Methods](#methods)** <br>
&nbsp;2.1 Data Description <br>
&nbsp;2.2 Variables and Dataframes <br>
&nbsp;2.3 Data Analysis

**III. [Results](#results)** <br>
&nbsp;3.1 Evaluation of Significance <br>
&nbsp;3.2 Conclusion <br>

**IV. [Discussion](#discussion)** <br>
&nbsp;4.1 Limitations

# I. Introduction <a name="introduction"></a>

### 1.1 Background Information and Context

Due to plenty of studies we already know lung cancer can be related to pollution and living environments. And we also know that, due to our lived expiriences and life long consumption of news, in the U.S. surviving any kind of helath related complication is dependent on the resources present and resources's accessiblity. But, what we don't really know about is how societal factors can correlate different helath conditions globally. In this study, we try to see if lung cancer survival rates correlate to specific variables and if there can be any conclusions drawn from that.

### 1.2 Data Sources
- Lung Cancer Survival Rates : https://gco.iarc.fr/overtime/en/dataviz/trends
- Proportion of Population Pushed Below 3.65 Poverty Line by Out-of-Pocket Health Care Expenditure : https://data.worldbank.org/indicator/SH.UHC.NOP2.ZS
- Forest Area Coverage : https://data.worldbank.org/indicator/AG.LND.FRST.ZS 
- Current Health Expenditure (% of GDP) : https://data.worldbank.org/indicator/SH.XPD.CHEX.GD.ZS
- Domestic General Government Health Expenditure per Capita (Current US Dollars) : https://data.worldbank.org/indicator/SH.XPD.GHED.PC.CD
- Out-of-Pocket Expenditure per Capita (Current US Dollars) : https://data.worldbank.org/indicator/SH.XPD.OOPC.PC.CD
- Physicians (Per 1000 People) : https://data.worldbank.org/indicator/SH.MED.PHYS.ZS
- Incidence of Tuberculosis (Per 100,000 People) : https://data.worldbank.org/indicator/SH.TBS.INCD
- Urban Population (% of Total Population) : https://data.worldbank.org/indicator/SP.URB.TOTL.IN.ZS"


### 1.3 Research Questions 

Our general research question is do geographical and sociopolitical factors impact lung cancer survival rates? And if so, which factors and to what degree? <br> <br>
We try to start answer this by exploring the follwing statements: 
1. The effects of the factors on lung cancer survival rates will vary depending on the country. For example, developed countries will have a higher cancer survival rate if they have a greater forest area, while developing countries will have a lower cancer survival rate for a greater forest area.
2. Certain factors will affect the survival rate mostly/more in long term and others will affect it mostly in short term. For example, forest area could affect survival rate after 5 years more than after 1 year since forests won’t save someone who is in critical condition. However, it may affect survival rate after 5 years since it may help patients maintain their condition.

### 1.4 Summary of Findings
TODO

# II. Methods <a name="methods"></a>

## 2.1 Data Description

Individually, we selected 7 factors we thought are likely to impact the lung cancer survival rate for a total of 7 data sets. These datasets are all sourced from the WorldBank site. The 8th data set is the data set of lung cancer survival rates by country and by year. 

**What are the observations (rows) and the attributes (columns)?:** 
Attributes of the X DataFrame are indexed by the 'country' (the name of the relevant country) and further contextualized by the attribute 'year' (the relevant year of the data in between 2000-2019). Most country and year combination has an observation correlating to the following attributes: 
- c_dollar2_poverty : Proportion of Population Pushed Below 3 dollar and 65 cents Poverty Line by Out-of-Pocket Health Care Expenditure
- c_forest_area : The Percentage of Land Area covered by Forest
- c_health_expenditure : The Percentage of a Country's GDP that goes towards Health Expenditures
- c_out_of_pocket : Out-of-Pocket Expenditure per Capita (Current US Dollars)
- c_physician : Physicians (Per 1000 People)
- c_tuberculosis : Incidence of Tuberculosis (Per 100,000 People)
- c_urban_pop : The Percentage of the Total Population living in Urban Areas 

The observation of the y dataframe are by countries and attributed by the year of observation. The dataframe contained data on the age-standardized rate of mortality for lung cancer.

**Why was the dataset created? Who funded it?:** The data is all sourced from The World Bank. "The World Bank Group is one of the world’s largest sources of funding and knowledge for developing countries. Its five institutions share a commitment to reducing poverty, increasing shared prosperity, and promoting sustainable development." The World Bank collects the geographic and sociopolitical data used here, to better inform where and how funding is distributed on an international scale. https://www.worldbank.org/en/who-we-are

**What processes might have influenced what data was observed and recorded and what was not?** 
Though logistically difficult and expensive, the collection of world development indidicator data is an important part of any country's ability to recognize areas of opportunity for national improvement and the needs of communities. For example, with our world development indicators, countries with high rates of mortality from cancer may question the flaws in their infrastructure that leads to such negative outcomes. The influences that affect the recording of data can be nefarious, however. Countries may twist data to mislead international bodies, allies, and rivals, with high-impact indicators such as financial rates often being twisted to encourage foreign investment.

**What preprocessing was done, and how did the data come to be in the form that you are using?** We downloaded the csv files from the world bank. The datasets came in packs of three csv files, two of which consisted only of metadata. We discarded the two files of metadata. For the remaining csv file which had all of the relevent data, we deleted the first 4 lines.

**If people are involved, were they aware of the data collection and if so, what purpose did they expect the data to be used for?** 
The source of our data is likely government census data. As such, citizens of countries recognize that as a part of that country's population, they should contribute to that country's bank of data regarding their citizens. In the same way that taxes and birth certificates are filed and issued, as citizens we expect the government to keep a basic profile on all of us, whether that be in the form of simply affirming existence, or logging cancer mortality and household income.

**Where can your raw source data be found, if applicable?** Raw source Data can be found from where we got our csv files at https://data.worldbank.org/indicator . They can also be found on our github repository in the folder called data : https://github.coecis.cornell.edu/mk932/INFO2950_Project_Team.git .

We are hoping to explore possibilities of causation further down the road, but for phase 2 we will be looking for correlations between the factors and the survival rate.

## 2.2 Variables and DataFrames

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import duckdb
from statsmodels.stats.diagnostic import het_white
import statsmodels.api as sm



In [3]:
X = pd.read_csv('./data/dataframes/X')
X_df = X.set_index('country')
y_f = pd.read_csv('./data/dataframes/y_f')
y_m = pd.read_csv('./data/dataframes/y_m')

years = ['2000','2001','2002','2003','2004','2005','2006','2007','2008','2009',
         '2010','2011','2012','2013','2014','2015','2016','2017','2017','2018','2019']

countries = X['country'].unique()

dict_of_year_dfs = {}
for yr in years:
    dict_of_year_dfs[yr] = X[X["year"] == str(yr)]

dict_of_country_dfs = {}
for c in countries:
    dict_of_country_dfs[c] = X[X['country'] == c]

y_mortality_stat = pd.read_csv('./data/dataframes/y_stat')

country_means_df = pd.read_csv('./data/dataframes/country_means')
country_medians_df = pd.read_csv('./data/dataframes/country_medians')
country_variances_df = pd.read_csv('./data/dataframes/country_variances')

## 2.3 Data Analysis for Hypothesis 1

### Summary Statistics 

Here we look at the correlation matrix to check for confounding variables.

In [4]:
# Melt y_mortality to concatenate with X_df
# y_mortality_melt = pd.melt(y_mortality,id_vars='country',value_vars=years, \
#                           var_name='year',value_name='mortality')
# y_mortality_melt['year'] = y_mortality_melt['year'].astype(int)

# Create corr_Xy with X_df and y_mortality_melt merged for correlation matrix
Xy_df = X_df.merge(y_f,on=['country','year'])
Xy_df = Xy_df.dropna()

# Display the correlation matrix
Xy_df.drop(columns='country').corr()

Unnamed: 0,year,c_dollar2_poverty,c_forest_area,c_health_expenditure,c_out_of_pocket,c_physician,c_tuberculosis,c_urban_pop,ASR (female)
year,1.0,-0.073478,0.558065,0.14687,0.537079,0.354214,0.316367,-0.157359,0.111512
c_dollar2_poverty,-0.073478,1.0,0.151159,-0.255773,-0.127196,0.34759,0.514823,-0.398811,-0.104906
c_forest_area,0.558065,0.151159,1.0,-0.128064,0.887369,0.679421,0.685022,-0.367116,0.363734
c_health_expenditure,0.14687,-0.255773,-0.128064,1.0,-0.119837,-0.085814,-0.078611,0.113581,-0.40834
c_out_of_pocket,0.537079,-0.127196,0.887369,-0.119837,1.0,0.325402,0.272037,-0.348362,0.473556
c_physician,0.354214,0.34759,0.679421,-0.085814,0.325402,1.0,0.904064,-0.187403,0.02255
c_tuberculosis,0.316367,0.514823,0.685022,-0.078611,0.272037,0.904064,1.0,-0.215587,0.012881
c_urban_pop,-0.157359,-0.398811,-0.367116,0.113581,-0.348362,-0.187403,-0.215587,1.0,-0.275399
ASR (female),0.111512,-0.104906,0.363734,-0.40834,0.473556,0.02255,0.012881,-0.275399,1.0


### Grouping Countries

We grouped countries into four different groups based on their Human Development Index (HDI) as following: 1 being "Low", 2 being "Medium", 3 being "High", and 4 being "Very High".

In [6]:
# Get dataset of countries' Human Development Index
hdi_countries = pd.read_csv('./data/dataframes/hdi')

# Merge HDI dataframe with X_df on country
X_hdi = X_df.merge(hdi_countries,on='country')

# Group into countries with different hdi values
X_hdi_1 = X_hdi[X_hdi['hdicode']==1.0] 
X_hdi_2 = X_hdi[X_hdi['hdicode']==2.0]
X_hdi_3 = X_hdi[X_hdi['hdicode']==3.0]
X_hdi_4 = X_hdi[X_hdi['hdicode']==4.0]

# # Merge dataframe y_mortality with each X_hdi dataframes to create separate 
# # dataframes for countries with different hdi values
# Xy_hdi_1 = X_hdi_1.merge(y_mortality_melt,on=['country','year']).dropna()
# Xy_hdi_2 = X_hdi_2.merge(y_mortality_melt,on=['country','year']).dropna()
# Xy_hdi_3 = X_hdi_3.merge(y_mortality_melt,on=['country','year']).dropna()
# Xy_hdi_4 = X_hdi_4.merge(y_mortality_melt,on=['country','year']).dropna()

Xy_hdi_all = X_hdi.merge(y_f,on=['country','year']).merge(y_m,on=['country','year'])
Xy_hdi_all.to_csv('')

### Model Training

Next we trained separate models for countries with different HDI values to compare the weight of each input. First we split the test set and train set.

In [5]:
# Split train test for each hdi categories
X_hdi1_train, X_hdi1_test, y_hdi1_train, y_hdi1_test = train_test_split \
(Xy_hdi_1.drop(columns=['mortality','country','hdicode','year','Unnamed: 0']), \
 Xy_hdi_1['mortality'], test_size=0.2,random_state=2950)
X_hdi2_train, X_hdi2_test, y_hdi2_train, y_hdi2_test = train_test_split \
(Xy_hdi_2.drop(columns=['mortality','country','hdicode','year','Unnamed: 0']),\
 Xy_hdi_2['mortality'], test_size=0.2,random_state=2950)
X_hdi3_train, X_hdi3_test, y_hdi3_train, y_hdi3_test = train_test_split \
(Xy_hdi_3.drop(columns=['mortality','country','hdicode','year','Unnamed: 0']), \
 Xy_hdi_3['mortality'], test_size=0.2,random_state=2950)
X_hdi4_train, X_hdi4_test, y_hdi4_train, y_hdi4_test = train_test_split \
(Xy_hdi_4.drop(columns=['mortality','country','hdicode','year','Unnamed: 0']), \
 Xy_hdi_4['mortality'], test_size=0.2,random_state=2950)

In [6]:
# Fit each multilinear model
model_hdi1 = LinearRegression()
model_hdi2 = LinearRegression()
model_hdi3 = LinearRegression()
model_hdi4 = LinearRegression()

model_hdi1.fit(X_hdi1_train,y_hdi1_train)
model_hdi2.fit(X_hdi2_train,y_hdi2_train)
model_hdi3.fit(X_hdi3_train,y_hdi3_train)
model_hdi4.fit(X_hdi4_train,y_hdi4_train)

# Print MSE of each model
y_hdi1_pred = model_hdi1.predict(X_hdi1_test)
mse_hdi1 = mean_squared_error(y_hdi1_test, y_hdi1_pred)
print(f'HDI 1 Mean Squared Error: {mse_hdi1}')

y_hdi2_pred = model_hdi2.predict(X_hdi2_test)
mse_hdi2 = mean_squared_error(y_hdi2_test, y_hdi2_pred)
print(f'HDI 2 Mean Squared Error: {mse_hdi2}')

y_hdi3_pred = model_hdi3.predict(X_hdi3_test)
mse_hdi3 = mean_squared_error(y_hdi3_test, y_hdi3_pred)
print(f'HDI 3 Mean Squared Error: {mse_hdi3}')

y_hdi4_pred = model_hdi4.predict(X_hdi4_test)
mse_hdi4 = mean_squared_error(y_hdi4_test, y_hdi4_pred)
print(f'HDI 4 Mean Squared Error: {mse_hdi4}')

HDI 1 Mean Squared Error: 16.949771786522657
HDI 2 Mean Squared Error: 46.14491654415083
HDI 3 Mean Squared Error: 47.453855043340965
HDI 4 Mean Squared Error: 39.07412663779996


In [7]:
# Print coefficients of each variable for each model
coeff_hdi1 = model_hdi1.coef_
coeff_hdi2 = model_hdi2.coef_
coeff_hdi3 = model_hdi3.coef_
coeff_hdi4 = model_hdi4.coef_
print(f'HDI 1 Coefficient: \n{coeff_hdi1}')
print(f'HDI 2 Coefficient: \n{coeff_hdi2}')
print(f'HDI 3 Coefficient: \n{coeff_hdi3}')
print(f'HDI 4 Coefficient: \n{coeff_hdi4}')

HDI 1 Coefficient: 
[-0.07392172  0.12771702 -0.05161373 -0.18896928  0.94565316 -1.07900793
  0.01607057]
HDI 2 Coefficient: 
[-0.08988428  0.08733333 -0.00322938 -0.04053926  0.03084154 -0.17943786
  0.00972751]
HDI 3 Coefficient: 
[-0.08400009  0.08655574  0.00451196 -0.10076958  0.02652856 -0.11258985
  0.01490172]
HDI 4 Coefficient: 
[-0.0068523  -0.01690743 -0.11933517  0.01250459  0.0575888  -0.04337106
  0.01381115]


### Compare the Coefficients

Due to time limits, we could not perform hypothesis tests, but we could observe differences in the coefficients derived by the model.

## 2.4 Data Analysis for Hypothesis 2

We were also unable to complete this part.

# III. Results  <a name="results"></a>

The magnitude and direction of relationship of the coefficients were different for countries with different Human Development Index values. For example, in countries with 'Very High' HDI, forest area had a slightly negative relationship to cancer mortality rate while in countries with lower HDI, forest area had a slightly positive relationship. This could be interpreted in that countries that are relatively developed benefited from forest area in terms of lower lung cancer mortality rates, while less developed countries did not benefit from the forest area. 

# IV. Discussion  <a name="discussion"></a>

## 4.1 Limitations

Overall Dataset:

- Since our data is mainly sourced from a single website (The World Bank), most of our data faces the same limitations as well. For example, all of our dataframes lack any data for years before 2000, and also vary in the amount of NaNs that exist even outside of those years. There is a varying amount of documentation for varying countries, with those like the United States and Norway having almost complete information, while countries such as Angora and Afghanistan may lack information for a few years. Dealing with these NaN values will impact what subset of years we decide to analyze, and additionally what countries may not be suitable for an exploratory data analysis due to a lack of data.
- We are assuming different indicators affect different countries to the same degree. For example, depending on the countries location, forest area might not have as much of an impact. We selected the forest area dataframe because forests impact levels of air pollution which in turn effects incidence and survival rates of lung cancer. However the impact of forests on air pollution is dependent on geographical factors such as the alltitude, climate, and biomes of a country. Additionally, we are only looking at the forest area within the borders of a country. Most likely this will not have too much of an effect on large countries such as the U.S. or Canada, but smaller countries closely packed together such as the ones in Europe, could be greatly impacted.
- The data sets we included are not immedietly related to lung cancer. For example instead of using a dataset 'tuburculosis incidence per 100000 people', a data set that looked for 'incidence rates of turboculosis patients that later developed lung cancer' would be less removed. The closer degrees of relevance a dataset can have, the more likely high correlations actually means something. Basically, our data set is limiting how far we can take possible interpretaions of our data analysis.
- Like any project including data, there are underlying variables that are not immedieatly obvious. An example of a potential underlying variable not considered in our data is age groups. We can not know if someone who passed away at 80 actually passed away because of lung cancer, other medical issues, old age, or a combination. Additionally differeces in access to medical care depending on age, race/ethnicity are hard to properly gauge. Survival rates of lung cancer only looks at the people who had some form of medical access and thus could be documented. In countries like the U.S. where medical care is privatized and many do not go to hospitals for fear of the bills they would incur, there is a significant population that goes unrecognized by the pulled data sets. 

Hypothesis 1:

- The MSE was large so the coefficients we derived may not be as indicative of the relationship between the variables. 
- We could have used transformation to better fit the data and check for heteroskedasticity but could not due to time limits. Different models such as neural networks could have performed better.