# COGS 108 - Data Checkpoint

# Names

- Alan Wang
- An Huynh
- Hana Vaid
- Seung Huh
- Shreya Vanaki

<a id='research_question'></a>
# Research Question

According to Gallup’s World Happiness report’s dataset, which of the six measurements (GDP per capita, healthy life expectancy, social support, freedom to make life choices, generosity, corruption perception) most strongly predicts happiness across the countries surveyed ?:
- What indicator displays the largest difference in life satisfaction between the top 15% of the countries more happy than the bottom 15% of the countries, in terms of the indicators of happiness?
- Which feature overall is the most strongly correlated with life satisfaction up to 2021?
- Which feature overall is the most strongly correlated with positive affect up to 2020?

# Dataset(s)

*Fill in your dataset information here*

(Copy this information for each dataset)
- Dataset Name: World Happiness Report
- Link to the dataset: 
    - https://happiness-report.s3.amazonaws.com/2021/DataForFigure2.1WHR2021C2.xls
    - https://happiness-report.s3.amazonaws.com/2021/DataPanelWHR2021C2.xls
- Number of observations:
    - 149 countries for 2021 dataset
    - 166 countries for overall dataset (1949 observations overall)

1-2 sentences describing each dataset. 

If you plan to use multiple datasets, add 1-2 sentences about how you plan to combine these datasets.

# Setup

In [None]:
# Import necessary packages for data frames and etc.
import pandas as pd

In [1]:
# Read in the csv files
df1 = pd.read_csv('./DataForFigure2.1WHR2021C2.csv')
df2 = pd.read_csv('./DataPanelWHR2021C2.csv')

Unnamed: 0,Country name,Regional indicator,Ladder score,Standard error of ladder score,upperwhisker,lowerwhisker,Logged GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption,Ladder score in Dystopia,Explained by: Log GDP per capita,Explained by: Social support,Explained by: Healthy life expectancy,Explained by: Freedom to make life choices,Explained by: Generosity,Explained by: Perceptions of corruption,Dystopia + residual
0,Finland,Western Europe,7.842,0.032,7.904,7.780,10.775,0.954,72.000,0.949,-0.098,0.186,2.43,1.446,1.106,0.741,0.691,0.124,0.481,3.253
1,Denmark,Western Europe,7.620,0.035,7.687,7.552,10.933,0.954,72.700,0.946,0.030,0.179,2.43,1.502,1.108,0.763,0.686,0.208,0.485,2.868
2,Switzerland,Western Europe,7.571,0.036,7.643,7.500,11.117,0.942,74.400,0.919,0.025,0.292,2.43,1.566,1.079,0.816,0.653,0.204,0.413,2.839
3,Iceland,Western Europe,7.554,0.059,7.670,7.438,10.878,0.983,73.000,0.955,0.160,0.673,2.43,1.482,1.172,0.772,0.698,0.293,0.170,2.967
4,Netherlands,Western Europe,7.464,0.027,7.518,7.410,10.932,0.942,72.400,0.913,0.175,0.338,2.43,1.501,1.079,0.753,0.647,0.302,0.384,2.798
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
144,Lesotho,Sub-Saharan Africa,3.512,0.120,3.748,3.276,7.926,0.787,48.700,0.715,-0.131,0.915,2.43,0.451,0.731,0.007,0.405,0.103,0.015,1.800
145,Botswana,Sub-Saharan Africa,3.467,0.074,3.611,3.322,9.782,0.784,59.269,0.824,-0.246,0.801,2.43,1.099,0.724,0.340,0.539,0.027,0.088,0.648
146,Rwanda,Sub-Saharan Africa,3.415,0.068,3.548,3.282,7.676,0.552,61.400,0.897,0.061,0.167,2.43,0.364,0.202,0.407,0.627,0.227,0.493,1.095
147,Zimbabwe,Sub-Saharan Africa,3.145,0.058,3.259,3.030,7.943,0.750,56.201,0.677,-0.047,0.821,2.43,0.457,0.649,0.243,0.359,0.157,0.075,1.205


# Data Cleaning

For our first and second question, we want the Gallup World Health Report to include all years upto 2021. Therefore, we need to combine both the data from 2021 and the data from earlier years into one.

We simply pick only the columns we need for data analysis, rename them, and join them together. However, because the focus of the questions are on life satisfaction, we pop off the columns for positive affect and negative affect.

For our third question, because the 2021's data does not include the data for positive affect and negative affect, we use the data from earlier years and leave as is.

In [3]:
# Cleaning up df1: Picking necessary columns, renaming the columns, inserting a column for proper concatenation
df1 = df1[['Country name', 'Ladder score', 'Logged GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Perceptions of corruption']]
df1.columns = ['country_name', 'ladder_score', 'GDP', 'social_support', 'life_expectancy', 'choice_freedom', 'generosity', 'corr_perception']
df1.insert(loc=1, column='year', value=2021)

# Cleaning up df2: Renaming the columns
df2.columns = ['country_name', 'year', 'ladder_score', 'GDP', 'social_support', 'life_expectancy', 'choice_freedom', 'generosity', 'corr_perception', 'pos_affect', 'neg_affect']

In [4]:
# Concatenating the two data frames: First concatenate, sort by name and year, pop unnecessary columns, reset the indices
df1 = pd.concat([df1, df2])
df1 = df1.sort_values(['country_name', 'year'], ascending = True)
df1.pop('pos_affect')
df1.pop('neg_affect')
df1.reset_index(drop=True)

Unnamed: 0,country_name,year,ladder_score,GDP,social_support,life_expectancy,choice_freedom,generosity,corr_perception
0,Afghanistan,2008,3.724,7.370,0.451,50.800,0.718,0.168,0.882
1,Afghanistan,2009,4.402,7.540,0.552,51.200,0.679,0.190,0.850
2,Afghanistan,2010,4.758,7.647,0.539,51.600,0.600,0.121,0.707
3,Afghanistan,2011,3.832,7.620,0.521,51.920,0.496,0.162,0.731
4,Afghanistan,2012,3.783,7.705,0.521,52.240,0.531,0.236,0.776
...,...,...,...,...,...,...,...,...,...
2093,Zimbabwe,2017,3.638,8.016,0.754,55.000,0.753,-0.098,0.751
2094,Zimbabwe,2018,3.616,8.049,0.775,55.600,0.763,-0.068,0.844
2095,Zimbabwe,2019,2.694,7.950,0.759,56.200,0.632,-0.064,0.831
2096,Zimbabwe,2020,3.160,7.829,0.717,56.800,0.643,-0.009,0.789


This dataset is now in a suitable format for our purposes. df1 contains the data on life satisfaction from countries up to 2021, and df2 contains the data on positive affect from countries up to 2020