# COGS 108 - Data Checkpoint

# Names

- Alan Wang
- An Huynh
- Hana Vaid
- Seung Huh
- Shreya Vanaki

<a id='research_question'></a>
# Research Question

According to Gallup’s World Happiness report’s dataset, which of the six measurements (GDP per capita, healthy life expectancy, social support, freedom to make life choices, generosity, corruption perception) most strongly predicts happiness across the countries surveyed ?:
- What indicator displays the largest difference in life satisfaction between the top 15% of the countries more happy than the bottom 15% of the countries, in terms of the indicators of happiness?
- Which feature overall is the most strongly correlated with life satisfaction up to 2021?
- Which feature overall is the most strongly correlated with positive affect up to 2020?

# Dataset(s)

Based on the research question we have selected, we are planning to use the dataset from Gallup’s World Happiness Report, which contains data from the Gallup World Poll from 2005-2021.

- Dataset Name: World Happiness Report for 2021
- Link to the dataset:
    - https://happiness-report.s3.amazonaws.com/2021/DataForFigure2.1WHR2021C2.xls
- Number of observations:
    - 149 countries
- Description:
    - Researchers measured the GDP of each country and took data from the Gallup World Poll. Scientists carried out this poll by asking people from each country basic questions to gather opinion on several different categories. These questions included both “yes or no” and rating questions to express their thoughts on the matter (Yes being a 1, No being a 0; rating from 1-10). This dataset contains variables measuring features of countries such as social support, healthy life expectancy, freedom to make life choices, generosity and perceptions of corruption. Finally, it contains a variable measuring average rating of life satisfaction for each country.


- Dataset Name: World Happiness Reports prior to 2021
- Link to the dataset:
    - https://happiness-report.s3.amazonaws.com/2021/DataPanelWHR2021C2.xls
- Number of observations:
    - 166 countries (1949 observations overall due to multiple years)
- Description:
    - In addition to all the features of the previous dataset, all data up to 2020 also contains a variable measuring positive and negative affects. Positive affect is a measurement of one's laughter and joy while negative affect is a measurement of one's worry, anger, and sadness.

Dataset for reports prior to 2021 will be merged with the 2021 dataset. However, due to the lack of certain features within the 2021 dataset, we will omit the data that does not overlap with each other.

# Setup

In [1]:
# Import necessary packages for data frames and etc.
import pandas as pd

In [2]:
# Read in the csv files
df1 = pd.read_csv('./DataForFigure2.1WHR2021C2.csv')
df2 = pd.read_csv('./DataPanelWHR2021C2.csv')

# Data Cleaning

For our first and second question, we want the Gallup World Health Report to include all years upto 2021. Therefore, we need to combine both the data from 2021 and the data from earlier years into one.

After reading in both our csv datasets-df1 and df2- we proceeded to first clean df1 by selecting and renaming the columns and or categories we deemed essential to our research question for better formatting. To clean up df2, we renamed the columns we needed here as well to match them with the columns of df1. Moving on, both data frames were then concatenated, or linked and sorted by alphabetical order by country. We then used the dataframe.pop() method to remove columns that wouldn’t be relevant. Lastly, after manipulating and filtering these dataframes, we proceeded to reset the indices to make our data sequential. Our data is now in a usable format.

For our third question, because the 2021's data does not include the data for positive affect and negative affect, we use the data from earlier years and leave as is.

Apart from importing the necessary library packages and dataset and taking steps to clean and manipulate our datasets, there were no other pre-processing steps done.

In [3]:
# Cleaning up df1: Picking necessary columns, renaming the columns, inserting a column for proper concatenation
df1 = df1[['Country name', 'Ladder score', 'Logged GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Perceptions of corruption']]
df1.columns = ['country_name', 'ladder_score', 'GDP', 'social_support', 'life_expectancy', 'choice_freedom', 'generosity', 'corr_perception']
df1.insert(loc=1, column='year', value=2021)

# Cleaning up df2: Renaming the columns
df2.columns = ['country_name', 'year', 'ladder_score', 'GDP', 'social_support', 'life_expectancy', 'choice_freedom', 'generosity', 'corr_perception', 'pos_affect', 'neg_affect']

In [4]:
# Concatenating the two data frames: First concatenate, sort by name and year, pop unnecessary columns, reset the indices
df1 = pd.concat([df1, df2])
df1 = df1.sort_values(['country_name', 'year'], ascending = True)
df1.pop('pos_affect')
df1.pop('neg_affect')
df1.reset_index(drop=True)

Unnamed: 0,country_name,year,ladder_score,GDP,social_support,life_expectancy,choice_freedom,generosity,corr_perception
0,Afghanistan,2008,3.724,7.370,0.451,50.800,0.718,0.168,0.882
1,Afghanistan,2009,4.402,7.540,0.552,51.200,0.679,0.190,0.850
2,Afghanistan,2010,4.758,7.647,0.539,51.600,0.600,0.121,0.707
3,Afghanistan,2011,3.832,7.620,0.521,51.920,0.496,0.162,0.731
4,Afghanistan,2012,3.783,7.705,0.521,52.240,0.531,0.236,0.776
...,...,...,...,...,...,...,...,...,...
2093,Zimbabwe,2017,3.638,8.016,0.754,55.000,0.753,-0.098,0.751
2094,Zimbabwe,2018,3.616,8.049,0.775,55.600,0.763,-0.068,0.844
2095,Zimbabwe,2019,2.694,7.950,0.759,56.200,0.632,-0.064,0.831
2096,Zimbabwe,2020,3.160,7.829,0.717,56.800,0.643,-0.009,0.789


We now have two datasets that are ready for analysis: df1 contains the data on life satisfaction from countries up to 2021, df2 contains the data on positive affect from countries up to 2020

In [5]:
df1

Unnamed: 0,country_name,year,ladder_score,GDP,social_support,life_expectancy,choice_freedom,generosity,corr_perception
0,Afghanistan,2008,3.724,7.370,0.451,50.800,0.718,0.168,0.882
1,Afghanistan,2009,4.402,7.540,0.552,51.200,0.679,0.190,0.850
2,Afghanistan,2010,4.758,7.647,0.539,51.600,0.600,0.121,0.707
3,Afghanistan,2011,3.832,7.620,0.521,51.920,0.496,0.162,0.731
4,Afghanistan,2012,3.783,7.705,0.521,52.240,0.531,0.236,0.776
...,...,...,...,...,...,...,...,...,...
1945,Zimbabwe,2017,3.638,8.016,0.754,55.000,0.753,-0.098,0.751
1946,Zimbabwe,2018,3.616,8.049,0.775,55.600,0.763,-0.068,0.844
1947,Zimbabwe,2019,2.694,7.950,0.759,56.200,0.632,-0.064,0.831
1948,Zimbabwe,2020,3.160,7.829,0.717,56.800,0.643,-0.009,0.789


In [6]:
df2

Unnamed: 0,country_name,year,ladder_score,GDP,social_support,life_expectancy,choice_freedom,generosity,corr_perception,pos_affect,neg_affect
0,Afghanistan,2008,3.724,7.370,0.451,50.80,0.718,0.168,0.882,0.518,0.258
1,Afghanistan,2009,4.402,7.540,0.552,51.20,0.679,0.190,0.850,0.584,0.237
2,Afghanistan,2010,4.758,7.647,0.539,51.60,0.600,0.121,0.707,0.618,0.275
3,Afghanistan,2011,3.832,7.620,0.521,51.92,0.496,0.162,0.731,0.611,0.267
4,Afghanistan,2012,3.783,7.705,0.521,52.24,0.531,0.236,0.776,0.710,0.268
...,...,...,...,...,...,...,...,...,...,...,...
1944,Zimbabwe,2016,3.735,7.984,0.768,54.40,0.733,-0.095,0.724,0.738,0.209
1945,Zimbabwe,2017,3.638,8.016,0.754,55.00,0.753,-0.098,0.751,0.806,0.224
1946,Zimbabwe,2018,3.616,8.049,0.775,55.60,0.763,-0.068,0.844,0.710,0.212
1947,Zimbabwe,2019,2.694,7.950,0.759,56.20,0.632,-0.064,0.831,0.716,0.235
