Skip to content

Final project for CS360 (Machine Learning) where I use linear regression and K Neighbors Regressor to predict a country's happiness score

Notifications You must be signed in to change notification settings

choudharynisha/predicting-world-happiness

Repository files navigation

Predicting World Happiness

Description

For my final project for Machine Learning, I looked at 2015 through 2019 happiness data from the World Happiness Report. To predict a country's happiness score from a particular year, I implemented Linear Regression and K Neighbors Regressor from scikit-learn.

Running the Program

The main driver program for this project is find_happiness.py.

Requirements

The language requirement is Python 3. To download Python 3, please visit the Download Python page.

To run find_happiness.py, you need the following Python packages –
matplotlib
numpy
pandas
scikit-learn

To download these packages with pip, you can use the following commands in your Terminal / Command Prompt window –
pip3 install matplotlib
pip3 install numpy
pip3 install pandas
pip3 install scikit-learn

If you do not have pip installed, please navigate to the pip documentation's Installation page.

Command Line Arguments

All command line arguments are optional. Entering python3 find_happiness.py will run the program using data from 2015 through 2019 and split the data with a random seed.

Optional Arguments

-s (year_start) – The first year's data to look at (2015 through 2019, inclusive) [DEFAULT = 2015]
-e (year_end) – The last year's data to look at (2015 through 2019, inclusive) [DEFAULT = 2019]*
-d (seed) – The specified seed to split the train and test data on [DEFAULT = None]**

Examples of commands to run with the optional commands –
python3 find_happiness.py -s 2016 -e 2018 -d 18
Runs the program only using data from 2016 through 2018 with a seed of 18

python3 find_happiness.py -e 2017
Runs the program only using data from 2015 through 2017 with a random seed

python3 find_happiness.py -s 2019 -d 42
Runs the program only using data from 2019 with a seed of 42

* The program will not run if the specified year_end is before the specified year_start.
** If the seed is None, a random seed will be used to split the data into training and testing.

Output

The predicted happiness scores versus the actual happiness scores will be graphed and displayed in a new window that opens up upon running the program (with the first graph's window displayed after a short delay). The K Neighbors Regressor's graph will open in a new window after the graph for Linear Regression is closed and another short delay. In all of the scatterplots, each point represents a country's prediction vs actual label from a country between year_start and year_end, inclusive. When looking back at the Terminal window or Command Prompt after closing the graphs, it shows the R2 coefficient for the Linear Regression and the K Neighbors Regressor models and the weights for the Linear Regression model.

To find out which country a point represents, hover over it when the graph's window is shown.

Regions

The regions on the graph are numbered as follows –
0 = Australia and New Zealand
1 = Central and Eastern Europe
2 = Eastern Asia
3 = Latin America and Carribean
4 = Middle East and Northern Africa
5 = North America
6 = Southeastern Asia
7 = Southern Asia
8 = Sub-Saharan Africa
9 = Western Europe

Dataset

I found the 2015 through 2020* data from the World Happiness Report on Kaggle.

The columns** I looked at from the dataset were Country, Region***, Happiness Score, Economy (GDP per Capita), Health (Life Expectancy), Freedom, Trust (Government Corruption), and Generosity.

Health (Life Expectancy) – based on the data extracted from the World Health Organization's Global Health Observatory data repositoory.****

Freedom – national average of responses to “Are you satisfied or dissatisfied with your freedom to choose what you do with your life?”

Trust (Government Corruption)***** The questions asked were, “Is corruption widespread throughout the government or not?” and “Is corruption widespread within businesses or not?” The overall perception is just the average of the two 0-or-1 responses.

Generosity – national average of response to the Gallup World Poll question “Have you donated money to a charity in the past month?”

The dataset is stored in a Pandas DataFrame. Certain years did not have region, so the regions used in the 2015 dataset are mapped to every country in all of the other datasets. An additional column was added to the DataFrame in order to make note of the year that this datapoint is from in order to concatenate each year's DataFrame into one larger DataFrame while retaining the information about the year it came from.

* Because of the differences and inconsistencies in the 2020 data, I used the 2015 through 2019 data from this dataset.
** The column names used are from the 2015 dataset and combined with equivalent columns from other datasets.
*** Regions used are from the 2015 dataset and are mapped to all countries in the other datasets to ensure consistency and add regions to datasets that didn't previously have them
**** The data at the source are available for the years 2000, 2005, 2010, 2015 and 2016. To match the report’s sample period, interpolation and extrapolation are used.
***** Labeled as Corruption Perception in some datasets

Observations

The seeds that I compared the results of were 10, 18, 26, 34, and 42.

Common observations among all seeds

  1. The K Neighbors Regressor model was more correlated than the Linear Regression model, no matter what the seed was for shuffling, even though linear regression is known to work well with continuous data and K Nearest Neighbors is not known to be a very effective algorithm.
  2. While there were a few outliers that predicted a lower happiness score than it should have, most outliers would predict a higher happiness score, regardless of whether the model was K Neighbors Regressor or Linear Regression.
  3. As the seed increased, the correlation coefficient R2 for both models decreased.
  4. The models performed about the same, regardless of whether or not the feature values were scaled using the MinMaxScaler.
  5. Both models had a generally high correlation coefficient R2 (> 0.7), and most points on the scatterplots were somewhere around the ideal line of y = x.

Seed of 10
Linear Regression
– R2 ≈ 0.75506
– Weights
 → Economy (GDP per Capita) ≈ 2.77764
 → Health (Life Expectancy) ≈ 1.33523
 → Freedom ≈ 1.45193
 → Trust (Government Corruption) ≈ 0.279344
 → Generosity ≈ 0.29783

K Neighbors Regressor
– R2 ≈ 0.81473

Seed of 18
Linear Regression – R2 ≈ 0.79730
– Weights
 → Economy (GDP per Capita) ≈ 2.84379
 → Health (Life Expectancy) ≈ 1.27833
 → Freedom ≈ 1.25342
 → Trust (Government Corruption) ≈ 0.25597
 → Generosity ≈ 0.45937

K Neighbors Regressor
– R2 ≈ 0.85948

Seed of 26
Linear Regression – R2 ≈ 0.75742
– Weights
 → Economy (GDP per Capita) ≈ 2.84740
 → Health (Life Expectancy) ≈ 1.29336
 → Freedom ≈ 1.36589
 → Trust (Government Corruption) ≈ 0.18373
 → Generosity ≈ 0.37860

K Neighbors Regressor
– R2 ≈ 0.81424

Seed of 34
Linear Regression – R2 ≈ 0.71898
– Weights
 → Economy (GDP per Capita) ≈ 2.847401
 → Health (Life Expectancy) ≈ 1.26220
 → Freedom ≈ 1.42733
 → Trust (Government Corruption) ≈ 0.35726
 → Generosity ≈ 0.32781

K Neighbors Regressor
– R2 0.78445

Seed of 42
Linear Regression
– R2 ≈ 0.71544
– Weights
 → Economy (GDP per Capita) ≈ 2.631697
 → Health (Life Expectancy) ≈ 1.38927
 → Freedom ≈ 1.38302
 → Trust (Government Corruption) ≈ 0.46644
 → Generosity ≈ 0.29325

K Neighbors Regressor – R2 ≈ 0.78414

Other Observations

  1. Economy was more correlated to how happy a country's average citizen perceived themselves to be than personal freedom.
  2. When the seed is lower, generosity is more correlated with a country's average citizen's perception of personal happiness than corruption. When the seed is higher, this is reversed.
  3. Western European countries tend to have the highest happiness scores and predictions while Sub-Saharan African countries tend to have the lowest happiness scores.
  4. Even though the region was part of the label and not one of the features, the models were still able to identify patterns that different countries in each region had, and most countries in each region are in the same part of the graph.

Other Interpretations

  1. Before having each point on the scatterplot colored based on which region the country was in, it seemed like the models were working well, and the data I found on Kaggle was good data. When looking at the graphs after coloring each point by region, it seems like the data may look at each country's data to determine and assess its overall happiness score only based on the happiness definition of Western Europe.
  2. Countries with darker skinned / black populations tend to be on the lower end of the ranking. It makes me curious about how to change this dataset to include multiple definitions of happiness and taking in the different ways that different cultures and different individuals perceive think about what it means to be happy and other factors that can be measured instead.
  3. Going off of the last observation about how a country's region was not a feature used to predict the happiness score of each country, it seems like a lot of this computation was based off a definition that a certain group of people, and by this definition, since different countries in the same region do have similarities, it seems to make sense, in a way, that countries in the same region may be in the same part of the scatterplots.

Goals for the Future

– Update the dataset and predict 2021 data (and later)
– Learn more about how to collect data
– Put together different data sources
 → Consider other potential factors of happiness, including but not limited to
  ⇒ Volunteer work
  ⇒ Scale of self-appreciation
   ⇛ 1 = no self-appreciation
   ⇛ 10 = lots of self-appreciation
  ⇒ How much natural light people are exposed to
  ⇒ Air quality and pollution – scale
   ⇛ 1 = no pollution
   ⇛ 10 = high pollution
  ⇒ Personal assessment of work-life balance
  ⇒ How much experiences versus material goods are valued
 → Compare similar measures from different data sources
– Learn more about positive psychology to help determine what leads to better perceptions of personal happiness
– Make a web app to make the data and results more accessible
– Combine with the Happy Journal web app project

Releases

No releases published

Packages

No packages published

Languages