# Introduction

Although often neglected, the research of happiness represents a critical goal of humankind. A happy human is empirically more creative, energetic and feels healthier both physically and emotionally. Research on the key factors of happiness has historically been carried out by philosophers and writers, as happiness was thought of as an abstract, transcendental concept based solely on the mindset of an individual ("Nobody is happy if not for his fault" - Seneca). Would a data-driven approach be able to provide us with new insights? Our project beings with this question in mind.


# Dataset

Beginning in 2012, research was conducted on world happiness levels by asking the question "How would you rate your happiness on a scale of 0 to 10, where 10 is the happiest?" worldwide. The results, which are published annually on March 20th (World Happiness Day), are summarized on Kaggle (www.kaggle.com/unsdsn/world-happiness#2019.csv), where they are merged together with additional country data, to form the complete data set.

The World Happiness Report Dataset includes one .csv file for each year from 2015 through 2019 (the 2020 report was recently published as a pdf, but has not made its way into the dataset yet), and ranks 156 countries based on their surveyed happiness score, along with additional data on several topics, including the economy (GDP), health (life expectancy), preceived freedom, trust in the government, and generosity.

This dataset is already clean and ready-to-use, allowing the focus of the project to be on creating compelling visualizations from the data.

Since understanding happiness is so complex, we anticipate that our insights may be limited if we stick to one numerical dataset. To combat this, we will consider expanding the data with additional social insights. For example, we propose to scrape some Google trends data, giving insight into public attention shifts, and seeing how they correlate with the happiness data.


# Problematic

The main focus of this project is country happiness levels around the world, and understanding the factors that drive them. The project will take shape across two main axes – geographic visualizations of country happiness levels, and comparative visualizations of country happiness levels against other metrics.
 
The aim of the geographic visualizations is to allow users to interact with and gain insights into the data set in as simple and fast a way as possible. The geographic visualizations will center around maps which display the happiness index of each country worldwide, with users being able to interact with the visuals to view specific subsets of the data, filtered across different categories (e.g. year, geographic subregion, etc.).
 
The aim of the comparative visualizations is to present the data in a way that will allow users to better understand what the contributing factors to a country’s happiness may be. The comparative visualizations will take additional country data, such as GDP, freedom, and perceived corruption, and plot them against happiness, helping users see correlations (or a lack thereof) between various pieces of data.
 
The motivations for this topic are to better understand what factors go into making a country happy. Happiness is an abstract concept, and so being able to find correlation between it and more concrete, tangible metrics is valuable insight for those trying to make their countries happier. The target audience for this project comprises of federal politicians, international relations researchers, and anyone else with an interest in international data.


# Exploratory Data Analysis

### Installation of packages and imports

In [1]:
!pip install plotly
!pip install chart_studio



In [2]:
#Call required libraries
import time                   # To time processes
import warnings               # To suppress warnings

import numpy as np            # Data manipulation
import pandas as pd           # Dataframe manipulatio 
import matplotlib.pyplot as plt                   # For graphics
import seaborn as sns
import chart_studio.plotly as py #For World Map
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
import plotly.express as px

import os                     # For os related operations
import sys                    # For data size

### Reading data and cleaning


Reading files

In [3]:
d2015 = pd.read_csv("https://raw.githubusercontent.com/com-480-data-visualization/com-480-project-datavaders/master/2015.csv")
d2016 = pd.read_csv("https://raw.githubusercontent.com/com-480-data-visualization/com-480-project-datavaders/master/2016.csv")
d2017 = pd.read_csv("https://raw.githubusercontent.com/com-480-data-visualization/com-480-project-datavaders/master/2017.csv")
d2018 = pd.read_csv("https://raw.githubusercontent.com/com-480-data-visualization/com-480-project-datavaders/master/2018.csv")
d2019 = pd.read_csv("https://raw.githubusercontent.com/com-480-data-visualization/com-480-project-datavaders/master/2019.csv")

The 2018 and 2019 datasets don't have regions, so we need to add a 'Region' column with the region for every country.

In [4]:
western_europe = ['Switzerland', 'Iceland', 'Denmark', 'Norway', 'Finland',
       'Netherlands', 'Sweden', 'Austria', 'Luxembourg', 'Ireland',
       'Belgium', 'United Kingdom', 'Germany', 'France', 'Spain', 'Malta',
       'Italy', 'North Cyprus', 'Cyprus', 'Portugal', 'Greece']
north_america = ['Canada', 'United States']
australia = ['New Zealand', 'Australia']
middle_east = ['Israel', 'United Arab Emirates', 'Oman', 'Qatar', 'Saudi Arabia',
       'Kuwait', 'Bahrain', 'Libya', 'Algeria', 'Turkey', 'Jordan',
       'Morocco', 'Lebanon', 'Tunisia', 'Palestinian Territories', 'Iran',
       'Iraq', 'Egypt', 'Yemen', 'Syria']
latin_america = ['Costa Rica', 'Mexico', 'Brazil', 'Venezuela', 'Panama', 'Chile',
       'Argentina', 'Uruguay', 'Colombia', 'Suriname',
       'Trinidad and Tobago', 'El Salvador', 'Guatemala', 'Ecuador',
       'Bolivia', 'Paraguay', 'Nicaragua', 'Peru', 'Jamaica',
       'Dominican Republic', 'Honduras', 'Haiti']
southeastern_asia = ['Singapore', 'Thailand', 'Malaysia', 'Indonesia', 'Vietnam',
       'Philippines', 'Laos', 'Myanmar', 'Cambodia']
central_and_eastern_europe = ['Czech Republic', 'Uzbekistan', 'Slovakia', 'Moldova',
       'Kazakhstan', 'Slovenia', 'Lithuania', 'Belarus', 'Poland',
       'Croatia', 'Russia', 'Kosovo', 'Turkmenistan', 'Estonia',
       'Kyrgyzstan', 'Azerbaijan', 'Montenegro', 'Romania', 'Serbia',
       'Latvia', 'Macedonia', 'Albania', 'Bosnia and Herzegovina',
       'Hungary', 'Tajikistan', 'Ukraine', 'Armenia', 'Georgia',
       'Bulgaria']
eastern_asia = ['Taiwan', 'Japan', 'South Korea', 'Hong Kong', 'China', 'Mongolia']
subsaharan_africa = ['Mauritius', 'Nigeria', 'Zambia', 'Somaliland region',
       'Mozambique', 'Lesotho', 'Swaziland', 'South Africa', 'Ghana',
       'Zimbabwe', 'Liberia', 'Sudan', 'Congo (Kinshasa)', 'Ethiopia',
       'Sierra Leone', 'Mauritania', 'Kenya', 'Djibouti', 'Botswana',
       'Malawi', 'Cameroon', 'Angola', 'Mali', 'Congo (Brazzaville)',
       'Comoros', 'Uganda', 'Senegal', 'Gabon', 'Niger', 'Tanzania',
       'Madagascar', 'Central African Republic', 'Chad', 'Guinea',
       'Ivory Coast', 'Burkina Faso', 'Rwanda', 'Benin', 'Burundi',
       'Togo']
southern_asia = ['Bhutan', 'Pakistan', 'Bangladesh', 'India', 'Nepal', 'Sri Lanka',
       'Afghanistan']

In [5]:
def getRegion(country):
  if country in western_europe:
    return 'Western Europe'
  elif country in north_america:
    return 'North America'
  elif country in australia:
    return 'Australia and New Zealand'
  elif country in middle_east:
    return  'Middle East and Northern Africa'
  elif country in latin_america:
    return 'Latin America and Caribbean'
  elif country in southeastern_asia:
    return  'Southeastern Asia'
  elif country in central_and_eastern_europe:
    return 'Central and Eastern Europe'
  elif country in eastern_asia:
    return 'Eastern Asia'
  elif country in subsaharan_africa:
    return 'Sub-Saharan Africa'
  elif country in southern_asia:
    return 'Southern Asia'

In [6]:
d2017['Region'] = d2015['Country'].apply(lambda x: getRegion(x))
d2018['Region'] = d2015['Country'].apply(lambda x: getRegion(x))
d2019['Region'] = d2015['Country'].apply(lambda x: getRegion(x))

In [7]:
d2017[d2017['Region'].isnull()]

Unnamed: 0,Country,Happiness.Rank,Happiness.Score,Whisker.high,Whisker.low,Economy..GDP.per.Capita.,Family,Health..Life.Expectancy.,Freedom,Generosity,Trust..Government.Corruption.,Dystopia.Residual,Region


Here we delete the columns that aren't present in all of the datasets so we can make an analysis of how the factors change over time.

In [8]:

d2019.columns = ["rank","country", "score",
                  "gdp_per_capita","social_support","healthy_life_expectancy",
                 "freedom_to_life_choice","generosity","corruption_perceptions", 'region']
d2018.columns = ["rank","country", "score",
                  "gdp_per_capita","social_support","healthy_life_expectancy",
                 "freedom_to_life_choice","generosity","corruption_perceptions", 'region']
d2017.drop(["Whisker.high","Whisker.low",
            "Family","Dystopia.Residual"],axis=1,inplace=True)
d2017.columns =  ["country","region","score",
                  "gdp_per_capita","healthy_life_expectancy",
                 "freedom_to_life_choice","generosity","corruption_perceptions", 'region']
d2016.drop(['Lower Confidence Interval','Upper Confidence Interval',
            "Family",'Dystopia Residual'],axis=1,inplace=True)
d2016.columns = ["country", "region","rank","score",
                  "gdp_per_capita","healthy_life_expectancy",
                 "freedom_to_life_choice","corruption_perceptions","generosity"]
d2015.drop(['Standard Error', 'Family', 'Dystopia Residual'],axis=1,inplace=True)
d2015.columns = ['country',"region", "rank", "score", "gdp_per_capita",
"healthy_life_expectancy", "freedom_to_life_choice", "corruption_perceptions",
"generosity"]


In [9]:
coltoselect = ["rank","country","region","score",
                "gdp_per_capita","healthy_life_expectancy",
                "freedom_to_life_choice","generosity","corruption_perceptions"]
d2015 = d2015.loc[:,coltoselect].copy()
d2016 = d2016.loc[:,coltoselect].copy()
d2017 = d2017.loc[:,coltoselect].copy()
d2018 = d2018.loc[:,coltoselect].copy()
d2019 = d2019.loc[:,coltoselect].copy()
d2015["year"] = 2015
d2016["year"] = 2016
d2017["year"] = 2017
d2018["year"] = 2018
d2019["year"] = 2019



Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike



Now we put the modified datasets from 2015, 2016, 2017, 2018 and 2019 in one big dataset called finaldf that will be the main dataset we'll be working on. 

In [10]:
finaldf = d2015.append([d2016,d2017,d2018,d2019])
finaldf.head()


Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.





ValueError: Plan shapes are not aligned

### Some scatter plots plotting the happiness Score against different factors

The plots below will show the effect of GDP, freedom, generosity and corruption on the average happiness of the people in a country and how that effect changed over time. The plots are interactive so by hovering your mouse on a bubble you can get the name of the country as well as some other interesting values. If you click on a country on the right you can show / hide it, and by double clicking on it you're isolating it on the graph. You can also move the slider so you can see the evolution of the effect over the years. 

In [0]:
fig = px.scatter(finaldf, x="gdp_per_capita", y="score", animation_frame="year",
           animation_group="country",
           size="rank", color="country", hover_name="country",
          trendline= "ols")
fig.show(renderer = 'colab')

In [0]:
fig = px.scatter(finaldf, x="freedom_to_life_choice", y="score", animation_frame="year",
           animation_group="country",
           size="rank", color="country", hover_name="country",
          trendline= "ols")
fig.show(renderer = 'colab')

In [0]:
fig = px.scatter(finaldf, x="generosity", y="score", animation_frame="year",
           animation_group="country",
           size="rank", color="country", hover_name="country",
          trendline= "ols")
fig.show(renderer = 'colab')

In [0]:
fig = px.scatter(finaldf, x="corruption_perceptions", y="score", animation_frame="year",
           animation_group="country",
           size="rank", color="country", hover_name="country",
          trendline= "ols")
fig.show(renderer = 'colab')

### Maps for every year

In [0]:
fig = px.choropleth(finaldf, locations="country", locationmode='country names',color="score", hover_name="country", animation_frame = 'year')

In [0]:
fig.show(renderer = 'colab')

### Regions Pie chart



In [0]:
d2015 = wh15['Region'].value_counts()

label_d2015 = d2015.index
size_d2015 = d2015.values


colors = ['#003f5c','#2f4b7c','#665191','#a05195','#d45087','#f95d6a','#ff7c43','#ffa600']

trace = go.Pie(
         labels = label_d2015, values = size_d2015, marker = dict(colors = colors), name = '2015', hole = 0.3)

data = [trace]

layout1 = go.Layout(
           title = 'Regions')

fig = go.Figure(data = data, layout = layout1)
fig.show(renderer = 'colab')

### Correlation between different variables

In [0]:
plt.rcParams['figure.figsize'] = (20, 15)
sns.heatmap(d2017.corr(), cmap = 'copper', annot = True)

plt.show()

# Related work

Since the data being used in this project is relatively popular on Kaggle, it has already been subject to some analysis. On Kaggle itself, there have been a few attempts at processing and visualizing the set in Python (e.g. kaggle.com/jesperdramsch/the-reason-we-re-happy). Additionally, there are already a few independent web projects on world happiness data (e.g. www.benscott.co.uk/wdvp/, alpha.iodide.io/notebooks/193/?viewMode=report). 

This project will differentiate itself by the quality of interactions and user control provided in the visualizations, as well as the detail of the analysis across metrics and between countries, allowing users to more easily find correlations between different metrics. Users will be able to sort through several years worth of data (as opposed to the single year offered in other projects), and directly compare metrics between countries and regions of their choosing. Furthermore, comparative visualizations will explore the relation between happiness and every other metric in the dataset, with the potential to add further metrics from outside the set (e.g. scraping Google trends data), further setting the project apart from its predecessors. 

Sources of inspiration for this project include sample d3 visualization features found on www.d3-graph-gallery.com/, as well as novel ways of data display and storytelling seen on www.reddit.com/r/dataisbeautiful/ and www.bl.ocks.org/.