## Data from World Happiness Report

The World Happiness Report is an annual publication of the United Nations Sustainable Development Solutions Network. It contains articles, and rankings of national happiness based on respondent ratings of their own lives, which the report also correlates with various life factors.

In this notebook we will explore the happiness of different countries and the features associated.
The datasets that we will use are available in *Data*: **happiness2020.pkl** and **countries_info.csv**.

Although the features are self-explanatory, here a summary: 

**happiness2020.pkl**
* country: *Name of the country*
* happiness_score: *Happiness score*
* social_support: *Social support (mitigation the effects of inequality)*
* healthy_life_expectancy: *Healthy Life Expectancy*
* freedom_of_choices: *Freedom to make life choices*
* generosity: *Generosity (charity, volunteers)*
* perception_of_corruption: *Corruption Perception*
* world_region: *Area of the world of the country*

**countries_info.csv**
* country_name: *Name of the country*
* area: *Area in sq mi*
* population: *Number of people*
* literacy: *Literacy percentage*

In [2]:
!head Data/countries_info.csv

'head' is not recognized as an internal or external command,
operable program or batch file.


In [1]:
import pandas as pd
%matplotlib inline

DATA_FOLDER = 'Data/'

HAPPINESS_DATASET = DATA_FOLDER+"happiness2020.csv"
COUNTRIES_DATASET = DATA_FOLDER+"countries_info.csv"

## Task 1: Load the data

Load the 2 datasets in Pandas dataframes (called *happiness* and *countries*), and show the first rows.


**Hint**: Use the correct reader and verify the data has the expected format.

In [2]:
happiness = pd.read_csv(HAPPINESS_DATASET)
countries = pd.read_csv(COUNTRIES_DATASET)

happiness.head()

Unnamed: 0,country,happiness_score,social_support,healthy_life_expectancy,freedom_of_choices,generosity,perception_of_corruption,world_region
0,Afghanistan,2.5669,0.470367,52.59,0.396573,-0.096429,0.933687,South Asia
1,Albania,4.8827,0.67107,68.708138,0.781994,-0.042309,0.896304,Central and Eastern Europe
2,Algeria,5.0051,0.803385,65.905174,0.466611,-0.121105,0.735485,Middle East and North Africa
3,Argentina,5.9747,0.900568,68.803802,0.831132,-0.194914,0.84201,Latin America and Caribbean
4,Armenia,4.6768,0.757479,66.750656,0.712018,-0.13878,0.773545,Commonwealth of Independent States


In [3]:
happiness['country'] = happiness['country'].str.lower()

In [4]:
countries.head()

Unnamed: 0,country_name,area,population,literacy
0,afghanistan,647500,31056997,360
1,albania,28748,3581655,865
2,algeria,2381740,32930091,700
3,argentina,2766890,39921833,971
4,armenia,29800,2976372,986


## Task 2: Let's merge the data

Create a dataframe called *country_features* by merging *happiness* and *countries*. A row of this dataframe must describe all the features that we have about a country.

**Hint**: Verify that all the rows are in the final dataframe.

In [5]:
country_features = pd.merge(countries, happiness, left_on='country_name', right_on='country', how='outer')


In [6]:
country_features.head()

Unnamed: 0,country_name,area,population,literacy,country,happiness_score,social_support,healthy_life_expectancy,freedom_of_choices,generosity,perception_of_corruption,world_region
0,afghanistan,647500,31056997,360,afghanistan,2.5669,0.470367,52.59,0.396573,-0.096429,0.933687,South Asia
1,albania,28748,3581655,865,albania,4.8827,0.67107,68.708138,0.781994,-0.042309,0.896304,Central and Eastern Europe
2,algeria,2381740,32930091,700,algeria,5.0051,0.803385,65.905174,0.466611,-0.121105,0.735485,Middle East and North Africa
3,argentina,2766890,39921833,971,argentina,5.9747,0.900568,68.803802,0.831132,-0.194914,0.84201,Latin America and Caribbean
4,armenia,29800,2976372,986,armenia,4.6768,0.757479,66.750656,0.712018,-0.13878,0.773545,Commonwealth of Independent States


## Task 3: Where are people happier?

Print the top 10 countries based on their happiness score (higher is better).

In [7]:
happiness[['country', 'happiness_score']].sort_values(by = 'happiness_score', ascending=False).head(10)

Unnamed: 0,country,happiness_score
38,finland,7.8087
31,denmark,7.6456
115,switzerland,7.5599
50,iceland,7.5045
92,norway,7.488
87,netherlands,7.4489
114,sweden,7.3535
88,new zealand,7.2996
6,austria,7.2942
72,luxembourg,7.2375


We are interested to know in what world region people are happier. 

Create and print a dataframe with the (1) average happiness score and (2) the number of contries for each world region.
Sort the result to show the happiness ranking.

In [8]:
region_happiness = happiness.groupby('world_region').agg(
    average_happiness=('happiness_score', 'mean'),
    country_count=('country', 'size')
)

region_happiness_sorted = region_happiness.sort_values(by='average_happiness', ascending=False)

print(region_happiness_sorted)

                                    average_happiness  country_count
world_region                                                        
North America and ANZ                        7.173525              4
Western Europe                               6.967405             20
Latin America and Caribbean                  5.971280             20
Central and Eastern Europe                   5.891393             14
Southeast Asia                               5.517788              8
East Asia                                    5.483633              3
Commonwealth of Independent States           5.358342             12
Middle East and North Africa                 5.269306             16
Sub-Saharan Africa                           4.393856             32
South Asia                                   4.355083              6


The first region has only a few countries! What are them and what is their score?

In [16]:
north_america_happiness = happiness[happiness['world_region'] == 'North America and ANZ'][['country', 'happiness_score']]

north_america_happiness.head()


Unnamed: 0,country,happiness_score
5,australia,7.2228
21,canada,7.2321
88,new zealand,7.2996
127,united states,6.9396


## Task 4: How literate is the world?

Print the names of the countries with a level of literacy of 100%. 

For each country, print the name and the world region in the format: *{region name} - {country name} ({happiness score})*

In [25]:
literacy_100 = country_features[country_features["literacy"] == "100,0"][['world_region', 'country_name', 'happiness_score']]

for _, entry in literacy_100.iterrows():
    print(f"{entry['world_region']} - {entry['country_name']} ({entry['happiness_score']})")

North America and ANZ - australia (7.222799778)
Western Europe - denmark (7.645599842)
Western Europe - finland (7.808700085)
Western Europe - luxembourg (7.237500191)
Western Europe - norway (7.487999916000001)


What is the global average?

In [36]:
country_features['literacy'].mean

<bound method Series.mean of 0      36.0
1      86.5
2      70.0
3      97.1
4      98.6
       ... 
130    93.4
131    90.3
132    50.2
133    80.6
134    90.7
Name: literacy, Length: 135, dtype: float64>

Calculate the proportion of countries with a literacy level below 50%. Print the value in percentage, formatted with 2 decimals.

In [56]:
below_50 =  country_features[country_features["literacy"] < 50]
literacy_below_50 =  (len(below_50) / len(country_features) * 100)
formatted_value = f"{literacy_below_50:.2f}"
print(f"{formatted_value}%")

11.85%


Print the raw number and the percentage of world population that is illiterate.

In [55]:
worlds_population = country_features["population"].sum()
illiterate_population = below_50["population"].sum()

print(f"Population of illiterate people: {illiterate_population} \nPercentage of illiterate people: {illiterate_population/worlds_population*100}%")

Population of illiterate people: 580572946 
Percentage of illiterate people: 9.447161309066706%


## Task 5: Population density

Add to the dataframe a new field called *population_density* computed by dividing *population* by *area*.

In [58]:
country_features["population_density"] = country_features["population"] / country_features["area"]

country_features.head()

Unnamed: 0,country_name,area,population,literacy,country,happiness_score,social_support,healthy_life_expectancy,freedom_of_choices,generosity,perception_of_corruption,world_region,population_density
0,afghanistan,647500,31056997,36.0,afghanistan,2.5669,0.470367,52.59,0.396573,-0.096429,0.933687,South Asia,47.964474
1,albania,28748,3581655,86.5,albania,4.8827,0.67107,68.708138,0.781994,-0.042309,0.896304,Central and Eastern Europe,124.587971
2,algeria,2381740,32930091,70.0,algeria,5.0051,0.803385,65.905174,0.466611,-0.121105,0.735485,Middle East and North Africa,13.826065
3,argentina,2766890,39921833,97.1,argentina,5.9747,0.900568,68.803802,0.831132,-0.194914,0.84201,Latin America and Caribbean,14.428413
4,armenia,29800,2976372,98.6,armenia,4.6768,0.757479,66.750656,0.712018,-0.13878,0.773545,Commonwealth of Independent States,99.878255


What is the happiness score of the 3 countries with the lowest population density?

In [61]:
country_features.sort_values(by = "population_density", ascending=True)['happiness_score'].head(3)

83    5.4562
5     7.2228
14    3.4789
Name: happiness_score, dtype: float64

## Task 6: Healty and happy?

Plot in a scatter plot the happiness score (x) and healty life expectancy (y).

In [64]:
happiness.plot.scatter(x = "happiness_score", y = "healthy_life_expectancy")

<Axes: xlabel='happiness_score', ylabel='healthy_life_expectancy'>

Feel free to continue the exploration of the dataset! We'll release the solutions next week.

----
Enjoy EPFL and be happy, next year Switzerland must be #1.