## Data from World Happiness Report

The World Happiness Report is an annual publication of the United Nations Sustainable Development Solutions Network. It contains articles, and rankings of national happiness based on respondent ratings of their own lives, which the report also correlates with various life factors.

In this notebook we will explore the happiness of different countries and the features associated.
The datasets that we will use are available in *Data*: **happiness2020.pkl** and **countries_info.csv**.

Although the features are self-explanatory, here a summary: 

**happiness2020.pkl**
* country: *Name of the country*
* happiness_score: *Happiness score*
* social_support: *Social support (mitigation the effects of inequality)*
* healthy_life_expectancy: *Healthy Life Expectancy*
* freedom_of_choices: *Freedom to make life choices*
* generosity: *Generosity (charity, volunteers)*
* perception_of_corruption: *Corruption Perception*
* world_region: *Area of the world of the country*

**countries_info.csv**
* country_name: *Name of the country*
* area: *Area in sq mi*
* population: *Number of people*
* literacy: *Literacy percentage*

In [1]:
!head Data/countries_info.csv

"head" non � riconosciuto come comando interno o esterno,
 un programma eseguibile o un file batch.


In [2]:
import pandas as pd
%matplotlib inline

DATA_FOLDER = 'Data/'

HAPPINESS_DATASET = DATA_FOLDER+"happiness2020.csv"
COUNTRIES_DATASET = DATA_FOLDER+"countries_info.csv"

## Task 1: Load the data

Load the 2 datasets in Pandas dataframes (called *happiness* and *countries*), and show the first rows.


**Hint**: Use the correct reader and verify the data has the expected format.

In [3]:
# Write your code here
df_happiness = pd.read_csv(HAPPINESS_DATASET)
df_happiness['country'] = df_happiness['country'].str.lower()
df_countries = pd.read_csv(COUNTRIES_DATASET)

#print(df_happiness.head())
print(df_countries.head())

  country_name     area  population literacy
0  afghanistan   647500    31056997     36,0
1      albania    28748     3581655     86,5
2      algeria  2381740    32930091     70,0
3    argentina  2766890    39921833     97,1
4      armenia    29800     2976372     98,6


## Task 2: Let's merge the data

Create a dataframe called *country_features* by merging *happiness* and *countries*. A row of this dataframe must describe all the features that we have about a country.

**Hint**: Verify that all the rows are in the final dataframe.

In [34]:
# Write your code here

country_features = pd.merge(df_happiness, df_countries, left_on='country', right_on='country_name')
country_features = country_features.drop(columns=['country_name']) # there are 2 columns for countries
#country_features = country_features.drop('country_name', axis = 1) axis 1 = columns
country_features

Unnamed: 0,country,happiness_score,social_support,healthy_life_expectancy,freedom_of_choices,generosity,perception_of_corruption,world_region,area,population,literacy
0,afghanistan,2.5669,0.470367,52.590000,0.396573,-0.096429,0.933687,South Asia,647500,31056997,360
1,albania,4.8827,0.671070,68.708138,0.781994,-0.042309,0.896304,Central and Eastern Europe,28748,3581655,865
2,algeria,5.0051,0.803385,65.905174,0.466611,-0.121105,0.735485,Middle East and North Africa,2381740,32930091,700
3,argentina,5.9747,0.900568,68.803802,0.831132,-0.194914,0.842010,Latin America and Caribbean,2766890,39921833,971
4,armenia,4.6768,0.757479,66.750656,0.712018,-0.138780,0.773545,Commonwealth of Independent States,29800,2976372,986
...,...,...,...,...,...,...,...,...,...,...,...
130,venezuela,5.0532,0.890408,66.505341,0.623278,-0.169091,0.837038,Latin America and Caribbean,912050,25730435,934
131,vietnam,5.3535,0.849987,67.952736,0.939593,-0.094533,0.796421,Southeast Asia,329560,84402966,903
132,yemen,3.5274,0.817981,56.727283,0.599920,-0.157735,0.800288,Middle East and North Africa,527970,21456188,502
133,zambia,3.7594,0.698824,55.299377,0.806500,0.078037,0.801290,Sub-Saharan Africa,752614,11502010,806


## Task 3: Where are people happier?

Print the top 10 countries based on their happiness score (higher is better).

In [11]:
# Write your code here

top_10_countries = country_features.sort_values(by='happiness_score', ascending=False).head(10)
print(top_10_countries[['country','happiness_score']])

         country  happiness_score
38       finland           7.8087
31       denmark           7.6456
115  switzerland           7.5599
50       iceland           7.5045
92        norway           7.4880
87   netherlands           7.4489
114       sweden           7.3535
88   new zealand           7.2996
6        austria           7.2942
72    luxembourg           7.2375


We are interested to know in what world region people are happier. 

Create and print a dataframe with the (1) average happiness score and (2) the number of contries for each world region.
Sort the result to show the happiness ranking.

In [7]:
# Write your code here

region_stats = country_features.groupby('world_region').agg(
    average_happiness=('happiness_score', 'mean'),  # Average happiness score
    number_of_countries=('country', 'count')        # Number of countries
).reset_index()


region_stats_sorted = region_stats.sort_values(by='average_happiness', ascending=False).head(10)
print(region_stats_sorted)

                         world_region  average_happiness  number_of_countries
5               North America and ANZ           7.173525                    4
9                      Western Europe           6.967405                   20
3         Latin America and Caribbean           5.971280                   20
0          Central and Eastern Europe           5.891393                   14
7                      Southeast Asia           5.517788                    8
2                           East Asia           5.483633                    3
1  Commonwealth of Independent States           5.358342                   12
4        Middle East and North Africa           5.269306                   16
8                  Sub-Saharan Africa           4.393856                   32
6                          South Asia           4.355083                    6


The first region has only a few countries! What are them and what is their score?

In [26]:
# Write your code here

countries_in_region = country_features[country_features['world_region'] == 'North America and ANZ']
# Print the names of the countries
print(countries_in_region['country'].tolist())  

['australia', 'canada', 'new zealand', 'united states']


## Task 4: How literate is the world?

Print the names of the countries with a level of literacy of 100%. 

For each country, print the name and the world region in the format: *{region name} - {country name} ({happiness score})*

In [42]:
# Write your code here

country_features['literacy'] = country_features['literacy'].str.replace(',', '.').astype(float)
literacy_level = country_features[country_features['literacy'] == 100.0]

AttributeError: Can only use .str accessor with string values!

In [41]:
def print_row(row):
    return (f"{row['world_region']} - {row['country']}({row['happiness_score']})")
formatted_features = literacy_level.apply(print_row, axis = 1)

# lambda is an anonymous function that work like def, .apply is used to apply the function to every row, 'row' is the object of the function
# formatted_features = literacy_level.apply(lambda row: f"{row['world_region']} - {row['country']} ({row['happiness_score']})", axis=1)
formatted_features

5     North America and ANZ - australia(7.222799778)
31             Western Europe - denmark(7.645599842)
38             Western Europe - finland(7.808700085)
72          Western Europe - luxembourg(7.237500191)
92        Western Europe - norway(7.487999916000001)
dtype: object

What is the global average?

In [43]:
# Write your code here

global_average = country_features['literacy'].mean()
global_average

81.85112781954888

Calculate the proportion of countries with a literacy level below 50%. Print the value in percentage, formatted with 2 decimals.

In [81]:
# Write your code here
literacy_below_50 = country_features[country_features['literacy'] < 50.0]
total_countries = len(country_features)
countries_below_50_count = len(literacy_below_50)
proportion_below_50 = (countries_below_50_count / total_countries) * 100

# Print the result formatted to 2 decimal places
print(f"Proportion of countries with literacy below 50%: {proportion_below_50:.2f}%")

Proportion of countries with literacy below 50%: 11.85%


Print the raw number and the percentage of world population that is illiterate.

In [83]:
# Write your code here

country_features['literacy_number'] = country_features['population'] * (1 - (country_features['literacy'] / 100))
raw_literacy_population = country_features['literacy_number'].sum()
total_population = country_features['population'].sum()
percentage_literacy_population = (raw_literacy_population / total_population) * 100
percentage_literacy_population

20.32996582965084

## Task 5: Population density

Add to the dataframe a new field called *population_density* computed by dividing *population* by *area*.

In [84]:
# Write your code here

country_features['population_density'] = country_features['population'] / country_features['area']

What is the happiness score of the 3 countries with the lowest population density?

In [85]:
# Write your code here

lowest_3_densities = country_features.sort_values(by='population_density', ascending=False).head(3)
print(lowest_3_densities['happiness_score'])

107    6.3771
78     6.7728
8      6.2273
Name: happiness_score, dtype: float64


## Task 6: Healty and happy?

Plot in a scatter plot the happiness score (x) and healty life expectancy (y).

In [86]:
# Write your code here

country_features.plot.scatter(x='happiness_score', y='healthy_life_expectancy')

<Axes: xlabel='happiness_score', ylabel='healthy_life_expectancy'>

Feel free to continue the exploration of the dataset! We'll release the solutions next week.

----
Enjoy EPFL and be happy, next year Switzerland must be #1.