In [2]:
import pandas as pd

In [3]:
PATH_HAPPY = "Data/world-happiness-report-2021.csv"
PATH_DRUNK = "Data/alcohol-consumption.csv"

# Data Visualisation
### Milestone 1
---

## Datasets [Mohamed]
Our work focuses on 2 datasets:
- Results of the 2021 [World Happiness Report](https://en.wikipedia.org/wiki/World_Happiness_Report) conducted by the Gallup World Poll (GWP) in about 150 countries.
- Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age) in 217 historically identifiable countries and 49 aggregates/areas (e.g. EU). Recent interval data on certain yearly periods depending on the country.

### World Happiness Report dataset

The dataset contains yearly happiness scores by country (= item) in the period 2006 through 2022.
To get the scores respondents are asked to rate their own lives on a 0 to 10 scale, i.e. it uses the [Cantril Scale](https://news.gallup.com/poll/122453/understanding-gallup-uses-cantril-scale.aspx).

The survey is conduceted on a regular semi-annual to biennial frequency depending on the country.
The sample size for each country is on average 1,000 people, but can be 2000 for bigger countries like China or Russia. Samples are weighted to correct for selection bias, nonresponse and other issues.
[[1](https://www.gallup.com/178667/gallup-world-poll-work.aspx)]

Other than the happiness score, the dataset also contains the following attributes:
- regional indicator of the country,
- standard error of the score,
- lower and upper whiskers,
- 6 indicators (columns) that may contribute to a happier life:
        - Econimic production 
        - Social support
        - Life expectency
        - Freedom
        - Absence of corruption
        - Generosity
- "Ladder score in Dystopia",
- "Dystopia + residual",
- attributes named "Explained by:" corresponding to each of the 6 indicators (in statistics terms, this is likely the R<sup>2</sup> coefficient of determination for the variable).

There are countries that do not have a happiness score for some years. In that case we decided to do an interpolation such that all countries have a happiness score value for all years going from 2006 to 2021. 

### Alcohol consumption per capita, year and country dataset
The dataset contains yearly alcohol consumption per capita for each country of the world and also regions of the world (ex: Africa Eastern and Southern, Arab World)

The value of consumed alcohol corresponds to the liters of pure alcohol consumed per capita for people that are 15 years of age or older over a calendar year, it is adjusted for tourist consumption. 

The data covers years going from 2000 to 2018 (2000, 2005, 2010, 2015 and 2018), so for the majority of countries there are 5 entries, for the countries that do not have the 5 entries (Like for example Afghanistan) we would like for our other dataset perform an interpolation to complete the missing values.

In [6]:
df_happy = pd.read_csv(PATH_HAPPY)
df_happy.head(10)

Unnamed: 0,Country name,Regional indicator,Ladder score,Standard error of ladder score,upperwhisker,lowerwhisker,Logged GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption,Ladder score in Dystopia,Explained by: Log GDP per capita,Explained by: Social support,Explained by: Healthy life expectancy,Explained by: Freedom to make life choices,Explained by: Generosity,Explained by: Perceptions of corruption,Dystopia + residual
0,Finland,Western Europe,7.842,0.032,7.904,7.78,10.775,0.954,72.0,0.949,-0.098,0.186,2.43,1.446,1.106,0.741,0.691,0.124,0.481,3.253
1,Denmark,Western Europe,7.62,0.035,7.687,7.552,10.933,0.954,72.7,0.946,0.03,0.179,2.43,1.502,1.108,0.763,0.686,0.208,0.485,2.868
2,Switzerland,Western Europe,7.571,0.036,7.643,7.5,11.117,0.942,74.4,0.919,0.025,0.292,2.43,1.566,1.079,0.816,0.653,0.204,0.413,2.839
3,Iceland,Western Europe,7.554,0.059,7.67,7.438,10.878,0.983,73.0,0.955,0.16,0.673,2.43,1.482,1.172,0.772,0.698,0.293,0.17,2.967
4,Netherlands,Western Europe,7.464,0.027,7.518,7.41,10.932,0.942,72.4,0.913,0.175,0.338,2.43,1.501,1.079,0.753,0.647,0.302,0.384,2.798
5,Norway,Western Europe,7.392,0.035,7.462,7.323,11.053,0.954,73.3,0.96,0.093,0.27,2.43,1.543,1.108,0.782,0.703,0.249,0.427,2.58
6,Sweden,Western Europe,7.363,0.036,7.433,7.293,10.867,0.934,72.7,0.945,0.086,0.237,2.43,1.478,1.062,0.763,0.685,0.244,0.448,2.683
7,Luxembourg,Western Europe,7.324,0.037,7.396,7.252,11.647,0.908,72.6,0.907,-0.034,0.386,2.43,1.751,1.003,0.76,0.639,0.166,0.353,2.653
8,New Zealand,North America and ANZ,7.277,0.04,7.355,7.198,10.643,0.948,73.4,0.929,0.134,0.242,2.43,1.4,1.094,0.785,0.665,0.276,0.445,2.612
9,Austria,Western Europe,7.268,0.036,7.337,7.198,10.906,0.934,73.3,0.908,0.042,0.481,2.43,1.492,1.062,0.782,0.64,0.215,0.292,2.784


In [5]:
df_drunk = pd.read_csv(PATH_DRUNK)
df_drunk[:5]

Unnamed: 0,Entity,Code,Year,"Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age)"
0,Afghanistan,AFG,2010,0.21
1,Afghanistan,AFG,2015,0.21
2,Afghanistan,AFG,2018,0.21
3,Africa Eastern and Southern,,2000,5.014051
4,Africa Eastern and Southern,,2005,4.856588


TODO : 
- Find a dataset (or multiple) that you will explore. Assess the quality of the data it
contains and how much preprocessing / data-cleaning it will require before tackling
visualization. We recommend using a standard dataset as this course is not about
scraping nor data processing.
- Add info about datasets (year, source...)
- Add info about columns (description, what they contain...)

## Problematic [Stefan + Andrey (visualization)]

### Overview

Money does not make you happy… or does it ? The most popular view of the relation between happiness* and life quality presents it as [logarithmic](https://www.cnbc.com/2015/12/14/money-can-buy-happiness-but-only-to-a-point.html). But what if we look at a country level ? Happiness certainly is correlated with GDP/inhabitant but some very surprising cases lie out of bonds. What are those countries, and what other factors play a role ? Is there any interesting correlation that could “explain” happy but not so rich countries, or sad rich states ?

We would indeed expect a positive tight between a country's happiness ("life ladder"), social support, health, freedom of choice and generosity. The other way around, corruption perception is expected to decrease happiness. Further analysis of this topic will be addressed in the next section.

As a second part, we will investigate a less serious (but a bit funnier - at least it should make **us** happy) relation. The american association of psyhology has published various studies showing alcohol works as a social [lubricant](https://www.psychologicalscience.org/news/alcohol-is-a-social-lubricant-study-confirms.html). Alcohol is supposed to minimize negative emotions. We will therefore try to show such relations at country level : do some countries tend to be happier than others, all other parameters equal, just because they drink alcohol ? Or at least, is social support higher due to alcohol consumtion ? If we would of course not show any direct implication, this could be of nice interest for future reasearch. 

*happiness = positive affect, not blue and low stress

### Target Audience

- Individuals interested in insights on the well-being and consumption habits worldwide
- Casual alcohol consumers eager to compare drinking habits in different countries
- Young professionals looking to find the country that will bring the most joy to their future family
- Politicians who would like to strategize on how to improve their citizens' lives

### Motivation

[TODO]

### Visualization (Andrey)

## Exploratory data analysis (Andrey/Stefan)

TODO : 
- Pre-process the data of your choice
- Show some basic statistics and get insights about the data
    - Comment the "explained by" columns
    - Show correlations btw columns
    - (optionnal) Linear regression results
- Show some examples of graph (Andrey) 

## Related work (Mohamed/Andrey)

TODO : 
- What others have already done with the data?
- Why is your approach original?
- What source of inspiration do you take? Visualizations that you found on other
websites or magazines (might be unrelated to your data).
- In case you are using a dataset that you have already explored in another
context (ML or ADA course, semester project...), you are required to share the
report of that work to outline the differences with the submission for this class