# EDA for Happiness

In this file, we make a EDA Report Jupyter Notebook using Python Kernel in a venv what get data of CSV files from **data** folder to start the process indicated in workshop 003.

Next, two processes will be carried out:

1. Exploration of the data of 5 files that are part of the **data** folder to know them in order to use the best methods to predict happiness in a better way.

2. We will draw a path to implement Airflow in the best possible way.

3. We will explore different methods and columns that are available to predict in the best way.

---
## Process

Import libraries:

In [1]:
import importlib.util
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import json
from plotly.offline import init_notebook_mode, iplot
from plotly.subplots import make_subplots
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
import joblib

spec = importlib.util.spec_from_file_location("connect_database", f"../shared_functions/connect_database.py")
connect_database = importlib.util.module_from_spec(spec)
spec.loader.exec_module(connect_database)

### 1. workshop_003_dag <- happiness process

Information about CSV files to load and explorate:

In [2]:
location_files = [
    '../data/2015.csv', 
    '../data/2016.csv', 
    '../data/2017.csv', 
    '../data/2018.csv', 
    '../data/2019.csv'
    ]

Read CSV files and create the dataframes:

<div class="alert alert-block alert-info">
<b>Note:</b> These files apparently have different columns and data, so each one will be read and scanned.</div>

#### 1. 2015 dataset

In [3]:
data_2015 = pd.read_csv(location_files[0], delimiter=',', header=0)

In [4]:
data_2015.head(3)

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
0,Switzerland,Western Europe,1,7.587,0.03411,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738
1,Iceland,Western Europe,2,7.561,0.04884,1.30232,1.40223,0.94784,0.62877,0.14145,0.4363,2.70201
2,Denmark,Western Europe,3,7.527,0.03328,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204


Here, we know that it has been loaded correctly and we have some information regarding the columns and what could be the types of data stored in each column.

Let's continue to explorate:

In [5]:
data_2015.columns

Index(['Country', 'Region', 'Happiness Rank', 'Happiness Score',
       'Standard Error', 'Economy (GDP per Capita)', 'Family',
       'Health (Life Expectancy)', 'Freedom', 'Trust (Government Corruption)',
       'Generosity', 'Dystopia Residual'],
      dtype='object')

In [6]:
len(data_2015.columns)

12

In [7]:
data_2015.shape

(158, 12)

Checking for NaN values:

In [8]:
data_2015.isna().sum()

Country                          0
Region                           0
Happiness Rank                   0
Happiness Score                  0
Standard Error                   0
Economy (GDP per Capita)         0
Family                           0
Health (Life Expectancy)         0
Freedom                          0
Trust (Government Corruption)    0
Generosity                       0
Dystopia Residual                0
dtype: int64

It is perfect that there are no nulls, it saves us a big task.

In [9]:
data_2015.dtypes

Country                           object
Region                            object
Happiness Rank                     int64
Happiness Score                  float64
Standard Error                   float64
Economy (GDP per Capita)         float64
Family                           float64
Health (Life Expectancy)         float64
Freedom                          float64
Trust (Government Corruption)    float64
Generosity                       float64
Dystopia Residual                float64
dtype: object

In [10]:
data_2015.describe()

Unnamed: 0,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
count,158.0,158.0,158.0,158.0,158.0,158.0,158.0,158.0,158.0,158.0
mean,79.493671,5.375734,0.047885,0.846137,0.991046,0.630259,0.428615,0.143422,0.237296,2.098977
std,45.754363,1.14501,0.017146,0.403121,0.272369,0.247078,0.150693,0.120034,0.126685,0.55355
min,1.0,2.839,0.01848,0.0,0.0,0.0,0.0,0.0,0.0,0.32858
25%,40.25,4.526,0.037268,0.545808,0.856823,0.439185,0.32833,0.061675,0.150553,1.75941
50%,79.5,5.2325,0.04394,0.910245,1.02951,0.696705,0.435515,0.10722,0.21613,2.095415
75%,118.75,6.24375,0.0523,1.158448,1.214405,0.811013,0.549092,0.180255,0.309883,2.462415
max,158.0,7.587,0.13693,1.69042,1.40223,1.02525,0.66973,0.55191,0.79588,3.60214


Analyzing, we obtain that:

- **Happiness Rank:** Happiness ranges from 1 to 158, with a mean and median around 79,5. The high standard deviation (45,75) suggests a wide and relatively uniform distribution of happiness ranks.

- **Happiness Score (To predict):** Happiness scores range from 2.839 to 7.587, with a mean of 5,3757 and a median of 5,2325, indicating that most scores are centered around these values. The standard deviation (1.145) shows that there is moderate variability in happiness across countries.

- **Standard Error:** The standard error is low on average (0,047885), indicating that the happiness scores have reasonable precision. The small standard deviation (0,017146) suggests that this accuracy is relatively consistent across countries.

- **Economy (GDP per Capita):** GDP per capita varies considerably, with values ranging from 0 to 1,69042. The mean (0,846137) and median (0,910245) are near the center of the range, but the high standard deviation (0,403121) indicates considerable economic disparity between countries.

    Economy and financial well-being are often strongly correlated with happiness. The ability to meet basic needs and enjoy a good quality of life is highly dependent on GDP per capita.

- **Family:** Family support shows values from 0 to 1,40223, with a mean close to 1 (0,991046) and a median of 1,02951, suggesting that most countries have good family support. The standard deviation (0,272369) indicates moderate variability in this factor.

    This and interpersonal relationships are key factors in perceived happiness. Human beings are social by nature, and supportive relation ships significantly influence their happiness.

- **Health (Life Expectancy):** Life expectancy ranges from 0 to 1,02525, with a mean of 0,630259 and a median of 0,696705. The standard deviation (0,24707078) suggests significant differences in health between countries.

    Those are direct indicators of physical well-being, which has a considerable impact on happiness. People in good health tend to be happier.

- **Freedom:** Perception of freedom ranges from 0 to 0,66973, with a mean of 0,428615 and a median of 0,435515, indicating that most countries have a moderate perception of freedom.

    This to make decisions about one's life is an important component of happiness. Societies with greater personal freedoms tend to have happier citizens.

- **Trust (Government Corruption):** Trust in government (absence of corruption) ranges from 0 to 0,55191, with a low mean (0,143422) and a median of 0,10722. The standard deviation (0,120034) suggests that the perception of corruption varies significantly across countries.

    This in institutions and perceptions of government corruption affect emotional stability and confidence in the future. Less corruption generally correlates with greater happiness.

- **Generosity:** Generosity ranges from 0 to 0,79588, with a mean of 0,237296 and a median of 0,21613. The standard deviation (0,126685) indicates considerable variability in generosity across countries.

    Acts of generosity and altruism are related to higher levels of personal happiness.

    The ability and willingness to help others can increase perceptions of well-being.

- **Residual Dystopia:** The residual dystopia component ranges from 0,32858 to 3,60214, with a mean of 2,098977 and a median of 2,095415. The standard deviation (0,55355) suggests considerable variability in this component, reflecting a wide disparity in baseline happiness conditions not explained by other factors.

    This factor captures what remains of happiness that is not explained by the other factors. It is a crucial component in understanding relative happiness across countries.

We will add the year column and we will do the same with the following datasets, only if there are new columns, they will be analyzed to make the joins between the datasets. It will be explored in a more detailed way at the moment of joining everything.

In [11]:
data_2015['Year'] = 2015

In [12]:
data_2015.head(3)

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual,Year
0,Switzerland,Western Europe,1,7.587,0.03411,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738,2015
1,Iceland,Western Europe,2,7.561,0.04884,1.30232,1.40223,0.94784,0.62877,0.14145,0.4363,2.70201,2015
2,Denmark,Western Europe,3,7.527,0.03328,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204,2015


#### 2. 2016 dataset

In [13]:
data_2016 = pd.read_csv(location_files[1], delimiter=',', header=0)

In [14]:
data_2016.head(3)

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Lower Confidence Interval,Upper Confidence Interval,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
0,Denmark,Western Europe,1,7.526,7.46,7.592,1.44178,1.16374,0.79504,0.57941,0.44453,0.36171,2.73939
1,Switzerland,Western Europe,2,7.509,7.428,7.59,1.52733,1.14524,0.86303,0.58557,0.41203,0.28083,2.69463
2,Iceland,Western Europe,3,7.501,7.333,7.669,1.42666,1.18326,0.86733,0.56624,0.14975,0.47678,2.83137


In [15]:
data_2016.columns

Index(['Country', 'Region', 'Happiness Rank', 'Happiness Score',
       'Lower Confidence Interval', 'Upper Confidence Interval',
       'Economy (GDP per Capita)', 'Family', 'Health (Life Expectancy)',
       'Freedom', 'Trust (Government Corruption)', 'Generosity',
       'Dystopia Residual'],
      dtype='object')

As we can see, there is a fewer column:

- **Standard Error**. This column in this dataset will be null.

And there are two additional columns:

- **Lower Confidence Interval**.

- **Upper Confidence Interval**.

These columns, so far, will be null in the other dataset. Let's check Nan values:

In [16]:
data_2016.isna().sum()

Country                          0
Region                           0
Happiness Rank                   0
Happiness Score                  0
Lower Confidence Interval        0
Upper Confidence Interval        0
Economy (GDP per Capita)         0
Family                           0
Health (Life Expectancy)         0
Freedom                          0
Trust (Government Corruption)    0
Generosity                       0
Dystopia Residual                0
dtype: int64

In [17]:
data_2016.dtypes

Country                           object
Region                            object
Happiness Rank                     int64
Happiness Score                  float64
Lower Confidence Interval        float64
Upper Confidence Interval        float64
Economy (GDP per Capita)         float64
Family                           float64
Health (Life Expectancy)         float64
Freedom                          float64
Trust (Government Corruption)    float64
Generosity                       float64
Dystopia Residual                float64
dtype: object

In [18]:
data_2016.describe()

Unnamed: 0,Happiness Rank,Happiness Score,Lower Confidence Interval,Upper Confidence Interval,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
count,157.0,157.0,157.0,157.0,157.0,157.0,157.0,157.0,157.0,157.0,157.0
mean,78.980892,5.382185,5.282395,5.481975,0.95388,0.793621,0.557619,0.370994,0.137624,0.242635,2.325807
std,45.46603,1.141674,1.148043,1.136493,0.412595,0.266706,0.229349,0.145507,0.111038,0.133756,0.54222
min,1.0,2.905,2.732,3.078,0.0,0.0,0.0,0.0,0.0,0.0,0.81789
25%,40.0,4.404,4.327,4.465,0.67024,0.64184,0.38291,0.25748,0.06126,0.15457,2.03171
50%,79.0,5.314,5.237,5.419,1.0278,0.84142,0.59659,0.39747,0.10547,0.22245,2.29074
75%,118.0,6.269,6.154,6.434,1.27964,1.02152,0.72993,0.48453,0.17554,0.31185,2.66465
max,157.0,7.526,7.46,7.669,1.82427,1.18326,0.95277,0.60848,0.50521,0.81971,3.83772


These confidence intervals and the standard error could be useful to validate future predictions, and this will be noted.

Let's create the column **Year**:

In [19]:
data_2016['Year'] = 2016

In [20]:
data_2016.head(3)

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Lower Confidence Interval,Upper Confidence Interval,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual,Year
0,Denmark,Western Europe,1,7.526,7.46,7.592,1.44178,1.16374,0.79504,0.57941,0.44453,0.36171,2.73939,2016
1,Switzerland,Western Europe,2,7.509,7.428,7.59,1.52733,1.14524,0.86303,0.58557,0.41203,0.28083,2.69463,2016
2,Iceland,Western Europe,3,7.501,7.333,7.669,1.42666,1.18326,0.86733,0.56624,0.14975,0.47678,2.83137,2016


#### 3. 2017 dataset

In [21]:
data_2017 = pd.read_csv(location_files[2], delimiter=',', header=0)

In [22]:
data_2017.head(3)

Unnamed: 0,Country,Happiness.Rank,Happiness.Score,Whisker.high,Whisker.low,Economy..GDP.per.Capita.,Family,Health..Life.Expectancy.,Freedom,Generosity,Trust..Government.Corruption.,Dystopia.Residual
0,Norway,1,7.537,7.594445,7.479556,1.616463,1.533524,0.796667,0.635423,0.362012,0.315964,2.277027
1,Denmark,2,7.522,7.581728,7.462272,1.482383,1.551122,0.792566,0.626007,0.35528,0.40077,2.313707
2,Iceland,3,7.504,7.62203,7.38597,1.480633,1.610574,0.833552,0.627163,0.47554,0.153527,2.322715


We have 2 new columns referring to the dataset data_2015, these could refer to the variability of the data, we will take note of what may be their importance when predicting happiness. Additionally, or has the region column. For now we will rename the columns to resemble all the columns of all the datasets as well as possible.

In [23]:
data_2017.columns = ['Country', 'Happiness Rank', 'Happiness Score', 'Whisker High', 'Whisker Low', 'Economy (GDP per Capita)', 'Family', 'Health (Life Expectancy)', 'Freedom', 'Generosity', 'Trust (Government Corruption)', 'Dystopia Residual']

In [24]:
data_2017.head(3)

Unnamed: 0,Country,Happiness Rank,Happiness Score,Whisker High,Whisker Low,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Generosity,Trust (Government Corruption),Dystopia Residual
0,Norway,1,7.537,7.594445,7.479556,1.616463,1.533524,0.796667,0.635423,0.362012,0.315964,2.277027
1,Denmark,2,7.522,7.581728,7.462272,1.482383,1.551122,0.792566,0.626007,0.35528,0.40077,2.313707
2,Iceland,3,7.504,7.62203,7.38597,1.480633,1.610574,0.833552,0.627163,0.47554,0.153527,2.322715


In [25]:
data_2017['Year'] = 2017

In [26]:
data_2017.head(3)

Unnamed: 0,Country,Happiness Rank,Happiness Score,Whisker High,Whisker Low,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Generosity,Trust (Government Corruption),Dystopia Residual,Year
0,Norway,1,7.537,7.594445,7.479556,1.616463,1.533524,0.796667,0.635423,0.362012,0.315964,2.277027,2017
1,Denmark,2,7.522,7.581728,7.462272,1.482383,1.551122,0.792566,0.626007,0.35528,0.40077,2.313707,2017
2,Iceland,3,7.504,7.62203,7.38597,1.480633,1.610574,0.833552,0.627163,0.47554,0.153527,2.322715,2017


In [27]:
data_2017.isna().sum()

Country                          0
Happiness Rank                   0
Happiness Score                  0
Whisker High                     0
Whisker Low                      0
Economy (GDP per Capita)         0
Family                           0
Health (Life Expectancy)         0
Freedom                          0
Generosity                       0
Trust (Government Corruption)    0
Dystopia Residual                0
Year                             0
dtype: int64

#### 4. 2018 dataset

In [28]:
data_2018 = pd.read_csv(location_files[3], delimiter=',', header=0)

In [29]:
data_2018.head(3)

Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
0,1,Finland,7.632,1.305,1.592,0.874,0.681,0.202,0.393
1,2,Norway,7.594,1.456,1.582,0.861,0.686,0.286,0.34
2,3,Denmark,7.555,1.351,1.59,0.868,0.683,0.284,0.408


In [30]:
data_2018.isna().sum()

Overall rank                    0
Country or region               0
Score                           0
GDP per capita                  0
Social support                  0
Healthy life expectancy         0
Freedom to make life choices    0
Generosity                      0
Perceptions of corruption       1
dtype: int64

In [31]:
data_2018.describe()

Unnamed: 0,Overall rank,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
count,156.0,156.0,156.0,156.0,156.0,156.0,156.0,155.0
mean,78.5,5.375917,0.891449,1.213237,0.597346,0.454506,0.181006,0.112
std,45.177428,1.119506,0.391921,0.302372,0.247579,0.162424,0.098471,0.096492
min,1.0,2.905,0.0,0.0,0.0,0.0,0.0,0.0
25%,39.75,4.45375,0.61625,1.06675,0.42225,0.356,0.1095,0.051
50%,78.5,5.378,0.9495,1.255,0.644,0.487,0.174,0.082
75%,117.25,6.1685,1.19775,1.463,0.77725,0.5785,0.239,0.137
max,156.0,7.632,2.096,1.644,1.03,0.724,0.598,0.457


We have a significant loss of columns and others are added.
Reviewing the oscillations of the **Overall rank**, **Score**, **Social support**, **GDP per capita** and **Perceptions of corruption** columns, they resemble those of the columns in the **data_2015** dataset, so they will be related when renaming the columns in the current dataset.
It is a pity that the **Dystopia Residual** data are not available.

The current columns will be renamed to the one now designated:

- **Overall rank ->** Happiness Rank.
- **Score ->** Happiness Score.
- **GDP per capita ->** Economy (GDP per Capita).
- **Social support ->** Family.
- **Healthy life expectancy ->** Healthy (Life Expectancy).
- **Freedom to make life choices ->** Freedom.
- **Perceptions of corruption ->** Trust (Government Corruption).

The **Country or region** column apparently collects both the country and the region, so these will not be touched so far.

In [32]:
data_2018.columns = ['Happiness Rank', 'Country or region', 'Happiness Score', 'Economy (GDP per Capita)', 'Family', 'Healthy (Life Expectancy)', 'Freedom', 'Generosity', 'Trust (Government Corruption)']

In [33]:
data_2018.head(3)

Unnamed: 0,Happiness Rank,Country or region,Happiness Score,Economy (GDP per Capita),Family,Healthy (Life Expectancy),Freedom,Generosity,Trust (Government Corruption)
0,1,Finland,7.632,1.305,1.592,0.874,0.681,0.202,0.393
1,2,Norway,7.594,1.456,1.582,0.861,0.686,0.286,0.34
2,3,Denmark,7.555,1.351,1.59,0.868,0.683,0.284,0.408


In [34]:
data_2018['Year'] = 2018

In [35]:
data_2018.head(3)

Unnamed: 0,Happiness Rank,Country or region,Happiness Score,Economy (GDP per Capita),Family,Healthy (Life Expectancy),Freedom,Generosity,Trust (Government Corruption),Year
0,1,Finland,7.632,1.305,1.592,0.874,0.681,0.202,0.393,2018
1,2,Norway,7.594,1.456,1.582,0.861,0.686,0.286,0.34,2018
2,3,Denmark,7.555,1.351,1.59,0.868,0.683,0.284,0.408,2018


#### 5. 2019 dataset

In [36]:
data_2019 = pd.read_csv(location_files[4], delimiter=',', header=0)

In [37]:
data_2019.head(3)

Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
0,1,Finland,7.769,1.34,1.587,0.986,0.596,0.153,0.393
1,2,Denmark,7.6,1.383,1.573,0.996,0.592,0.252,0.41
2,3,Norway,7.554,1.488,1.582,1.028,0.603,0.271,0.341


In [38]:
data_2019.isna().sum()

Overall rank                    0
Country or region               0
Score                           0
GDP per capita                  0
Social support                  0
Healthy life expectancy         0
Freedom to make life choices    0
Generosity                      0
Perceptions of corruption       0
dtype: int64

In [39]:
data_2019.describe()

Unnamed: 0,Overall rank,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
count,156.0,156.0,156.0,156.0,156.0,156.0,156.0,156.0
mean,78.5,5.407096,0.905147,1.208814,0.725244,0.392571,0.184846,0.110603
std,45.177428,1.11312,0.398389,0.299191,0.242124,0.143289,0.095254,0.094538
min,1.0,2.853,0.0,0.0,0.0,0.0,0.0,0.0
25%,39.75,4.5445,0.60275,1.05575,0.54775,0.308,0.10875,0.047
50%,78.5,5.3795,0.96,1.2715,0.789,0.417,0.1775,0.0855
75%,117.25,6.1845,1.2325,1.4525,0.88175,0.50725,0.24825,0.14125
max,156.0,7.769,1.684,1.624,1.141,0.631,0.566,0.453


With respect to the previous evaluated dataset, we have the same columns, so we will rename and move on to the stage of concatenating the datasets.

The current columns will be renamed to the one now designated:

- **Overall rank ->** Happiness Rank.
- **Score ->** Happiness Score.
- **GDP per capita ->** Economy (GDP per Capita).
- **Social support ->** Family.
- **Healthy life expectancy ->** Healthy (Life Expectancy).
- **Freedom to make life choices ->** Freedom.
- **Perceptions of corruption ->** Trust (Government Corruption).

The **Country or region** column apparently collects both the country and the region, so these will not be touched so far.

In [40]:
data_2019.columns = ['Happiness Rank', 'Country or Region', 'Happiness Score', 'Economy (GDP per Capita)', 'Family', 'Healthy (Life Expectancy)', 'Freedom', 'Generosity', 'Trust (Government Corruption)']

In [41]:
data_2019.head(3)

Unnamed: 0,Happiness Rank,Country or Region,Happiness Score,Economy (GDP per Capita),Family,Healthy (Life Expectancy),Freedom,Generosity,Trust (Government Corruption)
0,1,Finland,7.769,1.34,1.587,0.986,0.596,0.153,0.393
1,2,Denmark,7.6,1.383,1.573,0.996,0.592,0.252,0.41
2,3,Norway,7.554,1.488,1.582,1.028,0.603,0.271,0.341


In [42]:
data_2019['Year'] = 2019

In [43]:
data_2019.head(3)

Unnamed: 0,Happiness Rank,Country or Region,Happiness Score,Economy (GDP per Capita),Family,Healthy (Life Expectancy),Freedom,Generosity,Trust (Government Corruption),Year
0,1,Finland,7.769,1.34,1.587,0.986,0.596,0.153,0.393,2019
1,2,Denmark,7.6,1.383,1.573,0.996,0.592,0.252,0.41,2019
2,3,Norway,7.554,1.488,1.582,1.028,0.603,0.271,0.341,2019


In [44]:
data_2019['Country or Region']

0                       Finland
1                       Denmark
2                        Norway
3                       Iceland
4                   Netherlands
                 ...           
151                      Rwanda
152                    Tanzania
153                 Afghanistan
154    Central African Republic
155                 South Sudan
Name: Country or Region, Length: 156, dtype: object

#### Conclusions 

After exploring the data of the 5 datasets, we obtain 17 columns, which are:

- **Country.**
- **Region.**
- **Country or Region.**
- **Happiness Ranking.**
- **Happiness Score.**
- **Standard Error.**
- **Lower Confidence Interval.**
- **Upper Confidence Interval.**
- **Whisker High.**
- **Whisker Low.**
- **Economy (GDP per Capita).**
- **Family**
- **Health (Life Expectancy).**
- **Freedom.**
- **Trust (government corruption).**
- **Generosity.**
- **Residual Dystopia.**

From which good information is obtained to find the Happiness Score. 5 of these columns are discarded because we only have data from a little more than 1 quarter of the complete dataset, which are: **Standard Error**, **Upper and Lower Confidence Interval**, **Whisker High and Low**.

The **Residual Dystopia** would bring us much the same as Happiness Rank, but this is part of the Happiness Score, so it will not be taken into account to predict the latter value.

There are a total of 782 records, but we will validate all of these assumptions at a later date.

### 2. Creating the steps for workshop

This section is aimed at establishing a step-by-step approach to perform in Airflow and to take advantage of and further explore the data available to us.

All code that is not referenced with a “#! Don't put", will be put into Airflow.

#### 1. Steps for extraction:

In [45]:
datasets_urls = {
    '2015': '../data/2015.csv', 
    '2016': '../data/2016.csv', 
    '2017': '../data/2017.csv', 
    '2018': '../data/2018.csv', 
    '2019': '../data/2019.csv'
}

In [46]:
datasets_dict = dict()
for (year, url) in datasets_urls.items():
    dataset = pd.read_csv(url, header=0, delimiter=',')
    datasets_dict[year] = dataset.to_dict(orient='records')
json_data_1 = json.dumps(datasets_dict, indent=4)

#### 2. Steps for transform:

In [58]:
json_data_2 = json.loads(json_data_1)

Let's create a function to provide the correct columns:

In [59]:
def get_columns(year):
    """ Get the columns of the datasets based on the year. """
    columns = list()
    
    if year == '2015':
        columns = ['country', 'region', 'happiness_rank', 'happiness_score', 'standard_error', 'economy_per_capita', 'family', 'life_expectancy', 'freedom', 'government_corruption', 'generosity', 'dystopia_residual', 'year']
    if year == '2016':
        columns = ['country', 'region', 'happiness_rank', 'happiness_score', 'lower_confidence_interval', 'upper_confidence_interval', 'economy_per_capita', 'family', 'life_expectancy', 'freedom', 'government_corruption', 'generosity', 'dystopia_residual', 'year']
    if year == '2017':
        columns = ['country', 'happiness_rank', 'happiness_score', 'whisker_high', 'whisker_low', 'economy_per_capita', 'family', 'life_expectancy', 'freedom','generosity', 'government_corruption',  'dystopia_residual', 'year']
    if year == '2018':
        columns = ['happiness_rank', 'country_region', 'happiness_score', 'economy_per_capita', 'family', 'life_expectancy', 'freedom', 'generosity', 'government_corruption', 'year']
    if year == '2019':
        columns = ['happiness_rank', 'country_region', 'happiness_score', 'economy_per_capita', 'family', 'life_expectancy', 'freedom', 'generosity', 'government_corruption', 'year']
        
    return columns

In [60]:
dataframes_dict = dict()
for key in json_data_2.keys():
    year = str(key)
    dataset = pd.json_normalize(json_data_2[year])
    dataset[f'year'] = year
    dataset.columns = get_columns(year)
    dataframes_dict[year] = dataset.to_dict(orient='records')
    print(year, '---', dataset.shape)

json_data_2 = json.dumps(dataframes_dict, indent=4)

2015 --- (158, 13)
2016 --- (157, 14)
2017 --- (155, 13)
2018 --- (156, 10)
2019 --- (156, 10)


##### 3. Concatenate

In [50]:
json_data_3 = json.loads(json_data_2)

concatenated_dataframe = pd.concat([pd.json_normalize(json_data_3[year]) for year in json_data_3.keys()], ignore_index=True)

# logging.info('Data concatenated has ', concatenated_dataframe.shape)
json_data_3 = concatenated_dataframe.to_json(orient='records')

In [61]:
concatenated_dataframe.shape #! Don't put

(782, 18)

In [62]:
concatenated_dataframe.isna().sum() #! Don't put

country                      312
region                       467
happiness_rank                 0
happiness_score                0
standard_error               624
economy_per_capita             0
family                         0
life_expectancy                0
freedom                        0
government_corruption          1
generosity                     0
dystopia_residual            312
year                           0
lower_confidence_interval    625
upper_confidence_interval    625
whisker_high                 627
whisker_low                  627
country_region               470
dtype: int64

As we can see, out of 782 records, the 5 columns we talked about above are around 625 null records.

We also saw that **Country** and **Region** were not always given, since **Country Or Region** was also counted, so we will get the region of each record via the country if possible, and this will be assigned to flow into **Country Or Region**.

#### 4. Transform concatenated

In [89]:
json_data_4 = json.loads(json_data_3)
concatenated_dataframe = pd.json_normalize(json_data_4)

Let's see the data region when this is NaN:

In [90]:
concatenated_dataframe[concatenated_dataframe['region'].isna()].head() #!

Unnamed: 0,country,region,happiness_rank,happiness_score,standard_error,economy_per_capita,family,life_expectancy,freedom,government_corruption,generosity,dystopia_residual,year,lower_confidence_interval,upper_confidence_interval,whisker_high,whisker_low,country_region
315,Norway,,1,7.537,,1.616463,1.533524,0.796667,0.635423,0.315964,0.362012,2.277027,2017,,,7.594445,7.479556,
316,Denmark,,2,7.522,,1.482383,1.551122,0.792566,0.626007,0.40077,0.35528,2.313707,2017,,,7.581728,7.462272,
317,Iceland,,3,7.504,,1.480633,1.610574,0.833552,0.627163,0.153527,0.47554,2.322715,2017,,,7.62203,7.38597,
318,Switzerland,,4,7.494,,1.56498,1.516912,0.858131,0.620071,0.367007,0.290549,2.276716,2017,,,7.561772,7.426227,
319,Finland,,5,7.469,,1.443572,1.540247,0.809158,0.617951,0.382612,0.245483,2.430182,2017,,,7.527542,7.410458,


Now, let's see data country when region is NaN:

In [91]:
concatenated_dataframe[concatenated_dataframe['region'].isna()]['country'].unique() #! Don't put

array(['Norway', 'Denmark', 'Iceland', 'Switzerland', 'Finland',
       'Netherlands', 'Canada', 'New Zealand', 'Sweden', 'Australia',
       'Israel', 'Costa Rica', 'Austria', 'United States', 'Ireland',
       'Germany', 'Belgium', 'Luxembourg', 'United Kingdom', 'Chile',
       'United Arab Emirates', 'Brazil', 'Czech Republic', 'Argentina',
       'Mexico', 'Singapore', 'Malta', 'Uruguay', 'Guatemala', 'Panama',
       'France', 'Thailand', 'Taiwan Province of China', 'Spain', 'Qatar',
       'Colombia', 'Saudi Arabia', 'Trinidad and Tobago', 'Kuwait',
       'Slovakia', 'Bahrain', 'Malaysia', 'Nicaragua', 'Ecuador',
       'El Salvador', 'Poland', 'Uzbekistan', 'Italy', 'Russia', 'Belize',
       'Japan', 'Lithuania', 'Algeria', 'Latvia', 'South Korea',
       'Moldova', 'Romania', 'Bolivia', 'Turkmenistan', 'Kazakhstan',
       'North Cyprus', 'Slovenia', 'Peru', 'Mauritius', 'Cyprus',
       'Estonia', 'Belarus', 'Libya', 'Turkey', 'Paraguay',
       'Hong Kong S.A.R., China', '

We could assign a region to each country and have more data for prediction, so let's get to work.

Let's look at how regions are designated to make the new region assignments:

In [92]:
concatenated_dataframe['region'].unique() #! Don't put

array(['Western Europe', 'North America', 'Australia and New Zealand',
       'Middle East and Northern Africa', 'Latin America and Caribbean',
       'Southeastern Asia', 'Central and Eastern Europe', 'Eastern Asia',
       'Sub-Saharan Africa', 'Southern Asia', None], dtype=object)

There are some regions found in the **country_region** column, so in one way or another, we must take it into account.

In [93]:
concatenated_dataframe['country_region'].unique() #! Don't put

array([None, 'Finland', 'Norway', 'Denmark', 'Iceland', 'Switzerland',
       'Netherlands', 'Canada', 'New Zealand', 'Sweden', 'Australia',
       'United Kingdom', 'Austria', 'Costa Rica', 'Ireland', 'Germany',
       'Belgium', 'Luxembourg', 'United States', 'Israel',
       'United Arab Emirates', 'Czech Republic', 'Malta', 'France',
       'Mexico', 'Chile', 'Taiwan', 'Panama', 'Brazil', 'Argentina',
       'Guatemala', 'Uruguay', 'Qatar', 'Saudi Arabia', 'Singapore',
       'Malaysia', 'Spain', 'Colombia', 'Trinidad & Tobago', 'Slovakia',
       'El Salvador', 'Nicaragua', 'Poland', 'Bahrain', 'Uzbekistan',
       'Kuwait', 'Thailand', 'Italy', 'Ecuador', 'Belize', 'Lithuania',
       'Slovenia', 'Romania', 'Latvia', 'Japan', 'Mauritius', 'Jamaica',
       'South Korea', 'Northern Cyprus', 'Russia', 'Kazakhstan', 'Cyprus',
       'Bolivia', 'Estonia', 'Paraguay', 'Peru', 'Kosovo', 'Moldova',
       'Turkmenistan', 'Hungary', 'Libya', 'Philippines', 'Honduras',
       'Belarus', '

In [94]:
concatenated_dataframe['country_region'].nunique() #! Don't put

160

With the above list in mind, let's make a list for each country and region:

In [69]:
country_region_dict = {
    'Finland': 'Europe',
    'Norway': 'Europe',
    'Denmark': 'Europe',
    'Iceland': 'Europe',
    'Switzerland': 'Europe',
    'Netherlands': 'Europe',
    'Canada': 'North America',
    'New Zealand': 'Oceania',
    'Sweden': 'Europe',
    'Australia': 'Oceania',
    'United Kingdom': 'Europe',
    'Austria': 'Europe',
    'Costa Rica': 'Central America',
    'Ireland': 'Europe',
    'Germany': 'Europe',
    'Belgium': 'Europe',
    'Luxembourg': 'Europe',
    'United States': 'North America',
    'Israel': 'Middle East',
    'United Arab Emirates': 'Middle East',
    'Czech Republic': 'Europe',
    'Malta': 'Europe',
    'France': 'Europe',
    'Mexico': 'Latin America and Caribbean',
    'Chile': 'South America',
    'Taiwan': 'Eastern Asia',
    'Panama': 'Central America',
    'Brazil': 'South America',
    'Argentina': 'South America',
    'Guatemala': 'Central America',
    'Uruguay': 'South America',
    'Qatar': 'Middle East',
    'Saudi Arabia': 'Middle East',
    'Singapore': 'Southeastern Asia',
    'Malaysia': 'Southeastern Asia',
    'Spain': 'Europe',
    'Colombia': 'South America',
    'Trinidad & Tobago': 'Latin America and Caribbean',
    'Slovakia': 'Central and Eastern Europe',
    'El Salvador': 'Central America',
    'Nicaragua': 'Central America',
    'Poland': 'Central and Eastern Europe',
    'Bahrain': 'Middle East',
    'Uzbekistan': 'Central Asia',
    'Kuwait': 'Middle East',
    'Thailand': 'Southeastern Asia',
    'Italy': 'Europe',
    'Ecuador': 'South America',
    'Belize': 'Central America',
    'Lithuania': 'Central and Eastern Europe',
    'Slovenia': 'Central and Eastern Europe',
    'Romania': 'Central and Eastern Europe',
    'Latvia': 'Central and Eastern Europe',
    'Japan': 'Eastern Asia',
    'Mauritius': 'Sub-Saharan Africa',
    'Jamaica': 'Latin America and Caribbean',
    'South Korea': 'Eastern Asia',
    'Northern Cyprus': 'Western Europe',
    'Russia': 'Europe',
    'Kazakhstan': 'Central Asia',
    'Cyprus': 'Western Europe',
    'Bolivia': 'South America',
    'Estonia': 'Northern Europe',
    'Paraguay': 'South America',
    'Peru': 'South America',
    'Kosovo': 'Europe',
    'Moldova': 'Europe',
    'Turkmenistan': 'Central Asia',
    'Hungary': 'Central and Eastern Europe',
    'Libya': 'Middle East and Northern Africa',
    'Philippines': 'Southeastern Asia',
    'Honduras': 'Central America',
    'Belarus': 'Central and Eastern Europe',
    'Turkey': 'Middle East',
    'Pakistan': 'Southern Asia',
    'Hong Kong': 'Eastern Asia',
    'Portugal': 'Europe',
    'Serbia': 'Central and Eastern Europe',
    'Greece': 'Europe',
    'Lebanon': 'Middle East',
    'Montenegro': 'Central and Eastern Europe',
    'Croatia': 'Central and Eastern Europe',
    'Dominican Republic': 'Latin America and Caribbean',
    'Algeria': 'Middle East and Northern Africa',
    'Morocco': 'Middle East and Northern Africa',
    'China': 'Eastern Asia',
    'Azerbaijan': 'Central Asia',
    'Tajikistan': 'Central Asia',
    'Macedonia': 'Central and Eastern Europe',
    'Jordan': 'Middle East',
    'Nigeria': 'Sub-Saharan Africa',
    'Kyrgyzstan': 'Central Asia',
    'Bosnia and Herzegovina': 'Central and Eastern Europe',
    'Mongolia': 'Eastern Asia',
    'Vietnam': 'Southeastern Asia',
    'Indonesia': 'Southeastern Asia',
    'Bhutan': 'Southern Asia',
    'Somalia': 'Sub-Saharan Africa',
    'Cameroon': 'Sub-Saharan Africa',
    'Bulgaria': 'Central and Eastern Europe',
    'Nepal': 'Southern Asia',
    'Venezuela': 'Latin America and Caribbean',
    'Gabon': 'Sub-Saharan Africa',
    'Palestinian Territories': 'Middle East and Northern Africa',
    'South Africa': 'Sub-Saharan Africa',
    'Iran': 'Middle East',
    'Ivory Coast': 'Sub-Saharan Africa',
    'Ghana': 'Sub-Saharan Africa',
    'Senegal': 'Sub-Saharan Africa',
    'Laos': 'Southeastern Asia',
    'Tunisia': 'Middle East and Northern Africa',
    'Albania': 'Central and Eastern Europe',
    'Sierra Leone': 'Sub-Saharan Africa',
    'Congo (Brazzaville)': 'Sub-Saharan Africa',
    'Bangladesh': 'Southern Asia',
    'Sri Lanka': 'Southern Asia',
    'Iraq': 'Middle East',
    'Mali': 'Sub-Saharan Africa',
    'Namibia': 'Sub-Saharan Africa',
    'Cambodia': 'Southeastern Asia',
    'Burkina Faso': 'Sub-Saharan Africa',
    'Egypt': 'Middle East and Northern Africa',
    'Mozambique': 'Sub-Saharan Africa',
    'Kenya': 'Sub-Saharan Africa',
    'Zambia': 'Sub-Saharan Africa',
    'Mauritania': 'Sub-Saharan Africa',
    'Ethiopia': 'Sub-Saharan Africa',
    'Georgia': 'Central and Eastern Europe',
    'Armenia': 'Central and Eastern Europe',
    'Myanmar': 'Southeastern Asia',
    'Chad': 'Sub-Saharan Africa',
    'Congo (Kinshasa)': 'Sub-Saharan Africa',
    'India': 'Southern Asia',
    'Niger': 'Sub-Saharan Africa',
    'Uganda': 'Sub-Saharan Africa',
    'Benin': 'Sub-Saharan Africa',
    'Sudan': 'Sub-Saharan Africa',
    'Ukraine': 'Central and Eastern Europe',
    'Togo': 'Sub-Saharan Africa',
    'Guinea': 'Sub-Saharan Africa',
    'Lesotho': 'Sub-Saharan Africa',
    'Angola': 'Sub-Saharan Africa',
    'Madagascar': 'Sub-Saharan Africa',
    'Zimbabwe': 'Sub-Saharan Africa',
    'Afghanistan': 'Southern Asia',
    'Botswana': 'Sub-Saharan Africa',
    'Malawi': 'Sub-Saharan Africa',
    'Haiti': 'Latin America and Caribbean',
    'Liberia': 'Sub-Saharan Africa',
    'Syria': 'Middle East',
    'Rwanda': 'Sub-Saharan Africa',
    'Yemen': 'Middle East',
    'Tanzania': 'Sub-Saharan Africa',
    'South Sudan': 'Sub-Saharan Africa',
    'Central African Republic': 'Sub-Saharan Africa',
    'Burundi': 'Sub-Saharan Africa',
    'North Macedonia': 'Central and Eastern Europe',
    'Gambia': 'Sub-Saharan Africa',
    'Swaziland': 'Sub-Saharan Africa',
    'Comoros': 'Sub-Saharan Africa'
}

In [95]:
concatenated_dataframe['region'] = concatenated_dataframe['country'].map(country_region_dict)

In [96]:
concatenated_dataframe['region'].nunique() #! Don't put

16

Let's finish by applying the latter: Additionally, let's eliminate those 5 columns we talked about earlier from the equation.

In [97]:
concatenated_dataframe['country_region'] = concatenated_dataframe['country'].fillna('') + ' - ' + concatenated_dataframe['region'].fillna('')
concatenated_dataframe['country_region'] = concatenated_dataframe['country_region'].str.strip(' -')

concatenated_dataframe.drop(columns=['standard_error', 'dystopia_residual', 'lower_confidence_interval', 'upper_confidence_interval', 'whisker_high', 'whisker_low'], inplace=True)

concatenated_dataframe.index += 1
concatenated_dataframe.reset_index(inplace=True)
concatenated_dataframe.rename(columns={'index': 'id'}, inplace=True)

json_data_4 = concatenated_dataframe.to_json(orient='records')

In [98]:
concatenated_dataframe[concatenated_dataframe['country_region'].isna()].head() #! Don't put

Unnamed: 0,id,country,region,happiness_rank,happiness_score,economy_per_capita,family,life_expectancy,freedom,government_corruption,generosity,year,country_region


### 3. Creating the selection for prediction

In this part of the paper, we will validate which data and with which model we obtain a better percentage of accuracy of predicted vs. actual data.

We are going to create two functions that will help me to plot in an agile way the data that we are going to obtain:

In [156]:
def corr_graph(correlation_matrix):
    """ Create a graph with the correlation matrix. """
    fig = go.Figure(data=go.Heatmap(
        z=correlation_matrix.values,
        x=correlation_matrix.columns,
        y=correlation_matrix.columns,
        colorscale='Viridis',
        zmin=-1, zmax=1
    ))
    annotations = []
    for i, row in enumerate(correlation_matrix.values):
        for j, value in enumerate(row):
            annotations.append(
                go.layout.Annotation(
                    text=f'{value:.2f}',
                    x=correlation_matrix.columns[j],
                    y=correlation_matrix.index[i],
                    xref='x1',
                    yref='y1',
                    showarrow=False,
                    font=dict(color='white' if abs(value) > 0.5 else 'black')
                )
            )
            
    fig.update_layout(
        title="Correlation Matrix",
        annotations=annotations,
        xaxis_nticks=36
    )

    fig.show()
    
def graph_to_predict(happiness_dataframe, title = 'Observed and Estimated Happiness Scores'):
    fig = go.Figure()

    fig.add_trace(go.Scatter(x=happiness_dataframe.index, y=happiness_dataframe['happiness_score'], mode='lines', name='Observed'))

    fig.add_trace(go.Scatter(x=happiness_dataframe.index, y=happiness_dataframe['happiness_predicted'], mode='lines', name='Estimated'))

    fig.update_layout(
        title=title,
        xaxis_title='Index',
        yaxis_title='Happiness Score',
        template='plotly_dark'
    )

    fig.show()

We are going to declare the columns we count on to make what we should the predictions:

In [80]:
dataframe_filtered = concatenated_dataframe[['happiness_rank', 'happiness_score', 'economy_per_capita', 'family', 'life_expectancy', 'freedom', 'government_corruption', 'generosity', 'year', 'country_region']]

We check if we have any nulls so as not to damage our model.

In [99]:
dataframe_filtered.isna().sum()

happiness_rank           0
happiness_score          0
economy_per_capita       0
family                   0
life_expectancy          0
freedom                  0
government_corruption    0
generosity               0
year                     0
country_region           0
dtype: int64

We see that there is a null, so let's validate its data and if so it will be deleted:

In [83]:
dataframe_filtered[dataframe_filtered['government_corruption'].isna()]

Unnamed: 0,happiness_rank,happiness_score,economy_per_capita,family,life_expectancy,freedom,government_corruption,generosity,year,country_region
489,20,6.774,2.096,0.776,0.67,0.284,,0.186,2018,


This null could kill my model training, so 782 to 781 is a good decision considering that **government_corruption** could be a good variable to explore.

In [84]:
dataframe_filtered.dropna(inplace=True)



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [85]:
dataframe_filtered.loc[0]

happiness_rank                              1
happiness_score                         7.587
economy_per_capita                    1.39651
family                                1.34951
life_expectancy                       0.94143
freedom                               0.66557
government_corruption                 0.41978
generosity                            0.29678
year                                     2015
country_region           Switzerland - Europe
Name: 0, dtype: object

In [86]:
correlation_matrix = dataframe_filtered.loc[:, ~dataframe_filtered.columns.isin(['country_region'])].corr()
corr_graph(correlation_matrix)

As we can see, it would be a good match to take **economy_per_capita**, **family**, **life_expectancy** and **freedom** to predict the model, but that does not mean that we will deal with all columns.

Let's review the boxplots of the data we have:

In [104]:
features_columns = ['economy_per_capita', 'family', 'life_expectancy', 'freedom', 'government_corruption', 'generosity', 'year']
X = dataframe_filtered[features_columns]
y = dataframe_filtered['happiness_score']

In [108]:
for feature in features_columns:
    fig = px.box(dataframe_filtered, y=feature, title=f'Outliers in {feature}')
    fig.update_layout(
        paper_bgcolor='black',
        plot_bgcolor='black',
        font=dict(color='white')
    )
    fig.show()

As we can see, of the 7 columns we can count on to predict **Happiness Score**, the **family**, **government_corruption** and **generosity** columns have a considerable number of outliers. So we are going to take two paths, on the one hand the data without an adjustment treatment and on the other hand the adjusted data.

In [127]:
ajusted_dataframe = dataframe_filtered.copy()
unajusted_dataframe = dataframe_filtered.copy()

##### Ajusted dataframe

In [130]:
def remove_outliers(dataframe, column):
    Q1 = dataframe[column].quantile(0.25)
    Q3 = dataframe[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return dataframe[(dataframe[column] >= lower_bound) & (dataframe[column] <= upper_bound)]

In [132]:
for feature in ['economy_per_capita', 'family', 'life_expectancy', 'freedom', 'government_corruption', 'generosity']:
    ajusted_dataframe = remove_outliers(ajusted_dataframe, feature)


In [133]:
ajusted_dataframe.shape

(675, 10)

According to the adjustment through the IQR method, we went from 781 to 675 records.

#### Model

Now, let's start with the models.

In [134]:
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression, HuberRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

In [135]:
columns_to_use = ['happiness_score', 'economy_per_capita', 'family', 'life_expectancy', 'freedom', 'government_corruption', 'generosity', 'year']

To perform the tests in an agile way, we will create a class that executes everything necessary and automate this process.

In [115]:
class Test_Model():
    
    def __init__(self, dataframe, model, random_split_index):
        self.dataframe = dataframe
        self.model = model
        self.X = dataframe.loc[:, ~dataframe.columns.isin(['happiness_score'])]
        self.y = dataframe['happiness_score']
        self.random_split_index = random_split_index

        self.test_split()
        self.fit_model()
        self.predict_model()        
    
    def test_split(self):
        self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(self.X, self.y, test_size=0.3, random_state=self.random_split_index)
        
        self.test_data = self.dataframe.loc[self.y_test.index]
    
    def fit_model(self):
        self.model.fit(self.X_train, self.y_train)
    
    def predict_model(self):
        self.y_pred = self.model.predict(self.X_test)
        self.test_data['happiness_predicted'] = self.y_pred
    
    def metrics(self):
        r2 = r2_score(self.y_test, self.y_pred)
        mae = mean_absolute_error(self.y_test, self.y_pred)
        mse = mean_squared_error(self.y_test, self.y_pred)
        
        return f'R2: {r2} | MAE: {mae} | MSE: {mse}'
    
    def get_r2_metric(self):
        r2 = r2_score(self.y_test, self.y_pred)
        return r2

    def get_model(self):
        return self.model
    
    def get_shapes(self):
        return f"X train: {self.X_train.shape} | X test: {self.X_test.shape}"
    
    def get_test_data(self):
        return self.test_data

Let's train models:

##### Linear regression

Unajusted dataframe:

In [159]:
linear_regression_test = Test_Model(unajusted_dataframe[['economy_per_capita', 'family', 'life_expectancy', 'freedom', 'government_corruption', 'generosity', 'year', 'happiness_score']], LinearRegression(), 125)

In [160]:
linear_regression_test.metrics()

'R2: 0.8152564389436721 | MAE: 0.38439142063503495 | MSE: 0.23266887900698213'

In [161]:
linear_regression_test.get_shapes()

'X train: (546, 7) | X test: (235, 7)'

In [162]:
linear_regression_test.get_test_data().head()

Unnamed: 0,economy_per_capita,family,life_expectancy,freedom,government_corruption,generosity,year,happiness_score,happiness_predicted
580,0.9,0.906,0.69,0.271,0.063,0.04,2018,4.592,4.912264
693,1.183,1.452,0.726,0.334,0.031,0.082,2019,5.648,5.696573
503,1.529,1.451,1.008,0.631,0.457,0.261,2018,6.343,7.2374
330,1.487923,1.47252,0.798951,0.562511,0.276732,0.336269,2017,6.951,6.87528
748,0.204,0.986,0.39,0.494,0.138,0.197,2019,4.466,4.293861


In [163]:
graph_to_predict(linear_regression_test.get_test_data().reset_index(), 'Observed and Estimated Happiness Scores (Linear Regression) with unadjusted data')

0.815 of R2 is a considerable result; let us observe with the adjusted values.

Ajusted dataframe:

In [169]:
linear_regression_test = Test_Model(ajusted_dataframe[['economy_per_capita', 'family', 'life_expectancy', 'freedom', 'happiness_score']], LinearRegression(), 101)
print(linear_regression_test.metrics())
print(linear_regression_test.get_shapes())
graph_to_predict(linear_regression_test.get_test_data().reset_index(), 'Observed and Estimated Happiness Scores (Linear Regression) with adjusted data')

R2: 0.7235206807698333 | MAE: 0.41893557030419937 | MSE: 0.27715912790687863
X train: (472, 4) | X test: (203, 4)


In [170]:
linear_regression_test = Test_Model(ajusted_dataframe[['economy_per_capita', 'family', 'life_expectancy', 'freedom', 'government_corruption', 'generosity', 'year', 'happiness_score']], LinearRegression(), 101)
print(linear_regression_test.metrics())
print(linear_regression_test.get_shapes())
graph_to_predict(linear_regression_test.get_test_data().reset_index(), 'Observed and Estimated Happiness Scores (Linear Regression) with adjusted data')

R2: 0.7352605277501857 | MAE: 0.4059027061359079 | MSE: 0.2653904149344414
X train: (472, 7) | X test: (203, 7)


With the adjusted ones there is apparently less effectiveness in predicting the values. Even, analyzing with and without the columns that we excluded when looking at the correlation plot, using all the columns we talked about is more effective.

##### Random Forest

Let's try the Random Forest model and see how it works. This might be more effective.

In [171]:
columns_to_use

['happiness_score',
 'economy_per_capita',
 'family',
 'life_expectancy',
 'freedom',
 'government_corruption',
 'generosity',
 'year']

Unajusted data:

In [None]:
import logging
logging.basicConfig(level=logging.INFO, filename='../shared_functions/model/model_trainner.log')
r2_init = 0.80
for random_model in range(100, 201):
    for random_split in range(160):
        random_forest_regressor_test = Test_Model(unajusted_dataframe[['economy_per_capita', 'family', 'life_expectancy','freedom', 'government_corruption', 'generosity', 'year', 'happiness_score']], RandomForestRegressor(n_estimators=165, random_state=random_model), random_split)
        r2 = random_forest_regressor_test.get_r2_metric()
        if r2 > r2_init:
            r2_init = r2
            logging.info(f"[{random_model}, {random_split}, {r2}]")

With this method a 0.8614 of R2 was achieved, which is interesting and so far is the best model, so it will be saved to pass it to the Airflow.

In [172]:
random_forest_regressor_test = Test_Model(dataframe_filtered[['economy_per_capita', 'family', 'life_expectancy','freedom', 'government_corruption', 'generosity', 'year', 'happiness_score']], RandomForestRegressor(n_estimators=165, random_state=3), 125)
random_forest_regressor_test.get_r2_metric()

0.8614328441454778

Ajusted data:

In [175]:
import logging
logging.basicConfig(level=logging.INFO, filename='../shared_functions/model/model_trainner.log')
r2_init = 0
for random_model in range(0, 50):
    for random_split in range(160):
        random_forest_regressor_test = Test_Model(ajusted_dataframe[['economy_per_capita', 'family', 'life_expectancy','freedom', 'happiness_score']], RandomForestRegressor(n_estimators=165, random_state=random_model), random_split)
        r2 = random_forest_regressor_test.get_r2_metric()
        if r2 > r2_init:
            r2_init = r2
            logging.info(f"[{random_model}, {random_split}, {r2}]")

KeyboardInterrupt: 

In [196]:
from joblib import Parallel, delayed
logging.basicConfig(level=logging.INFO, filename='../shared_functions/model/model_trainner.log')
r2_init = 0.8627981738240167
def evaluate_model(random_model, random_split, dataframe):
    random_forest_regressor_test = Test_Model(dataframe[['economy_per_capita', 'family', 'life_expectancy','freedom', 'government_corruption', 'generosity', 'year', 'happiness_score']], RandomForestRegressor(n_estimators=200, random_state=random_model), random_split)
    r2 = random_forest_regressor_test.get_r2_metric()
    return random_model, random_split, r2

results = Parallel(n_jobs=-1)(delayed(evaluate_model)(random_model, random_split, unajusted_dataframe)
    for random_model in range(0, 201) for random_split in range(126))

In [194]:
# r2_init = -float('inf')
for random_model, random_split, r2 in results:
    if r2 > r2_init:
        r2_init = r2
        logging.info(f"[{random_model}, {random_split}, {r2}]")

##### Gradient Boosting Regressor

In [477]:
columns_to_use

['id',
 'happiness_score',
 'economy_per_capita',
 'family',
 'life_expectancy',
 'freedom',
 'government_corruption',
 'generosity',
 'year',
 'country_region']

In [484]:
import logging
logging.basicConfig(level=logging.INFO, filename='../shared_functions/model/model_trainner.log')
r2_init = 0
for random_model in range(21):
    for random_split in range(200):
        gradient_boosting_regressor_test = Test_Model(dataframe_filtered[['economy_per_capita', 'family', 'life_expectancy', 'happiness_score']], GradientBoostingRegressor(n_estimators=150, random_state=random_model), random_split)
        r2 = gradient_boosting_regressor_test.get_r2_metric()
        if r2 > r2_init:
            r2_init = r2
            logging.info(f"[{random_model}, {random_split}, {r2}]")

#### Save trained model

In [463]:
url_model = '../shared_functions/model/random_forest_regressor_model.pkl'
joblib.dump(random_forest_regressor_test.get_model(), url_model)

['../shared_functions/model/random_forest_regressor_model.pkl']

#### Test the trained model

In [466]:
model_trainner = joblib.load(url_model)

In [468]:
dataframe_filtered.columns

Index(['id', 'happiness_rank', 'happiness_score', 'economy_per_capita',
       'family', 'life_expectancy', 'freedom', 'government_corruption',
       'generosity', 'year', 'country_region', 'happiness_predicted'],
      dtype='object')

In [472]:
happiness_dataframe = dataframe_filtered.copy()
happiness_dataframe['happiness_predicted'] = model_trainner.predict(happiness_dataframe.loc[:, ~happiness_dataframe.columns.isin(['id', 'happiness_rank', 'happiness_score', 'country_region', 'happiness_predicted'])])

In [473]:
r2_score(happiness_dataframe['happiness_score'], happiness_dataframe['happiness_predicted'])

0.9384169391836191

In [475]:
mean_absolute_error(happiness_dataframe['happiness_score'], happiness_dataframe['happiness_predicted'])

0.2033875356217257

In [476]:
mean_squared_error(happiness_dataframe['happiness_score'], happiness_dataframe['happiness_predicted'])

0.07812816508429145

In [474]:
graph_to_predict(happiness_dataframe)


#### Evidence of loaded from CSV file to **raw_spotify**:

![Raw spotify](https://gist.githubusercontent.com/dventep/e72eff0212b00635e1a970f2822ac87c/raw/696f7a37edf3dcc5f7a015a64431d603ec0eeedf/RawSpotify_Loaded.png)

#### Evidence execution of **initial_load_dag**:

![Initial_load_dag](https://gist.githubusercontent.com/dventep/e72eff0212b00635e1a970f2822ac87c/raw/696f7a37edf3dcc5f7a015a64431d603ec0eeedf/Initial_load_data_done.png)