<div class="alert alert-block alert-info">
    <p><img src="https://geekbrains.ru/apple-touch-icon-57x57.png" align="right" alt="GeekBrains"></p>
    <p style="color:DarkSlateGray"><b>Course:</b> Probability Theory and Math Statistics</p>
    <p style="color:DarkSlateGray"><b>Project:</b> Research of happiness indicators by countries</p>
    <p style="color:DarkSlateGray"><b>Supervisor:</b> Yury Lytkin</p>
    <p style="color:DarkSlateGray"><b>Author:</b> Dmitry Doni</p>
</div>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

In [2]:
# Display charts in Jupyter Notebook
%matplotlib inline
plt.style.use('fivethirtyeight')
# Display inline plots as SVG
%config InlineBackend.figure_formats = ['svg']
# Set the limit of columns displayed in the notebook
pd.options.display.max_columns = 100

### Raw Data Analysis

In [3]:
df = pd.read_csv('../../datasets/happiness.csv')

df.head(10).style.hide_index()\
    .bar(subset=["Positive affect", "Social support", "Freedom", "Generosity", "Log of GDP per capita", "Healthy life expectancy"], color='lightgreen')\
    .bar(subset=["Negative affect", "Corruption"], color='gray')\

Country (region),Ladder,SD of Ladder,Positive affect,Negative affect,Social support,Freedom,Corruption,Generosity,Log of GDP per capita,Healthy life expectancy
Finland,1,4,41,10,2,5,4,47,22,27
Denmark,2,13,24,26,4,6,3,22,14,23
Norway,3,8,16,29,3,3,8,11,7,12
Iceland,4,9,3,3,1,7,45,3,15,13
Netherlands,5,1,12,25,15,19,12,7,12,18
Switzerland,6,11,44,21,13,11,7,16,8,4
Sweden,7,18,34,8,25,10,6,17,13,17
New Zealand,8,15,22,12,5,8,5,8,26,14
Canada,9,23,18,49,20,9,11,14,19,8
Austria,10,10,64,24,31,26,19,25,16,15


In [4]:
df.shape

(156, 11)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156 entries, 0 to 155
Data columns (total 11 columns):
Country (region)           156 non-null object
Ladder                     156 non-null int64
SD of Ladder               156 non-null int64
Positive affect            155 non-null float64
Negative affect            155 non-null float64
Social support             155 non-null float64
Freedom                    155 non-null float64
Corruption                 148 non-null float64
Generosity                 155 non-null float64
Log of GDP
per capita      152 non-null float64
Healthy life
expectancy    150 non-null float64
dtypes: float64(8), int64(2), object(1)
memory usage: 13.5+ KB


In [6]:
df.describe()

Unnamed: 0,Ladder,SD of Ladder,Positive affect,Negative affect,Social support,Freedom,Corruption,Generosity,Log of GDP per capita,Healthy life expectancy
count,156.0,156.0,155.0,155.0,155.0,155.0,148.0,155.0,152.0,150.0
mean,78.5,78.5,78.0,78.0,78.0,78.0,74.5,78.0,76.5,75.5
std,45.177428,45.177428,44.888751,44.888751,44.888751,44.888751,42.868014,44.888751,44.022721,43.445368
min,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
25%,39.75,39.75,39.5,39.5,39.5,39.5,37.75,39.5,38.75,38.25
50%,78.5,78.5,78.0,78.0,78.0,78.0,74.5,78.0,76.5,75.5
75%,117.25,117.25,116.5,116.5,116.5,116.5,111.25,116.5,114.25,112.75
max,156.0,156.0,155.0,155.0,155.0,155.0,148.0,155.0,152.0,150.0


### Correlation Analysis

#### Correlation

In [12]:
df[["Ladder", "Social support", "Freedom", "Corruption", "Generosity"]].corr()

Unnamed: 0,Ladder,Social support,Freedom,Corruption,Generosity
Ladder,1.0,0.817842,0.546777,0.190071,0.497856
Social support,0.817842,1.0,0.448903,0.118434,0.442432
Freedom,0.546777,0.448903,1.0,0.381304,0.489991
Corruption,0.190071,0.118434,0.381304,1.0,0.266138
Generosity,0.497856,0.442432,0.489991,0.266138,1.0


#### Covariance

Unbiased covariance:
$$\sigma_{xy} = \dfrac{1}{n - 1} \displaystyle\sum_{i = 1}^n (x_i - \overline{x}) \cdot (y_i - \overline{y}).$$

In [39]:
X = df["Ladder"]
Y = df["Social support"]

In [41]:
MX = X.mean()
MY = Y.mean()

# Calculate unbiased covariance using formula
cov = ((X - MX) * (Y - MY)).sum() / (X.shape[0] - 1)
cov

1646.7354838709678

#### Correlation coefficient

Pearson correlation coefficient:
$$r_{XY} = \dfrac{\operatorname{cov}(X, Y)}{\sigma_X \cdot \sigma_Y}.$$

In [47]:
# Calculate Pearson correlation coefficient using formula 
# By default, std returns unbiased standard deviation
corr = cov / (X.std() * Y.std())
corr

0.8120164291106637