## Intro

### What makes a country happier?

What makes a country happy? Or.. An easier question. What makes you, as an individual, happier? Your future expectations or your income or your honest government? According to <a href="https://worldhappiness.report/faq/" target="_blank">happiness report</a>.; "**The variables used reflect what has been broadly found in the research literature to be important in explaining national-level differences in life evaluations.**". We will be using those datas to understand better what makes a country more happy. Without further ado, let's get started

***
- I will be using two datasets in this analysis. One of them is latest world happines report of 2021 and the other one contains data from previous years.
***

***
- Let's start with importing required libraries.
***

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt 
import seaborn as sns 
import matplotlib as mpl



import plotly 
import plotly.express as px
import plotly.graph_objs as go
import plotly.offline as py
from plotly.offline import iplot
from plotly.subplots import make_subplots
import plotly.figure_factory as ff

***
- Let's read our data.
***

In [2]:
df_2021 = pd.read_csv("world-happiness-report-2021.csv")
df_history = pd.read_csv("world-happiness-report.csv")
pd.set_option('display.max_columns', None)

***
- First, inspect two dataframes one by one.
***

***
- Let's check statistical information of 2021 data.
***

In [3]:
df_2021.describe()

Unnamed: 0,Ladder score,Standard error of ladder score,upperwhisker,lowerwhisker,Logged GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption,Ladder score in Dystopia,Explained by: Log GDP per capita,Explained by: Social support,Explained by: Healthy life expectancy,Explained by: Freedom to make life choices,Explained by: Generosity,Explained by: Perceptions of corruption,Dystopia + residual
count,149.0,149.0,149.0,149.0,149.0,149.0,149.0,149.0,149.0,149.0,149.0,149.0,149.0,149.0,149.0,149.0,149.0,149.0
mean,5.532839,0.058752,5.648007,5.417631,9.432208,0.814745,64.992799,0.791597,-0.015134,0.72745,2.43,0.977161,0.793315,0.520161,0.498711,0.178047,0.135141,2.430329
std,1.073924,0.022001,1.05433,1.094879,1.158601,0.114889,6.762043,0.113332,0.150657,0.179226,5.347044e-15,0.40474,0.258871,0.213019,0.137888,0.09827,0.114361,0.537645
min,2.523,0.026,2.596,2.449,6.635,0.463,48.478,0.382,-0.288,0.082,2.43,0.0,0.0,0.0,0.0,0.0,0.0,0.648
25%,4.852,0.043,4.991,4.706,8.541,0.75,59.802,0.718,-0.126,0.667,2.43,0.666,0.647,0.357,0.409,0.105,0.06,2.138
50%,5.534,0.054,5.625,5.413,9.569,0.832,66.603,0.804,-0.036,0.781,2.43,1.025,0.832,0.571,0.514,0.164,0.101,2.509
75%,6.255,0.07,6.344,6.128,10.421,0.905,69.6,0.877,0.079,0.845,2.43,1.323,0.996,0.665,0.603,0.239,0.174,2.794
max,7.842,0.173,7.904,7.78,11.647,0.983,76.953,0.97,0.542,0.939,2.43,1.751,1.172,0.897,0.716,0.541,0.547,3.482


***
- What about historic data?
***

In [4]:
df_history.describe()

Unnamed: 0,year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Positive affect,Negative affect
count,1949.0,1949.0,1913.0,1936.0,1894.0,1917.0,1860.0,1839.0,1927.0,1933.0
mean,2013.216008,5.466705,9.368453,0.812552,63.359374,0.742558,0.000103,0.747125,0.710003,0.268544
std,4.166828,1.115711,1.154084,0.118482,7.510245,0.142093,0.162215,0.186789,0.1071,0.085168
min,2005.0,2.375,6.635,0.29,32.3,0.258,-0.335,0.035,0.322,0.083
25%,2010.0,4.64,8.464,0.74975,58.685,0.647,-0.113,0.69,0.6255,0.206
50%,2013.0,5.386,9.46,0.8355,65.2,0.763,-0.0255,0.802,0.722,0.258
75%,2017.0,6.283,10.353,0.905,68.59,0.856,0.091,0.872,0.799,0.32
max,2020.0,8.019,11.648,0.987,77.1,0.985,0.698,0.983,0.944,0.705


***
- Now, let's check general information of both dataframes.
***

In [5]:
df_2021.info()
print()
print("-"*60)
print()
df_history.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149 entries, 0 to 148
Data columns (total 20 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   Country name                                149 non-null    object 
 1   Regional indicator                          149 non-null    object 
 2   Ladder score                                149 non-null    float64
 3   Standard error of ladder score              149 non-null    float64
 4   upperwhisker                                149 non-null    float64
 5   lowerwhisker                                149 non-null    float64
 6   Logged GDP per capita                       149 non-null    float64
 7   Social support                              149 non-null    float64
 8   Healthy life expectancy                     149 non-null    float64
 9   Freedom to make life choices                149 non-null    float64
 10  Generosity    

***
- 2021 data has no missing values. Great! On the other hand historic data has some missing values. The other thing is some of the column names are different on each dataframe. Although there are 11 and 20 variables in dataframes, my interest will be on
    - 'Country name'
    - 'Year'
    - 'Life Ladder'
    - 'Log GDP per capita'
    - 'Social support'
    - 'Healthy life expectancy at birth'
    - 'Freedom to make life choices'
    - 'Generosity'
    - 'Perceptions of corruption'
    - 'Regional indicator'
    
- Let's prepare our data and concatenate two dataframes into one dataframe.
***

In [6]:
df_2021 = df_2021.rename(columns={"Ladder score": "Life Ladder", "Logged GDP per capita": "Log GDP per capita", "Healthy life expectancy":"Healthy life expectancy at birth", })
df = pd.concat([df_history, df_2021], axis=0, join="outer", ignore_index=True)
df = df.drop(columns=['Standard error of ladder score', 'upperwhisker', 'lowerwhisker',
       'Ladder score in Dystopia', 'Explained by: Log GDP per capita',
       'Explained by: Social support', 'Explained by: Healthy life expectancy',
       'Explained by: Freedom to make life choices',
       'Explained by: Generosity', 'Explained by: Perceptions of corruption',
       'Dystopia + residual', "Positive affect", "Negative affect"])

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2098 entries, 0 to 2097
Data columns (total 10 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Country name                      2098 non-null   object 
 1   year                              1949 non-null   float64
 2   Life Ladder                       2098 non-null   float64
 3   Log GDP per capita                2062 non-null   float64
 4   Social support                    2085 non-null   float64
 5   Healthy life expectancy at birth  2043 non-null   float64
 6   Freedom to make life choices      2066 non-null   float64
 7   Generosity                        2009 non-null   float64
 8   Perceptions of corruption         1988 non-null   float64
 9   Regional indicator                149 non-null    object 
dtypes: float64(8), object(2)
memory usage: 164.0+ KB


***
- There seems to be some missing values. Let's check missing values.
***

In [8]:
df.isnull().sum()

Country name                           0
year                                 149
Life Ladder                            0
Log GDP per capita                    36
Social support                        13
Healthy life expectancy at birth      55
Freedom to make life choices          32
Generosity                            89
Perceptions of corruption            110
Regional indicator                  1949
dtype: int64

***
- *`year`* column has missing values but all those values are actually 2021. Because we know historic data had no missing values in *`year`* column.
- *`Regional indicator`* colum has 1949 missing values. That is just like *`year`* column. Historic column had no column as *`Regional indicator`*. We will fill those missing values by using 2021 data.
- I will check the other ones later.
***

***
- Now, let's fill *`Regional indicator`* column by using 2021 data.
***

In [9]:
for i in df["Country name"].unique():
    filt = (df["Country name"] == i)
    regional = df.loc[filt, "Regional indicator"].unique()[-1]
    df.loc[filt, "Regional indicator"] = regional
df

Unnamed: 0,Country name,year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Regional indicator
0,Afghanistan,2008.0,3.724,7.370,0.451,50.800,0.718,0.168,0.882,South Asia
1,Afghanistan,2009.0,4.402,7.540,0.552,51.200,0.679,0.190,0.850,South Asia
2,Afghanistan,2010.0,4.758,7.647,0.539,51.600,0.600,0.121,0.707,South Asia
3,Afghanistan,2011.0,3.832,7.620,0.521,51.920,0.496,0.162,0.731,South Asia
4,Afghanistan,2012.0,3.783,7.705,0.521,52.240,0.531,0.236,0.776,South Asia
...,...,...,...,...,...,...,...,...,...,...
2093,Lesotho,,3.512,7.926,0.787,48.700,0.715,-0.131,0.915,Sub-Saharan Africa
2094,Botswana,,3.467,9.782,0.784,59.269,0.824,-0.246,0.801,Sub-Saharan Africa
2095,Rwanda,,3.415,7.676,0.552,61.400,0.897,0.061,0.167,Sub-Saharan Africa
2096,Zimbabwe,,3.145,7.943,0.750,56.201,0.677,-0.047,0.821,Sub-Saharan Africa


***
- Now it is time for *`year`* column.
***

In [10]:
df.year = df.year.fillna(2021)

***
- Let's check missing values again.
***

In [11]:
df.isnull().sum()

Country name                          0
year                                  0
Life Ladder                           0
Log GDP per capita                   36
Social support                       13
Healthy life expectancy at birth     55
Freedom to make life choices         32
Generosity                           89
Perceptions of corruption           110
Regional indicator                   63
dtype: int64

***
- Still there are some missing values. What should we do? Filling those values with mean or median values may be one solution. Or we can just drop those values which is not the best option considering there are at least 110 rows with missing values.
- I will be filling missing values with that country's mean values. That seems like a proper way to me.
***

In [12]:
def fill_by_country(df):
    missing = df.drop(["Country name", "Regional indicator"], axis=1).isnull().sum()
    missing = missing[missing>0]
    for i in missing.index:
        df[i] = df.groupby("Country name")[i].transform(lambda val: val.fillna(val.mean()))
    return df

In [13]:
fill_by_country(df)

Unnamed: 0,Country name,year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Regional indicator
0,Afghanistan,2008.0,3.724,7.370,0.451,50.800,0.718,0.168,0.882,South Asia
1,Afghanistan,2009.0,4.402,7.540,0.552,51.200,0.679,0.190,0.850,South Asia
2,Afghanistan,2010.0,4.758,7.647,0.539,51.600,0.600,0.121,0.707,South Asia
3,Afghanistan,2011.0,3.832,7.620,0.521,51.920,0.496,0.162,0.731,South Asia
4,Afghanistan,2012.0,3.783,7.705,0.521,52.240,0.531,0.236,0.776,South Asia
...,...,...,...,...,...,...,...,...,...,...
2093,Lesotho,2021.0,3.512,7.926,0.787,48.700,0.715,-0.131,0.915,Sub-Saharan Africa
2094,Botswana,2021.0,3.467,9.782,0.784,59.269,0.824,-0.246,0.801,Sub-Saharan Africa
2095,Rwanda,2021.0,3.415,7.676,0.552,61.400,0.897,0.061,0.167,Sub-Saharan Africa
2096,Zimbabwe,2021.0,3.145,7.943,0.750,56.201,0.677,-0.047,0.821,Sub-Saharan Africa


In [14]:
df.isnull().sum()

Country name                         0
year                                 0
Life Ladder                          0
Log GDP per capita                  12
Social support                       1
Healthy life expectancy at birth     4
Freedom to make life choices         0
Generosity                          12
Perceptions of corruption            2
Regional indicator                  63
dtype: int64

***
- We filled most of the missing values but still there are some. Let's check those missing values if the same row has more than one missing values. If that is the case, it may be clever to just drop those rows.
***

In [15]:
df[df["Log GDP per capita"].isnull()]

Unnamed: 0,Country name,year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Regional indicator
423,Cuba,2006.0,5.418,,0.97,68.44,0.281,,,
1559,Somalia,2014.0,5.528,,0.611,49.6,0.874,,0.456,
1560,Somalia,2015.0,5.354,,0.599,50.1,0.968,,0.41,
1561,Somalia,2016.0,4.668,,0.594,50.0,0.917,,0.441,
1562,Somaliland region,2009.0,4.991,,0.88,,0.746,,0.513,
1563,Somaliland region,2010.0,4.657,,0.829,,0.82,,0.471,
1564,Somaliland region,2011.0,4.931,,0.788,,0.858,,0.357,
1565,Somaliland region,2012.0,5.057,,0.786,,0.758,,0.334,
1596,South Sudan,2014.0,3.832,,0.545,49.84,0.567,,0.742,
1597,South Sudan,2015.0,4.071,,0.585,50.2,0.512,,0.71,


***
- All *`Log GDP per capita`*, *`Generosity`*  and *`Regional indicator`* columns' missing values in the same 12 rows. Also 4 of them also contains missing values in *`Healthy life expectancy at birth`* column. Let' drop those 12 rows.
***

In [16]:
df = df[~df["Log GDP per capita"].isnull()]
df.isnull().sum()

Country name                         0
year                                 0
Life Ladder                          0
Log GDP per capita                   0
Social support                       1
Healthy life expectancy at birth     0
Freedom to make life choices         0
Generosity                           0
Perceptions of corruption            1
Regional indicator                  51
dtype: int64

***
- Apart from *`Regional indicator`* column, we have only 2 missing values. Let's look one of them.
***

In [17]:
df[df["Social support"].isnull()]

Unnamed: 0,Country name,year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Regional indicator
1310,Oman,2011.0,6.853,10.382,,65.5,0.916,0.025,,


***
- Great! Our last missing values in the *`Social support`* and *`Perceptions of corruption`* in the same row. Let's drop this row too.
***

In [18]:
df = df[~df["Social support"].isnull()]
df.isnull().sum()

Country name                         0
year                                 0
Life Ladder                          0
Log GDP per capita                   0
Social support                       0
Healthy life expectancy at birth     0
Freedom to make life choices         0
Generosity                           0
Perceptions of corruption            0
Regional indicator                  50
dtype: int64

***
- We have done it! There is no missing values in the columns other than *`Regional indicator`* column in which it is OK to leave as NaN. This is because 2021 data had no information about those countries that historic data had. It is not a problem though. We can start our analysis now.
***

***
## What Makes a Country Happier?
***

def trust(corrupt):
    if corrupt >=  0.8450:
        return "Low trust in institutions"
    elif corrupt < 0.8450 and corrupt > 0.781:
        return "Lower than normal level trust in institutions"
    elif corrupt <= 0.781 and corrupt > 0.667:
        return "About the normal level trust in institutions"
    else:
        return "High level trust in institutions"
df3["Trust in institutions"] = df3["Perceptions of corruption"].apply(trust)

***
- First, let's check correlation among the numerical variables in the dataset.
***

In [19]:
df.drop("year",axis=1).corr()

Unnamed: 0,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption
Life Ladder,1.0,0.788567,0.711121,0.737231,0.525287,0.178586,-0.445101
Log GDP per capita,0.788567,1.0,0.695889,0.849697,0.366862,-0.0075,-0.372409
Social support,0.711121,0.695889,1.0,0.612149,0.416092,0.056169,-0.228155
Healthy life expectancy at birth,0.737231,0.849697,0.612149,1.0,0.395658,0.02421,-0.348888
Freedom to make life choices,0.525287,0.366862,0.416092,0.395658,1.0,0.315045,-0.484465
Generosity,0.178586,-0.0075,0.056169,0.02421,0.315045,1.0,-0.281922
Perceptions of corruption,-0.445101,-0.372409,-0.228155,-0.348888,-0.484465,-0.281922,1.0


***
- Happiness score(Life Ladder) has strong level correlation with GDP, Social Support and Healthy life expectancy at birth.
- Freedom to make life choices and happiness score have mide level correlation between them.
- Perception of corruption and happiness score have weak level negative level correlation between them.
***

***
- Let's see correlation in the heatmap.
***

In [40]:
fig = go.Figure(go.Heatmap(z=df.corr(), x=df.corr().columns.tolist(), y=df.corr().columns.tolist(),
                          colorscale="viridis"))
fig.show()

***
- What about happines at the Regional level?
***

In [27]:
df.groupby("Regional indicator")["Life Ladder"].describe().sort_values(by="std")

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Regional indicator,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
North America and ANZ,62.0,7.254919,0.179134,6.804,7.16675,7.2555,7.37825,7.65
East Asia,88.0,5.645511,0.563235,4.454,5.2815,5.722,5.979,6.947
Commonwealth of Independent States,182.0,5.212835,0.612177,3.675,4.7465,5.2565,5.7105,6.568
Central and Eastern Europe,242.0,5.55793,0.619974,3.844,5.10925,5.5915,6.03575,7.034
Sub-Saharan Africa,426.0,4.317688,0.633168,2.694,3.831,4.32,4.7795,6.241
South Asia,91.0,4.537363,0.68475,2.375,4.229,4.567,5.026,5.831
Western Europe,292.0,6.833096,0.705528,4.72,6.438,6.9755,7.417,8.019
Southeast Asia,125.0,5.34596,0.752216,3.569,4.88,5.339,5.885,7.062
Latin America and Caribbean,299.0,5.987666,0.766028,3.352,5.6495,6.025,6.4665,7.615
Middle East and North Africa,228.0,5.384469,0.980232,2.983,4.701,5.165,6.17475,7.433


***
- North America and ANZ	has the highest level happines mean on the other hand Sub-Saharan Africa has the least level happines mean.
- Western Europe with the highest level happiness and South Asia least highest levet happiness.
- North America and ANZ has the least standard deviation and almost the same mediand and mean which means happiness there is a normal distribution acroos countires in North America and ANZ.
- Middle East and North Africa with the highest standard deviation.
***

***
- Let's see all this better with boxplot.
***

In [39]:
fig = px.box(df, x="Life Ladder", y="Regional indicator", hover_data = df[['Regional indicator','Country name']])
fig.show()

***
- South Asia, Latin America and Caribbean, North American and ANZ, Western Europe has several outliers in the minimum side.
- Sub-Saharan Africa has outlier in the maxiumum side.
- Middle East and North Africa seems interesting to me since it has long tail in both minimum and maximum side.
- Let's dive into Middle East and North Africa.

In [36]:
middle_east = df[df["Regional indicator"] == "Middle East and North Africa"]
middle_east

Unnamed: 0,Country name,year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Regional indicator
25,Algeria,2010.0,5.464,9.287,0.803375,64.500,0.593000,-0.205000,0.618000,Middle East and North Africa
26,Algeria,2011.0,5.317,9.297,0.810000,64.660,0.530000,-0.181000,0.638000,Middle East and North Africa
27,Algeria,2012.0,5.605,9.311,0.839000,64.820,0.587000,-0.172000,0.690000,Middle East and North Africa
28,Algeria,2014.0,6.355,9.335,0.818000,65.140,0.513571,-0.133286,0.699714,Middle East and North Africa
29,Algeria,2016.0,5.341,9.362,0.749000,65.500,0.513571,-0.133286,0.699714,Middle East and North Africa
...,...,...,...,...,...,...,...,...,...,...
2071,Lebanon,2021.0,4.584,9.626,0.848000,67.355,0.525000,-0.073000,0.898000,Middle East and North Africa
2073,Palestinian Territories,2021.0,4.517,8.485,0.826000,62.250,0.653000,-0.163000,0.821000,Middle East and North Africa
2075,Jordan,2021.0,4.395,9.182,0.767000,67.000,0.755000,-0.167000,0.705000,Middle East and North Africa
2080,Egypt,2021.0,4.283,9.367,0.750000,61.998,0.749000,-0.182000,0.795000,Middle East and North Africa


***
- Let's see how correlated the variables in Middle East and North Africa.
***

In [37]:
df.drop("year",axis=1).corr()

Unnamed: 0,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption
Life Ladder,1.0,0.788567,0.711121,0.737231,0.525287,0.178586,-0.445101
Log GDP per capita,0.788567,1.0,0.695889,0.849697,0.366862,-0.0075,-0.372409
Social support,0.711121,0.695889,1.0,0.612149,0.416092,0.056169,-0.228155
Healthy life expectancy at birth,0.737231,0.849697,0.612149,1.0,0.395658,0.02421,-0.348888
Freedom to make life choices,0.525287,0.366862,0.416092,0.395658,1.0,0.315045,-0.484465
Generosity,0.178586,-0.0075,0.056169,0.02421,0.315045,1.0,-0.281922
Perceptions of corruption,-0.445101,-0.372409,-0.228155,-0.348888,-0.484465,-0.281922,1.0


In [38]:
middle_east.drop("year",axis=1).corr()

Unnamed: 0,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption
Life Ladder,1.0,0.818969,0.640044,0.674217,0.524962,0.448896,-0.52646
Log GDP per capita,0.818969,1.0,0.559865,0.655847,0.544174,0.453228,-0.630186
Social support,0.640044,0.559865,1.0,0.372988,0.313726,0.237148,-0.407732
Healthy life expectancy at birth,0.674217,0.655847,0.372988,1.0,0.387367,0.382277,-0.196717
Freedom to make life choices,0.524962,0.544174,0.313726,0.387367,1.0,0.349958,-0.56586
Generosity,0.448896,0.453228,0.237148,0.382277,0.349958,1.0,-0.234272
Perceptions of corruption,-0.52646,-0.630186,-0.407732,-0.196717,-0.56586,-0.234272,1.0


***
- Even though some similarities can be found with the whole dataset correlation, in Middle East and North Africa has some differences in correlation matrix.
- GDP has more correlation with happiness compared to whole dataset but in a low margin but on the other hand, Generosity has so much more correlation with happines in Middle East and North Africa compared to whole dataset.
- Social support, Healthy life expectancy at birth and Perception of corruption has much lower correlation with happines than whole dataset.
- Freedom to make life choices' correlation almost identical in both datasets.
***

***
- Let's look at heatmap for better understanding.
***

In [41]:
fig = go.Figure(go.Heatmap(z=middle_east.corr(), x=middle_east.corr().columns.tolist(), y=middle_east.corr().columns.tolist(),
                          colorscale="viridis"))
fig.show()

In [42]:
middle_east.describe()

Unnamed: 0,year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption
count,228.0,228.0,228.0,228.0,228.0,228.0,228.0,228.0
mean,2014.0,5.384469,9.692509,0.790107,64.557434,0.681605,-0.058448,0.724934
std,4.412637,0.980232,0.872271,0.09288,3.889171,0.139725,0.123149,0.149805
min,2005.0,2.983,7.578,0.535,53.4,0.315,-0.244,0.203
25%,2011.0,4.701,9.21975,0.7265,62.2875,0.5945,-0.151,0.6665
50%,2014.0,5.165,9.49,0.818,65.2,0.6645,-0.081,0.7535
75%,2018.0,6.17475,10.57125,0.86225,66.5,0.774,0.031,0.83125
max,2021.0,7.433,11.367,0.959,73.7,0.962,0.241,0.949


***
- Based on descriptive information, possible outliers can be seen in the:
    - Healthy life expectancy at birth
    - Freedom to make life choices
    - Perceptions of corruption
***

***
- In this EDA, I will mostly focus on happines score.
***

## Happiness in the Middle East and North Africa

***
- Let's start with boxplot.
***

In [48]:
fig = px.box(middle_east, x="Life Ladder", hover_data = middle_east[['Country name']])
fig.update_traces(quartilemethod="inclusive")
fig.show()

***
- Boxplot does not show any possible outliers.
***

***
- Let's continue with a bar plot.
***

In [53]:
middle_east = middle_east.sort_values(by="Life Ladder")

fig = px.bar(middle_east, x="Life Ladder", y="Country name")
fig.show()

***
- **Israel** has the highest happiness score in Middle East and North Africa.
- **Yemen** has the least happiness score in Middle East and North Africa.

In [62]:
fig=go.Figure()
fig.add_trace(go.Scatter(
    x=middle_east.groupby("Country name").mean().sort_values(by="Life Ladder").index,
    y=middle_east.groupby("Country name").mean().sort_values(by="Life Ladder")["Life Ladder"],
    name='Happines Score',
    mode='markers+text',
    marker_color='blue',
    marker_size=10,
    textposition='top center',
    line=dict(color='red',dash='dash'),
))
fig.update_layout(
    title= "<b>Middle East Happiness Score in 2021</b>",
    xaxis_title="<b>Country</b>",
    yaxis_title="<b>Happiness Score</b>",
    template='plotly_white',
    font=dict(
        size=12,
        color="Black",
        family="Oswald', sans-serif"
        ),
    xaxis=dict(showgrid=True),
    yaxis=dict(showgrid=True),
    yaxis2=dict(showgrid=True,overlaying='y',side='right',title='<b>Happiness Score</b>'),
)
fig.show()

***
- It is important to understand that we are not working with just 2021 data but all data which contains data from 2013 to 2021. This values are mean of all the years.
- Most of the countries has happines score between 4.5 and 6.5.
***

***
- Let's see happiness score trends in the past years.
***

In [99]:
middle_east = middle_east.sort_values(["Country name", "year"])

In [75]:
css3_colors = ['#add8e6', '#f08080','#e0ffff','#fafad2','#d3d3d3','#90ee90','#ffb6c1','#ffa07a','#20b2aa','#87cefa','#778899','#b0c4de','#32cd32','#ff00ff','#66cdaa','#ba55d3', '#7b68ee']
css3_dict ={}
i=0
for name in middle_east["Country name"].unique():
    css3_dict[name]=css3_colors[i]
    i+=1

In [82]:
fig=go.Figure()
for name in middle_east['Country name'].unique():
    fig.add_trace(go.Scatter(
    x=middle_east[middle_east['Country name']==name]['year'],
    y=middle_east[middle_east['Country name']==name]['Life Ladder'],
    name=name,
    mode='markers+text+lines',
    marker_color='black',
    line=dict(color=css3_dict[name]),
    marker_size=3,
    yaxis='y1'))
    
fig.update_layout(
    title="Happiness Score Trend in Central and Eastern Europe ",
    xaxis_title="Year",
    yaxis_title='Happiness Score',
    template='plotly_white',
    font=dict(
        size=14,
        color="Blue",
        family="Oswald', sans-serif"
    ),
    xaxis=dict(showgrid=True),
    yaxis=dict(showgrid=True)
)
fig.show()

***
- This plot seems kind of complicated. Let's visualize the change by country.
***

***
- This visualizations seems better.
- Algeria, Jordan, Turkey and United Arab Emirates have downtrend. 
- Bahrain and Iraq's happiness increased over the last years.
***

***
- Let's start working with other variables' relation with happiness score.
***

## GDP's Impact on Happiness Score

***
- Is really the money answer to our happiness? Let's check!
***

In [87]:
middle_east[middle_east["year"]==2021]

Unnamed: 0,Country name,year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Regional indicator
2057,Algeria,2021.0,4.887,9.342,0.802,66.005,0.48,-0.067,0.752,Middle East and North Africa
1970,Bahrain,2021.0,6.647,10.669,0.862,69.495,0.925,0.089,0.722,Middle East and North Africa
2080,Egypt,2021.0,4.283,9.367,0.75,61.998,0.749,-0.182,0.795,Middle East and North Africa
2066,Iran,2021.0,4.721,9.584,0.71,66.3,0.608,0.218,0.714,Middle East and North Africa
2059,Iraq,2021.0,4.854,9.24,0.746,60.583,0.63,-0.053,0.875,Middle East and North Africa
1960,Israel,2021.0,7.157,10.575,0.939,73.503,0.8,0.031,0.753,Middle East and North Africa
2075,Jordan,2021.0,4.395,9.182,0.767,67.0,0.755,-0.167,0.705,Middle East and North Africa
1995,Kuwait,2021.0,6.106,10.817,0.843,66.9,0.867,-0.104,0.736,Middle East and North Africa
2071,Lebanon,2021.0,4.584,9.626,0.848,67.355,0.525,-0.073,0.898,Middle East and North Africa
2028,Libya,2021.0,5.41,9.622,0.827,62.3,0.771,-0.087,0.667,Middle East and North Africa


In [88]:
trace = go.Scatter(x=middle_east[middle_east["year"]==2021]['Life Ladder'],y=middle_east[middle_east["year"]==2021]['Log GDP per capita'],text = middle_east['Country name'],mode='markers',marker={'color':'blue', 'size':10})
df=[trace]
layout = go.Layout(title='Happiness Score & Logged GDP per capita in Middle East',xaxis=dict(title='Ladder Score'),yaxis=dict(title='Log GDP per capita'),hovermode='closest')
figure = go.Figure(data=df,layout=layout)
figure.update_layout(template='plotly_white',
                  font=dict(family="Oswald', sans-serif"))
figure.show()

In [89]:
print(f"Happines score and GDP has {round(middle_east.corr().loc['Life Ladder', 'Log GDP per capita'],2)} correlation score.")

Happines score and GDP has 0.82 correlation score.


***
- The answer the question I asked before is yes. GDP has really strong correlation with happines score. We can see this from scatterplot.

***
- What about social support?
***

## Social Support's Impact on Happiness Score

In [90]:
trace = go.Scatter(x=middle_east[middle_east["year"]==2021]['Life Ladder'],y=middle_east[middle_east["year"]==2021]['Social support'],text = middle_east['Country name'],mode='markers',marker={'color':'blue', 'size':10})
df=[trace]
layout = go.Layout(title='Happiness Score & Social support in Middle East',xaxis=dict(title='Ladder Score'),yaxis=dict(title='Social support'),hovermode='closest')
figure = go.Figure(data=df,layout=layout)
figure.update_layout(template='plotly_white',
                  font=dict(family="Oswald', sans-serif"))
figure.show()

In [91]:
print(f"Happines score and social support has {round(middle_east.corr().loc['Life Ladder', 'Social support'],2)} correlation score.")

Happines score and social support has 0.64 correlation score.


***
- Also social support has strong correlation score with happiness.
***

## Healthy Life Expectancy at Birth's Impact on Happiness Score

In [92]:
trace = go.Scatter(x=middle_east[middle_east["year"]==2021]['Life Ladder'],y=middle_east[middle_east["year"]==2021]['Healthy life expectancy at birth'],text = middle_east['Country name'],mode='markers',marker={'color':'blue', 'size':10})
df=[trace]
layout = go.Layout(title='Happiness Score & Healthy life expectancy at birth in Middle East',xaxis=dict(title='Ladder Score'),yaxis=dict(title='Healthy life expectancy at birth'),hovermode='closest')
figure = go.Figure(data=df,layout=layout)
figure.update_layout(template='plotly_white',
                  font=dict(family="Oswald', sans-serif"))
figure.show()

In [93]:
print(f"Happines score and healthy life expectancy at birth has {round(middle_east.corr().loc['Life Ladder', 'Healthy life expectancy at birth'],2)} correlation score.")

Happines score and healthy life expectancy at birth has 0.67 correlation score.


***
- Healthy life expectancy at birth has strong correlation score with happiness.
***

## Perceptions of Corruption's Impact on Happiness Score

In [94]:
trace = go.Scatter(x=middle_east[middle_east["year"]==2021]['Life Ladder'],y=middle_east[middle_east["year"]==2021]['Perceptions of corruption'],text = middle_east['Country name'],mode='markers',marker={'color':'blue', 'size':10})
df=[trace]
layout = go.Layout(title='Happiness Score & Perceptions of corruption in Middle East',xaxis=dict(title='Ladder Score'),yaxis=dict(title='Perceptions of corruption'),hovermode='closest')
figure = go.Figure(data=df,layout=layout)
figure.update_layout(template='plotly_white',
                  font=dict(family="Oswald', sans-serif"))
figure.show()

In [95]:
print(f"Happines score and perceptions of corruption has {round(middle_east.corr().loc['Life Ladder', 'Perceptions of corruption'],2)} correlation score.")

Happines score and perceptions of corruption has -0.53 correlation score.


***
- As expected, perceptions of corruption has strong negative correlation with happiness. If people who are ruling you are not honest, it is hard to stay happy.
***

## Freedom to Make Life Choices' Impact on Happiness Score

In [96]:
trace = go.Scatter(x=middle_east[middle_east["year"]==2021]['Life Ladder'],y=middle_east[middle_east["year"]==2021]['Freedom to make life choices'],text = middle_east['Country name'],mode='markers',marker={'color':'blue', 'size':10})
df=[trace]
layout = go.Layout(title='Happiness Score & Freedom to make life choices in Middle East',xaxis=dict(title='Ladder Score'),yaxis=dict(title='Freedom to make life choices'),hovermode='closest')
figure = go.Figure(data=df,layout=layout)
figure.update_layout(template='plotly_white',
                  font=dict(family="Oswald', sans-serif"))
figure.show()

In [97]:
print(f"Happines score and freedom to make life choices has {round(middle_east.corr().loc['Life Ladder', 'Freedom to make life choices'],2)} correlation score.")

Happines score and freedom to make life choices has 0.52 correlation score.


***

We have come to an end of another great analysis. It was really enjoyable for me. It was a quite pleasure to work with this dataset. I would like to thank dataset contibutor for this data. I hope you enjoyed too. If you liked my EDA on this dataset, feel free to check my other notebooks as well. Looking forward for your feedback. Thanks a lot.

Have a great day.