# Exploratory Data Analysis of World Happiness

- Mutluluk endeksi veri seti [kaggle](https://www.kaggle.com/unsdsn/world-happiness)'dan alinmistir.
- Hedef degisken: Happiness Score

---
## ***Data Cleaning***
- Veri tipleri
- Eksik Degerler

In [191]:
import pandas as pd
import numpy as np
# grafiksel araclar
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid")
import missingno
# json dosyasina okuma yazma icin
import json
import math
# aykiri degerleri giderme icin
from scipy.stats.mstats import winsorize
# jb test icin
from scipy.stats import jarque_bera
from scipy.stats import normaltest
# ttesti ve anova icin
import scipy.stats as stats
# pca icin
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# uyarilari dikkate alma
import warnings
warnings.filterwarnings('ignore')

# * ile dosya okumak icin
import glob
# path islemleri icin
import os

# pandas varsayilan olarak cok sayida sutun veya satir varsa tumunu gostermez
# bu nedenle 100 sutun ve satir gostermesi icin
pd.options.display.max_rows = 100
pd.options.display.max_columns = 100

# virgulden sonra belirli sayida basamak gosterimi icin 
pd.options.display.float_format = '{:,.3f}'.format

# font tanimlamalari
title_font = {'family': 'times new roman', 'color': 'darkred','weight': 'bold','size': 14}
axis_font  = {'family': 'times new roman', 'color': 'darkred','weight': 'bold','size': 14}

- Birden fazla dosyadan tek frame e veri yukleme.

In [192]:
# veri setini dataframe icerisine yukle
# path bulundugumuz dizin
path = './'
# dizindeki .csv uzantili butun dosyalarin listesi
all_files = glob.glob(os.path.join(path, "*.csv"))
# happiness df olusturulur
happiness = pd.DataFrame([])
# csv uzantili dosyalar dongu ile happiness df e yuklenir
for f in all_files:
    df = pd.read_csv(f)
    # yil bilgisi de eklenir
    df['year'] = int(os.path.basename(f).split('.')[0])
    # concat ile uzerine eklenerek df olusturulur
    happiness = pd.concat([happiness, df])



***Veri tiplerinin tespiti***

In [193]:
happiness.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 782 entries, 0 to 157
Data columns (total 31 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Overall rank                   312 non-null    float64
 1   Country or region              312 non-null    object 
 2   Score                          312 non-null    float64
 3   GDP per capita                 312 non-null    float64
 4   Social support                 312 non-null    float64
 5   Healthy life expectancy        312 non-null    float64
 6   Freedom to make life choices   312 non-null    float64
 7   Generosity                     782 non-null    float64
 8   Perceptions of corruption      311 non-null    float64
 9   year                           782 non-null    int64  
 10  Country                        470 non-null    object 
 11  Region                         315 non-null    object 
 12  Happiness Rank                 315 non-null    flo

***Eksik Degerler***

- Sutun bazinda bosluklar bulunmakta, ancak benzer isimde sutun da mevcut, bunlar birlestirilebilir.

In [194]:
happiness.isnull().mean()*100

Overall rank                    60.102
Country or region               60.102
Score                           60.102
GDP per capita                  60.102
Social support                  60.102
Healthy life expectancy         60.102
Freedom to make life choices    60.102
Generosity                       0.000
Perceptions of corruption       60.230
year                             0.000
Country                         39.898
Region                          59.719
Happiness Rank                  59.719
Happiness Score                 59.719
Lower Confidence Interval       79.923
Upper Confidence Interval       79.923
Economy (GDP per Capita)        59.719
Family                          39.898
Health (Life Expectancy)        59.719
Freedom                         39.898
Trust (Government Corruption)   59.719
Dystopia Residual               59.719
Happiness.Rank                  80.179
Happiness.Score                 80.179
Whisker.high                    80.179
Whisker.low              

- Her yil icin tamamen bosluksuz ve bosluklu sutunlari tespit ederek benzer isimdeki sutunlari tek bir sutunda toplayacagiz ve gereksiz sutunlari atabilecegiz.

In [195]:
columns = happiness.columns
non_null_columns = {}
null_columns = {}
years = [2000+i for i in range(15,20)]
for year in years:
    non_null_columns[year] = []
    null_columns[year] = []
    for column in columns:
        if happiness.loc[happiness['year']==year, column].isnull().mean() == 0:
            non_null_columns[year].append(column)
        elif happiness.loc[happiness['year']==year, column].isnull().mean() == 1:
            null_columns[year].append(column)
    non_null_columns[year].sort()
    
print('Yil'.ljust(8),'Tamamen dolu'.rjust(13), 
      'Tamamen bos'.rjust(13), 'Toplam sutun'.rjust(13))
for year in years:
    print('{:<8}{:>13}{:>13}{:>13}'.format(year,
                                           len(non_null_columns[year]),
                                           len(null_columns[year]),
                                           len(non_null_columns[year])+len(null_columns[year])))


Yil       Tamamen dolu   Tamamen bos  Toplam sutun
2015               13           18           31
2016               14           17           31
2017               13           18           31
2018                9           21           30
2019               10           21           31


- 2018 yilinda %0.6 bos olan bir sutun var -> 'Perception of corruption'

In [196]:
happiness[happiness['year']==2018].isnull().mean()*100

Overall rank                      0.000
Country or region                 0.000
Score                             0.000
GDP per capita                    0.000
Social support                    0.000
Healthy life expectancy           0.000
Freedom to make life choices      0.000
Generosity                        0.000
Perceptions of corruption         0.641
year                              0.000
Country                         100.000
Region                          100.000
Happiness Rank                  100.000
Happiness Score                 100.000
Lower Confidence Interval       100.000
Upper Confidence Interval       100.000
Economy (GDP per Capita)        100.000
Family                          100.000
Health (Life Expectancy)        100.000
Freedom                         100.000
Trust (Government Corruption)   100.000
Dystopia Residual               100.000
Happiness.Rank                  100.000
Happiness.Score                 100.000
Whisker.high                    100.000


- Tekrar eden sutunlari birlestirip, gereksiz sutunlari atalim. 

In [197]:
non_null_columns[2018]

['Country or region',
 'Freedom to make life choices',
 'GDP per capita',
 'Generosity',
 'Healthy life expectancy',
 'Overall rank',
 'Score',
 'Social support',
 'year']

In [198]:
happiness.loc[happiness['year'] == 2015,'rank'] = happiness.loc[happiness['year'] == 2015,'Happiness Rank']
happiness.loc[happiness['year'] == 2016,'rank'] = happiness.loc[happiness['year'] == 2016,'Happiness Rank']
happiness.loc[happiness['year'] == 2017,'rank'] = happiness.loc[happiness['year'] == 2017,'Happiness.Rank']
happiness.loc[happiness['year'] == 2018,'rank'] = happiness.loc[happiness['year'] == 2018,'Overall rank']
happiness.loc[happiness['year'] == 2019,'rank'] = happiness.loc[happiness['year'] == 2019,'Overall rank']
happiness.drop(columns=['Happiness Rank', 'Happiness.Rank', 'Overall rank'], axis=1, inplace=True)

In [199]:
happiness.loc[happiness['year'] == 2015,'hapiness_score'] = happiness.loc[happiness['year'] == 2015,'Happiness Score']
happiness.loc[happiness['year'] == 2016,'hapiness_score'] = happiness.loc[happiness['year'] == 2016,'Happiness Score']
happiness.loc[happiness['year'] == 2017,'hapiness_score'] = happiness.loc[happiness['year'] == 2017,'Happiness.Score']
happiness.loc[happiness['year'] == 2018,'hapiness_score'] = happiness.loc[happiness['year'] == 2018,'Score']
happiness.loc[happiness['year'] == 2019,'hapiness_score'] = happiness.loc[happiness['year'] == 2019,'Score']
happiness.drop(columns=['Happiness Score', 'Happiness.Score', 'Score'], axis=1, inplace=True)

In [200]:
happiness.loc[happiness['year'] == 2015,'gdp'] = happiness.loc[happiness['year'] == 2015,'Economy (GDP per Capita)']
happiness.loc[happiness['year'] == 2016,'gdp'] = happiness.loc[happiness['year'] == 2016,'Economy (GDP per Capita)']
happiness.loc[happiness['year'] == 2017,'gdp'] = happiness.loc[happiness['year'] == 2017,'Economy..GDP.per.Capita.']
happiness.loc[happiness['year'] == 2018,'gdp'] = happiness.loc[happiness['year'] == 2018,'GDP per capita']
happiness.loc[happiness['year'] == 2019,'gdp'] = happiness.loc[happiness['year'] == 2019,'GDP per capita']
happiness.drop(columns=['Economy (GDP per Capita)', 'Economy..GDP.per.Capita.', 'GDP per capita'], axis=1, inplace=True)

- 2015-17 arasi Family ile 2018-19 Social Support birlestirildi.

In [201]:
happiness.loc[happiness['year'] == 2015,'family_social'] = happiness.loc[happiness['year'] == 2015,'Family']
happiness.loc[happiness['year'] == 2016,'family_social'] = happiness.loc[happiness['year'] == 2016,'Family']
happiness.loc[happiness['year'] == 2017,'family_social'] = happiness.loc[happiness['year'] == 2017,'Family']
happiness.loc[happiness['year'] == 2018,'family_social'] = happiness.loc[happiness['year'] == 2018,'Social support']
happiness.loc[happiness['year'] == 2019,'family_social'] = happiness.loc[happiness['year'] == 2019,'Social support']
happiness.drop(columns=['Family', 'Social support'], axis=1, inplace=True)

In [202]:
happiness.loc[happiness['year'] == 2015,'healthy_life'] = happiness.loc[happiness['year'] == 2015,'Health (Life Expectancy)']
happiness.loc[happiness['year'] == 2016,'healthy_life'] = happiness.loc[happiness['year'] == 2016,'Health (Life Expectancy)']
happiness.loc[happiness['year'] == 2017,'healthy_life'] = happiness.loc[happiness['year'] == 2017,'Health..Life.Expectancy.']
happiness.loc[happiness['year'] == 2018,'healthy_life'] = happiness.loc[happiness['year'] == 2018,'Healthy life expectancy']
happiness.loc[happiness['year'] == 2019,'healthy_life'] = happiness.loc[happiness['year'] == 2019,'Healthy life expectancy']
happiness.drop(columns=['Health (Life Expectancy)', 'Health..Life.Expectancy.', 'Healthy life expectancy'], axis=1, inplace=True)

In [203]:
happiness.loc[happiness['year'] == 2015,'freedom'] = happiness.loc[happiness['year'] == 2015,'Freedom']
happiness.loc[happiness['year'] == 2016,'freedom'] = happiness.loc[happiness['year'] == 2016,'Freedom']
happiness.loc[happiness['year'] == 2017,'freedom'] = happiness.loc[happiness['year'] == 2017,'Freedom']
happiness.loc[happiness['year'] == 2018,'freedom'] = happiness.loc[happiness['year'] == 2018,'Freedom to make life choices']
happiness.loc[happiness['year'] == 2019,'freedom'] = happiness.loc[happiness['year'] == 2019,'Freedom to make life choices']
happiness.drop(columns=['Freedom', 'Freedom to make life choices'], axis=1, inplace=True)

In [204]:
happiness.loc[happiness['year'] == 2015,'corruption'] = happiness.loc[happiness['year'] == 2015,'Trust (Government Corruption)']
happiness.loc[happiness['year'] == 2016,'corruption'] = happiness.loc[happiness['year'] == 2016,'Trust (Government Corruption)']
happiness.loc[happiness['year'] == 2017,'corruption'] = happiness.loc[happiness['year'] == 2017,'Trust..Government.Corruption.']
happiness.loc[happiness['year'] == 2018,'corruption'] = happiness.loc[happiness['year'] == 2018,'Perceptions of corruption']
happiness.loc[happiness['year'] == 2019,'corruption'] = happiness.loc[happiness['year'] == 2019,'Perceptions of corruption']
happiness.drop(columns=['Trust (Government Corruption)', 'Trust..Government.Corruption.', 'Perceptions of corruption'], axis=1, inplace=True)

In [205]:
happiness['Country or region'].unique()

array(['Finland', 'Norway', 'Denmark', 'Iceland', 'Switzerland',
       'Netherlands', 'Canada', 'New Zealand', 'Sweden', 'Australia',
       'United Kingdom', 'Austria', 'Costa Rica', 'Ireland', 'Germany',
       'Belgium', 'Luxembourg', 'United States', 'Israel',
       'United Arab Emirates', 'Czech Republic', 'Malta', 'France',
       'Mexico', 'Chile', 'Taiwan', 'Panama', 'Brazil', 'Argentina',
       'Guatemala', 'Uruguay', 'Qatar', 'Saudi Arabia', 'Singapore',
       'Malaysia', 'Spain', 'Colombia', 'Trinidad & Tobago', 'Slovakia',
       'El Salvador', 'Nicaragua', 'Poland', 'Bahrain', 'Uzbekistan',
       'Kuwait', 'Thailand', 'Italy', 'Ecuador', 'Belize', 'Lithuania',
       'Slovenia', 'Romania', 'Latvia', 'Japan', 'Mauritius', 'Jamaica',
       'South Korea', 'Northern Cyprus', 'Russia', 'Kazakhstan', 'Cyprus',
       'Bolivia', 'Estonia', 'Paraguay', 'Peru', 'Kosovo', 'Moldova',
       'Turkmenistan', 'Hungary', 'Libya', 'Philippines', 'Honduras',
       'Belarus', 'Turkey

- ```Country or region``` degeri aslinda ```Country``` oldugundan iki sutunu birlestirdik ve country or region sutununu attik.

In [206]:
happiness.loc[happiness['year'] == 2018,'Country'] = happiness.loc[happiness['year'] == 2018,'Country or region']
happiness.loc[happiness['year'] == 2019,'Country'] = happiness.loc[happiness['year'] == 2019,'Country or region']
happiness.drop(columns=['Country or region'], axis=1, inplace=True)

In [207]:
happiness.isnull().mean()*100

Generosity                   0.000
year                         0.000
Country                      0.000
Region                      59.719
Lower Confidence Interval   79.923
Upper Confidence Interval   79.923
Dystopia Residual           59.719
Whisker.high                80.179
Whisker.low                 80.179
Dystopia.Residual           80.179
Standard Error              79.795
rank                         0.000
hapiness_score               0.000
gdp                          0.000
family_social                0.000
healthy_life                 0.000
freedom                      0.000
corruption                   0.128
dtype: float64

- Gereksiz gordugumuz sutunlari sildik.

In [208]:
columns_to_rm = ['Lower Confidence Interval', 'Upper Confidence Interval',
                     'Dystopia Residual', 'Dystopia.Residual', 
                     'Whisker.high', 'Whisker.low', 'Standard Error']
happiness.drop(columns=columns_to_rm, axis=1, inplace=True)

- Sutunlari yeniden isimlendirdik ve siraladik.

In [209]:
rename_dict = {'Generosity': 'generosity', 'Country': 'country', 'Region': 'region' }
happiness.rename(columns=rename_dict, inplace=True)

reordered_columns = ['country', 'region', 'hapiness_score',
                     'rank', 'gdp', 'healthy_life', 'family_social', 
                     'freedom', 'corruption', 'generosity', 'year']
happiness = happiness[reordered_columns]

In [210]:
happiness.isnull().mean()*100

country           0.000
region           59.719
hapiness_score    0.000
rank              0.000
gdp               0.000
healthy_life      0.000
family_social     0.000
freedom           0.000
corruption        0.128
generosity        0.000
year              0.000
dtype: float64

- Bos sutunlara yil bazinda bakalim.

In [211]:
columns = ['region', 'corruption']
for year in years:
    print(year)
    for column in columns:
        print('{:<18}'.format(column), end='')
        print(happiness.loc[happiness['year']==year,column].isnull().mean()*100)

2015
region            0.0
corruption        0.0
2016
region            0.0
corruption        0.0
2017
region            100.0
corruption        0.0
2018
region            100.0
corruption        0.641025641025641
2019
region            100.0
corruption        0.0


- 2017-19 arasi region degerleri bos, bunlari 2015-16 country-region degerleri ile karsilastirarak doldurabiliriz.
- Ayrica 2018 yilina ait eksik corruption (%0.6) degerini de dolduracagiz.