<div style="display: flex; background-color: RGB(255,114,0);" >

# PROJET - Bonheur World <mark>Exploration des données et nettoyage</mark>
</div>

<div style="display: flex; background-color: Blue; padding: 15px;" >

## 1.Mission 
</div>

<div style="display: flex; background-color: Green; padding: 7px;" >

### 1.1. Rappel sur sujet
</div>

Challenge goals

How accurately can we predict regional temperature anomalies based on past and neighbouring climate observations ?

Dans cette étude de cas, nous formerons un algorithme d'apprentissage automatique non supervisé pour regrouper les pays en fonction de caractéristiques telles que la production économique, le soutien social, l'espérance de vie, la liberté, l'absence de corruption et la générosité. Le Rapport sur le bonheur dans le monde détermine l'état du bonheur mondial. Les scores de bonheur et les données de classement ont été collectés en demandant aux individus de classer leur vie de 0 (la pire vie possible) à 10 (la meilleure vie possible).         

- Un notebook contenant les fonctions permettant le prétraitement des données ainsi que les résultats du clustering (en y incluant des représentations graphiques) ....
- Un support de présentation qui présente la démarche et les résultats du clustering.

<div style="display: flex; background-color: Green; padding: 7px;" >

### 1.2. Description du notebook
</div>

Ce note book, a pour objectif d'explorer les données et d'effectuer le nettoyage.

<div style="display: flex; background-color: Green; padding: 7px;" >

### 1.3. Description des données
</div>



|column        | Description                             |
|--------------| --------------------------------------- |
|country_official | nom officiel du pays |
|country | nom du pays reçu à l'origine|
|year | année|
|score | score du bonheur|
|PIB | PIB par habitant|
|Soutien | Soutien social|
|Esperance vie BS | Espérance de vie en Bonne Santé|
|Liberte des choix de vie | |
|Generosite | la générosité perçue|
|Corruption | le faible niveau de corruption perçue|
|Regional indicator||
|id_country | identifiant officiel du pays|
|alpha3 | code alpha3 officiel du pays|
|alpha2 | code alpha2 officiel du pays|
|continent_code | code continent officiel du pays|
|latitude | latitude du pays|
|longitude | longitude du pay|s
|Annual Sunshine | durée d'ensoleillement du pays|
|Annual Sunshine NCDC Computed||
|deaths_rural_urban | nombre de morts en zone urbaine|
|divorces | nombre de divorce|
|Precipitations in million cubic metres | volume de précipitations|
|rate of Population connected to wastewater collecting system | ratio de la population avec l'eau courante|
|rate of Population connected to wastewater treatment | ratio de la population connectée au tout à l'égout|
|gini | indice de gini|
|intentional homicide victims Female nb | nombre de femmes victimes d'homicide |
|intentional homicide victims Male nb | nombre d'hommes victimes d'homicide |
|Homicide victime nb | nombre de victimes d'homicide |


In [1]:
from os import getcwd
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msno
from tqdm import tqdm
import csv
from bonheur_bed_ara import *
from IPython.core.display import HTML
import plotly.express as px

<div style="display: flex; background-color: Blue; padding: 15px;" >

## 2.Chargement des données
</div>

In [2]:
# ---------------------------------------------------------------------------------------------
#                               MAIN
# ---------------------------------------------------------------------------------------------
verbose = False

# Récupère le répertoire du programme
file_path = getcwd() + "\\"
data_set_path = file_path + "dataset\\"
country_col_name = "country"
country_official_col_name = "country_official"
data_set_file_name = "DATA_SET_FULL.csv"

print(f"Current execution path : {file_path}")
print(f"Dataset path : {data_set_path}")

Current execution path : c:\Users\User\WORK\workspace-ia\PROJETS\projet_bonheur_bed\
Dataset path : c:\Users\User\WORK\workspace-ia\PROJETS\projet_bonheur_bed\dataset\


<div style="display: flex; background-color: Green; padding: 7px;" >

### 2.1. Chargement
</div>

Le code continent `NA` pour `North America` est interprété comme un NAN, il faut donc en tenir compte au chargement des données.

In [3]:
# Le code continent North American = NA était interprété en tant que NaN et non sa valeur
df_origin = pd.read_csv(data_set_path+data_set_file_name, quoting=csv.QUOTE_NONNUMERIC, na_values=["", np.nan], keep_default_na=False, sep=',')
df_origin = df_origin.sort_values(by=[country_official_col_name, "year"])

print(f"{df_origin.shape} données chargées ------> {list(df_origin.columns)}")
df_origin.head()

(2359, 28) données chargées ------> ['country_official', 'country', 'year', 'score', 'PIB', 'Soutien', 'Esperance vie BS', 'Liberte des choix de vie', 'Generosite', 'Corruption', 'Regional indicator', 'id_country', 'alpha3', 'alpha2', 'continent_code', 'latitude', 'longitude', 'Annual Sunshine', 'Annual Sunshine NCDC Computed', 'deaths_rural_urban', 'divorces', 'Precipitations in million cubic metres', 'rate of Population connected to wastewater collecting system', 'rate of Population connected to wastewater treatment', 'gini', 'intentional homicide victims Female nb', 'intentional homicide victims Male nb', 'Homicide victime nb']


Unnamed: 0,country_official,country,year,score,PIB,Soutien,Esperance vie BS,Liberte des choix de vie,Generosite,Corruption,...,Annual Sunshine NCDC Computed,deaths_rural_urban,divorces,Precipitations in million cubic metres,rate of Population connected to wastewater collecting system,rate of Population connected to wastewater treatment,gini,intentional homicide victims Female nb,intentional homicide victims Male nb,Homicide victime nb
443,Arab Republic of Egypt,Egypt,2005.0,5.168,9.036,0.848,59.7,0.817,,,...,3293.3,450646.0,65047.0,1300.0,,,,63.0,459.0,10962.0
828,Arab Republic of Egypt,Egypt,2007.0,5.541,9.135,0.686,59.82,0.609,-0.121,,...,3790.8,450596.0,77878.0,1300.0,,,,163.0,517.0,14280.0
1082,Arab Republic of Egypt,Egypt,2008.0,4.632,9.186,0.738,59.88,,-0.087,0.914,...,3790.8,461934.0,84430.0,,,,,110.0,856.0,20286.0
936,Arab Republic of Egypt,Egypt,2009.0,5.066,9.213,0.744,59.94,0.611,-0.1,0.801,...,3298.9,476592.0,141467.0,,,,,193.0,719.0,19152.0
961,Arab Republic of Egypt,Egypt,2010.0,4.669,9.244,0.769,60.0,0.486,-0.076,0.826,...,3666.4,483385.0,149376.0,,,,,197.0,1642.0,38619.0


<div style="display: flex; background-color: Green; padding: 7px;" >

### 2.2. Exploration de la forme
</div>

In [4]:
df_origin[df_origin['continent_code'].isna()]

Unnamed: 0,country_official,country,year,score,PIB,Soutien,Esperance vie BS,Liberte des choix de vie,Generosite,Corruption,...,Annual Sunshine NCDC Computed,deaths_rural_urban,divorces,Precipitations in million cubic metres,rate of Population connected to wastewater collecting system,rate of Population connected to wastewater treatment,gini,intentional homicide victims Female nb,intentional homicide victims Male nb,Homicide victime nb


In [6]:
df = df_origin.copy()

<div style="display: flex; background-color: indigo;" >

#### Encodage des continents
</div>

In [5]:
# Encodage des continents :
from sklearn.preprocessing import LabelEncoder

In [7]:
transformer_continent = LabelEncoder()
df['continent_encode'] = transformer_continent.fit_transform(df['continent_code'])
df = df[['country_official', 'country', 'continent_code','continent_encode','year', 'score', 'PIB', 'Soutien',
       'Esperance vie BS', 'Liberte des choix de vie', 'Generosite',
       'Corruption', 'Regional indicator', 'alpha3', 'alpha2', 'latitude', 'longitude', 'id_country', 'Annual Sunshine',
       'Annual Sunshine NCDC Computed', 'deaths_rural_urban', 'divorces',
       'Precipitations in million cubic metres',
       'rate of Population connected to wastewater collecting system',
       'rate of Population connected to wastewater treatment', 'gini',
       'intentional homicide victims Female nb',
       'intentional homicide victims Male nb', 'Homicide victime nb'
       ]]

<div style="display: flex; background-color: indigo;" >

#### Typage des données
</div>

In [8]:
df["year"] = df["year"].astype(int)
df["id_country"] = df["id_country"].astype(int)

In [9]:
df.dtypes

country_official                                                 object
country                                                          object
continent_code                                                   object
continent_encode                                                  int32
year                                                              int32
score                                                           float64
PIB                                                             float64
Soutien                                                         float64
Esperance vie BS                                                float64
Liberte des choix de vie                                        float64
Generosite                                                      float64
Corruption                                                      float64
Regional indicator                                               object
alpha3                                                          

In [10]:
df.describe(include="all")

Unnamed: 0,country_official,country,continent_code,continent_encode,year,score,PIB,Soutien,Esperance vie BS,Liberte des choix de vie,...,Annual Sunshine NCDC Computed,deaths_rural_urban,divorces,Precipitations in million cubic metres,rate of Population connected to wastewater collecting system,rate of Population connected to wastewater treatment,gini,intentional homicide victims Female nb,intentional homicide victims Male nb,Homicide victime nb
count,2359,2359,2359,2359.0,2359.0,2359.0,2265.0,2274.0,2242.0,2256.0,...,1470.0,1035.0,946.0,534.0,461.0,386.0,841.0,1063.0,1057.0,1055.0
unique,164,190,6,,,,,,,,...,,,,,,,,,,
top,Arab Republic of Egypt,Egypt,AS,,,,,,,,...,,,,,,,,,,
freq,17,17,707,,,,,,,,...,,,,,,,,,,
mean,,,,1.518016,2014.507842,5.442263,8.84142,0.817168,59.3323,0.730743,...,2363.185646,362451.8,72895.47,790954.9,66.35536,67.537196,36.906421,599.159925,2555.536424,66352.91
std,,,,1.384665,4.768411,1.121303,2.249192,0.137225,17.045576,0.151687,...,672.81209,1180071.0,267807.8,2366401.0,25.82858,33.103334,7.890536,2207.075003,7445.275762,190168.4
min,,,,0.0,2005.0,2.375,0.0,0.0,0.0,0.0,...,8.7,1216.0,42.0,15.4,4.9,0.0,24.0,0.0,0.0,0.0
25%,,,,0.0,2011.0,4.6245,8.167,0.746,56.4,0.636,...,1820.4,28788.5,5162.25,32515.7,51.040001,46.625,30.9,22.0,45.0,1407.0
50%,,,,1.0,2015.0,5.383,9.37,0.838,64.8,0.7505,...,2438.6,80876.0,13109.5,101055.2,69.800003,80.299999,35.4,58.0,162.0,5124.0
75%,,,,2.0,2019.0,6.2565,10.263,0.909,68.2,0.853,...,2892.85,257900.0,35836.25,365961.4,87.0,97.0,42.0,202.0,752.0,20223.0


In [16]:
df_clean = df.copy()
df_clean = df_clean.drop(['Annual Sunshine', 'intentional homicide victims Male nb', 'intentional homicide victims Female nb'], axis=1)

In [150]:
def complete_df_with_water_data(df_clean, verbose=0):
    col_names = ['rate of Population connected to wastewater collecting system', 'rate of Population connected to wastewater treatment']
    df_completed = df_clean.copy()

    i = 0

    for col_name in col_names:
        # Création de la DF temporaire avec la dernière valeur par pays
        df_temp_max = df_completed[df_completed[col_name].notna()]
        df_temp_max = df_temp_max.sort_values(by=["country_official","year"], ascending=False)
        df_temp_max = df_temp_max[[country_official_col_name, "year", col_name]]
        if verbose>1:
            print(df_clean['year']>df_temp_max['year'])
            # print("SHAPE 1 : df_clean[df_clean['year']>df_temp_max['year']]       => ",df_clean[df_clean['year']>df_temp_max[year_col_name]].shape)
            # print("SHAPE 2 : df_temp_max[df_temp_max['year']<df_temp_max['year']] => ",df_temp_max[df_temp_max['year']<df_temp_max[year_col_name]].shape)
        year_col_name = "year_max_"+str(i)
        df_temp_max = df_temp_max.rename(columns={"year": year_col_name, col_name : col_name+"_max"})
        # on ne garde que la dernière année
        df_temp_max = df_temp_max.drop_duplicates(subset=["country_official"], keep="first")

        # merge des 2 DF pour pouvoir compléter les données
        df_completed = df_completed.merge(df_temp_max, on=country_official_col_name, how="left", indicator=False)
        df_completed.loc[(df_completed[col_name].isna()) & (df_completed["year"]>df_completed[year_col_name]), col_name] = df_completed[col_name+"_max"]

        if verbose:
            print("SHAPE df_clean : => ",df_clean.shape)
            print("SHAPE df_temp_max : => ",df_temp_max.shape)

        # Suppression des colonnes ajoutées
        df_completed = df_completed.drop(columns=[year_col_name, col_name+"_max"])
        # DF complétée
        if verbose:
            print(f"{col_name} : {df_clean[col_name].isna().sum()} NA BEFORE => {df_completed[col_name].isna().sum()} NA AFTER")
        i += 1
    return df_completed


In [152]:
df_completed = complete_df_with_water_data(df_clean, verbose=verbose)
df_completed.head()

SHAPE df_clean : =>  (2359, 26)
SHAPE df_temp_max : =>  (76, 3)
rate of Population connected to wastewater collecting system : 1898 NA BEFORE => 1279 NA AFTER
SHAPE df_clean : =>  (2359, 26)
SHAPE df_temp_max : =>  (66, 3)
rate of Population connected to wastewater treatment : 1973 NA BEFORE => 1453 NA AFTER


Unnamed: 0,country_official,country,continent_code,continent_encode,year,score,PIB,Soutien,Esperance vie BS,Liberte des choix de vie,...,longitude,id_country,Annual Sunshine NCDC Computed,deaths_rural_urban,divorces,Precipitations in million cubic metres,rate of Population connected to wastewater collecting system,rate of Population connected to wastewater treatment,gini,Homicide victime nb
0,Arab Republic of Egypt,Egypt,AF,0,2005,5.168,9.036,0.848,59.7,0.817,...,119.525419,818,3293.3,450646.0,65047.0,1300.0,,,,10962.0
1,Arab Republic of Egypt,Egypt,AF,0,2007,5.541,9.135,0.686,59.82,0.609,...,119.525419,818,3790.8,450596.0,77878.0,1300.0,,,,14280.0
2,Arab Republic of Egypt,Egypt,AF,0,2008,4.632,9.186,0.738,59.88,,...,119.525419,818,3790.8,461934.0,84430.0,,,,,20286.0
3,Arab Republic of Egypt,Egypt,AF,0,2009,5.066,9.213,0.744,59.94,0.611,...,119.525419,818,3298.9,476592.0,141467.0,,,,,19152.0
4,Arab Republic of Egypt,Egypt,AF,0,2010,4.669,9.244,0.769,60.0,0.486,...,119.525419,818,3666.4,483385.0,149376.0,,,,,38619.0
