# <font color = 'purple'>
    
<font color = 'purple'>
    
# Introduction 

<font color = "black" >

The following data was generated based on the World Happiness Report and provides information about happiness levels of various nations and other significant socio-economic variables such as GDP per capita, social support, health, freedom, and corruption.

<font  color = 'purple'>
Content:
    
1. [Load and Check Data](#1)
2. [Variable Description](#2)
   * [Univariate Variable Analysis](#3)
      * [Categorical Variable Analysis](#4)
      * [Numerical Variable Analysis](#5)
3. [Basic Data Analysis](#6)
4. [Outlier Detection](#7)
5. [Missing Values](#8)




In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
plt.style.use("seaborn-v0_8-whitegrid")

import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

from collections import Counter

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<a name="load"></a>

<font color = 'red'>

## Load and Check Data





In [None]:
train_df = pd.read_csv("/kaggle/input/world-happiness-report-2024-yearly-updated/World-happiness-report-2024.csv")

test_df = pd.read_csv("/kaggle/input/world-happiness-report-2024-yearly-updated/World-happiness-report-2024.csv")

In [None]:
train_df.columns

In [None]:
train_df.head()

In [None]:
train_df.describe()

<font color = 'red'>
    
<a id ="2"></a>

## Variable Description
<font color = 'black'>


1. Country name: The name of the country.
2. Regional indicator: The geographical region to which the country belongs.
3. Ladder score:The happiness score for each country, based on responses to the Cantril Ladder question that asks respondents to think of a ladder(10 is the best,0 is the worst).
4. upperwhisker: The upper bound of the happiness score.
5. lowerwhisker: The lower bound of the happiness score.
6. Log GDP per capita: The natural logarithm of the country's GDP per capita, adjusted for purchasing power parity (PPP) to account for differences in the cost of living between countries.
7. Social support: Is there are relatives or friends to count on in times of trouble.
8. Healthy life expectancy: The average number of years a newborn infant would live in good health.
9. Freedom to make life choices: The national average of survey responses measuring satisfaction with freedom to choose what to do in life.
10. Generosity: The residual obtained by regressing the national average of charitable donation responses.
11. Perceptions of corruption: The national average of survey responses reflecting perceived corruption in government and business sectors.
12. Dystopia + residual: A benchmark combining the score of a hypothetical least-happy country (Dystopia) with the unexplained residual for each country, ensuring all happiness scores remain positive.


In [None]:
train_df.info()

* float64(10) : Ladder score, upperwhisker, lowerwhisker, Log GDP per capita, Social support, Healthy life expectancy, Freedom to make life choices, Generosity, Perceptions of corruption, Dystopia + residual
* object(2) : Country name, Regional indicator

<font color = 'blue'>
    
### Univariate Variable Analysis
<font color = 'black'>
    
* Categorical Variable: Country name, Regional indicator
* Numerical Variable: Ladder Score, upperwhisker, lowerwhisker, Log GDP per capita, Social support, Healthy life expectancy, Freedom to make life choices,
Generosity, Perceptions of corruption, Dystopia + residual 

<font color = 'green'>
    
### Categorical Variable

In [None]:
def bar_plot(variable):
    var = train_df[variable]
    # count number of categorical variable
    varValue = var.value_counts()
    #visualize
    plt.figure(figsize=(20,5))
    plt.bar(varValue.index,varValue)
    plt.xticks(varValue.index, varValue.index.values)
    plt.ylabel("Frequency")
    plt.title(variable)
    plt.show()
    print(f"{variable}:{varValue}")
    

In [None]:
category = ["Regional indicator"]
for c in category:
    bar_plot(c)

* Country name is an identifier-like variable (unique for each row). Categorical frequency plot is not meaningful.

<font color = 'green'>
    
### Numerical Variable 

In [None]:
numerical_cols = train_df.select_dtypes(include="number").columns


In [None]:
train_df[numerical_cols].describe()


In [None]:
def plot_hist(variable):
    plt.figure(figsize=(10,5))
    plt.hist(train_df[variable], bins=50)
    plt.xlabel(variable)
    plt.ylabel("Frequency")
    plt.title(f"distribution with hist {variable}")
    plt.show()
    

In [None]:
for c in numerical_cols:
    plot_hist(c)

<font color = 'red'>

## Basic Data Analysis

<font color = 'black'>
    
* Ladder score - Log GDP per capita
* Social Support - Ladder score
* Healthy life expectancy - Freedom to make life choices
* Generosity - Ladder score
* Regional Indicator - Ladder score
  

In [None]:
def plot_bivariate(x, y, df=train_df):
    plt.figure(figsize=(5,4))
    plt.scatter(df[x], df[y])
    plt.xlabel(x)
    plt.ylabel(y)
    plt.title(f"{y} vs {x}")
    plt.show()


<font color = 'brown'>

*  **Ladder Score -  Log GDP per capita**

In [None]:
plot_bivariate("Ladder score", "Log GDP per capita")

<font color = 'brown'>

* **Social Support - Ladder score**

In [None]:
plot_bivariate("Ladder score", "Social support")

<font color = 'brown'>
    
* **Healthy life expectancy - Freedom to make life choices**

In [None]:
plot_bivariate("Healthy life expectancy", "Freedom to make life choices")

<font color = 'brown'>

* **Generosity - Ladder score**

In [None]:
plot_bivariate("Generosity", "Ladder score")

<font color = 'brown'>
    
* **Regional indicator - Ladder score**
  

In [None]:
region_happiness = (train_df
    .groupby("Regional indicator")["Ladder score"]
    .mean()
    .reset_index()
    .rename(columns={"Ladder score": "Average Ladder Score"})
    .sort_values(by="Average Ladder Score", ascending=False))
region_happiness

<font color = 'red'>

## Outlier Detection

In [None]:
def detect_outliers(df,col):
    outlier_indices = []
    for c in col:
        Q1 = np.percentile(df[c].dropna(), 25)
        Q3 = np.percentile(df[c].dropna(), 75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        outlier_list_col = df[(df[c] < lower_bound) | (df[c] > upper_bound)].index
        outlier_indices.extend(outlier_list_col)
    outlier_indices = Counter(outlier_indices)
    multiple_outliers = list(i for i, v in outlier_indices.items() if v>2)
    return multiple_outliers


In [None]:
train_df.loc[detect_outliers(train_df,numerical_cols)]

In [None]:
#drop outliers
train_df = train_df.drop(detect_outliers(train_df,numerical_cols),axis=0).reset_index(drop=True)

<font color = 'red'>

## Missing Values
<font color = 'black'>
    
* Find Missing Values
* Fill Missing Values

In [None]:
train_df_len =len(train_df)
train_df = pd.concat([train_df,test_df],axis=0).reset_index(drop=True)

* **Find Missing Values**

In [None]:
train_df.columns[train_df.isnull().any()]

In [None]:
train_df.isnull().sum()

* **Fill Missing Values**
  Log GDP per capita,Social support, Healthy life expectancy, Freedom to make life choices, Generosity, Perceptions of corruption and Dystopia + residual have 6 missing values.

In [None]:
cols_with_missing = ["Log GDP per capita","Social support","Healthy life expectancy","Freedom to make life choices","Generosity","Perceptions of corruption","Dystopia + residual"]
for col in cols_with_missing:
    train_df[col].fillna(train_df[col].median(), inplace=True)


In [None]:
train_df[cols_with_missing].isnull().sum()
