# World Happiness Report

### Problem Statement:
### Context

>The World Happiness Report is a landmark survey of the state of global happiness. The first report was published in 2012, the second in 2013, the third in 2015, and the fourth in the 2016 Update. The World Happiness 2017, which ranks 155 countries by their happiness levels, was released at the United Nations at an event celebrating International Day of Happiness on March 20th. The report continues to gain global recognition as governments, organizations and civil society increasingly use happiness indicators to inform their policy-making decisions. Leading experts across fields – economics, psychology, survey analysis, national statistics, health, public policy and more – describe how measurements of well-being can be used effectively to assess the progress of nations. The reports review the state of happiness in the world today and show how the new science of happiness explains personal and national variations in happiness.

### Content

>The happiness scores and rankings use data from the Gallup World Poll. The scores are based on answers to the main life evaluation question asked in the poll. This question, known as the Cantril ladder, asks respondents to think of a ladder with the best possible life for them being a 10 and the worst possible life being a 0 and to rate their own current lives on that scale. The scores are from nationally representative samples for the years 2013-2016 and use the Gallup weights to make the estimates representative. The columns following the happiness score estimate the extent to which each of six factors – 

                                              1. Economic production, 
                                              2. Social support, 
                                              3. Life expectancy, 
                                              4. Freedom, 
                                              5. Absence of corruption, 
                                              6. Generosity
                                              
>Contribute to making life evaluations higher in each country than they are in Dystopia, a hypothetical country that has values equal to the world’s lowest national averages for each of the six factors. They have no impact on the total score reported for each country, but they do explain why some countries rank higher than others.

### Inspiration

>What countries or regions rank the highest in overall happiness and each of the six factors contributing to happiness? How did country ranks or scores change between the 2015 and 2016 as well as the 2016 and 2017 reports? Did any country experience a significant increase or decrease in happiness?

### What is Dystopia?

>Dystopia is an imaginary country that has the world’s least-happy people. The purpose in establishing Dystopia is to have a benchmark against which all countries can be favorably compared (no country performs more poorly than Dystopia) in terms of each of the six key variables, thus allowing each sub-bar to be of positive width. The lowest scores observed for the six key variables, therefore, characterize Dystopia. Since life would be very unpleasant in a country with the world’s lowest incomes, lowest life expectancy, lowest generosity, most corruption, least freedom and least social support, it is referred to as “Dystopia,” in contrast to Utopia.

### What are the residuals?

>The residuals, or unexplained components, differ for each country, reflecting the extent to which the six variables either over- or under-explain average 2014-2016 life evaluations. These residuals have an average value of approximately zero over the whole set of countries. Figure 2.2 shows the average residual for each country when the equation in Table 2.1 is applied to average 2014- 2016 data for the six variables in that country. We combine these residuals with the estimate for life evaluations in Dystopia so that the combined bar will always have positive values. As can be seen in Figure 2.2, although some life evaluation residuals are quite large, occasionally exceeding one point on the scale from 0 to 10, they are always much smaller than the calculated value in Dystopia, where the average life is rated at 1.85 on the 0 to 10 scale.

### What do the columns succeeding the Happiness Score(like Family, Generosity, etc.) describe?

>The following columns: GDP per Capita, Family, Life Expectancy, Freedom, Generosity, Trust Government Corruption describe the extent to which these factors contribute in evaluating the happiness in each country.
The Dystopia Residual metric actually is the Dystopia Happiness Score(1.85) + the Residual value or the unexplained value for each country as stated in the previous answer.

>If you add all these factors up, you get the happiness score so it might be un-reliable to model them to predict Happiness Scores.

### importing reqiredd libraries

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

In [3]:
df = pd.read_csv("https://github.com/dsrscientist/DSData/blob/master/happiness_score_dataset.csv")

ParserError: Error tokenizing data. C error: Expected 1 fields in line 109, saw 2


In [None]:
df

## EDA:

In [None]:
df.columns

In [None]:
df.isna().sum()

there are no null values 

In [None]:
cols = df.columns
num_cols = df._get_numeric_data().columns

catagorical_data =list(set(cols) - set(num_cols))
catagorical_data

### Encoding the Data with Label Encoder converting all the catagorical data into numeric values

In [None]:
from sklearn.preprocessing import LabelEncoder
LE= LabelEncoder()


for i in catagorical_data:
    df[i]=LE.fit_transform(df[i])

In [None]:
df.info()

### Lets see the mathematical Summary of the data set.

In [None]:
df.describe()

### Key observation:

1. Happiness Rank.
2. Country. 
>have only unique value in all the columns so it will not help us in prediction

1. Happiness Score.
2. Standard Error.
3. Family 
4. Health (life Expectancy)
5. Freedom
6. Trust
7. Generosity
8. Dystopia Residual
>Continous data have the mean and 50th percentile nearly equal and difference between 75th percentile and max is little.
The above observations define Skewness is lesser and less outliers

In [None]:
df.drop(["Happiness Rank","Country"], axis= 1, inplace = True)

### Skeness Identification


In [None]:
plt.figure(figsize=(20,5))
collist = df.columns.values
for i in range (0, len(collist)):
    plt.subplot(2,5,i+1)         
    sns.kdeplot(df[collist[i]], color = "purple")
    plt.title(f"Skewness = {round(df[collist[i]].skew(),5)}",fontsize=15)
    plt.tight_layout()

lets see the distribution of the data

In [None]:
df.hist(edgecolor="red",linewidth= 1.5, figsize= (20,10))
plt.show()

In [None]:
skewness=[]
for i in df.skew().values:
    skewness.append(i)
    
df_skewness= pd.DataFrame({"Feature_names": collist,"Skew": skewness})
df_skewness= df_skewness.sort_values(by="Skew", ascending=False, ignore_index= True)


skew_postive_row= []
skew_negative_row=[]
for index, row in df_skewness.iterrows():
    if row['Skew']>0.49:
        skew_postive_row.append(row['Feature_names'])
    elif row['Skew']< -0.49:
        skew_negative_row.append(row['Feature_names'])
        
df_skewness

In [None]:
print("\n\nFeature names with Skewness is present more than +/-0.5 as follows:\n","\n\nPostive Skewed data:\n", skew_postive_row,"\n\nnegative Skewed data:\n", skew_negative_row)       

In [None]:
DF=df

from scipy.stats import yeojohnson

for i in skew_postive_row:
    DF[i]= yeojohnson(DF[i])[0]
for i in skew_negative_row:
    DF[i]= yeojohnson(DF[i])[0]   
    
print("BELOW GRAPH WILL SHOW THE SKEWNESS OF THE DATA")
plt.figure(figsize=(15,15))
for i in range (0, len(collist)):
    plt.subplot(8,4,i+1)
    plt.title(f"Skewness = {round(DF[collist[i]].skew(),5)}",fontsize=20)         
    sns.distplot(DF[collist[i]], color = "#f80424")
    plt.tight_layout() 

### We have removed the skewness of the data by "yeojohnson" method. We will remove the outliers and lets see the relationship of the data with the target variable...

In [None]:
df = DF
plt.figure(figsize=(20,50))
collist = df.columns.values
for i in range (0, len(collist)):
    plt.subplot(15,4,i+1)
    ax=sns.boxplot(df[collist[i]], color = "#fb0a29" , orient = "h")
    ax.set_facecolor("#fec1c9")
    plt.tight_layout()

There is certain amount of outliers exist in the data lets see them mathematically. 

In [None]:
df1=df

In [None]:
from scipy.stats import zscore
import numpy as np
z= np.abs(zscore(df1))
threshold= 3
df_new = df[(z < 3).all(axis=1)]


In [None]:
print(f"Orginal Data {df1.shape}\nAfter Removing outliers {df_new.shape}\nThe percentage of data loss {((158-155)/158)*100}%")

### Key observation:
    The outliers exist in the data post the removal of outliers the loss of data is very less so we can remove outliers

In [None]:
df=df_new

### Lets see the correlation of the dataset.

In [None]:
plt.figure(figsize=(10,10))
sns.heatmap(df.corr(),annot = True, cmap = "coolwarm")
plt.show()

### Key observation:
    Above we can see the happiness score has postive correlation with most of the variable and only region has negativa correlation.
    this means increase in factors like GDP per capita, Family, freedom, life expentancy also increases hapiness score
    Lets picturise the correlation with the targeet vaariable alone.
    

In [None]:
plt.figure(figsize=(15,7))
df.corr()["Happiness Score"].sort_values(ascending=False).drop(["Happiness Score"]).plot.bar()
plt.xlabel("Feature", fontsize= 14)
plt.ylabel("correlation with Target column", fontsize = 18)
plt.title("Correlation of Fetures with the target column", fontsize=25)
plt.show()

Above we can see the ranking of the correlation of the feature variable with target variable.

In [None]:
orange ="#60010d"
green ="#980216"
grey ="#fc6a7d"
yellow ="#fecad1"
blue ="#fc8292"

In [None]:
target_variable_1 = "Happiness Score"
feature_variable = ['Region',"Standard Error","Economy (GDP per Capita)","Family","Health (Life Expectancy)","Freedom",
                    "Trust (Government Corruption)","Generosity","Dystopia Residual","Family","Health (Life Expectancy)",
                    "Freedom","Trust (Government Corruption)","Generosity","Dystopia Residual"]


def num_plots(feature_name):
    fig, axs = plt.subplots(1, 3, figsize=(15, 2))
    a1=sns.boxplot(x=df[feature_name], ax=axs[0], color=blue)
    a1.set_facecolor(yellow)
    a2=sns.distplot(df[feature_name], bins=20, kde=True, ax=axs[1],color=orange)
    a2.set_facecolor(grey)
    a3=sns.scatterplot(data=df, x=feature_name, y=target_variable_1, ax=axs[2], color="k")
    a3.set_facecolor(blue)
    plt.show()
    
for i in feature_variable:
    num_plots(i)  

### key observation:
         The distribution of the scattered plot is as per the correlation of the variable when the correlation is higher the distribution is lesser.

In [None]:
sns.pairplot(df)

### Key observation : 
    1. We have only 1 categorical variable "Region"we totally have 10 regions.
    2. Most of the data have high postive correlation. The spread of the data is mostly diagonal down left to upper right.

### Scaling of the data wont be required since the vaules of the data is lesser lets split the data and stain our model.

In [None]:
x_1=df.drop(["Happiness Score"], axis = 1)
y_1=df["Happiness Score"]

In [None]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import r2_score, mean_absolute_error,mean_squared_error

accu = 0
for i in range(0,1000):
    x_train, x_test, y_train, y_test = train_test_split(x_1,y_1,test_size = .25, random_state = i)
    mod = LinearRegression()
    mod.fit(x_train,y_train)
    y_pred = mod.predict(x_test)
    tempacc = r2_score(y_test,y_pred)
    if tempacc> accu:
        accu= tempacc
        best_rstate=i

print(f"Best Accuracy {accu*100} found on randomstate {best_rstate}")        
        

In [None]:
x_train_1, x_test_1, y_train_1, y_test_1 = train_test_split(x_1,y_1,test_size = .25, random_state = best_rstate)

In [None]:
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor

### Lets shortlist promising Regression models.

In [None]:
models = [LinearRegression(), Lasso(), Ridge(alpha=1, random_state=42), ElasticNet(), SVR(), KNeighborsRegressor(), DecisionTreeRegressor(), AdaBoostRegressor(random_state=42), RandomForestRegressor(random_state=42)]

model_names = ["LinearRegression", "Lasso", "Ridge", "ElasticNet", "SVR", "KNeighborsRegressor", "DecisionTreeRegressor", "AdaBoostRegressor", "RandomForestRegressor"]

In [None]:
score= []
mean_abs_e=[]
mean_sqr_e=[]
root_mean_e=[]
r2=[]

for m in models:
    m.fit(x_train_1,y_train_1)
    print("Score of", m, "is:", m.score(x_train_1,y_train_1))
    score.append(m.score(x_train_1,y_train_1))
    predm=m.predict(x_test_1)
    print("\nERROR:")
    print("MEAN ABSOLUTE ERROR: ",mean_absolute_error(y_test_1,predm))
    mean_abs_e.append(mean_absolute_error(y_test_1,predm))
    print("MEAN SQUARED ERROR: ", mean_squared_error(y_test_1,predm))
    mean_sqr_e.append(mean_squared_error(y_test_1,predm))
    print("ROOT MEAN SQUARED ERROR :",np.sqrt(mean_squared_error(y_test_1,predm)))
    root_mean_e.append(np.sqrt(mean_squared_error(y_test_1,predm)))
    print("R2 SCORE: ", r2_score(y_test_1,predm))
    r2.append(r2_score(y_test_1,predm))
    print("**********************************************************************************************************")
    print('\n\n')

In [None]:
mean_score= []
STD=[]
for m in models:
    CV=cross_val_score(m,x_1,y_1,cv=5,scoring="r2")
    print("SCORE OF",m,"IS:")
    print("SCORE IS:", CV)
    print("MEAN OF SCORE is :", CV.mean())
    mean_score.append(CV.mean())
    print("Standard Deviation :", CV.std())
    STD.append(CV.std())
    print("**************************************************************************************************")
    print("\n\n")

In [None]:
Regression_result = pd.DataFrame({"MODEL": model_names,
                                  "SCORE": score,
                                  "CV_mean_score": mean_score,
                                  "CV_STD": STD,
                                  "MBE": mean_abs_e,
                                  "MSE": mean_sqr_e,
                                  "RMSE": root_mean_e,
                                  "R2":r2 
                                 })
Regression_result.sort_values(by="CV_mean_score", ascending=False)



In [None]:
metrics_list = ["SCORE", "CV_mean_score", "CV_STD", "MBE", "MSE", "RMSE", "R2"]

for metric in metrics_list:
    Regression_result.sort_values(by=metric).plot.bar("MODEL", metric, color = orange)
    plt.title(f"MODEL by {metric}")
    plt.show()

From the above observation Linear Regression is undoubtedly the best model with the model score of 99% and also good CV score of 84% we are training our data with linear regression and saving the model without hypertuning since model score is 99%...

In [None]:
lr = LinearRegression()
lr.fit(x_train_1,y_train_1)

In [None]:
pred = lr.predict(x_test_1)

In [None]:
lr.score(x_train_1,y_train_1)

### Saving our model and prediction

In [None]:
import joblib
joblib.dump(lr,"happiness_score.obj")

In [None]:
pred_Happiness_Score = pd.DataFrame({"pred_Happiness_Score":pred})
pred_Happiness_Score.head()

In [None]:
pred_Happiness_Score.to_csv("pred_Happiness_Score.csv")

### Conclusion:
>The above analysis is the observation of 157 countries hapiness scores and rank. Study which will help to understand the what are all the factors that increases the happiness score.
    what is the contribution of GDP per capita and trust on the goverment that helps to improve the Quality of living.
    