# About the Dataset

### World Happiness Report Data for 2017 from their website:

The first World Happiness Report was published in April, 2012, in support of the UN High Level Meeting on happiness and well-being. Since then the world has come a long way. Increasingly, happiness is considered to be the proper measure of social progress and the goal of public policy. In June 2016 the OECD committed itself “to redefine the growth narrative to put people’s well-being at the center of governments’ efforts”. In February 2017, the United Arab Emirates held a full-day World Happiness meeting, as part of the World Government Summit. Now on World Happiness Day, March 20th, we launch the World Happiness Report 2017, once again back at the United Nations, again published by the Sustainable Development Solutions Network, and now supported by a generous three-year grant from the Ernesto Illy Foundation. Some highlights are as follows.

Source: Helliwell, J., Layard, R., & Sachs, J. (2017). World Happiness Report 2017, New York: Sustainable Development Solutions Network.

##### Dictionary:

![Dictionary](img/DictionaryData.png)

##### Labeled Visualized Happiness Data Comparison of various countries on varous factors:

![data1-image](img/Data1_Happiness.png)


![data2-image](img/Data2_Happiness.png)


![data3-image](img/Data3_Happiness.png)





### Review of Regression and Correlation [ To be Completed ]

What is Linear Regression?

Why use Linear Regression?

What is Correlation?

Explain Regression with formulas
    Cost function
    Gradient descent step
    Optimal Coefficients
    Predicting the value
    Accuracy Score
    Plotting

Explain Correlation with formula and graphs

    Correlation Formula
    Correlation Coeff
    Correlation Values and meaning of a positive or negative correlation
    Calculates a Pearson correlation coefficient and the p-value for testing non-correlation.

The Pearson correlation coefficient measures the linear relationship between two datasets. Strictly speaking, Pearson’s correlation requires that each dataset be normally distributed. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact linear relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases.

The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Pearson correlation at least as extreme as the one computed from these datasets. The p-values are not entirely reliable but are probably reasonable for datasets larger than 500 or so.



# Explaination of the Code






### Import important libraries 

In [8]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import Imputer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression as LR
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
import matplotlib.pyplot as plt
import scipy as sp
from sklearn.metrics import accuracy_score as AccuracyScore

### Read the dataset

In [9]:
# data=pd.read_excel("World Happiness Data.xlsx")
data=pd.read_csv("World Happiness Data.csv")
data1=data[data.year==2016]
data1=data1.reset_index()

In [10]:
data1.head()

Unnamed: 0,index,WP5 Country,country,year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Positive affect,Negative affect,Confidence in national government,Democratic Quality,Delivery Quality,Standard deviation of ladder by country-year,Standard deviation/Mean of ladder by country-year
0,8,Afghanistan,Afghanistan,2016,4.220169,7.497288,0.559072,49.871265,0.522566,0.057393,0.793246,0.564953,0.348332,0.32499,,,1.796219,0.425627
1,17,Albania,Albania,2016,4.511101,9.2823,0.638411,68.69838,0.729819,-0.017927,0.901071,0.675244,0.321706,0.40091,,,2.646668,0.586701
2,22,Algeria,Algeria,2016,5.388171,9.549138,0.74815,64.829948,,,,0.668838,0.371372,,,,2.109472,0.391501
3,37,Argentina,Argentina,2016,6.427221,,0.882819,67.443993,0.847702,,0.850924,0.841907,0.311646,0.419562,,,2.127109,0.330953
4,48,Armenia,Armenia,2016,4.325472,8.989569,0.709218,65.40947,0.610987,-0.155814,0.921421,0.5936,0.437228,0.184713,,,2.126364,0.491591


### Preprocessing the Data

In [11]:
data1=data1.drop('index',axis=1)
data1=data1.drop("WP5 Country",axis=1)
data1=data1.drop(["country","year"],axis=1)
Y=data1["Life Ladder"].values
X=data1.drop(["Life Ladder","Democratic Quality","Delivery Quality"],axis=1).values
(data1.head())
# Here the data is being imputed and the missing values are being replaced with the mean of all the values
data1
imp=Imputer(missing_values="NaN",strategy="mean")
X=imp.fit_transform(X)

### Split data into training and test set


In [12]:
xtrain,xtest,ytrain,ytest=train_test_split(X,Y)

### Fit Linear Regression Model and Predict

In [13]:
alg=LR()
alg.fit(xtrain,ytrain)
print('Intercept: ',alg.intercept_)
print("Coefficients : ",alg.coef_)
ypred=alg.predict(xtest)
# print('Accracy of prediction on training set is : ',AccuracyScore(ytrain,alg.predict(xtrain)))
# print('Accracy of prediction on test set is : ',AccuracyScore(ytest,ypred))


Intercept:  1.906808774110139
Coefficients :  [ 2.05274717e-01  1.01483401e+00  5.36633761e-03  1.76040865e-01
  6.26174432e-01 -6.80429620e-01  1.41113695e+00  9.66886130e-01
 -3.90062533e-01  1.07817109e+00 -6.12650109e+00]


ValueError: continuous is not supported

# Data Visualization and Inferences

In [None]:
from sklearn.metrics import r2_score

def PearsonR(x,y):
    return(sp.stats.pearsonr(x,y))
def AccuracyScore(ytrue,ypred):
    return r2_score(ytrue,ypred)   

#Formula used internally
# xy=x*y
# xsquare=x**2
# ysquare=y**2
# sumXY=xy.sum()
# sumXSquare=xsquare.sum()
# sumYSquare=ysquare.sum()
# N=len(x)

# import math
# denominator=math.sqrt((N*sumXSquare-x.sum()*x.sum())*(N*sumYSquare-y.sum()*y.sum()))
# numerator=(N*sumXY-(sumXSquare)*(sumYSquare))
# r=numerator/denominator

In [None]:
plt.scatter(xtrain[:,0],ytrain,color="red")
plt.scatter(xtest[:,0],ypred,color="blue")
alg_pred=LR()
alg_pred.fit(xtrain[:,0].reshape(-1,1),ytrain)
plt.plot(xtrain[:,0],alg_pred.predict(xtrain[:,0].reshape(-1,1)),color="blue")
# plt.scatter(xtest[:,0],ytest,color="black")
plt.xlabel("GDP-Countries")
plt.ylabel("Happiness Index")
plt.title("Grpah of GDP vs Happiness Index")
plt.show()

# Metrics for judging the result, calculating correlation
x=xtest[:,0]
y=alg_pred.predict(ytest.reshape(-1,1))
print('Correlation between data is : ',PearsonR(x,y)[0])


print("Note: blue dots :- Machine Learning predicted models using >>Multivariate Linear and Logistic Regression<<")



In [None]:
plt.scatter(xtrain[:,1],ytrain,color="red")
plt.scatter(xtest[:,1],ypred,color="blue")
alg_pred=LR()
alg_pred.fit(xtrain[:,1].reshape(-1,1),ytrain)
plt.plot(xtrain[:,1],alg_pred.predict(xtrain[:,1].reshape(-1,1)),color="blue")
plt.xlabel("Social Support(Country, Democracy)")
plt.ylabel("Happiness Index")
plt.title("Graph of Social Support(Retirement, Democracy or Good Leadership) vs Happiness")

plt.show()
x=xtest[:,1]
y=alg_pred.predict(ytest.reshape(-1,1))
print('Correlation between data is : ',PearsonR(x,y)[0])

print("Note: blue dots :- Machine Learning predicted models using >>Multivariate Linear and Logistic Regression<<")

# plt.scatter(xtrain[:,1],ytrain,color="red")
# plt.scatter(xtest[:,1],ypred,color="blue")
# alg_pred=LR()
# alg_pred.fit(xtrain[:,1].reshape(-1,1),ytrain)
# plt.plot(xtrain[:,1],alg_pred.predict(xtrain[:,1].reshape(-1,1)),color="blue")
# plt.show()



In [None]:
plt.scatter(xtrain[:,2],ytrain,color="red")
plt.scatter(xtest[:,2],ypred,color="blue")
alg_pred=LR()
alg_pred.fit(xtrain[:,2].reshape(-1,1),ytrain)
plt.plot(xtrain[:,2],alg_pred.predict(xtrain[:,2].reshape(-1,1)),color="blue")
plt.xlabel("Life Expectancy")
plt.ylabel("Happiness Index")
plt.title("Graph of Life Expectancy vs Happiness")
plt.show()
x=xtest[:,2]
y=alg_pred.predict(ytest.reshape(-1,1))
print('Correlation between data is : ',PearsonR(x,y)[0])
print("Note: blue dots :- Machine Learning predicted models using >>Multivariate Linear and Logistic Regression<<")


In [None]:
plt.scatter(xtrain[:,3],ytrain,color="red")
plt.scatter(xtest[:,3],ypred,color="blue")
alg_pred=LR()
alg_pred.fit(xtrain[:,3].reshape(-1,1),ytrain)
plt.plot(xtrain[:,3],alg_pred.predict(xtrain[:,3].reshape(-1,1)),color="blue")
plt.xlabel("Government Policies towards freedom")
plt.ylabel("Happiness Index")
plt.title("Graph of Government Freedom Policy(Eg. China 1 child policy etc.) vs Happiness")
plt.show()
x=xtest[:,3]
y=alg_pred.predict(ytest.reshape(-1,1))
print('Correlation between data is : ',PearsonR(x,y)[0])
# print("Conclusion: Happy people may or maynot be donating :(- World bank")
print("Note: blue dots :- Machine Learning predicted models using >>Multivariate Linear and Logistic Regression<<")


In [None]:
plt.scatter(xtrain[:,4],ytrain,color="red")
plt.scatter(xtest[:,4],ypred,color="blue")
# plt.scatter(xtest[:,0],ytest,color="black")
alg_pred=LR()
alg_pred.fit(xtrain[:,4].reshape(-1,1),ytrain)
plt.plot(xtrain[:,4],alg_pred.predict(xtrain[:,4].reshape(-1,1)),color="blue") 
plt.xlabel("Generousity")
plt.ylabel("Happiness Index")
plt.title("Graph of Generousity vs Happiness")
plt.show()
x=xtest[:,4]
y=alg_pred.predict(ytest.reshape(-1,1))
print('Correlation between data is : ',PearsonR(x,y)[0])
print("Conclusion: Happy people may or maynot be donating :(- World bank")
print("Note: blue dots :- Machine Learning predicted models using >>Multivariate Linear and Logistic Regression<<")


In [None]:
data1.head(1)

In [None]:
plt.scatter(xtrain[:,5],ytrain,color="red")
plt.scatter(xtest[:,5],ypred,color="blue")
alg_pred=LR()
alg_pred.fit(xtrain[:,5].reshape(-1,1),ytrain)
plt.plot(xtrain[:,5],alg_pred.predict(xtrain[:,5].reshape(-1,1)),color="blue")
plt.xlabel("Perception Towards Corruption")
plt.ylabel("Happiness Index")
plt.title("Graph of Perception Towards Corruption vs Happiness")
plt.show()
print("Conclusion: Negative Correlation, higher the corruption, then lower the happiness gets")
print("Conclusion: Countries where corruption is low,People are very happy but at corrupt places the distinction is lost, though correlation is clearly negative")
print("Note: blue dots :- Machine Learning predicted models using >>Multivariate Linear and Logistic Regression<<")


In [None]:
plt.scatter(xtrain[:,6],ytrain,color="red")
plt.scatter(xtest[:,6],ypred,color="blue")
alg_pred=LR()
alg_pred.fit(xtrain[:,6].reshape(-1,1),ytrain)
plt.plot(xtrain[:,6],alg_pred.predict(xtrain[:,6].reshape(-1,1)),color="blue")
plt.xlabel("Positivity")
plt.ylabel("Happiness Index")
plt.title("Graph of Positivity vs Happiness")
plt.show()

print("Note: blue dots :- Machine Learning predicted models using >>Multivariate Linear and Logistic Regression<<")


In [None]:
plt.scatter(xtrain[:,7],ytrain,color="red")
plt.scatter(xtest[:,7],ypred,color="blue")
alg_pred=LR()
alg_pred.fit(xtrain[:,7].reshape(-1,1),ytrain)
plt.plot(xtrain[:,7],alg_pred.predict(xtrain[:,7].reshape(-1,1)),color="blue")
plt.xlabel("Negativity")
plt.ylabel("Happiness Index")
plt.show()
print("Note: blue dots :- Machine Learning predicted models using >>Multivariate Linear and Logistic Regression<<")
