# About the Dataset

### World Happiness Report Data for 2017 from their website:

The first World Happiness Report was published in April, 2012, in support of the UN High Level Meeting on happiness and well-being. Since then the world has come a long way. Increasingly, happiness is considered to be the proper measure of social progress and the goal of public policy. In June 2016 the OECD committed itself “to redefine the growth narrative to put people’s well-being at the center of governments’ efforts”. In February 2017, the United Arab Emirates held a full-day World Happiness meeting, as part of the World Government Summit. Now on World Happiness Day, March 20th, we launch the World Happiness Report 2017, once again back at the United Nations, again published by the Sustainable Development Solutions Network, and now supported by a generous three-year grant from the Ernesto Illy Foundation. Some highlights are as follows.

Source: Helliwell, J., Layard, R., & Sachs, J. (2017). World Happiness Report 2017, New York: Sustainable Development Solutions Network.

##### Dictionary:

![Dictionary](img/DictionaryData.png)

##### Labeled Visualized Happiness Data Comparison of various countries on varous factors:

![data1-image](img/Data1_Happiness.png)


![data2-image](img/Data2_Happiness.png)


![data3-image](img/Data3_Happiness.png)





### Review of Regression and Correlation [ To be Completed ]

What is Linear Regression?

Why use Linear Regression?

What is Correlation?

Explain Regression with formulas
    Cost function
    Gradient descent step
    Optimal Coefficients
    Predicting the value
    Accuracy Score
    Plotting

Explain Correlation with formula and graphs

    Correlation Formula
    Correlation Coeff
    Correlation Values and meaning of a positive or negative correlation
    Calculates a Pearson correlation coefficient and the p-value for testing non-correlation.

The Pearson correlation coefficient measures the linear relationship between two datasets. Strictly speaking, Pearson’s correlation requires that each dataset be normally distributed. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact linear relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases.

The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Pearson correlation at least as extreme as the one computed from these datasets. The p-values are not entirely reliable but are probably reasonable for datasets larger than 500 or so.



# Explaination of the Code






### Import important libraries 

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import Imputer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression as LR
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
import matplotlib.pyplot as plt
import scipy as sp

### Read the dataset

In [2]:
# data=pd.read_excel("World Happiness Data.xlsx")
data=pd.read_csv("World Happiness Data.csv")
data1=data[data.year==2016]
data1=data1.reset_index()

In [3]:
data1.head()

Unnamed: 0,index,WP5 Country,country,year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Positive affect,Negative affect,Confidence in national government,Democratic Quality,Delivery Quality,Standard deviation of ladder by country-year,Standard deviation/Mean of ladder by country-year
0,8,Afghanistan,Afghanistan,2016,4.220169,7.497288,0.559072,49.871265,0.522566,0.057393,0.793246,0.564953,0.348332,0.32499,,,1.796219,0.425627
1,17,Albania,Albania,2016,4.511101,9.2823,0.638411,68.69838,0.729819,-0.017927,0.901071,0.675244,0.321706,0.40091,,,2.646668,0.586701
2,22,Algeria,Algeria,2016,5.388171,9.549138,0.74815,64.829948,,,,0.668838,0.371372,,,,2.109472,0.391501
3,37,Argentina,Argentina,2016,6.427221,,0.882819,67.443993,0.847702,,0.850924,0.841907,0.311646,0.419562,,,2.127109,0.330953
4,48,Armenia,Armenia,2016,4.325472,8.989569,0.709218,65.40947,0.610987,-0.155814,0.921421,0.5936,0.437228,0.184713,,,2.126364,0.491591


# Without polynomial Features

In [4]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import Imputer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression as LR
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
import matplotlib.pyplot as plt
import scipy as sp
from sklearn.metrics import adju

# data=pd.read_excel("World Happiness Data.xlsx")
data=pd.read_csv("World Happiness Data.csv")
data1=data[data.year==2016]
data1=data1.reset_index()

data1=data1.drop('index',axis=1)
data1=data1.drop("WP5 Country",axis=1)
data1=data1.drop(["country","year"],axis=1)
Y=data1["Life Ladder"].values
X=data1.drop(["Life Ladder","Democratic Quality","Delivery Quality","Standard deviation of ladder by country-year","Standard deviation/Mean of ladder by country-year"],axis=1).values
data1=data1.drop(["Life Ladder","Democratic Quality","Delivery Quality","Standard deviation of ladder by country-year","Standard deviation/Mean of ladder by country-year"],axis=1)

# Here the data is being imputed and the missing values are being replaced with the mean of all the values
data1
imp=Imputer(missing_values="NaN",strategy="mean")
X=imp.fit_transform(X)
data1


# Normalize the data using sklearn
# Converting features to degree 2
# from sklearn.preprocessing import PolynomialFeatures
# pol_reg=PolynomialFeatures(degree=2)
# X=pol_reg.fit_transform(X)

xtrain,xtest,ytrain,ytest=train_test_split(X,Y)


# from sklearn.preprocessing import StandardScaler as SC
# sc_X=SC()
# xtrain=sc_X.fit_transform(xtrain)
# xtest=sc_X.transform(xtest)
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression as LR

def PearsonR(x,y):
    return(sp.stats.pearsonr(x,y))
def AccuracyScore(ytrue,ypred):
    return r2_score(ytrue,ypred)  
def AccuracyScore(ytrue,ypred):
    return r2_score(ytrue,ypred) 
alg=LR()
alg.fit(xtrain,ytrain)
print('Intercept: ',alg.intercept_)
print("Coefficients : ",alg.coef_)
ypred=alg.predict(xtest)
print('Accuracy of prediction on training set is : ',AccuracyScore(ytrain,alg.predict(xtrain)))
print('Accuracy of prediction on test set is : ',AccuracyScore(ytest,ypred))


Intercept:  -4.051024757253949
Coefficients :  [ 0.32557992  2.50855526  0.03520315  1.38818214  0.61097546 -0.09728142
  1.9795098   0.12836087 -0.6065006 ]
Accuracy of prediction on training set is :  0.8061516119862016
Accuracy of prediction on test set is :  0.7767877535986406


# With Polynomial Features

In [5]:

import pandas as pd
import numpy as np
from sklearn.preprocessing import Imputer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression as LR
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
import matplotlib.pyplot as plt
import scipy as sp

# data=pd.read_excel("World Happiness Data.xlsx")
data=pd.read_csv("World Happiness Data.csv")
data1=data[data.year==2016]
data1=data1.reset_index()

data1=data1.drop('index',axis=1)
data1=data1.drop("WP5 Country",axis=1)
data1=data1.drop(["country","year"],axis=1)
Y=data1["Life Ladder"].values
X=data1.drop(["Life Ladder","Democratic Quality","Delivery Quality","Standard deviation of ladder by country-year","Standard deviation/Mean of ladder by country-year"],axis=1).values
data1=data1.drop(["Life Ladder","Democratic Quality","Delivery Quality","Standard deviation of ladder by country-year","Standard deviation/Mean of ladder by country-year"],axis=1)

# Here the data is being imputed and the missing values are being replaced with the mean of all the values
data1
imp=Imputer(missing_values="NaN",strategy="mean")
X=imp.fit_transform(X)
data1
# Normalize the data using sklearn
# Converting features to degree 2
from sklearn.preprocessing import PolynomialFeatures
pol_reg=PolynomialFeatures(degree=2)
X=pol_reg.fit_transform(X)


xtrain,xtest,ytrain,ytest=train_test_split(X,Y)


# from sklearn.preprocessing import StandardScaler as SC
# sc_X=SC()
# xtrain=sc_X.fit_transform(xtrain)
# xtest=sc_X.transform(xtest)
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression as LR

def PearsonR(x,y):
    return(sp.stats.pearsonr(x,y))
def AccuracyScore(ytrue,ypred):
    return r2_score(ytrue,ypred)  
def AccuracyScore(ytrue,ypred):
    return r2_score(ytrue,ypred) 
alg=LR()
alg.fit(xtrain,ytrain)
print('Intercept: ',alg.intercept_)
print("Coefficients : ",alg.coef_)
ypred=alg.predict(xtest)
print('Accuracy of prediction on training set is : ',AccuracyScore(ytrain,alg.predict(xtrain)))
print('Accuracy of prediction on test set is : ',AccuracyScore(ytest,ypred))


Intercept:  42902817558.91934
Coefficients :  [-4.29028176e+10  3.91505208e+00 -3.79109819e+01 -4.36548805e-01
  1.36314752e+00 -1.17844027e+01  2.14302602e+01  9.56609135e+00
 -2.46798916e+01  8.80344687e+00 -1.19386078e-01  7.43530722e-01
 -3.68579142e-03  1.21226802e+00 -4.58020229e-01 -3.43217489e+00
 -4.46339704e-03  4.00736853e+00 -2.40384488e+00 -4.73575230e+00
  2.49581095e-01  1.77817399e+01  8.08633502e+00  2.59911790e+01
 -1.04617826e+01 -1.93715223e+01  1.28654977e+01  4.55503233e-04
 -2.30052811e-01 -1.85637466e-02  9.68610338e-02  3.08162644e-01
 -2.16109723e-03  2.63884516e-01 -1.18655062e+01  2.38885930e+00
 -1.01122551e+01  1.53579537e+01  6.08484570e-01  9.12628798e+00
 -3.49082104e+00  7.76859369e+00 -5.19899569e+00  2.41394562e+01
  2.64742434e+00  8.99654883e-01 -9.70343875e+00 -3.38252602e+00
 -2.25165435e+00 -1.10443325e+01  1.05000955e+01 -2.55367196e+01
 -6.98331515e+00  7.98367874e+00 -3.27369491e+00]
Accuracy of prediction on training set is :  0.869480263778