# Project Title

## Nick Buller, Kerry Harp, Sofanit Mengesha, Christopher Pope

### Narrative, Hypothesis, Intro, etc.
World Happiness Report (https://worldhappiness.report/)

Social Progress Index (https://www.socialprogress.org/)

In [None]:
import os
import csv
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import linregress
import requests
import json

In [None]:
#file path
#Import clean csv file
project_df_path = "Data/merge.csv"
project_df = pd.read_csv(project_df_path)
project_df.head()

In [None]:
#suppress warning
import warnings
warnings.simplefilter("ignore")

Over the 5 years that were analyzed (2015-2019), the Happiness Score had up to 156 countries participating and the Social Progress Index had up to 192. After data from both was merged and cleaned up, 76 cities make up the participation in both the Happiness Score and the Social Progress Index.

![countries.png](attachment:countries.png)

# The Social Progress Index Score and Happiness Score have a positive correlation

In [None]:
#Export CSV for participating countries
map_locations = project_df["Country or region_HS"].unique()
#map_locations

In [None]:
#Get ranking and totals data
totals_df = project_df[['Year_HS', 'Country_SP', 'Overall rank_HS', 'Score_HS', 'SPI Rank_SP',  'Social Progress Index_SP']]

#Rename year and country columns
totals_df.rename(columns = {"Year_HS":"Year", "Country_SP":"Country"}, inplace = True)

#configure SPI ranking to not have a decimal
totals_df["SPI Rank_SP"] = totals_df["SPI Rank_SP"].astype(int)

#renaming columns to get rid of spaces
totals_df.rename(columns={"Overall rank_HS":"Rank_HS"}, inplace=True)
totals_df.rename(columns={"SPI Rank_SP":"Rank_SPI"}, inplace=True)
totals_df.rename(columns={"Social Progress Index_SP":"Score_SPI"}, inplace=True)

## 2015-2019 Average Scores

Add observations of the scores...

In [None]:
#Charting average of each country of countries
scores_df = totals_df[["Country", "Score_HS", "Score_SPI"]]

#Group by country
scoresgroup_df = scores_df.groupby("Country")
scores_all = scoresgroup_df.mean()

#Plot Happiness Score
HS_df = totals_df.groupby(["Country"])["Score_HS"].mean().reset_index()
ax = plt.subplot(1,2,1)
HS_plot = HS_df.sort_values(["Score_HS"], ascending=True).plot(kind="barh", y="Score_HS", x="Country", figsize=(15,25), color="steelblue", ax=ax)

plt.title("Average Happiness Scores (2015-2019)", fontsize=15)
plt.ylabel(" Participating Countries", fontsize=12)
plt.xlabel("Score", fontsize=12)
plt.grid(True, linestyle="-", which="major", color="gray", alpha=0.25)

#Plot Social Progress Index
SP_df = totals_df.groupby(["Country"])["Score_SPI"].mean().reset_index()
ax = plt.subplot(1,2,2)
SP_plot = SP_df.sort_values(["Score_SPI"], ascending=True).plot(kind="barh", y="Score_SPI", x="Country", figsize=(15,25), color="teal", ax=ax)

plt.title("Average Social Progress Index (2015-2019)", fontsize=15)
plt.ylabel(" Participating Countries", fontsize=12)
plt.xlabel("Score", fontsize=12)
plt.grid(True, linestyle="-", which="major", color="gray", alpha=0.25)

#plt.savefig("Data/AverageScores.png")
plt.show()

## Do the two systems correlate?

Add observations and findings...

In [None]:
#Scatterplot all years, all countries
x_values = totals_df["Score_HS"]
y_values = totals_df["Score_SPI"]

(slope, intercept, rvalue, pvalue, stderr) = linregress(x_values, y_values)
regress_values = slope * x_values + intercept
line_eq = "y =" + str(round(slope,2)) + "x " + str(round(intercept,2))
r2_eq = "r^2 =" + str(round(rvalue**2,2))

plt.rcParams["figure.figsize"] = (7,7)
plt.title("Social Progress Index and Happiness Scores \n for All Countries (2015-2019)",  fontsize=15)
plt.ylabel(" Social Progress Index",  fontsize=12)
plt.xlabel("Happiness Score",  fontsize=12)
plt.ylim(20,100)
plt.xlim(0,10)
plt.grid(True, linestyle="-", which="major", color="gray", alpha=0.25)
plt.annotate(line_eq, (2,30), fontsize=12, color="red")
plt.annotate(r2_eq, (2,25), fontsize=12, color="red")

plt.scatter(x_values, y_values, color="black", marker="o", alpha=0.4, s=125 )
plt.plot(x_values, regress_values, "r-")

#plt.savefig("Data/HP_SPI_scatter.png")
plt.show()

#Scatterplot averaged scores, all countries
#Group by country
grouped_totals_df = totals_df.groupby(["Country"])
#grouped_totals_df.mean()

#Converting groupby to a dataframe
averages_df = pd.DataFrame(grouped_totals_df["Rank_HS", "Score_HS", "Rank_SPI", "Score_SPI"].mean())

#Scatterplot and annotate averaged data 
x_values = averages_df["Score_HS"]
y_values = averages_df["Score_SPI"]

(slope, intercept, rvalue, pvalue, stderr) = linregress(x_values, y_values)
regress_values = slope * x_values + intercept
line_eq = "y =" + str(round(slope,2)) + "x " + str(round(intercept,2))
r2_eq = "r^2 =" + str(round(rvalue**2,2))

plt.rcParams["figure.figsize"] = (7,7)
plt.title("5-year Average Social Progress Index \n and Happiness Score for Each Country",  fontsize=15)
plt.ylabel(" Social Progress Index",  fontsize=12)
plt.xlabel("Happiness Score",  fontsize=12)
plt.ylim(20,100)
plt.xlim(0,10)
plt.grid(True, linestyle="-", which="major", color="gray", alpha=0.25)
plt.annotate(line_eq, (2,30), fontsize=12, color="red")
plt.annotate(r2_eq, (2,25), fontsize=12, color="red")

plt.scatter(x_values, y_values, color="black", marker="o",  alpha=0.4, s=125)
plt.plot(x_values, regress_values, "r-")

#plt.savefig("Data/HP_SPI_average_scatter")
plt.show()

## Top of the Top 10

Countries that are ranked in the top 10 of BOTH the Happiness Score and the Social Progress Index (2015-2019).

Add observations...

![countries10.png](attachment:countries10.png)

In [None]:
#Rankings of Happiness Score, grouped by year
ranking_HS = totals_df.groupby(["Country","Year"])
#ranking_HS = totals_df.sort_values(["Year","Score_HS"])
#ranking_HS.mean()

In [None]:
ranking_both_top10 = totals_df[(totals_df["Rank_HS"] <= 10) & (totals_df["Rank_SPI"] <= 10)]
#use \ (vertical bar for either condiation)


In [None]:
ranking_HS = ranking_both_top10.sort_values(["Rank_HS"])
ranking_SPI = ranking_both_top10.sort_values(["Rank_SPI"])
ranking_country = ranking_both_top10.sort_values(["Country"])
ranking_top10 = ranking_both_top10.groupby(["Country","Rank_HS","Rank_SPI"])
count10 = ranking_top10["Rank_SPI"].count()

In [None]:
groups = ranking_country.groupby("Country")
colors = {'Australia':'crimson', 'Canada':'darkviolet', 'Denmark':'green', 'Finland':'deepskyblue', 'Netherlands':'gold', 'New Zealand':'orange', 'Norway':'mediumblue', 'Sweden':'teal', 'Switzerland':'yellowgreen'}

for name, group in groups:
    plt.plot(group["Rank_HS"], group["Rank_SPI"], marker="o", markersize=20, linestyle="", label=name,)

plt.rcParams["figure.figsize"] = (7,7)
plt.rcParams["legend.markerscale"] = 0.4

plt.ylim(-1,11)
plt.xlim(-1,11)
plt.title("Countries that Rank in the Top 10 in Both \n Social Progress and Happiness Rankings (2015-2019)", fontsize=15)
plt.ylabel(" Social Progress Rank", fontsize=12)
plt.xlabel("Happiness Rank", fontsize=12)
plt.grid(True, linestyle="-", which="major", color="gray", alpha=0.25)
plt.legend()

#plt.savefig("Data/TopTenRanking-alternate.png")
plt.show()

This plot did not represent all the information wanted. It was done again using a different method which allowed customization of the colors of the dots (coldest climate is blue --> hotest climate is red), and the size of the dots to represent the best of the two rankings combined (ie: lower the sum of the two scores, the bigger the dot). Unfortunately, there was a problem in getting the legend to display correctly. Visually, this is the preferred plot.

In [None]:
#Create variable for list of ranking data
plot_HS_rank = ranking_country["Rank_HS"].to_list()

plot_SPI_rank = ranking_country["Rank_SPI"].to_list()

country = ranking_country["Country"].to_list()

#Create variable to add and inverse the rankings so the smallest number for the ranking displays a larger dot using list comprehension.
#https://www.geeksforgeeks.org/python-adding-two-list-elements/
#display_dot = [plot_HS_rank[i] + plot_SPI_rank[i] for i in range(len(plot_HS_rank))]

display_dot = [float(plot_HS_rank[i]) + float(plot_SPI_rank[i]) for i in range(len(plot_HS_rank))]

#Find reciprocal
#https://www.geeksforgeeks.org/numpy-reciprocal
display_dot_inv = np.reciprocal(display_dot)

#display_dot
#display_dot_inv

In [None]:
#Scatterplot of the two rankings with cities that rank 1-10 in both.
#https://stackoverflow.com/questions/26139423/plot-different-color-for-different-categorical-levels-using-matplotlib
#https://python-graph-gallery.com/270-basic-bubble-plot/
#https://matplotlib.org/stable/gallery/lines_bars_and_markers/scatter_with_legend.html
    
df = pd.DataFrame(dict(plot_HS_rank=plot_HS_rank, plot_SPI_rank=plot_SPI_rank, country=country, display_dot_inv=display_dot_inv))
fig, ax = plt.subplots()
plt.rcParams["figure.figsize"] = (7,7)

colors = {'Australia':'crimson', 'Canada':'darkviolet', 'Denmark':'green', 'Finland':'deepskyblue', 'Netherlands':'gold', 'New Zealand':'orange', 'Norway':'mediumblue', 'Sweden':'teal', 'Switzerland':'yellowgreen'}
scatter = ax.scatter(df['plot_HS_rank'], df['plot_SPI_rank'], s=df['display_dot_inv']*2500, c=df['country'].apply(lambda x: colors[x]))

legend1 = ax.legend(*scatter.legend_elements(), loc="lower right", title="Countries")
ax.add_artist(legend1)

plt.ylim(-1,11)
plt.xlim(-1,11)
plt.title("Countries that Rank in the Top 10 in Both \n Social Progress and Happiness Rankings (2015-2019)", fontsize=15)
plt.ylabel(" Social Progress Rank", fontsize=12)
plt.xlabel("Happiness Rank", fontsize=12)
plt.grid(True, linestyle="-", which="major", color="gray", alpha=0.25)

#plt.savefig("Data/TopTenRanking.png")
plt.show()

# Analysis of Relationships between Social Progress Indicators and Happiness Score

### Basic Human Needs vs. Happiness Linear Regression

In [None]:
# Add the linear regression equation and line to plot
x_values = project_df['Score_HS']
y_values = project_df['Basic Human Needs_SP']
(slope, intercept, rvalue, pvalue, stderr) = linregress(x_values, y_values)
regress_values = x_values * slope + intercept
line_eq = "y = " + str(round(slope,2)) + "x + " + str(round(intercept,2))
plt.scatter(x_values,y_values, edgecolors= "black")
plt.plot(x_values,regress_values,"r-")
plt.annotate(line_eq,(5,45),fontsize=15,color="red")

# Labels
plt.title('Basic Human Needs vs. Happiness Linear Regression')
plt.xlabel('Hapiness Score')
plt.ylabel('Basic Human Needs Score')
plt.grid()
plt.show()

In [None]:
score = f"The r value is: {rvalue}"
if rvalue == 0:
    relationship = "none existant"
    strength = ""
else:
    if rvalue > 0:
        relationship = "positive"
    else:
        relationship = "negative"

if abs(rvalue) >= .7:
    strength = "strong"
elif abs(rvalue) >= .5:
    strength = "moderate"
elif abs(rvalue) >= .3 and rvalue != 0:
    strength = "weak"
    
print(f"The r value is: {rvalue}.  This is a {strength} {relationship} relationship")

### Foundations of Wellbeing vs Happiness Linear Regression

In [None]:
# Add the linear regression equation and line to plot
x_values = project_df['Score_HS']
y_values = project_df['Foundations of Wellbeing_SP']
(slope, intercept, rvalue, pvalue, stderr) = linregress(x_values, y_values)
regress_values = x_values * slope + intercept
line_eq = "y = " + str(round(slope,2)) + "x + " + str(round(intercept,2))
plt.scatter(x_values,y_values, edgecolors= "black")
plt.plot(x_values,regress_values,"r-")
plt.annotate(line_eq,(5,51),fontsize=15,color="red")

# Labels
plt.title('Foundations of Wellbeing  vs. Happiness Linear Regression')
plt.xlabel('Hapiness Score')
plt.ylabel('Foundations of Wellbeing Score')
plt.grid()
plt.show()

In [None]:
score = f"The r value is: {rvalue}"
if rvalue == 0:
    relationship = "none existant"
    strength = ""
else:
    if rvalue > 0:
        relationship = "positive"
    else:
        relationship = "negative"

if abs(rvalue) >= .7:
    strength = "strong"
elif abs(rvalue) >= .5:
    strength = "moderate"
elif abs(rvalue) >= .3 and rvalue != 0:
    strength = "weak"
    
print(f"The r value is: {rvalue}.  This is a {strength} {relationship} relationship")

### Opportunity vs. Happiness Linear Regression

In [None]:
# Add the linear regression equation and line to plot
x_values = project_df['Score_HS']
y_values = project_df['Opportunity_SP']
(slope, intercept, rvalue, pvalue, stderr) = linregress(x_values, y_values)
regress_values = x_values * slope + intercept
line_eq = "y = " + str(round(slope,2)) + "x + " + str(round(intercept,2))
plt.scatter(x_values,y_values, edgecolors= "black")
plt.plot(x_values,regress_values,"r-")
plt.annotate(line_eq,(5.7,40),fontsize=15,color="red")

# Labels
plt.title('Opportunity  vs. Happiness Linear Regression')
plt.xlabel('Hapiness Score')
plt.ylabel('Foundations of Wellbeing Score')
plt.grid()
plt.show()

In [None]:
score = f"The r value is: {rvalue}"
if rvalue == 0:
    relationship = "none existant"
    strength = ""
else:
    if rvalue > 0:
        relationship = "positive"
    else:
        relationship = "negative"

if abs(rvalue) >= .7:
    strength = "strong"
elif abs(rvalue) >= .5:
    strength = "moderate"
elif abs(rvalue) >= .3 and rvalue != 0:
    strength = "weak"
    
print(f"The r value is: {rvalue}.  This is a {strength} {relationship} relationship")

# Finding Sub-metric with Strongest Correlation to Happiness Score

In [None]:
# Create lists for storage of information
sub_metric_list = []
rvalue_list = []

# Select only columns relating to Social Progress Indicators
submetric_df = project_df.iloc[:, 22:84]

# Iterate through the submetric data frame and store rvalues and submetric label for each
for column in submetric_df:
    submetric_string = column[:-3]
    x_values = project_df['Score_HS']
    y_values = project_df[column]
    (slope, intercept, rvalue, pvalue, stderr) = linregress(x_values, y_values)
    sub_metric_list.append(submetric_string)
    rvalue_list.append(rvalue)
    
# Store lists in a dictionary
submetric_results_dict = {
    'Sub Metric' : sub_metric_list,
    'R Value' : rvalue_list
}

# Create a dataframe with the dictionary
submetric_results_df = pd.DataFrame(data = submetric_results_dict)

### Positive Relationship Results

In [None]:
positive_results_df = submetric_results_df.loc[submetric_results_df['R Value'] >= 0]
weak_pos_results_df = positive_results_df.sort_values(by=['R Value'])
strong_pos_results_df = positive_results_df.sort_values(by=['R Value'], ascending = False)

<b>Strongest Positive Indicators of Happiness</b>

In [None]:
strong_pos_results_df.head()

<b>Weakest Positive Indicators of Happiness</b>

In [None]:
weak_pos_results_df.head()

### Negative Relationship Results

In [None]:
negative_results_df = submetric_results_df.loc[submetric_results_df['R Value'] < 0]
strong_neg_results_df = negative_results_df.sort_values(by=['R Value'])
weak_neg_results_df = negative_results_df.sort_values(by=['R Value'], ascending = False)

<b>Strongest Negative Indicators of Happiness</b>

In [None]:
strong_neg_results_df.head()

<b>Weakest Negative Indicators of Happiness</b>

In [None]:
weak_neg_results_df.head()

# Analysis of Relationships between Happiness Indicators and Social Progress Score

In [None]:
# Add the linear regression equation and line to plot
x_values = project_df['GDP per capita_HS']
y_values = project_df['Social Progress Index_SP']
(slope, intercept, rvalue, pvalue, stderr) = linregress(x_values, y_values)
regress_values = x_values * slope + intercept
line_eq = "y = " + str(round(slope,2)) + "x + " + str(round(intercept,2))
plt.scatter(x_values,y_values, edgecolors= "black")
plt.plot(x_values,regress_values,"r-")
plt.annotate(line_eq,( 1.1, 60),fontsize=15,color="red")

# Labels
plt.title('GDP per capita vs. Social Progress Index Score')
plt.xlabel('GDP per capita')
plt.ylabel('Social Progress Index Score')
plt.grid()
plt.scatter(x_values, y_values, color="black", marker="o")
plt.show()

In [None]:
score = f"The r value is: {rvalue}"
if rvalue == 0:
    relationship = "none existant"
    strength = ""
else:
    if rvalue > 0:
        relationship = "positive"
    else:
        relationship = "negative"

if abs(rvalue) >= .7:
    strength = "strong"
elif abs(rvalue) >= .5:
    strength = "moderate"
elif abs(rvalue) >= .3 and rvalue != 0:
    strength = "weak"
    
print(f"The r value is: {rvalue}.  This is a {strength} {relationship} relationship")

In [None]:
x_values = project_df['GDP per capita_HS']
y_values = project_df['Social Progress Index_SP']

correlation_matrix = np.corrcoef(x_values, y_values)
correlation_xy = correlation_matrix[0,1]
r_squared = correlation_xy**2

#print(r_squared)
print (f"The R^2 value for GDP per Capita is: {r_squared}. GDP per capita is the highest happiness indicator that predicts social progress score." )

In [None]:
# Add the linear regression equation and line to plot
x_values = project_df['Social support_HS']
y_values = project_df['Social Progress Index_SP']
(slope, intercept, rvalue, pvalue, stderr) = linregress(x_values, y_values)
regress_values = x_values * slope + intercept
line_eq = "y = " + str(round(slope,2)) + "x + " + str(round(intercept,2))
plt.scatter(x_values,y_values, edgecolors= "black")
plt.plot(x_values,regress_values,"r-")
plt.annotate(line_eq,( 1.1, 60),fontsize=15,color="red")

# Labels
plt.title('Social support vs. Social Progress Index Score')
plt.xlabel('Social support')
plt.ylabel('Social Progress Index Score')
plt.grid()
plt.scatter(x_values, y_values, color="black", marker="o")
plt.show()

There is a strong positive relationship between GDP per capita and Social Progress Index score, therefore as the GDP of the countries increase the social progress index will also increase and vice versa.

In [None]:
score = f"The r value is: {rvalue}"
if rvalue == 0:
    relationship = "none existant"
    strength = ""
else:
    if rvalue > 0:
        relationship = "positive"
    else:
        relationship = "negative"

if abs(rvalue) >= .7:
    strength = "strong"
elif abs(rvalue) >= .5:
    strength = "moderate"
elif abs(rvalue) >= .3 and rvalue != 0:
    strength = "weak"
    
print(f"The r value is: {rvalue}.  This is a {strength} {relationship} relationship")

In [None]:
x_values = project_df['Social support_HS']
y_values = project_df['Social Progress Index_SP']

correlation_matrix = np.corrcoef(x_values, y_values)
correlation_xy = correlation_matrix[0,1]
r_squared = correlation_xy**2

#print(r_squared)
print (f"The R^2 value for Social support is: {r_squared}. Social support is the third highest happiness indicator that predicts Social Progress score" )

In [None]:
# Add the linear regression equation and line to plot
x_values = project_df['Healthy life expectancy_HS']
y_values = project_df['Social Progress Index_SP']
(slope, intercept, rvalue, pvalue, stderr) = linregress(x_values, y_values)
regress_values = x_values * slope + intercept
line_eq = "y = " + str(round(slope,2)) + "x + " + str(round(intercept,2))
plt.scatter(x_values,y_values, edgecolors= "black")
plt.plot(x_values,regress_values,"r-")
plt.annotate(line_eq,( 0.7, 60),fontsize=15,color="red")

# Labels
plt.title('Healthy life expectancy vs. Social Progress Index Score')
plt.xlabel('Healthy life expectancy')
plt.ylabel('Social Progress Index Score')
plt.grid()
plt.scatter(x_values, y_values, color="black", marker="o")
plt.show()

There is a moderate positive relationship between Social support of countries and Social Progress index score, therefore whenever the social support increase the social progress index score of the respective countries increases moderately. And vice versa.

In [None]:
score = f"The r value is: {rvalue}"
if rvalue == 0:
    relationship = "none existant"
    strength = ""
else:
    if rvalue > 0:
        relationship = "positive"
    else:
        relationship = "negative"

if abs(rvalue) >= .7:
    strength = "strong"
elif abs(rvalue) >= .5:
    strength = "moderate"
elif abs(rvalue) >= .3 and rvalue != 0:
    strength = "weak"
    
print(f"The r value is: {rvalue}.  This is a {strength} {relationship} relationship")


In [None]:
x_values = project_df['Healthy life expectancy_HS']
y_values = project_df['Social Progress Index_SP']

correlation_matrix = np.corrcoef(x_values, y_values)
correlation_xy = correlation_matrix[0,1]
r_squared = correlation_xy**2

#print(r_squared)
print (f"The R^2 value for Healthy life expectancy is: {r_squared}. Healthy life expectancy is the second highest happiness indicator that predicts social progress score." )


There is a strong positive relationship between the Healthy life expectancy of countries and the social progress index score, thus when the healthy life expectancy of a specific country increases the social progress index score of the respective country increases as well.

In [None]:
# Add the linear regression equation and line to plot
x_values = project_df['Freedom to make life choices_HS']
y_values = project_df['Social Progress Index_SP']
(slope, intercept, rvalue, pvalue, stderr) = linregress(x_values, y_values)
regress_values = x_values * slope + intercept
line_eq = "y = " + str(round(slope,2)) + "x + " + str(round(intercept,2))
plt.scatter(x_values,y_values, edgecolors= "black")
plt.plot(x_values,regress_values,"r-")
plt.annotate(line_eq,( 0.5, 60),fontsize=15,color="red")

# Labels
plt.title('Freedom to make life choices vs. Social Progress Index Score')
plt.xlabel('Freedom to make life choices')
plt.ylabel('Social Progress Index Score')
plt.grid()
plt.scatter(x_values, y_values, color="black", marker="o")
plt.show()

In [None]:
score = f"The r value is: {rvalue}"
if rvalue == 0:
    relationship = "none existant"
    strength = ""
else:
    if rvalue > 0:
        relationship = "positive"
    else:
        relationship = "negative"

if abs(rvalue) >= .7:
    strength = "strong"
elif abs(rvalue) >= .5:
    strength = "moderate"
elif abs(rvalue) >= .2 and rvalue != 0:
    strength = "weak"
    
print(f"The r value is: {rvalue}.  This is a {strength} {relationship} relationship")

There is a weak positive correlation between Freedom to make life choices and Social progress Index Score, therefore freedom to make life choices does not affect Social Progress Index.

In [None]:
# Add the linear regression equation and line to plot
x_values = project_df['Generosity_HS']
y_values = project_df['Social Progress Index_SP']
(slope, intercept, rvalue, pvalue, stderr) = linregress(x_values, y_values)
regress_values = x_values * slope + intercept
line_eq = "y = " + str(round(slope,2)) + "x + " + str(round(intercept,2))
plt.scatter(x_values,y_values, edgecolors= "black")
plt.plot(x_values,regress_values,"r-")
plt.annotate(line_eq,( 0.5, 60),fontsize=15,color="red")

# Labels
plt.title('Generosity vs. Social Progress Index Score')
plt.xlabel('Generosity')
plt.ylabel('Social Progress Index Score')
plt.grid()
plt.scatter(x_values, y_values, color="black", marker="o")
plt.show()

In [None]:
score = f"The r value is: {rvalue}"
if rvalue == 0:
    relationship = "none existant"
    strength = ""
else:
    if rvalue > 0:
        relationship = "positive"
    else:
        relationship = "negative"

if abs(rvalue) >= .7:
    strength = "strong"
elif abs(rvalue) >= .5:
    strength = "moderate"
elif abs(rvalue) >= .1 and rvalue != 0:
    strength = "weak"
    
print(f"The r value is: {rvalue}.  This is a {strength} {relationship} relationship")

There is a weak positive relationship between Generosity and Social Progress Index Score, therefore generosity of countries will not affect the social progress index score of the respective countries.

In [None]:
# Add the linear regression equation and line to plot
x_values = project_df['Perceptions of corruption_HS']
y_values = project_df['Social Progress Index_SP']
(slope, intercept, rvalue, pvalue, stderr) = linregress(x_values, y_values)
regress_values = x_values * slope + intercept
line_eq = "y = " + str(round(slope,2)) + "x + " + str(round(intercept,2))
plt.scatter(x_values,y_values, edgecolors= "black")
plt.plot(x_values,regress_values,"r-")
plt.annotate(line_eq,( 0.5, 60),fontsize=15,color="red")

# Labels
plt.title('Perceptions of corruption vs. Social Progress Index Score')
plt.xlabel('Perceptions of corruption')
plt.ylabel('Social Progress Index Score')
plt.grid()
plt.scatter(x_values, y_values, color="black", marker="o")
plt.show()

In [None]:
score = f"The r value is: {rvalue}"
if rvalue == 0:
    relationship = "none existant"
    strength = ""
else:
    if rvalue > 0:
        relationship = "positive"
    else:
        relationship = "negative"

if abs(rvalue) >= .7:
    strength = "strong"
elif abs(rvalue) >= .5:
    strength = "moderate"
elif abs(rvalue) >= .1 and rvalue != 0:
    strength = "weak"
    
print(f"The r value is: {rvalue}.  This is a {strength} {relationship} relationship")

There is also a weak positive relationship between Perceptions of Corruption and Social Progress Score, therefore we can conclude and say that Social Progress Index of Countries will not be affected by Perceptions of Corruption.

# Analysis of the relationship between Population Density and Social Progress/ World Happiness

In this section we wanted to look at the relationship that a country's population density has on its Social progress Index Score as well as a country's World Happiness Score. Our hypothesis was that a denser population likely would have more innovation. Greater innovation would lead to a greater Social Progress Index, which would in turn lead to a greater World Happiness Score.

In [None]:
#locate World Density fine
den_file = "Data/population_density.csv"

# Read our World Density data into pandas
density_df = pd.read_csv(den_file, encoding = "ISO-8859-1")
density_df

In [None]:
# Only show density of specific countries
country_df = density_df.loc[density_df['Type'] == 'Country/Area']
rename_country = country_df.rename(columns={'Region, subregion, country or area *':'Country'})

# Break the data out by country and create seperate dataframe for years 2015 - 2019
clean_country = rename_country[['Country','2015', '2016', '2017', '2018', '2019']]
clean_country['Country'] = clean_country['Country'].replace({'United States of America':'United States'})
clean_country['Country'] = clean_country['Country'].replace({'United Republic of Tanzania':'Tanzania'})
clean_country['Country'] = clean_country['Country'].replace({'Russian Federation':'Russia'})
density_2015 = clean_country[['Country', '2015']]
density_2016 = clean_country[['Country', '2016']]
density_2017 = clean_country[['Country', '2017']]
density_2018 = clean_country[['Country', '2018']]
density_2019 = clean_country[['Country', '2019']]
density_2019.head()

In [None]:
#Change each dataframes column label from the year of the datafile to "Population Density"
density_2015 = density_2015.rename(columns={'2015':'Population Density'})
density_2016 = density_2016.rename(columns={'2016':'Population Density'})
density_2017 = density_2017.rename(columns={'2017':'Population Density'})
density_2018 = density_2018.rename(columns={'2018':'Population Density'})
density_2019 = density_2019.rename(columns={'2019':'Population Density'})
density_2019.loc[density_2019['Country'] == 'Russia']

In [None]:
# Create a column that puts the reads (The Year)+(The Country's name)
density_2015['Country & Year'] = ('2015')+density_2015['Country']
density_2016['Country & Year'] = ('2016')+density_2016['Country']
density_2017['Country & Year'] = ('2017')+density_2017['Country']
density_2018['Country & Year'] = ('2018')+density_2018['Country']
density_2019['Country & Year'] = ('2019')+density_2019['Country']

In [None]:
# append each of the dataframes above to each other so we have a combined dataframe from 2015-2019
v1 = density_2015.append(density_2016)
v2 = v1.append(density_2017)
v3 = v2.append(density_2018)
v4 = v3.append(density_2019)
v4

In [None]:
#Merge the population density dataframe with the complete dataframe from above. Merge on "Country & Year"

new_merge = project_df
new_merge['Country & Year'] = new_merge["Year_HS"].astype(str) + new_merge["Country or region_HS"]

combined_merge = pd.merge(v4, new_merge, on="Country & Year", how='right')
combined_merge

In [None]:
#Pull out a list for population density, happiness score & social progress index
pop_den = combined_merge['Population Density'].astype('float64')
hap_score = combined_merge['Score_HS'].astype('float64')
sp_index = combined_merge['Social Progress Index_SP'].astype('float64')

# Visual Analysis of the top 10 countries from each criteria

** If we take a moment to compare the 10 densest country graph with the graph of the 10 highest Happiness Scores or Social Progress Index scores we will find countries that appear in both.

If we compare the 10 Densest with the 10 highest happiness scores we will notice that these countries appear in both:
    - The Netherlands
    - Israel

If we compare the 10 Densest with the 10 highest Social Progress Index scores we will see that these countries appear in both:
    - Rwanda
    - India
    
With there being countries that appear in both of the cross analysis, that may be enough to hypothesize that there may be a correlation between the two datasets?

In [None]:
#Break each list into a column for country and a column for Density, Happiness Score or Social Progress Index 
den_reduce = combined_merge[["Country", "Population Density"]]
hs_reduce = combined_merge[["Country", "Score_HS"]]
sp_reduce = combined_merge[["Country", "SPI Rank_SP"]]

#Group Density by country
den_group = den_reduce.groupby("Country")
den_mean = den_group.mean()
sort_den = den_mean.sort_values('Population Density', ascending =False)
top_den = sort_den.loc[sort_den['Population Density'] > 270]

#Group Happiness by country
hs_group = hs_reduce.groupby("Country")
hs_mean = hs_group.mean()
sort_hs = hs_mean.sort_values('Score_HS', ascending =False)
top_hs = sort_hs.loc[sort_hs['Score_HS'] > 7.1421]

#Group Social Progress Index by country
sp_group = sp_reduce.groupby("Country")
sp_mean = sp_group.mean()
sort_sp = sp_mean.sort_values('SPI Rank_SP', ascending =False)
top_sp = sort_sp.loc[sort_sp['SPI Rank_SP'] > 114]

#Plot the 10 largest Density values by Country
top_den = top_den.plot(kind='bar', figsize=(10,10))

#Label the graph
plt.title('10 Densest Countries in the World')
plt.ylabel('Population Density (persons per square km)Population Density')
plt.xlabel('Country Name')

plt.grid()
plt.show()

In [None]:
#Plot the 10 largest Happiness Score values by Country
top_hs = top_hs.plot(kind='bar', figsize=(10,10))

#Labels for the graph
plt.title('Top 10 Happiness Scores by Country')
plt.ylabel('Happiness Score')
plt.xlabel('Country Name')

plt.grid()
plt.show()

In [None]:
# Plot the 10 largest Social Progress Index Score values by Country
top_sp = top_sp.plot(kind='bar', figsize=(10,10))

#Labels for the graph
plt.title('Top 10 Social Progress Scores by Country')
plt.ylabel('Social Progress Index Score')
plt.xlabel('Country Name')


plt.grid()
plt.show()

# Observations: Population Density Vs. Happiness Score

In [None]:
#Plot a scatterplot of Population Density vs. Happiness score. Print the r value at the bottom of the graph

slope, intercept, rvalue, pvalue, stderr = linregress(pop_den, hap_score)
regress_values = pop_den * slope + intercept
line_eq = "y = " + str(round(slope,2)) + "x + " + str(round(intercept,2))
plt.scatter(pop_den, hap_score, facecolors = 'black', edgecolors = 'black', s = 40)
plt.plot(pop_den,regress_values,"r-")
plt.annotate(line_eq,(150,4.5),fontsize=15,color="red")
plt.title("Population Density Vs Happiness Score")
plt.xlabel("Population Density (persons per square km)")
plt.ylabel("Happiness Score")
plt.grid()
plt.show()
print(f" The r Value is:{rvalue}")

As we can see in the graph, the greatest collection of dots appear to be between x values 0 to 100. And if we look at those values specifically we can see that, the majority of dots lie between y values 4 to 7. Visually I could hypothesize that of the countries we compared, a countries population density does not seem to have an effect of its Happiness Score. The datasets r Value of -0.076 seems to confirm that hypothesis.

# Observations: Population Density Vs. Social Progress Index Score

In [None]:
#Plot a scatterplot of Population Density vs. Social Progress Index score. Print the r value at the bottom of the graph

slope, intercept, rvalue, pvalue, stderr = linregress(pop_den, sp_index)
regress_values = pop_den * slope + intercept
line_eq = "y = " + str(round(slope,2)) + "x + " + str(round(intercept,2))
plt.scatter(pop_den, sp_index, facecolors = 'black', edgecolors = 'black', s = 40)
plt.plot(pop_den,regress_values,"r-")
plt.annotate(line_eq,(300,60),fontsize=15,color="red")
plt.title("Population Density Vs Social Progress Score")
plt.xlabel("Population Density (persons per square km)")
plt.ylabel("Social Progress Score")
plt.grid()
plt.show()
print(f" The r Value is:{rvalue}")

As we can see in the graph, the greatest collection of dots appear to be between x values 0 to 100. And if we look at those values specifically we can see that, the majority of dots lie between y values 50 to 90. Visually I could hypothesize that of the countries we compared, a countries population density does not seem to have an effect of its Social Progress Index Score. The datasets r Value of -0.117 seems to confirm that hypothesis.