In [None]:
from google.colab import drive
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
drive.mount("/content/drive")

MessageError: Error: credential propagation was unsuccessful

In [None]:
df = pd.read_csv("/content/drive/MyDrive/207R Files/2016.csv")

In [None]:
df.head(10)

In [None]:
df.tail(10)

In [None]:
df.info()

##High Level View

This dataset contains information about many countries (data samples) in the world related to the happiness of their citizens. It has the Dystopia Residual metric, which compares each country to a hypothetical dystopia with the least happy people in the world, and many variables like GDP, life expectancy, etc., as well as Happiness Rank and Happiness Score to try and quantify happiness.

In [None]:
# No null values in any column
df.isnull().sum()

In [None]:
plt.figure(figsize=(15,10))
sns.boxplot(df.drop("Happiness Rank", axis=1))
plt.xticks(fontsize=5)

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(10, 8))
sns.regplot(x="Happiness Score", y="Economy (GDP per Capita)", data=df, ax=axes[0,0])
sns.regplot(x="Happiness Score", y="Freedom", data=df, ax=axes[0,1])
sns.regplot(x="Happiness Score", y="Health (Life Expectancy)", data=df, ax=axes[1,0])
sns.regplot(x="Happiness Score", y="Trust (Government Corruption)", data=df, ax=axes[1,1])

In [None]:
df.describe()

In [None]:
print(df[df["Generosity"] == 0]["Country"])
df[df["Country"] == "Greece"].head()
# print(df[df["Freedom"] == 0]["Country"])
# print(df[df["Trust (Government Corruption)"] == 0]["Country"])
# print(df[df["Family"] == 0]["Country"])
# print(df[df["Economy (GDP per Capita)"] <= .01]["Country"])
# print(df[df["Health (Life Expectancy)"] <= .01]["Country"])

#Preliminary Exploration

There aren't any null values in the dataset, and the potential outliers where variables like freedom and generosity were equal to zero seem to be legitimate. So the dataset doesn't seem to have many quality issues and doesn't really need much pre-processing. One transformaton we could do is subtract the constant value from the dystopia residual to get the actual residual values, but this isn't necessary.


Judging by the correlation plots, the given variables all seem to have an influence on happiness score, with Economy and Health having the strongest linear relationship, and Freedom and Trust having the weakest linear relationship upon initial inspection. For the Economy, Family, Health, and Freedom, there is a left skew, indicating that there are countries with very low scores in these categories that are disproportionately affecting the data.

#Objectives

1. What is the relationship between generosity and happiness score?

2. Which regions of the world have the highest/lowest happiness scores?

3. If we split the dataset in half based on happiness rank, do the relationships between any of the variables and the happiness score change?

#Objective 1
We are using a scatter plot of the Happiness Score and Generosity to see if there are any potential relationships between the two variables. We are also using a heatmap to see the strength of correlation between the two variables in numerical terms.

In [None]:
sns.scatterplot(x="Happiness Score", y="Generosity", data=df).set_title("Scatterplot of Happiness Score and Generosity")

In [None]:
corr = df.drop(["Country", "Region", "Lower Confidence Interval", "Upper Confidence Interval"], axis=1).corr()
sns.heatmap(corr, annot=True, fmt=".1f")

#Objective 1 Conclusion

Generosity does not seem to have a strong linear relationship with happiness score based on visual inspection of the scatterplot as well as the 0.2 correlaton score in the heatmap.

#Objective 2

To answer this question, we need to drop the country column and group by the region column and apply the mean function to the grouped dataframe.

In [None]:
region_df = df.drop("Country", axis=1).groupby("Region").mean().sort_values("Happiness Score", ascending=False)
region_df.info()
region_df.head(10)
#print(region_df["Region"].unique())

In [None]:
plt.figure(figsize=(10, 6))
sns.barplot(x="Region", y="Happiness Score", data = region_df, palette="viridis").set_title("Happiness Score by Region")
plt.xlabel("Region")
plt.ylabel("Happiness Score")
plt.xticks(rotation=45, fontsize=8)
plt.show()

#Objective 2 Conclusion

I decided to use a box plot to compare the mean Happiness Scores by region to show the ranks of each region by happiness score. It turns out that Australia and New Zealand have the highest happiness score while Sub-Saharan Africa has the lowest happiness score. Predictably, Economy and Health seem to have a very strong correlation with the rank of the region.

#Objective 3

I split the dataset into halves and created a heatmap of both, as well as a heatmap of the difference between their correlations so that I could potentially see a difference in the effects of any of the variables for countries with higher happiness scores versus countries with lower happiness scores.

In [None]:
half = len(df) // 2
df_first_half = df.iloc[:half]
df_second_half = df.iloc[half:]
df_first_half.info()
#df_first_half.head(10)
df_second_half.info()
df_second_half.head(10)

In [None]:
df_1_corr = df_first_half.drop(["Country", "Region", "Lower Confidence Interval", "Upper Confidence Interval"], axis=1).corr()
sns.heatmap(df_1_corr, annot=True, fmt=".2f")

In [None]:
df_2_corr = df_second_half.drop(["Country", "Region", "Lower Confidence Interval", "Upper Confidence Interval"], axis=1).corr()
sns.heatmap(df_2_corr, annot=True, fmt=".2f")

In [None]:
df_corr_diff = df_1_corr - df_2_corr
sns.heatmap(df_corr_diff, annot=True, fmt=".2f").set_title("Difference in Correlation (First Half - Second Half)")

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(10, 8))
sns.scatterplot(x="Happiness Score", y="Trust (Government Corruption)", data=df_first_half, ax=axes[0]
  ).set_title("Scatterplot of Happiness Score and Trust--First Half", fontsize=10)
axes[0].set_xticks([1,2,3,4,5,6,7,8])
axes[0].set_yticks([0.1,0.2,0.3,0.4,0.5,0.6,0.7])

sns.scatterplot(x="Happiness Score", y="Trust (Government Corruption)", data=df_second_half,ax=axes[1]
  ).set_title("Scatterplot of Happiness Score and Trust--Second Half", fontsize=10)
axes[1].set_xticks([1,2,3,4,5,6,7,8])
axes[1].set_yticks([0.1,0.2,0.3,0.4,0.5,0.6,0.7])

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(10, 8))
sns.scatterplot(x="Happiness Score", y="Generosity", data=df_first_half, ax=axes[0]
  ).set_title("Scatterplot of Happiness Score and Generosity--First Half", fontsize=10)
axes[0].set_xticks([1,2,3,4,5,6,7,8])
axes[0].set_yticks([0.1,0.2,0.3,0.4,0.5,0.6,0.7])

sns.scatterplot(x="Happiness Score", y="Generosity", data=df_second_half,ax=axes[1]
  ).set_title("Scatterplot of Happiness Score and Generosity--Second Half", fontsize=10)
axes[1].set_xticks([1,2,3,4,5,6,7,8])
axes[1].set_yticks([0.1,0.2,0.3,0.4,0.5,0.6,0.7])

#Objective 3 Conclusion
The difference between the correlation plots seemed to indicate that Trust and Generosity have different effects on countries in the first half of the dataset and countries in the second half. I then plotted these variables against Happiness Score on four scatterplots, two for each half, but the difference did not seem very clear in the plots. I'm unsure whether or not there's a difference in the effect of these variables on Happiness Score and would probably have to do some hypothesis testing to get a better answer.



#Potential Ethical Issue:

One ethical issue I can think of would be whether these numbers are all accurately reported. For example, a country may have a vested interest in boosting its happiness score for optics, so they might report higher numbers on metrics like generosity or trust to earn a higher happiness rank over other countries. There is precedent for this kind of behavior, like during the COVID pandemic, when countries would underreport cases to make the situation seem better than it was.
