# Gallup Analyse Notebook

Author: Jade Bullock

Date: 06/04/2024

This Jupyter Notebook performs exploratory data analysis and modeling on the merged Gallup Global dataset, with the aim of identifying which factors best predict national happiness (as measured by the
World Happiness Report's "Happiness Index").

## Main Steps:
-----------
1. Load the cleaned and merged Gallup dataset.
2. Visualize correlations via heatmaps and bar charts.
3. Explore individual variable relationships using scatter plots.
4. Fit and compare two predictive models:
    - Linear Regression
    - Random Forest Regressor
5. Analyze and interpret feature importance from both models.
6. Evaluate model accuracy using R² and RMSE.
7. Highlight key insights, including expected and surprising findings (e.g. smiled_yes behavior).

## Outputs:
--------
- Heatmap and bar plots of feature correlations
- Scatter plots of happiness vs top features
- Ranked feature importances (coefficients and tree-based)
- Model accuracy summary (R², RMSE)
- A written summary of predictive insights

## Note:
-----
The analysis excludes *_no variables to avoid multicollinearity and focuses only
on *_yes indicators or scaled scores (e.g. law and order).

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np

### 1. Load the cleaned and merged Gallup dataset.

In [None]:
# Load Gallup merged dataset
gallup_df = pd.read_csv("../data/clean/gallup_merge.csv")

### 2. Visualize correlations via heatmaps and bar charts.

In [None]:
# Select only numeric columns for correlation analysis
numeric_data = gallup_df.select_dtypes(include='number')

# Compute the correlation matrix
corr_matrix = numeric_data.corr()

# Set up the matplotlib figure
plt.figure(figsize=(14, 10))

# Generate a heatmap using seaborn
sns.heatmap(
    corr_matrix,
    annot=True,          # show correlation coefficients
    fmt=".2f",           # format to 2 decimal places
    cmap="coolwarm",     # color map
    square=True,
    cbar_kws={"shrink": 0.8}
)

# Add title and adjust layout
plt.title("Correlation Heatmap of Gallup Indicators (Including Happiness Index)", fontsize=14)
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()

# Save the figure
plt.savefig("visuals/gallup_correlation_heatmap.png", dpi=300, bbox_inches='tight')
print("Heatmap saved to: visuals/gallup_correlation_heatmap.png")

# Show the plot
plt.show()

In [None]:
# Extract correlations with 'Happiness Index' only
target = "Happiness Index"
correlation_df = corr_matrix[[target]].drop(index=target).reset_index()
correlation_df.columns = ["Indicator", "Correlation with Happiness"]
correlation_df = correlation_df.sort_values("Correlation with Happiness", ascending=False)


# Create bar chart of correlation results
plt.figure(figsize=(12, 6))
sns.barplot(
    data=correlation_df,
    x="Correlation with Happiness",
    y="Indicator", hue="Indicator",
    palette="coolwarm"
)

plt.title("Correlation of Indicators with Happiness Index")
plt.xlabel("Pearson Correlation")
plt.ylabel("Indicator")
plt.axvline(0, color='gray', linestyle='--')  # zero line for reference
plt.tight_layout()

# Save the chart
plt.savefig("visuals/happiness_correlation_barplot.png", dpi=300, bbox_inches="tight")
print(" Bar chart saved to: visuals/happiness_correlation_barplot.png")

# Show the plot
plt.show()


### 3. Explore individual variable relationships against Happiness Index using scatter plots.

In [None]:
# Define key variables to plot against the Happiness Index
key_variables = [
    "sadness_yes",
    "enjoyment_yes",
    "respect_yes",
    "pain_yes",
    "PERCENTAGE_Safety",
    "SCORE_law_order"
]

# Create a 2x3 grid of scatter plots
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(18, 10))
axes = axes.flatten()

for i, var in enumerate(key_variables):
    sns.scatterplot(
        data=gallup_df,
        x=var,
        y="Happiness Index",
        ax=axes[i],
        edgecolor="w"
    )
    axes[i].set_title(f"Happiness Index vs. {var}", fontsize=12)
    axes[i].set_xlabel(var)
    axes[i].set_ylabel("Happiness Index")

# Layout adjustment
plt.tight_layout()

# Save to file
plt.savefig("visuals/happiness_scatter_plots1.png", dpi=300, bbox_inches="tight")
print(" Scatter plots saved to: visuals/happiness_scatter_plots1.png")

# Show the plots
plt.show()

In [None]:
# Define key variables to plot against the Happiness Index
key_variables = [
    "anger_yes",
    "learned_yes",
    "worry_yes",
    "well-rested_yes",
    "smiled_yes",
    "stress_yes"
]

# Create a 2x3 grid of scatter plots
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(18, 10))
axes = axes.flatten()

for i, var in enumerate(key_variables):
    sns.scatterplot(
        data=gallup_df,
        x=var,
        y="Happiness Index",
        ax=axes[i],
        edgecolor="w"
    )
    axes[i].set_title(f"Happiness Index vs. {var}", fontsize=12)
    axes[i].set_xlabel(var)
    axes[i].set_ylabel("Happiness Index")

# Layout adjustment
plt.tight_layout()

# Save to file
plt.savefig("visuals/happiness_scatter_plots2.png", dpi=300, bbox_inches="tight")
print(" Scatter plots saved to: visuals/happiness_scatter_plots2.png")

# Show the plots
plt.show()

### 4. Fit and compare two predictive models

Use regression to understand which Gallup indicators are most predictive of a country's Happiness Index.
Since the _yes and _no pairs are strongly negatively correlated, as expected (e.g. enjoyment_yes vs enjoyment_no, use regression on only one value.  This is to prevent distorting results (multicollinearity).

- Drop non-numeric columns (like Country) since they can't be used in regression.
- Drop rows with missing values to keep the analysis simple and clean.
- Normalise the data or use StandardScaler to put all features on the same scale.  (mean = 0 std = 1).
- Train a LinearRegression model using the cleaned data.
- Use model to assign coefficients to each feature based on how much it contributes to predicting the Happiness Index.
- Interpret the Coefficients for linear model
    - Positive coefficients = variables that increase happiness;
    - Negative coefficients = variables that decrease happiness;
    - Higher absolute value = more predictive power.


In [None]:
# Clean and Prepare the Data

df_filtered = (
    gallup_df
    .dropna()  # Remove rows with missing values
    .drop(columns=[col for col in gallup_df.columns if col.endswith("_no")])  # Drop '_no' columns
)
X = df_filtered.drop(columns=["Country", "Happiness Index"])  # remove the predictor and the non-numeric values
y = df_filtered["Happiness Index"]  # Assign as target predictor

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

### Linear Model

In [None]:
# Fit the Linear model
lin_model = LinearRegression()
lin_model.fit(X_scaled, y)

# Extract and rank feature importance
coefficients = pd.Series(lin_model.coef_, index=X.columns)
coeff_df = coefficients.sort_values(ascending=False).reset_index()
coeff_df.columns = ["Feature", "Predictive Strength (Coefficient)"]
print(coeff_df)

#  Visualize the top predictors
plt.figure(figsize=(12, 6))
sns.barplot(data=coeff_df, x="Predictive Strength (Coefficient)", y="Feature", hue="Feature", palette="coolwarm")
plt.title("Most Predictive Features of Happiness (Linear Regression)")
plt.axvline(0, color='gray', linestyle='--')
plt.tight_layout()

# Save to file
plt.savefig("visuals/Linear_reg_predictive_features.png", dpi=300, bbox_inches="tight")
print(" Plot saved to: visuals/Linear_reg_predictive_features.png")
plt.show()


Enjoyment_yes: +0.48 Most powerful positive predictor. More enjoyment → more happiness.
smiled_yes: -0.47 Unexpected!  Does not match the scatter plot relationship - has shown a strong negative relationship.  Indicating an issue with the regression.
Learned_yes: +0.39 Learning something interesting is highly linked to happiness.
Sadness_yes: -0.36 Strongly associate with unhappiness
Score_law_order:Trust in law and order boosts national happiness.


### Random forest (RandomForestRegressor)


In [None]:
# Fit Random Forest model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_scaled, y)

# Get feature importances
importances = rf_model.feature_importances_
rf_importance_df = pd.Series(importances, index=X.columns).sort_values(ascending=False).reset_index()
rf_importance_df.columns = ["Feature", "RF_Importance"]

# Print top features
print("\nTop Predictors (Random Forest):")
print(rf_importance_df.head(10))

# Visualize feature importances
plt.figure(figsize=(10, 6))
sns.barplot(data=rf_importance_df.head(10), x="RF_Importance", y="Feature", hue="Feature", palette="viridis")
plt.title("Top 10 Predictors of Happiness (Random Forest)")
plt.tight_layout()

# Save to file
plt.savefig("visuals/random_forest_predictive_features.png", dpi=300, bbox_inches="tight")
print(" Scatter plots saved to: visuals/random_forest_predictive_features.png")
plt.show()

- pain_yes: 36.5% Most important predictor. Pain dramatically lowers happiness.
- respect_yes: 15.6% Feeling respected plays a huge role in perceived well-being.
- SCORE_law_order: 10.0% Trust in institutions and rule of law strongly tied to happiness.
- enjoyment_yes: 7.6% Still a strong positive predictor — validates linear model.
- learned_yes: 6.7% Learning contributes meaningfully to happiness.
- smiled_yes: 4.9% Less dominant than in linear model — but still relevant.
- sadness_yes: 4.3% Directly tied to lower happiness.
- PERCENTAGE_Safety: 3.9% Confirms that personal safety is part of happiness.
- well-rested_yes: 3.2% Sleep and rest affect well-being.
- stress_yes: 2.8% Stress still matters, though not dominant.


### 5. Analyze and interpret feature importance from both models.
Compare Linear vs. Random Forest Rankings

In [None]:
# Merge and compare
comparison_df = pd.merge(coeff_df, rf_importance_df, on="Feature")
comparison_df = comparison_df.sort_values("RF_Importance", ascending=False)
print(comparison_df)

In [None]:
# Calculate univariate Pearson correlation for each feature with Happiness Index
correlations = {
    feature: df_filtered[feature].corr(df_filtered["Happiness Index"])
    for feature in comparison_df["Feature"]
}

# Convert to DataFrame
correlation_df = pd.DataFrame.from_dict(correlations, orient="index", columns=["Pearson Correlation"])
correlation_df.reset_index(inplace=True)
correlation_df.rename(columns={"index": "Feature"}, inplace=True)

# Drop existing Pearson Correlation columns before merging (just in case)
comparison_df = comparison_df.drop(
    columns=[col for col in comparison_df.columns if "Pearson Correlation" in col],
    errors="ignore"
)

# Merge correlation back in
comparison_df = comparison_df.merge(correlation_df, on="Feature")

# Sort again by Random Forest importance
comparison_df = comparison_df.sort_values("RF_Importance", ascending=False)

# Show results
print(comparison_df.to_string(index=False))



### 6. Evaluate model accuracy using R² and RMSE

In [None]:
# Cross-validate R^2 scores
lin_r2 = cross_val_score(lin_model, X_scaled, y, cv=5, scoring='r2')
rf_r2 = cross_val_score(rf_model, X_scaled, y, cv=5, scoring='r2')

# Train/Test RMSE
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Linear model
lin_model.fit(X_train, y_train)
y_pred_lin = lin_model.predict(X_test)
lin_rmse = np.sqrt(mean_squared_error(y_test, y_pred_lin))

# Random Forest model
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)
rf_rmse = np.sqrt(mean_squared_error(y_test, y_pred_rf))

# Correlation of individual predictor
smile_corr = df_filtered["smiled_yes"].corr(df_filtered["Happiness Index"])

# Generate results
print("\nModel Accuracy Comparison")
print("-" * 30)
print(f"Linear Regression R² (mean CV): {lin_r2.mean():.3f}")
print(f"Random Forest R² (mean CV):    {rf_r2.mean():.3f}")
print(f"Linear Regression RMSE:        {lin_rmse:.3f}")
print(f"Random Forest RMSE:            {rf_rmse:.3f}")


### 7. Summary and key insights

Top Predictive Indicators (Across Linear & Random Forest Models)
- Enjoyment (enjoyment_yes): The strongest positive predictor of happiness. Consistently ranked high in both regression and tree-based models.
- Respect (respect_yes): Feeling respected is highly predictive of higher happiness. Social dignity matters.
- Pain (pain_yes): Negatively correlated and the most important feature in Random Forest — highlighting how physical/emotional suffering impacts well-being.
- Learning (learned_yes): Associated with increased happiness, indicating the role of engagement and intellectual stimulation.
- Law & Order (SCORE_law_order): Trust in legal systems and safety correlates strongly with well-being.

Unexpected Findings
- Smiled_yes had a positive correlation with happiness (as expected), but a negative coefficient in the linear model. This likely indicates:
    - Multicollinearity (strong correlation with enjoyment_yes or other positive emotions),
    - A suppression effect, where its true impact is masked when modeled jointly with correlated variables.