#### Notebook Summary: Happiness Score Prediction (2015–2024)

This notebook performs predictive analysis using the World Happiness dataset from 2015 to 2024.

1. **Use data from 2015–2024** to predict the **Happiness Score** of countries.
2. **Create lag features** for key factors (e.g., Economy, Family, Health) to capture the effect of previous year values.
3. **Keep all rows including 2015** by filling missing lag values with same-year values.
4. **Train a Linear Regression model** using data from 2015 to 2022.
5. **Test the model** using data from 2023 and 2024.
6. **Evaluate model performance** using:
   - R² Score (model fit)
   - Mean Squared Error (prediction accuracy)

---


In [5]:
import pandas as pd

# Load your dataset
df = pd.read_csv('../data/cleaned/winsorized_df_all.csv')

In [6]:
print(df.columns)

Index(['Country', 'Happiness Rank', 'Happiness Score', 'Economy', 'Family',
       'Healthy life expectancy', 'Freedom to make life choices',
       'Perceptions of corruption', 'Generosity', 'Continent', 'Year',
       'Happiness Rank (Cleaned Data)'],
      dtype='object')


In [3]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import StandardScaler


In [7]:
# Sort the dataset by Country and Year
df = df.sort_values(by=['Country', 'Year'])

In [9]:
# List of numeric features to create lag features for
num_features = [
    'Economy', 'Family', 'Healthy life expectancy',
    'Freedom to make life choices', 'Generosity', 'Perceptions of corruption'
]

In [10]:
# Create lag features (previous year's value per country)
for col in num_features:
    df[f'{col}_lag1'] = df.groupby('Country')[col].shift(1)
    # For 2015 (first year), fill missing lag values with same-year values
    df[f'{col}_lag1'] = df[f'{col}_lag1'].fillna(df[col])


In [11]:
# Encode 'Continent' column (one-hot encoding)
df = pd.get_dummies(df, columns=['Continent'], drop_first=True)

# Update continent columns after encoding
continent_cols = [col for col in df.columns if col.startswith('Continent_')]


In [12]:
# Define target and features
target = 'Happiness Score'
feature_cols = [f'{col}_lag1' for col in num_features] + continent_cols

X = df[feature_cols]
y = df[target]

In [13]:
# Split dataset into training (2015–2022) and testing (2023–2024)
train = df[df['Year'] <= 2022]
test = df[df['Year'] > 2022]

X_train = train[feature_cols]
y_train = train[target]
X_test = test[feature_cols]
y_test = test[target]


In [14]:
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [15]:
# Train Linear Regression model
model = LinearRegression()
model.fit(X_train_scaled, y_train)

In [16]:
# Predict on test set
y_pred = model.predict(X_test_scaled)

In [17]:
# Evaluate performance
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)

print(f'R² Score: {r2:.4f}')
print(f'Mean Squared Error: {mse:.4f}')

R² Score: 0.5832
Mean Squared Error: 0.5263


In [18]:
# Random Forest training and evaluation
from sklearn.ensemble import RandomForestRegressor

rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)  # You can try without scaling here
y_pred_rf = rf_model.predict(X_test)

r2_rf = r2_score(y_test, y_pred_rf)
mse_rf = mean_squared_error(y_test, y_pred_rf)

print(f'Random Forest R² Score: {r2_rf:.4f}')
print(f'Random Forest Mean Squared Error: {mse_rf:.4f}')


Random Forest R² Score: 0.5314
Random Forest Mean Squared Error: 0.5918


#### Model Performance and Interpretation

We trained a Linear Regression model to predict the Happiness Score, which achieved an **R² score of approximately 0.58**. This means the model explains about **58% of the variation** in happiness scores across countries and years.

This indicates the model captures most of the important patterns between input features like Economy, Family, Health, and others, but about 42% of variation is due to other factors not included or randomness.

In addition to Linear Regression, we also tried a Random Forest model. The Random Forest achieved an **R² score of about 0.53**, which is slightly lower than Linear Regression, indicating it explained less variation in the data in this case.

This can happen when the relationships are mostly linear or when the Random Forest model needs further tuning. Given these results, **we decided to proceed with the Linear Regression model** because it provides a good balance of simplicity, interpretability, and predictive performance.

---

In [19]:
# To save the Linear Regression model for future use without retraining, we use the `joblib` library:

import joblib

# Save the trained Linear Regression model to a file
joblib.dump(model, 'linear_regression_happiness_model.joblib')

# To load the model later:
# loaded_model = joblib.load('linear_regression_happiness_model.joblib')


['linear_regression_happiness_model.joblib']

### About the Country Column

We used the `Country` column to create lag features, which help the model learn from past data for each country.

Because of this, we did **not** include the original `Country` column as a feature in the model.

So, we **didn't need to encode** the `Country` column.

Instead, we included encoded `Continent` columns to give the model some geographic information.

**In short:**  
No need to encode `Country` if you only use it to create lag features.
