## Data Preprocessing and Feature Engineering
This is an important step in machine learning. Here we will perfrom data scaling, either normalisation or standardisation, which will assist the ML model in making accurate predictions. 

In [3]:
#import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.ensemble import GradientBoostingRegressor

In [2]:
#import cleaned df csv

df = pd.read_csv("Regression project cleaned dataset", index_col=None)

#drop unamed column - exported incorrectly - index should be set to False
##df = df.drop(columns='Unnamed: 0')
#output top 5 rows
df.head()

Unnamed: 0,Country,Year,Savanna_fires,Forest_fires,Crop_Residues,Rice_Cultivation,Drained_organic_soils_(CO2),Pesticides_Manufacturing,Food_Transport,Forestland,...,Manure_Management,Fires_in_organic_soils,Fires_in_humid_tropical_forests,On-farm_energy_use,Rural_population,Urban_population,Total_Population_-_Male,Total_Population_-_Female,total_emission,Average_Temperature_°C
0,Afghanistan,1993,14.7237,0.0557,230.8175,686.0,0.0,11.712073,54.3617,-2388.803,...,352.2947,0.0,0.0,140.6888,11858090.0,3237009.0,7003641.0,7000119.0,2368.470529,0.101917
1,Afghanistan,1994,14.7237,0.0557,242.0494,705.6,0.0,11.712073,53.9874,-2388.803,...,367.6784,0.0,0.0,140.6888,12690115.0,3482604.0,7733458.0,7722096.0,2500.768729,0.37225
2,Afghanistan,1995,14.7237,0.0557,243.8152,666.4,0.0,11.712073,54.6445,-2388.803,...,397.5498,0.0,0.0,140.6888,13401971.0,3697570.0,8219467.0,8199445.0,2624.612529,0.285583
3,Afghanistan,1996,38.9302,0.2014,249.0364,686.0,0.0,11.712073,53.1637,-2388.803,...,465.205,0.0,0.0,140.6888,13952791.0,3870093.0,8569175.0,8537421.0,2838.921329,0.036583
4,Afghanistan,1997,30.9378,0.1193,276.294,705.6,0.0,11.712073,52.039,-2388.803,...,511.5927,0.0,0.0,140.6888,14373573.0,4008032.0,8916862.0,8871958.0,3204.180115,0.415167


#### Perform Encoding for Categorical or Non-Numeric features
We might not need this

In [16]:
#dummy variable endcoding for categorical variables
#df_encoded = pd.get_dummies(df, drop_first = True)

(6270, 257)

#### create predictor and response variables

In [5]:
#split data in predictor and response variables
X = df.drop(['Average_Temperature_°C','Country', 'Year'], axis = 1)

y = df['Average_Temperature_°C']


#### Create Train Test Split
Here we're going to split our data into a training set and a testing set. We'll apply the 20% rule where 20% of our X and y features will be used for testing the model.

In [6]:
#create train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42)

### Feature scaling 
I'm going to use standardisation - it involves centering values around the mean and adjusting the standard deviation to one unit. Using Sklearns StandardScaler from the sklearn - preprocessing package. The reason for using Standardisation is because it handles outliers gracefully. We noticed that the dataset contains a number of outliers in various columns.

In [13]:
#Normalise X and y
scaler = StandardScaler()

# Fit scaler on training data, transform both train and test data
X_train_scaled = scaler.fit_transform(X_train) 
X_test_scaled = scaler.transform(X_test)

# Optional: convert back to DataFrame for easier column reference
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X.columns)

#### Feature Selection
By selecting the most relevant features, we can improve the model’s interpretability, reduce overfitting, and potentially enhance performance. Given the weak correlations, some advanced feature selection techniques can help identify non-linear relationships and interactions that simple correlations might miss.

 Feature Selection with Recursive Feature Elimination (RFE)

In [14]:
from sklearn.feature_selection import RFE
# Initialize the linear regression model for RFE
lr_model = LinearRegression()

# Set up RFE with all features initially and trim down
selector = RFE(estimator=lr_model, n_features_to_select=10, step=1)
selector.fit(X_train_scaled, y_train)

# Get selected features
selected_features_rfe = X.columns[selector.support_]
print("Selected features by RFE:", selected_features_rfe)

Selected features by RFE: Index(['Pesticides_Manufacturing', 'Food_Packaging',
       'Agrifood_Systems_Waste_Disposal', 'Fertilizers_Manufacturing', 'IPPU',
       'Manure_applied_to_Soils', 'Manure_Management', 'Rural_population',
       'Urban_population', 'Total_Population_-_Female'],
      dtype='object')


Feature Selection with Random Forest

In [15]:
# Train Random Forest model for feature importance
rf_model = RandomForestRegressor(n_estimators=100,max_depth=10,random_state=42)
rf_model.fit(X_train_scaled, y_train)

# Get top 10 features based on importance
feature_importances = pd.Series(rf_model.feature_importances_, index=X.columns)
selected_features_rf = feature_importances.nlargest(10).index
print("Top 10 features by Random Forest importance:", selected_features_rf)

Top 10 features by Random Forest importance: Index(['Food_Retail', 'Manure_left_on_Pasture', 'Fertilizers_Manufacturing',
       'Rice_Cultivation', 'Fires_in_humid_tropical_forests',
       'Drained_organic_soils_(CO2)', 'IPPU', 'Food_Transport',
       'Rural_population', 'On-farm_Electricity_Use'],
      dtype='object')


In [16]:
#combine features from various selectors
final_features = list(set(selected_features_rfe) | set(selected_features_rf))

#update X_train and X_test with final features
X_train_final = X_train_scaled[final_features]
X_test_final = X_test_scaled[final_features]

#### Select and develop three primary models
I'm going to make use of the following models:
1. Linear Regression
2. Ridge Regression
3. Random Forest Regressor

We can add more or change it up later if need be.

#### Linear Regression Model

In [30]:
# model object has already been created during the feature selection phase - we can now train the model
lr_model.fit(X_train_final, y_train)

# predict on test data
lr_y_pred = lr_model.predict(X_test_final)

#evaluate model performance
mae_lr = mean_absolute_error(y_test, lr_y_pred)
mse_lr = mean_squared_error(y_test, lr_y_pred)
r2_lr = r2_score(y_test, lr_y_pred)
rmse_lr = np.sqrt(mse_lr) 


#output results
print('Linear Regression:')
print(f'MAE: {mae_lr}')
print(f'MSE: {mae_lr}')
print(f'r2: {r2_lr}')
print(f'RMSE: {rmse_lr}')


Linear Regression:
MAE: 0.3877482973052707
MSE: 0.3877482973052707
r2: 0.023337858157445934
RMSE: 0.5112781136526017


#### Random Forest Model

In [31]:
#train model
rf_model.fit(X_train_final, y_train)

#predict y
rf_y_pred = rf_model.predict(X_test_final)

#evaluate model performance
mae_rf = mean_absolute_error(y_test, rf_y_pred)
mse_rf = mean_squared_error(y_test, rf_y_pred)
r2_rf = r2_score(y_test, rf_y_pred)
rmse_rf = np.sqrt(mse_rf) 

#output results
print('Random Forest Regressor:')
print(f'MAE: {mae_rf}')
print(f'MSE: {mse_rf}')
print(f'r2: {r2_rf}')
print(f'RMSE: {rmse_rf}')


Random Forest Regressor:
MAE: 0.30383722159864285
MSE: 0.16256481749127405
r2: 0.39262556241546864
RMSE: 0.4031932756027487


#### Gradient Boosting Regressor

In [32]:
#Intantiate model object
gb_model = GradientBoostingRegressor(random_state=42)

#train model
gb_model.fit(X_train_final, y_train)

#predict y
gb_y_pred = gb_model.predict(X_test_final)

#evaluate model performance
mae_gb = mean_absolute_error(y_test, gb_y_pred)
mse_gb = mean_squared_error(y_test, gb_y_pred)
r2_gb = r2_score(y_test, gb_y_pred)
rmse_gb = np.sqrt(mse_gb) 
#output results
print('Gradient Booster Regressor:')
print(f'MAE: {mae_gb}')
print(f'MSE: {mse_gb}')
print(f'r2: {r2_gb}')
print(f'RMSE: {rmse_gb}')


Gradient Booster Regressor:
MAE: 0.3335567278780887
MSE: 0.19264862667325394
r2: 0.2802264777655187
RMSE: 0.43891756250263436


#### Evaluate model performance

In [33]:
model_results = {
    "Model": ["Linear Regression", "Random Forest Regressor", "Gradient Boosting Regressor"],
    "MAE": [mae_lr, mae_rf, mae_gb],
    "MSE": [mse_lr, mse_rf, mse_gb],
    "R²": [r2_lr, r2_rf, r2_gb],
    "RMSE": [rmse_lr, rmse_rf, rmse_gb]
}

summary_df = pd.DataFrame(model_results)

summary_df

Unnamed: 0,Model,MAE,MSE,R²,RMSE
0,Linear Regression,0.387748,0.261405,0.023338,0.511278
1,Random Forest Regressor,0.303837,0.162565,0.392626,0.403193
2,Gradient Boosting Regressor,0.333557,0.192649,0.280226,0.438918
