Multi variate Regression Analysis

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [3]:
# Load the dataset from the CSV file
df = pd.read_csv('co2_emission_preprocessed.csv')  # Replace 'your_dataset.csv' with the path to your CSV file



In [4]:
# Calculate the correlation matrix excluding non-numeric columns
correlation_matrix = df.select_dtypes(include=[np.number]).corr()
print(correlation_matrix)


                    Year  Population       CO2  CO2_per_capita
Year            1.000000    0.058253  0.111116       -0.022562
Population      0.058253    1.000000  0.846613       -0.014618
CO2             0.111116    0.846613  1.000000        0.000094
CO2_per_capita -0.022562   -0.014618  0.000094        1.000000


The most significant relationship in the dataset is the strong positive correlation between Population and CO2. This suggests that larger populations tend to produce more CO2 emissions.

In [5]:


# Prepare the independent variables (features) and the target variable
X = df[['Population', 'Year']]  # Independent variables
y = df['CO2_per_capita']  # Target variable

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a linear regression model
model = LinearRegression()

# Fit the model
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Output the results
print("Mean Squared Error:", mse)
print("R-squared:", r2)
print("Intercept:", model.intercept_)
print("Coefficients:", model.coef_)


Mean Squared Error: 0.0004161874853057511
R-squared: 0.00015509475606179102
Intercept: 0.00961964310792363
Coefficients: [-4.12940194e-03 -4.52069484e-06]


Since we see non-linear patterns, we consider fitting a polynomial regression model.

## Polynomial REgression model

In [6]:
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2)  # You can change the degree as needed
X_poly = poly.fit_transform(X)

# Split the polynomial features into training and testing sets
X_train_poly, X_test_poly, y_train, y_test = train_test_split(X_poly, y, test_size=0.2, random_state=42)

# Create and fit the model
model_poly = LinearRegression()
model_poly.fit(X_train_poly, y_train)

# Predict and evaluate
y_pred_poly = model_poly.predict(X_test_poly)
mse_poly = mean_squared_error(y_test, y_pred_poly)
r2_poly = r2_score(y_test, y_pred_poly)

print("Polynomial Mean Squared Error:", mse_poly)
print("Polynomial R-squared:", r2_poly)


Polynomial Mean Squared Error: 0.0004161084354711419
Polynomial R-squared: 0.0003450033361728222


R² of 0.000345 suggests that the model does not explain much of the variance in the target variable, meaning there is underfitting. But, if we increase the degree of polynomial, there is a risk of overfitting. We need to explore alternate methods to improve performance and accuracy of analysis.