# <center> Simple Linear Regression on CO2 Emissions : Core Logic Approach </center>

## Importing Dependencies

In [94]:
import pandas as pd
import random
import math

## Importing the dataset into the environment

In [95]:
co2Data = pd.read_csv('Datasets/CO2emission.csv')

## Inspecting the dataset

In [96]:
co2Data.head()

Unnamed: 0,Make,Model,VehicleClass,EngineSize,Cylinders,FuelConsumptionCity,CO2Emissions
0,ACURA,ILX,COMPACT,2.0,4,9.9,196
1,ACURA,ILX,COMPACT,2.4,4,11.2,221
2,ACURA,ILX HYBRID,COMPACT,1.5,4,6.0,136
3,ACURA,MDX 4WD,SUV - SMALL,3.5,6,12.7,255
4,ACURA,RDX AWD,SUV - SMALL,3.5,6,12.1,244


In [97]:
co2Data.shape

(7385, 7)

In [98]:
co2Data.isnull().sum()

Make                   0
Model                  0
VehicleClass           0
EngineSize             0
Cylinders              0
FuelConsumptionCity    0
CO2Emissions           0
dtype: int64

In [99]:
co2Data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7385 entries, 0 to 7384
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Make                 7385 non-null   object 
 1   Model                7385 non-null   object 
 2   VehicleClass         7385 non-null   object 
 3   EngineSize           7385 non-null   float64
 4   Cylinders            7385 non-null   int64  
 5   FuelConsumptionCity  7385 non-null   float64
 6   CO2Emissions         7385 non-null   int64  
dtypes: float64(2), int64(2), object(3)
memory usage: 404.0+ KB


In [100]:
co2Data.describe()

Unnamed: 0,EngineSize,Cylinders,FuelConsumptionCity,CO2Emissions
count,7385.0,7385.0,7385.0,7385.0
mean,3.160068,5.61503,12.556534,250.584699
std,1.35417,1.828307,3.500274,58.512679
min,0.9,3.0,4.2,96.0
25%,2.0,4.0,10.1,208.0
50%,3.0,6.0,12.1,246.0
75%,3.7,6.0,14.6,288.0
max,8.4,16.0,30.6,522.0


In [101]:
co2Data.duplicated().sum()

1295

In [102]:
co2Data = co2Data.drop_duplicates()
co2Data.duplicated().sum()

0

In [103]:
co2Data.shape

(6090, 7)

## Identifying correlations between the dependent and independent features in the dataset

### Pearson's Correlation Coefficient

- The Pearson correlation coefficient \( r \) is a measure of the linear relationship between two variables. It is calculated using the formula:

$$ 
r = \frac{n \sum (XY) - \sum X \sum Y}{\sqrt{[n \sum (X^2) - (\sum X)^2][n \sum (Y^2) - (\sum Y)^2]}}
$$

- Dependent Variable    = CO2Emissions
- Independent Variables = EngineSize, Cylinders, FuelConsumptionCity

In [104]:
def Karl_Pearson_Correlation(indeFeature, depenFeature = "CO2Emissions"):
    X = co2Data[indeFeature]
    Y = co2Data[depenFeature]
    n = len(Y)        
    sumX = sum(X)
    sumY = sum(Y)
    sumXY = sum(X[:] * Y[:])
    sumXsq = sum(X[:] * X[:])
    sumYsq = sum(Y[:] * Y[:])
    numerator = (n * sumXY) - (sumX * sumY)
    denominator = (n * sumXsq - (sumX ** 2)) * (n * sumYsq - (sumY ** 2))
    if denominator != 0:
        corr = numerator / math.sqrt(denominator)
    else:
        corr = None
    return corr

In [105]:
correlations = {}
for feature in ["EngineSize", "Cylinders", "FuelConsumptionCity"]:
    correlations[f"{feature}-CO2Emissions"] = Karl_Pearson_Correlation(feature)
pd.DataFrame(correlations, index = ['Correlation (r)'])

Unnamed: 0,EngineSize-CO2Emissions,Cylinders-CO2Emissions,FuelConsumptionCity-CO2Emissions
Correlation (r),0.855194,0.834444,0.918415


## <center> The Simple Linear Regression </center>
### <center> <i> Between FuelConsumptionCity and CO2Emissions, as well as Between EngineSize and CO2Emissions </i> </center>

### ► Splitting the data into training and testing sets in an 80:20 ratio

In [106]:
n = int(len(co2Data) * 80/100)
indexForTrainingSet = random.sample(range(len(co2Data)), n)
print(f"the length of the sampled indices: {len(indexForTrainingSet)}")
print(f"Is duplicated? : {len(indexForTrainingSet) != len(set(indexForTrainingSet))}")
indexForTrainingSet[:10]

the length of the sampled indices: 4872
Is duplicated? : False


[2183, 2921, 5986, 4335, 4528, 1356, 1154, 1145, 4332, 5222]

In [107]:
print(f"Is this exactly 80%? : {len(indexForTrainingSet) == n}")

Is this exactly 80%? : True


In [108]:
trainingSet = {'EngineSize': [], 'FuelConsumptionCity': [], 'CO2Emissions': []}
testingSet = {'EngineSize' : [], 'FuelConsumptionCity': [], 'CO2Emissions': []}
for index in range(len(co2Data)):
    if index in indexForTrainingSet:
        trainingSet['EngineSize'].append(co2Data.iloc[index]['EngineSize'])
        trainingSet['FuelConsumptionCity'].append(co2Data.iloc[index]['FuelConsumptionCity'])
        trainingSet['CO2Emissions'].append(co2Data.iloc[index]['CO2Emissions'])
    elif index not in indexForTrainingSet:
        testingSet['EngineSize'].append(co2Data.iloc[index]['EngineSize'])
        testingSet['FuelConsumptionCity'].append(co2Data.iloc[index]['FuelConsumptionCity'])
        testingSet['CO2Emissions'].append(co2Data.iloc[index]['CO2Emissions'])
trainingSet = pd.DataFrame(trainingSet).reset_index(drop = True)
testingSet = pd.DataFrame(testingSet).reset_index(drop = True)

In [109]:
print(f"the length of the training set: {len(trainingSet)}")
print(f"the length of the testing set: {len(testingSet)}")
print(f"the ratio of the training set to the entire dataset: {int((len(trainingSet) / len(co2Data)) * 100)}%")
print(f"the ratio of the testing set to the entire dataset: {int((len(testingSet) / len(co2Data)) * 100)}%")

the length of the training set: 4872
the length of the testing set: 1218
the ratio of the training set to the entire dataset: 80%
the ratio of the testing set to the entire dataset: 20%


In [110]:
trainingSet.head()

Unnamed: 0,EngineSize,FuelConsumptionCity,CO2Emissions
0,2.0,9.9,196
1,2.4,11.2,221
2,3.5,12.7,255
3,3.5,11.9,230
4,3.5,11.8,232


In [111]:
testingSet.head()

Unnamed: 0,EngineSize,FuelConsumptionCity,CO2Emissions
0,1.5,6.0,136
1,3.5,12.1,244
2,3.7,12.8,255
3,2.4,11.2,225
4,5.9,18.0,359


### ► Model training

#### ➔ Calculating Coefficients for Simple Linear Regression Using Ordinary Least Squares

In [112]:
def coefficients(X, Y):
    meanX = sum(X) / len(X)
    meanY = sum(Y) / len(Y)
    numerator = sum((X[:] - meanX) * (Y[:] - meanY))
    denominator = sum((X[:] - meanX) ** 2)
    beta1 = numerator / denominator
    beta0 = meanY - (beta1 * meanX)
    return beta0, beta1

#### ➔ Simple Linear Regression between FuelConsumptionCity and CO2Emissions

In [113]:
intercept, regressionCoefficient = coefficients(trainingSet['FuelConsumptionCity'], trainingSet['CO2Emissions'])
modelLR = lambda inputData : intercept + regressionCoefficient * inputData

In [114]:
print(f"Intercept: {intercept}")
print(f"Regression Coefficient: {regressionCoefficient}")

Intercept: 58.40910834478734
Regression Coefficient: 15.27851974238597


In [115]:
predictions = modelLR(testingSet['FuelConsumptionCity'])
predictions[:5]

0    150.080227
1    243.279197
2    253.974161
3    229.528529
4    333.422464
Name: FuelConsumptionCity, dtype: float64

#### ⇒ Model Evaluation

- ##### Mean Absolute Error (MAE) : Measures the average magnitude of errors in a set of predictions, without considering their direction.
$$ 
\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|
$$

---

- ##### Mean Squared Error (MSE) :  Measures the average squared difference between predicted and actual values, giving more weight to larger errors.
$$ 
\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$

---

- ##### Root Mean Squared Error (RMSE) : The square root of the average of squared differences between prediction and actual observation, representing error in the same units as the original data.
$$ 
\text{RMSE} = \sqrt{\text{MSE}} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}
$$

---

- ##### R-squared (R²) : Represents the proportion of variance for a dependent variable that's explained by an independent variable(s) in a regression model.
$$ 
R^2 = 1 - \frac{\text{SS}_{\text{res}}}{\text{SS}_{\text{tot}}}
$$

---

- ##### Adjusted R-squared (Adjusted R²) : Adjusts R² for the number of predictors in a model, providing a more accurate measure for multiple regression.
$$ 
\text{Adjusted } R^2 = 1 - \left(1 - R^2\right) \frac{n-1}{n-p-1}
$$

where \( n \) is the number of observations and \( p \) is the number of predictors.


In [134]:
print(f"MAE ➔ {sum(abs(testingSet['CO2Emissions'] - predictions)) / len(testingSet)}")
print(f"MSE ➔ {sum((testingSet['CO2Emissions'] - predictions) ** 2) / len(testingSet)}")
print(f"RMSE ➔ {math.sqrt(sum((testingSet['CO2Emissions'] - predictions) ** 2) / len(testingSet))}")

MAE ➔ 23.428725834834832
MSE ➔ 941.1305750215491
RMSE ➔ 30.677851538553824


In [124]:
meanVal = sum(co2Data['CO2Emissions']) / len(co2Data['CO2Emissions'])
sumOfSquaredError = sum((testingSet['CO2Emissions'] - predictions) ** 2)
totalSumOfSquare = sum((testingSet['CO2Emissions'] - meanVal) ** 2)
R_Squared = 1 - (sumOfSquaredError / totalSumOfSquare)
print(f"R-Squared/ Coefficient of determination ➔ { R_Squared }")

R-Squared/ Coefficient of determination ➔ 0.737521215337696


In [125]:
numerator = (1 - R_Squared) * (len(testingSet) - 1)
denominator = len(testingSet) - 1 - 1
adjustedR_Squared = 1 - numerator / denominator
print(f"Adjusted R-Squared/ Coefficient of determination ➔ { adjustedR_Squared }")

Adjusted R-Squared/ Coefficient of determination ➔ 0.7373053610739935


#### ➔ Simple Linear Regression between Engine and CO2 Emissions

In [135]:
intercept, regressionCoefficient = coefficients(trainingSet['EngineSize'], trainingSet['CO2Emissions'])
modelLR = lambda inputData : intercept + regressionCoefficient * inputData

In [136]:
print(f"Intercept: {intercept}")
print(f"Regression Coefficient: {regressionCoefficient}")

Intercept: 133.95520568545692
Regression Coefficient: 37.072613553498336


In [137]:
predictions = modelLR(testingSet['EngineSize'])
predictions[:5]

0    189.564126
1    263.709353
2    271.123876
3    222.929478
4    352.683626
Name: EngineSize, dtype: float64

#### ⇒ Model Evaluation

- ##### Mean Absolute Error (MAE) : Measures the average magnitude of errors in a set of predictions, without considering their direction.
$$ 
\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|
$$

---

- ##### Mean Squared Error (MSE) :  Measures the average squared difference between predicted and actual values, giving more weight to larger errors.
$$ 
\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$

---

- ##### Root Mean Squared Error (RMSE) : The square root of the average of squared differences between prediction and actual observation, representing error in the same units as the original data.
$$ 
\text{RMSE} = \sqrt{\text{MSE}} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}
$$

---

- ##### R-squared (R²) : Represents the proportion of variance for a dependent variable that's explained by an independent variable(s) in a regression model.
$$ 
R^2 = 1 - \frac{\text{SS}_{\text{res}}}{\text{SS}_{\text{tot}}}
$$

---

- ##### Adjusted R-squared (Adjusted R²) : Adjusts R² for the number of predictors in a model, providing a more accurate measure for multiple regression.
$$ 
\text{Adjusted } R^2 = 1 - \left(1 - R^2\right) \frac{n-1}{n-p-1}
$$

where \( n \) is the number of observations and \( p \) is the number of predictors.


In [138]:
print(f"MAE ➔ {sum(abs(testingSet['CO2Emissions'] - predictions)) / len(testingSet)}")
print(f"MSE ➔ {sum((testingSet['CO2Emissions'] - predictions) ** 2) / len(testingSet)}")
print(f"RMSE ➔ {math.sqrt(sum((testingSet['CO2Emissions'] - predictions) ** 2) / len(testingSet))}")

MAE ➔ 23.428725834834832
MSE ➔ 941.1305750215491
RMSE ➔ 30.677851538553824


In [139]:
meanVal = sum(co2Data['CO2Emissions']) / len(co2Data['CO2Emissions'])
sumOfSquaredError = sum((testingSet['CO2Emissions'] - predictions) ** 2)
totalSumOfSquare = sum((testingSet['CO2Emissions'] - meanVal) ** 2)
print(f"R-Squared/ Coefficient of determination ➔ { 1 - sumOfSquaredError / totalSumOfSquare}")

R-Squared/ Coefficient of determination ➔ 0.737521215337696


In [140]:
numerator = (1 - R_Squared) * (len(testingSet) - 1)
denominator = len(testingSet) - 1 - 1
adjustedR_Squared = 1 - numerator / denominator
print(f"Adjusted R-Squared/ Coefficient of determination ➔ { adjustedR_Squared }")

Adjusted R-Squared/ Coefficient of determination ➔ 0.7373053610739935
