## What is Regression?

The file "FuelConsumptionCo2.csv" contains data on fuel consumption and CO2 emissions for various car models. It includes information like the year, make, model, engine size, fuel type, and city/highway/combined fuel consumption. This data can be used to analyze trends in fuel efficiency and CO2 emissions across different car models and manufacturers.

In [1]:
import pandas as pd
import numpy as np

# Read the CSV file
df = pd.read_csv('FuelConsumptionCo2.csv')

# Display the head of the dataframe
print(df.head())

   MODELYEAR   MAKE       MODEL VEHICLECLASS  ENGINESIZE  CYLINDERS  \
0       2014  ACURA         ILX      COMPACT         2.0          4   
1       2014  ACURA         ILX      COMPACT         2.4          4   
2       2014  ACURA  ILX HYBRID      COMPACT         1.5          4   
3       2014  ACURA     MDX 4WD  SUV - SMALL         3.5          6   
4       2014  ACURA     RDX AWD  SUV - SMALL         3.5          6   

  TRANSMISSION FUELTYPE  FUELCONSUMPTION_CITY  FUELCONSUMPTION_HWY  \
0          AS5        Z                   9.9                  6.7   
1           M6        Z                  11.2                  7.7   
2          AV7        Z                   6.0                  5.8   
3          AS6        Z                  12.7                  9.1   
4          AS6        Z                  12.1                  8.7   

   FUELCONSUMPTION_COMB  FUELCONSUMPTION_COMB_MPG  CO2EMISSIONS  
0                   8.5                        33           196  
1                   

## Select Numerical Features and Target Variable

In [2]:
# Select only the columns with float64 and int64 data types
numerical_df = df.select_dtypes(include=['float64', 'int64'])

# Display the head of the new dataframe
print(numerical_df.head())

   MODELYEAR  ENGINESIZE  CYLINDERS  FUELCONSUMPTION_CITY  \
0       2014         2.0          4                   9.9   
1       2014         2.4          4                  11.2   
2       2014         1.5          4                   6.0   
3       2014         3.5          6                  12.7   
4       2014         3.5          6                  12.1   

   FUELCONSUMPTION_HWY  FUELCONSUMPTION_COMB  FUELCONSUMPTION_COMB_MPG  \
0                  6.7                   8.5                        33   
1                  7.7                   9.6                        29   
2                  5.8                   5.9                        48   
3                  9.1                  11.1                        25   
4                  8.7                  10.6                        27   

   CO2EMISSIONS  
0           196  
1           221  
2           136  
3           255  
4           244  


## Remove MODELYEAR variable

In [3]:
# Remove MODELYEAR variable
numerical_df = numerical_df.drop('MODELYEAR', axis=1)

# Display the head of the updated dataframe
print(numerical_df.head())

   ENGINESIZE  CYLINDERS  FUELCONSUMPTION_CITY  FUELCONSUMPTION_HWY  \
0         2.0          4                   9.9                  6.7   
1         2.4          4                  11.2                  7.7   
2         1.5          4                   6.0                  5.8   
3         3.5          6                  12.7                  9.1   
4         3.5          6                  12.1                  8.7   

   FUELCONSUMPTION_COMB  FUELCONSUMPTION_COMB_MPG  CO2EMISSIONS  
0                   8.5                        33           196  
1                   9.6                        29           221  
2                   5.9                        48           136  
3                  11.1                        25           255  
4                  10.6                        27           244  


## How does linear regression work?

In [4]:
import plotly.express as px

# Create a scatter plot using Plotly
fig = px.scatter(numerical_df, x='ENGINESIZE', y='CO2EMISSIONS', 
                 title='Scatter Plot of CO2 Emissions vs Engine Size',
                 labels={'ENGINESIZE': 'Engine Size', 'CO2EMISSIONS': 'CO2 Emissions'})

# Show the plot
fig.show()

## How does linear regression work?

In [5]:
import plotly.express as px

# Create a scatter plot using Plotly
fig = px.scatter(numerical_df, x='ENGINESIZE', y='CO2EMISSIONS', 
                 title='Scatter Plot of CO2 Emissions vs Engine Size',
                 labels={'ENGINESIZE': 'Engine Size', 'CO2EMISSIONS': 'CO2 Emissions'},
                 trendline="ols")

# Show the plot
fig.show()

## Estimating the parameters

In [6]:
import statsmodels.api as sm
# define response variable
y = numerical_df['CO2EMISSIONS']
# define explanatory variable
x = numerical_df['ENGINESIZE']

#add constant to predictor variables
x = sm.add_constant(x)

#fit linear regression model
model = sm.OLS(y, x).fit()

#view model summary
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:           CO2EMISSIONS   R-squared:                       0.764
Model:                            OLS   Adj. R-squared:                  0.764
Method:                 Least Squares   F-statistic:                     3451.
Date:                Fri, 06 Sep 2024   Prob (F-statistic):               0.00
Time:                        12:47:22   Log-Likelihood:                -5169.9
No. Observations:                1067   AIC:                         1.034e+04
Df Residuals:                    1065   BIC:                         1.035e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        125.3041      2.420     51.779      0.0


Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.



## What is training & Out-of-sample accuracy?

This code snippet reads fuel consumption data, trains a linear regression model to predict CO2 emissions based on engine size, and evaluates the model's performance using mean squared error and R-squared metrics.

In [7]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Read the CSV file and select numerical columns
df = pd.read_csv('FuelConsumptionCo2.csv')
numerical_df = df.select_dtypes(include=['float64', 'int64'])
numerical_df = numerical_df.drop('MODELYEAR', axis=1)

# Define the features and target variable
X = numerical_df[['ENGINESIZE']]
y = numerical_df['CO2EMISSIONS']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate the mean squared error and R-squared value
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Display the coefficients and performance metrics
print("Coefficient:", model.coef_[0])
print("Intercept:", model.intercept_)
print("Mean Squared Error:", mse)
print("R-squared:", r2)

Coefficient: 38.992978724434074
Intercept: 126.28970217408721
Mean Squared Error: 985.9381692274999
R-squared: 0.7615595731934373


## Interpretations

The linear regression model to predict CO2 emissions from engine size has been created. Here are the results:

Coefficient: Coefficient: 38.992978724434074

Intercept: Intercept: 126.28970217408721

Mean Squared Error: Mean Squared Error: 985.9381692274999

R-squared: R-squared: 0.7615595731934373

The coefficient indicates that for each unit increase in engine size, the CO2 emissions increase by approximately 38.99 units. The R-squared value of 0.76 suggests that the model explains about 76% of the variance in CO2 emissions based on engine size.

Predicting continuous values with MLR

This code snippet trains a multiple linear regression model to predict CO2 emissions based on various vehicle features and evaluates its performance using mean squared error and R-squared metrics.

In [8]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Define the features and target variable
X = numerical_df[['ENGINESIZE', 'CYLINDERS', 'FUELCONSUMPTION_CITY', 'FUELCONSUMPTION_HWY', 'FUELCONSUMPTION_COMB', 'FUELCONSUMPTION_COMB_MPG']]
y = numerical_df['CO2EMISSIONS']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the multiple linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate the mean squared error and R-squared value
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Display the coefficients and performance metrics
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)
print("Mean Squared Error:", mse)
print("R-squared:", r2)

Coefficients: [  8.96115125   7.24023686 -13.00156667  -5.47215573  21.86697334
  -4.03446041]
Intercept: 262.57216086987376
Mean Squared Error: 399.4186889144271
R-squared: 0.9034041224574011


## Interpretations

The multiple linear regression model to predict CO2 emissions from the specified features has been created. Here are the results:

Coefficients: Coefficients: [ 8.96115125 7.24023686 -13.00156667 -5.47215573 21.86697334 -4.03446041]

Intercept: Intercept: 262.5721608698735

Mean Squared Error: Mean Squared Error: 399.41868891442704

R-squared: R-squared: 0.9034041224574011

The R-squared value of 0.90 suggests that the model explains about 90% of the variance in CO2 emissions based on the given features.