<a href="https://colab.research.google.com/github/g-e-mm/SupervisedLearning/blob/main/LinearRegressionSGD.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Vehicle Performance Prediction by Linear Regression
---
using Stochastic Gradient Descent

- In this project, we aim to predict vehicle mileage (kilometer per liter) using technical specifications like horsepower, weight, and engine features. - We use Linear Regression to model the relationship between these features and mileage.
- Stochastic Gradient Descent (SGD) is an optimization technique used to minimize the model's error by updating weights iteratively with small batches of data.
- It’s especially useful for large datasets and speeds up convergence compared to standard gradient descent. SGD is chosen here for its efficiency, scalability, and ability to handle noisy or high-dimensional data, making it ideal for real-world automotive datasets.

## Loading data and importing necessary libraries
---


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
path="/content/drive/MyDrive/IMARTICUS/Linear Regression SGD/SGDdata.csv"
data=pd.read_csv(path)

## Exploratory Data Analysis
---

In [None]:
print(data.head())
print(data.describe())
print(data.info())

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(data['Kilometer_per_liter'], bins=30, kde=True)
plt.title('Distribution of Kilometer_per_liter')
plt.xlabel('Kilometer per Liter')
plt.ylabel('Frequency')
plt.show()

- Most cars have a mileage between 6 and 12 km/l.

- There's a noticeable right skew (long tail on the right), meaning a few cars have higher mileage (above 16 km/l), but they're less frequent.

- The mode (most common mileage) seems to be around 6-8 km/l.

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(x='weight', y='Kilometer_per_liter', data=data)
plt.title('Weight vs Kilometer_per_liter')
plt.xlabel('Weight of Car')
plt.ylabel('Kilometer per Liter')
plt.show()

- Negative Correlation:
There’s a clear downward trend — as car weight increases, mileage decreases. Heavier cars tend to be less fuel efficient.

- Compact Cars (Weight < 2500 lbs):
These have a wide mileage range, with some achieving over 18 km/l.

- Mid-weight Cars (2500–3500 lbs):
Mileage is more consistent here, generally between 8 and 14 km/l.

- Heavy Cars (> 4000 lbs):
Almost all have very low mileage (mostly under 8 km/l), confirming the inefficiency of bulky vehicles.

- Outliers:
A few heavy cars still manage decent mileage — worth investigating what makes them different (e.g., engine type, design).

## Model building
---

**Splitting the data**

In [None]:

X = data[['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration']]
y = data['Kilometer_per_liter']

X = X.replace('?', np.nan)  # This line is added to replace '?' with NaN
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_imputed, y, test_size=0.2, random_state=42)

In [None]:
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

In [None]:
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)
r2 = r2_score(y_test, y_pred)
print("R2 Score:", r2)

## ✅ Model Evaluation Summary

- **Mean Squared Error (MSE):** `2.65`  
  The average squared difference between predicted and actual values. A lower value indicates better performance.  
  → √2.65 ≈ **1.63**, meaning predictions are off by about 1.63 km/l on average.

- **R² Score:** `0.7272`  
  Indicates that **72.7%** of the variance in `Kilometer_per_liter` is explained by the model.  
  This suggests a **strong linear relationship** between features and the target variable.

---

## 📌 Conclusion

- The model has **reasonable predictive accuracy** with **low error**.
- It performs **well** for a linear regression approach using **Stochastic Gradient Descent (SGD)**.

