<a href="https://colab.research.google.com/github/usshaa/BK_Birla/blob/main/02_Regression_Type_Machine_Learning_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 🚗 **Problem Statement**

Predict the **price of a car** based on various features such as car size, engine specifications, fuel type, and more.


## 🧾 **1. Understanding the Dataset**

### ✅ Target Variable:
- `price`: The price of the car (Continuous - Regression)

### 🧱 Feature Types:
- **Categorical Features (object)**: `carCompany`, `fueltype`, `aspiration`, `doornumber`, `carbody`, `drivewheel`, `enginelocation`, `enginetype`, `cylindernumber`, `fuelsystem`
- **Ordinal/Categorical**: `Symboling`
- **Numerical Features**: All remaining columns



## 🧹 **2. Data Preprocessing**

### 🧼 A. Data Cleaning

In [89]:
import pandas as pd
import numpy as np

In [90]:
# Load the dataset
df = pd.read_csv("Car_Price.csv")

In [91]:
# Check for nulls
df.isnull().sum()

Unnamed: 0,0
car_ID,0
symboling,0
CarName,0
fueltype,0
aspiration,0
doornumber,0
carbody,0
drivewheel,0
enginelocation,0
wheelbase,0


In [92]:
# Remove car_ID as it's just a unique identifier
df.drop(['car_ID'], axis=1, inplace=True)

In [93]:
# Lowercase car company and fix typos
df['CarName'] = df['CarName'].str.lower()
df['carCompany'] = df['CarName'].apply(lambda x: x.split(" ")[0])
df.drop('CarName', axis=1, inplace=True)

In [94]:
# Check for typos in car company names
df['carCompany'].value_counts()

Unnamed: 0_level_0,count
carCompany,Unnamed: 1_level_1
toyota,31
nissan,18
mazda,15
honda,13
mitsubishi,13
subaru,12
peugeot,11
volvo,11
volkswagen,9
dodge,9


In [95]:
# Fix common typos (example: 'maxda' -> 'mazda', 'porcshce' -> 'porsche')
df['carCompany'].replace({'maxda':'mazda','porcshce':'porsche','toyouta':'toyota','vokswagen':'volkswagen','vw':'volkswagen','nissan':'nissan'}, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['carCompany'].replace({'maxda':'mazda','porcshce':'porsche','toyouta':'toyota','vokswagen':'volkswagen','vw':'volkswagen','nissan':'nissan'}, inplace=True)


In [96]:
# Drop rows where price is missing
df.dropna(subset=['price'], inplace=True)

### 🔁 B. Data Transformation & Encoding

#### i. Encoding Categorical Variables

In [97]:
# Convert categorical variables to dummy/one-hot encoding
categorical_vars = ['fueltype', 'aspiration', 'doornumber', 'carbody', 'drivewheel',
                    'enginelocation', 'enginetype', 'cylindernumber', 'fuelsystem', 'carCompany']
df = pd.get_dummies(df, columns=categorical_vars, drop_first=True)

### ⚖️ C. Feature Scaling

In [98]:
from sklearn.preprocessing import StandardScaler

In [99]:
# List of numerical columns to scale
num_cols = ['wheelbase', 'carlength', 'carwidth', 'carheight', 'curbweight',
            'enginesize', 'boreratio', 'stroke', 'compressionratio', 'horsepower',
            'peakrpm', 'citympg', 'highwaympg']

scaler = StandardScaler()
df[num_cols] = scaler.fit_transform(df[num_cols])

## 🧪 **3. Train-Test Split**

In [100]:
from sklearn.model_selection import train_test_split

X = df.drop('price', axis=1)
y = df['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 🤖 **4. Model Building & Evaluation**

We will use:
- **R2 Score**
- **RMSE**
- **MAE**

In [101]:
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

In [102]:
def evaluate_model(model, name):
    # If the model is the polynomial regression model, use X_poly_test for prediction
    if name == "Polynomial Regression":
        y_pred = model.predict(X_poly_test)  # Use transformed test data for polynomial regression
    else:
        y_pred = model.predict(X_test)
    print(f"\n{name}")
    print(f"R2 Score: {r2_score(y_test, y_pred):.4f}")
    print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.2f}")
    print(f"MAE: {mean_absolute_error(y_test, y_pred):.2f}")

### 1️⃣ **Linear Regression**

In [103]:
from sklearn.linear_model import LinearRegression

In [104]:
lr = LinearRegression()
lr.fit(X_train, y_train)
evaluate_model(lr, "Linear Regression")


Linear Regression
R2 Score: 0.9097
RMSE: 2669.93
MAE: 1763.57


### 2️⃣ **Ridge Regression**

In [105]:
from sklearn.linear_model import Ridge
ridge = Ridge(alpha=1)
ridge.fit(X_train, y_train)
evaluate_model(ridge, "Ridge Regression")


Ridge Regression
R2 Score: 0.9006
RMSE: 2801.05
MAE: 1951.82


### 3️⃣ **Lasso Regression**

In [106]:
from sklearn.linear_model import Lasso

In [107]:
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
evaluate_model(lasso, "Lasso Regression")


Lasso Regression
R2 Score: 0.9003
RMSE: 2805.52
MAE: 1816.45


### 4️⃣ **Polynomial Regression**

In [108]:
from sklearn.preprocessing import PolynomialFeatures

In [109]:
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X_train)
X_poly_test = poly.transform(X_test)

In [110]:
lr_poly = LinearRegression()
lr_poly.fit(X_poly, y_train)
evaluate_model(lr_poly, "Polynomial Regression")


Polynomial Regression
R2 Score: -10.5721
RMSE: 30225.00
MAE: 12387.80


### 5️⃣ **Decision Tree Regressor**

In [111]:
from sklearn.tree import DecisionTreeRegressor

In [112]:
dt = DecisionTreeRegressor(random_state=42, max_depth=5)
dt.fit(X_train, y_train)
evaluate_model(dt, "Decision Tree Regressor")


Decision Tree Regressor
R2 Score: 0.8910
RMSE: 2933.68
MAE: 2086.81


### 6️⃣ **Random Forest Regressor**

In [113]:
from sklearn.ensemble import RandomForestRegressor

In [114]:
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
evaluate_model(rf, "Random Forest Regressor")


Random Forest Regressor
R2 Score: 0.9583
RMSE: 1814.44
MAE: 1292.07


### 7️⃣ **Support Vector Regressor (SVR)**

In [115]:
from sklearn.svm import SVR

In [116]:
svr = SVR(kernel='rbf', C=100)
svr.fit(X_train, y_train)
evaluate_model(svr, "Support Vector Regressor")


Support Vector Regressor
R2 Score: 0.1078
RMSE: 8392.42
MAE: 4535.92


### 8️⃣ **K-Nearest Neighbors Regressor**

In [117]:
from sklearn.neighbors import KNeighborsRegressor

In [118]:
knn = KNeighborsRegressor(n_neighbors=5)
knn.fit(X_train, y_train)
evaluate_model(knn, "K-Nearest Neighbors Regressor")


K-Nearest Neighbors Regressor
R2 Score: 0.7062
RMSE: 4815.88
MAE: 2808.11


## 🔍 **Hyperparameter Tuning Example: Random Forest**

In [119]:
from sklearn.model_selection import GridSearchCV

In [120]:
params = {
    'n_estimators': [100, 200],
    'max_depth': [5, 10],
    'min_samples_split': [2, 5]
}

In [121]:
grid_rf = GridSearchCV(RandomForestRegressor(random_state=42), param_grid=params, cv=3, scoring='r2')
grid_rf.fit(X_train, y_train)

In [122]:
print("Best Params:", grid_rf.best_params_)
evaluate_model(grid_rf.best_estimator_, "Tuned Random Forest")

Best Params: {'max_depth': 10, 'min_samples_split': 2, 'n_estimators': 200}

Tuned Random Forest
R2 Score: 0.9598
RMSE: 1781.94
MAE: 1233.29


## 🔍 **Regularization Summary**

- **Ridge**: L2 Regularization - reduces coefficient magnitudes to prevent overfitting.
- **Lasso**: L1 Regularization - can set some coefficients to 0 for feature selection.
- **Polynomial**: Useful for capturing non-linear trends but risks overfitting without regularization.



## 🔮 **Predicting New Samples**

In [130]:
# Create a new sample (same format as X_train)
sample = X_train.iloc[[0]]
price_pred = rf.predict(sample)
print("Predicted Price:", price_pred[0])
# df.iloc[66]['price']

Predicted Price: 15821.9


## 📊 **Conclusion**

- Use **Random Forest** or **Lasso Regression** when dealing with many features or potential overfitting.
- **Polynomial Regression** is powerful but needs caution due to overfitting risk.
- Always scale your data for models like **SVR** and **KNN**.
- Use **GridSearchCV** to fine-tune your models for best performance.