### LOADING AND PREPROCESSING


The dataset is loaded into a Pandas DataFrame using pd.read_csv()

In [10]:
import pandas as pd

# Load dataset
df = pd.read_csv("CarPrice_Assignment.csv")



 PREPROCESSING
 -identify missing values, data types, and potential preprocessing needs.
 -Handling Missing Values
 -Removing Duplicates
 -Encoded categorical features



In [13]:
# Display basic information about the dataset

print("Dataset Information:")
print(df.info())

Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   car_ID            205 non-null    int64  
 1   symboling         205 non-null    int64  
 2   CarName           205 non-null    object 
 3   fueltype          205 non-null    object 
 4   aspiration        205 non-null    object 
 5   doornumber        205 non-null    object 
 6   carbody           205 non-null    object 
 7   drivewheel        205 non-null    object 
 8   enginelocation    205 non-null    object 
 9   wheelbase         205 non-null    float64
 10  carlength         205 non-null    float64
 11  carwidth          205 non-null    float64
 12  carheight         205 non-null    float64
 13  curbweight        205 non-null    int64  
 14  enginetype        205 non-null    object 
 15  cylindernumber    205 non-null    object 
 16  enginesize        205 n

In [15]:
# Remove duplicate rows

df.drop_duplicates(inplace=True)

Duplicate rows are removes and the datset is cleaned

In [18]:
# Check for missing values

print("\nMissing Values:")
print(df.isnull().sum())


Missing Values:
car_ID              0
symboling           0
CarName             0
fueltype            0
aspiration          0
doornumber          0
carbody             0
drivewheel          0
enginelocation      0
wheelbase           0
carlength           0
carwidth            0
carheight           0
curbweight          0
enginetype          0
cylindernumber      0
enginesize          0
fuelsystem          0
boreratio           0
stroke              0
compressionratio    0
horsepower          0
peakrpm             0
citympg             0
highwaympg          0
price               0
dtype: int64


Handle Missing Values
The dataset typically does not contain missing values, but we will verify and handle them if necessary. If missing values exist, we can:

Use mean/median imputation for numerical data.



In [21]:
df.head()


Unnamed: 0,car_ID,symboling,CarName,fueltype,aspiration,doornumber,carbody,drivewheel,enginelocation,wheelbase,...,enginesize,fuelsystem,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price
0,1,3,alfa-romero giulia,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495.0
1,2,3,alfa-romero stelvio,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500.0
2,3,1,alfa-romero Quadrifoglio,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500.0
3,4,2,audi 100 ls,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950.0
4,5,2,audi 100ls,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450.0


### MODEL IMPLEMENTATION

In [24]:
#set target value

X = df.iloc[:, :-1] 
y = df.iloc[:, -1]  

In [26]:
print(X.dtypes) 


car_ID                int64
symboling             int64
CarName              object
fueltype             object
aspiration           object
doornumber           object
carbody              object
drivewheel           object
enginelocation       object
wheelbase           float64
carlength           float64
carwidth            float64
carheight           float64
curbweight            int64
enginetype           object
cylindernumber       object
enginesize            int64
fuelsystem           object
boreratio           float64
stroke              float64
compressionratio    float64
horsepower            int64
peakrpm               int64
citympg               int64
highwaympg            int64
dtype: object


In [28]:
X = X.apply(pd.to_numeric, errors='coerce')
X.fillna(0, inplace=True)  

Machine learning algorithms require numeric input; categorical or mixed data types can cause issues.
Filling NaN values with 0 ensures that missing values do not affect the model's performance.
This is a simple yet effective method for handling missing data when no domain-specific imputation strategy is available.

In [31]:
print(X.dtypes) 

car_ID                int64
symboling             int64
CarName             float64
fueltype            float64
aspiration          float64
doornumber          float64
carbody             float64
drivewheel          float64
enginelocation      float64
wheelbase           float64
carlength           float64
carwidth            float64
carheight           float64
curbweight            int64
enginetype          float64
cylindernumber      float64
enginesize            int64
fuelsystem          float64
boreratio           float64
stroke              float64
compressionratio    float64
horsepower            int64
peakrpm               int64
citympg               int64
highwaympg            int64
dtype: object


In [33]:
#standardisation

from sklearn.preprocessing import StandardScaler

object= StandardScaler()
X_scale = object.fit_transform(X)

In [34]:
X_scale

array([[-1.72362229,  1.74347043,  0.        , ..., -0.26296022,
        -0.64655303, -0.54605874],
       [-1.70672403,  1.74347043,  0.        , ..., -0.26296022,
        -0.64655303, -0.54605874],
       [-1.68982577,  0.133509  ,  0.        , ..., -0.26296022,
        -0.95301169, -0.69162706],
       ...,
       [ 1.68982577, -1.47645244,  0.        , ...,  0.78785546,
        -1.10624102, -1.12833203],
       [ 1.70672403, -1.47645244,  0.        , ..., -0.68328649,
         0.11959362, -0.54605874],
       [ 1.72362229, -1.47645244,  0.        , ...,  0.57769233,
        -0.95301169, -0.83719538]])

In [35]:
X_scale.shape

(205, 25)

In [36]:
#linear Regression

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

X_train, X_test, y_train, y_test = train_test_split(
    X_scale, y, test_size=0.33, random_state=42)

lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)

Linear Regression

How It Works:

Linear Regression finds the best-fitting line to predict the target variable 𝑌 based on features 𝑋

In [44]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

In [46]:
#Mean Squared Error (MSE)

mean_squared_error(y_pred_lr, y_test)
    

11424184.576565895

Mean Squared Error (MSE)

Measures the average squared difference between actual and predicted values.

Lower MSE means better model performance.



In [52]:
#Mean Absolute Error (MAE)

mean_absolute_error(y_pred_lr, y_test)


2402.831456067675

Mean Absolute Error (MAE)

Measures the absolute difference between actual and predicted values.
    
Lower MAE means better model accuracy.

In [61]:
#R-squared Score (𝑅^2)

r2_score(y_pred_lr, y_test)


0.7937604167144586


R-squared (R²) → 

Measures how well the model explains the variance in the target variable.

Higher values indicate better performance.


In [90]:
from sklearn.tree import DecisionTreeRegressor


dt = DecisionTreeRegressor(random_state=42)
dt.fit(X_train, y_train)  
y_pred_dt = dt.predict(X_test)

The Decision Tree Regressor is a non-linear model that splits the data into different regions based on feature values and makes predictions using a tree-like structure. 

    It works well for capturing complex relationships but can overfit if not properly tuned.

In [93]:
mean_squared_error(y_pred_dt, y_test)


6224720.150263074

In [95]:
mean_absolute_error(y_pred_dt, y_test)


1734.151955882353

In [97]:
r2_score(y_pred_dt, y_test)


0.9151225135950479

In [76]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)  
y_pred_rf = rf.predict(X_test)

In [79]:
mean_squared_error(y_pred_rf, y_test)


3713387.5718385144

In [81]:
mean_absolute_error(y_pred_rf, y_test)


1391.958573529412

In [83]:
r2_score(y_pred_rf, y_test)


0.9355920115857586

In [85]:
from sklearn.ensemble import GradientBoostingRegressor
gb = GradientBoostingRegressor(n_estimators=100, random_state=42)
gb.fit(X_train, y_train)
y_pred_gb = gb.predict(X_test)


The Gradient Boosting Regressor (GBR) 

it is an ensemble learning method that builds trees sequentially, minimizing errors from previous trees.

It is more robust than a single Decision Tree and often outperforms Random Forest by reducing bias and variance effectively.

In [35]:
mean_squared_error(y_pred_gb, y_test)


5249013.95599746

In [37]:
mean_absolute_error(y_pred_gb, y_test)


1666.4562367007948

In [39]:
r2_score(y_pred_gb, y_test)


0.9269552482133443

In [26]:
from sklearn.svm import SVR
svr = SVR(kernel='rbf')
svr.fit(X_train, y_train)  # Needs scaling
y_pred_svr = svr.predict(X_test)

The Support Vector Regressor (SVR) 
it is a powerful regression model based on Support Vector Machines (SVM).

It is useful for capturing complex relationships, especially when data is non-linearly distributed. 

Unlike tree-based models, SVR works by finding the best-fit hyperplane within a margin, minimizing errors while ignoring small deviations.



In [27]:
mean_squared_error(y_pred_svr, y_test)


69764315.30103254

In [28]:
mean_absolute_error(y_pred_svr, y_test)


5329.326682945467

In [29]:
r2_score(y_pred_svr, y_test)


-359301.1097648825

 ## Random Forest Regressor is the better model because:

## It has a lower MSE (3,713,387.57 vs. 6,224,720.15) → meaning lower overall prediction errors.
## It has a lower MAE (1,391.96 vs. 1,734.15) → meaning more accurate predictions on average.
## It has a higher R² score (0.9356 vs. 0.9151) → meaning better explanatory power (it explains 93.56% of the variance in car prices).

In [31]:
#Feature Importance Analysis

from sklearn.feature_selection import SelectKBest, f_regression

# Using SelectKBest to find top features
selector = SelectKBest(score_func=f_regression, k=10)
X_new = selector.fit_transform(X_scale, y)
# Selected features
selected_features = X.columns.values[selector.get_support()]
print("Top Features Selected:", list(selected_features))



Top Features Selected: ['wheelbase', 'carlength', 'carwidth', 'carheight', 'curbweight', 'enginesize', 'boreratio', 'horsepower', 'citympg', 'highwaympg']


Feature Importance Analysis
Feature importance analysis helps identify the most significant variables influencing car prices. Selecting the right features improves model accuracy, reduces overfitting, and enhances interpretability.



Feature Selection using SelectKBest
SelectKBest is a feature selection method in Scikit-Learn that selects the top K features based on statistical tests. It helps improve model performance by reducing noise and focusing on the most relevant features.

 Hyperparameter Tuning 

Hyperparameter tuning is essential to optimize machine learning models by finding the best set of parameters that improve performance.
We will use GridsearchCV for his model

In [106]:
from sklearn.model_selection import GridSearchCV

# Define a refined grid based on previous results
param_grid = {
    "n_estimators": [200, 300, 400],
    "max_depth": [10, 15, 20],
    "min_samples_split": [2, 5],
    "min_samples_leaf": [1, 2],
    "max_features": ["sqrt", "log2"]
}

# Grid Search
grid_search = GridSearchCV(
    RandomForestRegressor(random_state=42), param_grid, cv=5, scoring="r2", n_jobs=-1
)
grid_search.fit(X_train, y_train)

# Best Parameters and Score
print("Best Parameters:", grid_search.best_params_)
print("Best R² Score:", grid_search.best_score_)

Best Parameters: {'max_depth': 20, 'max_features': 'log2', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 400}
Best R² Score: 0.9048085683592093
