# Comprehensive Assessment : Machine Learning
 This project required to model the price of cars with the available independent variables. It will be used by the management to understand how exactly the prices vary with the independent variables. They can accordingly manipulate the design of the cars, the business strategy etc. to meet certain price levels. Further, the model will be a good way for the management to
understand the pricing dynamics of a new market.


# 1. Loading and Preprocessing
## Step 1: Load the Dataset



In [14]:
import pandas as pd

data = pd.read_csv('C:/Users/user/Downloads/CarPrice_Assignment.csv')
data.head()

Unnamed: 0,car_ID,symboling,CarName,fueltype,aspiration,doornumber,carbody,drivewheel,enginelocation,wheelbase,...,enginesize,fuelsystem,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price
0,1,3,alfa-romero giulia,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495.0
1,2,3,alfa-romero stelvio,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500.0
2,3,1,alfa-romero Quadrifoglio,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500.0
3,4,2,audi 100 ls,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950.0
4,5,2,audi 100ls,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450.0


## Step 2: Data Exploration

Explore the dataset to understand its structure and check for any missing values.

In [15]:
data.info()
data.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   car_ID            205 non-null    int64  
 1   symboling         205 non-null    int64  
 2   CarName           205 non-null    object 
 3   fueltype          205 non-null    object 
 4   aspiration        205 non-null    object 
 5   doornumber        205 non-null    object 
 6   carbody           205 non-null    object 
 7   drivewheel        205 non-null    object 
 8   enginelocation    205 non-null    object 
 9   wheelbase         205 non-null    float64
 10  carlength         205 non-null    float64
 11  carwidth          205 non-null    float64
 12  carheight         205 non-null    float64
 13  curbweight        205 non-null    int64  
 14  enginetype        205 non-null    object 
 15  cylindernumber    205 non-null    object 
 16  enginesize        205 non-null    int64  
 1

Unnamed: 0,car_ID,symboling,wheelbase,carlength,carwidth,carheight,curbweight,enginesize,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price
count,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0
mean,103.0,0.834146,98.756585,174.049268,65.907805,53.724878,2555.565854,126.907317,3.329756,3.255415,10.142537,104.117073,5125.121951,25.219512,30.75122,13276.710571
std,59.322565,1.245307,6.021776,12.337289,2.145204,2.443522,520.680204,41.642693,0.270844,0.313597,3.97204,39.544167,476.985643,6.542142,6.886443,7988.852332
min,1.0,-2.0,86.6,141.1,60.3,47.8,1488.0,61.0,2.54,2.07,7.0,48.0,4150.0,13.0,16.0,5118.0
25%,52.0,0.0,94.5,166.3,64.1,52.0,2145.0,97.0,3.15,3.11,8.6,70.0,4800.0,19.0,25.0,7788.0
50%,103.0,1.0,97.0,173.2,65.5,54.1,2414.0,120.0,3.31,3.29,9.0,95.0,5200.0,24.0,30.0,10295.0
75%,154.0,2.0,102.4,183.1,66.9,55.5,2935.0,141.0,3.58,3.41,9.4,116.0,5500.0,30.0,34.0,16503.0
max,205.0,3.0,120.9,208.1,72.3,59.8,4066.0,326.0,3.94,4.17,23.0,288.0,6600.0,49.0,54.0,45400.0


In [12]:
data.duplicated()

0      False
1      False
2      False
3      False
4      False
       ...  
200    False
201    False
202    False
203    False
204    False
Length: 205, dtype: bool

## Step 3: Data Cleaning

Drop unnecessary columns: Since car_ID and CarName are not useful for predictions, we can drop them.

Handle missing values: Check if there are any missing values in the target variable (price).

One-hot encoding: Convert categorical features to numeric.

In [16]:
data.drop(['car_ID', 'CarName'], axis=1, inplace=True)

data.dropna(subset=['price'], inplace=True)

categorical_cols = ['fueltype', 'aspiration', 'doornumber', 'carbody', 
                    'drivewheel', 'enginelocation', 'enginetype', 
                    'cylindernumber', 'fuelsystem']
data = pd.get_dummies(data, columns=categorical_cols, drop_first=True)

data.head()

Unnamed: 0,symboling,wheelbase,carlength,carwidth,carheight,curbweight,enginesize,boreratio,stroke,compressionratio,...,cylindernumber_three,cylindernumber_twelve,cylindernumber_two,fuelsystem_2bbl,fuelsystem_4bbl,fuelsystem_idi,fuelsystem_mfi,fuelsystem_mpfi,fuelsystem_spdi,fuelsystem_spfi
0,3,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,9.0,...,False,False,False,False,False,False,False,True,False,False
1,3,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,9.0,...,False,False,False,False,False,False,False,True,False,False
2,1,94.5,171.2,65.5,52.4,2823,152,2.68,3.47,9.0,...,False,False,False,False,False,False,False,True,False,False
3,2,99.8,176.6,66.2,54.3,2337,109,3.19,3.4,10.0,...,False,False,False,False,False,False,False,True,False,False
4,2,99.4,176.6,66.4,54.3,2824,136,3.19,3.4,8.0,...,False,False,False,False,False,False,False,True,False,False


# 2. Model Implementation
## Step 1: Split the Data

Separate features and target variable, and split the data into training and testing sets.

In [17]:
from sklearn.model_selection import train_test_split

X = data.drop('price', axis=1)
y = data['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Step 2: Implement Regression Models

You will implement various regression models using scikit-learn.

In [18]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR

models = {
    'Linear Regression': LinearRegression(),
    'Decision Tree': DecisionTreeRegressor(),
    'Random Forest': RandomForestRegressor(),
    'Gradient Boosting': GradientBoostingRegressor(),
    'Support Vector Regressor': SVR()
}

for name, model in models.items():
    model.fit(X_train, y_train)

# 3. Model Evaluation
## Step 1: Evaluate Performance

Use R-squared, Mean Squared Error (MSE), and Mean Absolute Error (MAE) to evaluate each model.

In [19]:
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

results = {}
for name, model in models.items():
    y_pred = model.predict(X_test)
    results[name] = {
        'R-squared': r2_score(y_test, y_pred),
        'MSE': mean_squared_error(y_test, y_pred),
        'MAE': mean_absolute_error(y_test, y_pred)
    }

results_df = pd.DataFrame(results).T
print(results_df)

                          R-squared           MSE          MAE
Linear Regression          0.892557  8.482008e+06  2089.382729
Decision Tree              0.890801  8.620583e+06  1981.967488
Random Forest              0.956316  3.448584e+06  1270.103447
Gradient Boosting          0.923058  6.074109e+06  1736.886027
Support Vector Regressor  -0.101973  8.699416e+07  5707.106446


Step 2: Identify Best Model

Determine which model performed the best based on R-squared or another metric.

In [20]:
best_model = results_df['R-squared'].idxmax()
print(f"The best performing model is: {best_model}")

The best performing model is: Random Forest


# 4. Feature Importance Analysis
## Step 1: Extract Feature Importance

For tree-based models like Random Forest, extract feature importances.

In [21]:
importances = models['Random Forest'].feature_importances_
feature_importances = pd.Series(importances, index=X.columns).sort_values(ascending=False)

# Display feature importances
print(feature_importances)

enginesize               0.600386
curbweight               0.252484
highwaympg               0.049909
citympg                  0.017172
horsepower               0.016404
carwidth                 0.011495
carlength                0.009790
peakrpm                  0.007383
wheelbase                0.006517
stroke                   0.005001
boreratio                0.004581
carheight                0.003710
compressionratio         0.002891
enginetype_ohc           0.001265
carbody_hatchback        0.001164
symboling                0.001042
aspiration_turbo         0.001021
fuelsystem_mpfi          0.001016
fuelsystem_2bbl          0.001012
cylindernumber_four      0.001004
carbody_hardtop          0.000847
carbody_sedan            0.000789
cylindernumber_six       0.000592
drivewheel_rwd           0.000492
doornumber_two           0.000467
drivewheel_fwd           0.000363
enginetype_l             0.000230
carbody_wagon            0.000186
fuelsystem_idi           0.000160
fuelsystem_spf

# 5. Hyperparameter Tuning
## Step 1: Hyperparameter Tuning with Grid Search

Use GridSearchCV to tune hyperparameters for the best model, like Random Forest.

In [25]:
from sklearn.model_selection import GridSearchCV

# Define parameter grid for Random Forest
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(RandomForestRegressor(), param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

best_rf = grid_search.best_estimator_
y_pred_best_rf = best_rf.predict(X_test)

# Evaluate the best tuned model
best_r2 = r2_score(y_test, y_pred_best_rf)
best_mse = mean_squared_error(y_test, y_pred_best_rf)
best_mae = mean_absolute_error(y_test, y_pred_best_rf)

print(f"Tuned Random Forest R-squared: {best_r2}, MSE: {best_mse}, MAE: {best_mae}")

Tuned Random Forest R-squared: 0.9570279778226529, MSE: 3392384.2160095964, MAE: 1287.2716779148639


# Conclusion
By following these steps, you'll have a comprehensive analysis of the factors affecting car prices in your dataset. The best-performing model, identified through evaluation metrics, can be used for further predictions, while feature importance analysis provides insights into the key drivers of car pricing. Hyperparameter tuning ensures optimal model performance, enabling data-driven decision-making for the automobile company.



