## Multiple Linear Regression with some X-categorical columns
The **Automobile dataset (UCI Car Price dataset)** is about **used car prices**.

* It was collected from the 1985 Wardâ€™s Automotive Yearbook.
* Each row represents a **specific car model (make, type, specs)** that was on the market at that time.
* The **target variable** (`price`) is the **selling price of the car (in U.S. dollars)**.

### ðŸ“Œ What it contains:

* **Car specifications** (e.g., horsepower, engine-size, curb-weight, fuel system).
* **Categorical attributes** (e.g., make, fuel-type, body-style, drive-wheels).
* **Numeric attributes** (e.g., length, width, mpg, horsepower).
* **Price** = the **market value** (depends on brand, specs, and features).

##### This is a dataset that can be used to perform  **Multiple Linear Regression (MLR)**
---


In [1]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score


In [3]:

# Load dataset (from UCI or Kaggle link)

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data"

# Column names (as per UCI documentation)
columns = ["symboling", "normalized-losses", "make", "fuel-type", "aspiration",
           "num-of-doors", "body-style", "drive-wheels", "engine-location",
           "wheel-base", "length", "width", "height", "curb-weight",
           "engine-type", "num-of-cylinders", "engine-size", "fuel-system",
           "bore", "stroke", "compression-ratio", "horsepower", "peak-rpm",
           "city-mpg", "highway-mpg", "price"]

df = pd.read_csv(url, names=columns)
df


Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.40,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.40,8.0,115,5500,18,22,17450
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,-1,95,volvo,gas,std,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,9.5,114,5400,23,28,16845
201,-1,95,volvo,gas,turbo,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,8.7,160,5300,19,25,19045
202,-1,95,volvo,gas,std,four,sedan,rwd,front,109.1,...,173,mpfi,3.58,2.87,8.8,134,5500,18,23,21485
203,-1,95,volvo,diesel,turbo,four,sedan,rwd,front,109.1,...,145,idi,3.01,3.40,23.0,106,4800,26,27,22470


In [5]:
df.drop("make", axis=1, inplace = True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 25 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   symboling          205 non-null    int64  
 1   normalized-losses  205 non-null    object 
 2   fuel-type          205 non-null    object 
 3   aspiration         205 non-null    object 
 4   num-of-doors       205 non-null    object 
 5   body-style         205 non-null    object 
 6   drive-wheels       205 non-null    object 
 7   engine-location    205 non-null    object 
 8   wheel-base         205 non-null    float64
 9   length             205 non-null    float64
 10  width              205 non-null    float64
 11  height             205 non-null    float64
 12  curb-weight        205 non-null    int64  
 13  engine-type        205 non-null    object 
 14  num-of-cylinders   205 non-null    object 
 15  engine-size        205 non-null    int64  
 16  fuel-system        205 non

In [9]:
# To observe the numerical columns appearing as object data type

object_cols = df.drop("price", axis=1).select_dtypes(include="object").columns.to_list()
df[object_cols].T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,195,196,197,198,199,200,201,202,203,204
normalized-losses,?,?,?,164,164,?,158,?,158,?,...,74,103,74,103,74,95,95,95,95,95
fuel-type,gas,gas,gas,gas,gas,gas,gas,gas,gas,gas,...,gas,gas,gas,gas,gas,gas,gas,gas,diesel,gas
aspiration,std,std,std,std,std,std,std,std,turbo,turbo,...,std,std,std,turbo,turbo,std,turbo,std,turbo,turbo
num-of-doors,two,two,two,four,four,two,four,four,four,two,...,four,four,four,four,four,four,four,four,four,four
body-style,convertible,convertible,hatchback,sedan,sedan,sedan,sedan,wagon,sedan,hatchback,...,wagon,sedan,wagon,sedan,wagon,sedan,sedan,sedan,sedan,sedan
drive-wheels,rwd,rwd,rwd,fwd,4wd,fwd,fwd,fwd,fwd,4wd,...,rwd,rwd,rwd,rwd,rwd,rwd,rwd,rwd,rwd,rwd
engine-location,front,front,front,front,front,front,front,front,front,front,...,front,front,front,front,front,front,front,front,front,front
engine-type,dohc,dohc,ohcv,ohc,ohc,ohc,ohc,ohc,ohc,ohc,...,ohc,ohc,ohc,ohc,ohc,ohc,ohc,ohcv,ohc,ohc
num-of-cylinders,four,four,six,four,five,five,five,five,five,five,...,four,four,four,four,four,four,four,six,six,four
fuel-system,mpfi,mpfi,mpfi,mpfi,mpfi,mpfi,mpfi,mpfi,mpfi,mpfi,...,mpfi,mpfi,mpfi,mpfi,mpfi,mpfi,mpfi,mpfi,idi,mpfi


**It is observed that columns ["normalized-losses", "bore", "stroke", "horsepower",
"peak-rpm", "price"] are actually numerical but appearing as object**

In [11]:
df.isnull().sum()

symboling            0
normalized-losses    0
fuel-type            0
aspiration           0
num-of-doors         0
body-style           0
drive-wheels         0
engine-location      0
wheel-base           0
length               0
width                0
height               0
curb-weight          0
engine-type          0
num-of-cylinders     0
engine-size          0
fuel-system          0
bore                 0
stroke               0
compression-ratio    0
horsepower           0
peak-rpm             0
city-mpg             0
highway-mpg          0
price                0
dtype: int64

In [13]:

# Data cleaning

df.replace("?", np.nan, inplace=True)
df.isnull().sum()

symboling             0
normalized-losses    41
fuel-type             0
aspiration            0
num-of-doors          2
body-style            0
drive-wheels          0
engine-location       0
wheel-base            0
length                0
width                 0
height                0
curb-weight           0
engine-type           0
num-of-cylinders      0
engine-size           0
fuel-system           0
bore                  4
stroke                4
compression-ratio     0
horsepower            2
peak-rpm              2
city-mpg              0
highway-mpg           0
price                 4
dtype: int64

In [15]:
# Drop rows where price is missing (target variable)
df = df.dropna(subset=["price"])
df

Unnamed: 0,symboling,normalized-losses,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,length,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,,gas,std,two,convertible,rwd,front,88.6,168.8,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,,gas,std,two,convertible,rwd,front,88.6,168.8,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,,gas,std,two,hatchback,rwd,front,94.5,171.2,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,gas,std,four,sedan,fwd,front,99.8,176.6,...,109,mpfi,3.19,3.40,10.0,102,5500,24,30,13950
4,2,164,gas,std,four,sedan,4wd,front,99.4,176.6,...,136,mpfi,3.19,3.40,8.0,115,5500,18,22,17450
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,-1,95,gas,std,four,sedan,rwd,front,109.1,188.8,...,141,mpfi,3.78,3.15,9.5,114,5400,23,28,16845
201,-1,95,gas,turbo,four,sedan,rwd,front,109.1,188.8,...,141,mpfi,3.78,3.15,8.7,160,5300,19,25,19045
202,-1,95,gas,std,four,sedan,rwd,front,109.1,188.8,...,173,mpfi,3.58,2.87,8.8,134,5500,18,23,21485
203,-1,95,diesel,turbo,four,sedan,rwd,front,109.1,188.8,...,145,idi,3.01,3.40,23.0,106,4800,26,27,22470


In [17]:
# Convert object columns numeric columns
numeric_cols = ["normalized-losses", "bore", "stroke", "horsepower",
                "peak-rpm", "price"]
for col in numeric_cols:
    df[col] = pd.to_numeric(df[col])

# Fill missing numeric values with mean
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].median())
df["num-of-doors"] = df["num-of-doors"].fillna(df["num-of-doors"].mode()[0])

#  X, y columns
X = df.drop("price", axis=1)
y = df["price"]


In [19]:
df.isnull().sum()

symboling            0
normalized-losses    0
fuel-type            0
aspiration           0
num-of-doors         0
body-style           0
drive-wheels         0
engine-location      0
wheel-base           0
length               0
width                0
height               0
curb-weight          0
engine-type          0
num-of-cylinders     0
engine-size          0
fuel-system          0
bore                 0
stroke               0
compression-ratio    0
horsepower           0
peak-rpm             0
city-mpg             0
highway-mpg          0
price                0
dtype: int64

In [23]:
# See the no of categories of object columns
categorical_cols = X.select_dtypes(include="object").columns

for x in categorical_cols:
    print(x, "has", df[x].nunique(), "categories")


fuel-type has 2 categories
aspiration has 2 categories
num-of-doors has 2 categories
body-style has 5 categories
drive-wheels has 3 categories
engine-location has 2 categories
engine-type has 6 categories
num-of-cylinders has 7 categories
fuel-system has 8 categories


In [25]:
# ---------------------------
# One-hot encode categorical features
# ---------------------------
X = pd.get_dummies(X, columns=categorical_cols, drop_first=True)

print("Final feature matrix shape:", X.shape)
X.iloc[:,10:20]

Final feature matrix shape: (201, 43)


Unnamed: 0,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,fuel-type_gas,aspiration_turbo,num-of-doors_two,body-style_hardtop,body-style_hatchback
0,9.0,111.0,5000.0,21,27,True,False,True,False,False
1,9.0,111.0,5000.0,21,27,True,False,True,False,False
2,9.0,154.0,5000.0,19,26,True,False,True,False,True
3,10.0,102.0,5500.0,24,30,True,False,False,False,False
4,8.0,115.0,5500.0,18,22,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...
200,9.5,114.0,5400.0,23,28,True,False,False,False,False
201,8.7,160.0,5300.0,19,25,True,True,False,False,False
202,8.8,134.0,5500.0,18,23,True,False,False,False,False
203,23.0,106.0,4800.0,26,27,False,True,False,False,False


-->**drop_first=True drops the first dummy column for each categorical variable.**

-->**This reduces redundancy and avoids the dummy variable trap (perfect multicollinearity).**

In [27]:

# Train-test split

X_train, X_test, y_train, y_test = train_test_split(X, y,
                            test_size=0.2, random_state=42)

# Fit Multiple Linear Regression
model = LinearRegression()
model.fit(X_train, y_train)


In [16]:
# ---------------------------
# Predictions and evaluation
# ---------------------------
y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print("Model Evaluation:")
print("RMSE:", rmse)
print("RÂ² Score:", r2)


Model Evaluation:
RMSE: 3807.279386298739
RÂ² Score: 0.8815222015251803


In [17]:

# ---------------------------
# Show coefficients
# ---------------------------
coefficients = pd.DataFrame({
    "Feature": X.columns,
    "Coefficient": model.coef_}).sort_values
                       (by="Coefficient", ascending=False)

print("\nTop 10 influential features:")
print(coefficients.head(20))


Top 10 influential features:
                 Feature   Coefficient
24  engine-location_rear  11455.745873
26       engine-type_ohc   2554.126179
38       fuel-system_idi   1854.020852
23      drive-wheels_rwd   1528.742449
16      aspiration_turbo   1304.643655
27      engine-type_ohcf    969.914909
4                  width    707.916233
14           highway-mpg    177.200278
0              symboling    160.714870
7            engine-size     62.045602
5                 height     34.684330
2             wheel-base     25.560906
11            horsepower      9.776300
1      normalized-losses      7.173585
6            curb-weight      3.591876
12              peak-rpm      0.829362
22      drive-wheels_fwd    -27.753547
3                 length    -54.829795
13              city-mpg   -234.701550
10     compression-ratio   -243.389779


#### Run Lasso

In [33]:
# import Lasso regression from sklearn library (L1)
from sklearn.linear_model import Lasso

# Fit Multiple Linear Regression
model = Lasso(alpha = 3)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)


In [35]:
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print("Model Evaluation:")
print("RMSE:", rmse)
print("RÂ² Score:", r2)

Model Evaluation:
RMSE: 3427.3777025561776
RÂ² Score: 0.9039866960116714


In [37]:
coefficients = pd.DataFrame({
    "Feature": X.columns,
    "Coefficient": model.coef_}).sort_values(by="Coefficient",
                                    ascending=False)

print("\nTop 10 influential features:")
print(coefficients.head(30))


Top 10 influential features:
                   Feature   Coefficient
24    engine-location_rear  10152.400558
26         engine-type_ohc   2235.755005
16        aspiration_turbo   1772.061126
23        drive-wheels_rwd   1223.683812
27        engine-type_ohcf    856.246461
4                    width    720.595191
14             highway-mpg    143.182511
7              engine-size    116.662250
0                symboling     98.518039
5                   height     93.712450
10       compression-ratio     44.401451
2               wheel-base     41.798492
1        normalized-losses      7.753537
11              horsepower      3.434628
6              curb-weight      1.957876
12                peak-rpm      0.963449
38         fuel-system_idi      0.000000
40        fuel-system_mpfi      0.000000
15           fuel-type_gas     -0.000000
33  num-of-cylinders_three     -0.000000
37        fuel-system_4bbl    -44.175728
3                   length    -64.386100
13                city-mpg 