# Practice 3: Implement linear regression to perform prediction.

**Name:** Spandan Sahai
**Roll Number:** RA2311026010918

**Dataset:** Cars Dataset

**Tool Used:** Google Colab  

### Steps involved :
Step 1. Loading dataset  
Step 2. Fixing Missing Values
Step 3. Applying Linear Regression  
Step 4. Evaluating the model  
Step 5. Performing prediction

## Step 1 & 2: Loading the dataset and Fixing values

In [None]:
import pandas as pd
from google.colab import files

# Upload file
uploaded = files.upload()

# Read Excel
df = pd.read_excel('Cars Datasets 2025.xlsx')
df.head()

print("\n=== Dataset Information ===")
df.info()

# Convert to numeric
df['HorsePower_numeric'] = pd.to_numeric(df['HorsePower'].str.extract(r'(\d+)')[0], errors='coerce')
df['TotalSpeed_numeric'] = pd.to_numeric(df['Total Speed'].str.extract(r'(\d+)')[0], errors='coerce')
df['Performance_0_100_sec'] = pd.to_numeric(df['Performance(0 - 100 )KM/H'].str.extract(r'(\d+\.?\d*)')[0], errors='coerce')
df['CC_Battery_numeric'] = pd.to_numeric(df['CC/Battery Capacity'].str.replace(',', '').str.extract(r'(\d+)')[0], errors='coerce')
df['Seats_numeric'] = pd.to_numeric(df['Seats'], errors='coerce')
df['Torque_numeric'] = pd.to_numeric(df['Torque'].str.extract(r'(\d+)')[0], errors='coerce')

# Clean up Cars Prices
df['CarsPrices_numeric'] = (
    df['Cars Prices']
    .astype(str)
    .str.replace(r'[\$,]', '', regex=True)   # remove $ and commas
    .str.replace('/', '-', regex=False)      # handle ranges
    .str.split('-').str[0]                   # take first value if range
    .str.strip()
)
df['CarsPrices_numeric'] = pd.to_numeric(df['CarsPrices_numeric'], errors='coerce')

# Select numeric columns
numeric_cols = [
    'HorsePower_numeric', 'TotalSpeed_numeric', 'Performance_0_100_sec',
    'CC_Battery_numeric', 'Seats_numeric', 'Torque_numeric', 'CarsPrices_numeric'
]

numeric_df = df[numeric_cols].dropna(how='all')

Saving Cars Datasets 2025.xlsx to Cars Datasets 2025 (6).xlsx

=== Dataset Information ===
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1218 entries, 0 to 1217
Data columns (total 11 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Company Names              1218 non-null   object
 1   Cars Names                 1218 non-null   object
 2   Engines                    1218 non-null   object
 3   CC/Battery Capacity        1215 non-null   object
 4   HorsePower                 1218 non-null   object
 5   Total Speed                1218 non-null   object
 6   Performance(0 - 100 )KM/H  1212 non-null   object
 7   Cars Prices                1218 non-null   object
 8   Fuel Types                 1218 non-null   object
 9   Seats                      1218 non-null   object
 10  Torque                     1217 non-null   object
dtypes: object(11)
memory usage: 104.8+ KB


### Step 2 : Checking and fixing Null values

In [None]:

df.isnull().sum()

Unnamed: 0,0
Company Names,0
Cars Names,0
Engines,0
CC/Battery Capacity,3
HorsePower,0
Total Speed,0
Performance(0 - 100 )KM/H,6
Cars Prices,0
Fuel Types,0
Seats,0


### Step 3: Applying Linear Regression

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
X = df[[
    'HorsePower_numeric',
    'TotalSpeed_numeric',
    'Performance_0_100_sec',
    'CC_Battery_numeric',
    'Seats_numeric',
    'Torque_numeric'
]]

y = df['CarsPrices_numeric']
df_numeric = df.dropna(subset=X.columns.tolist() + ['CarsPrices_numeric'])

X = df_numeric[[
    'HorsePower_numeric',
    'TotalSpeed_numeric',
    'Performance_0_100_sec',
    'CC_Battery_numeric',
    'Seats_numeric',
    'Torque_numeric'
]]
y = df_numeric['CarsPrices_numeric']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)
print("R² Score:", r2)

Mean Squared Error: 135034972826.84402
R² Score: 0.17087035709090725


### Step 4: Evaluate the Model

Once the linear regression model is trained, we need to evaluate its performance on the dataset.

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Model Evaluation Metrics:")
print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R² Score: {r2:.2f}")

Model Evaluation Metrics:
Mean Absolute Error (MAE): 205530.57
Mean Squared Error (MSE): 135034972826.84
R² Score: 0.17


### Step 5 : performing prediction

In [None]:
y_pred = model.predict(X_test)
prediction_results = pd.DataFrame({
    "Actual Valeus": y_test.values,
    "Predicted Values": y_pred
}).head(10)
prediction_results

Unnamed: 0,Actual Valeus,Predicted Values
0,18400.0,-83157.146819
1,41500.0,129993.792992
2,12000.0,-79551.286155
3,26400.0,-18974.731644
4,40000.0,-34004.205901
5,28000.0,-109468.670958
6,25000.0,-80231.7831
7,355000.0,488433.334345
8,72000.0,106290.684285
9,60000.0,126759.84534
