# Assignment 0 - Task 1: Linear & Polynomial Regression

## Problem Statement
We are working with a dataset of used car listings from **PakWheels** (`data-a1.csv`).
**Goal**: Predict the **price** of a car based on its attributes.

## Dataset Description
- **make, model**: Manufacturer and specific car name.
- **year**: Manufacturing year.
- **engine**: Engine capacity (cc).
- **mileage**: Distance traveled (km).
- **price**: Target variable (PKR).
- **Others**: fuel, transmission, city, registered, color, etc.

## approach
baby steps.
step by step.
evolutionary style.

## 1. Setup
getting tools.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
%matplotlib inline

## 2. Data Preparation
reading file.

In [None]:
df = pd.read_csv('data-a1.csv')

checking the raw data.

In [None]:
df.head()

checking columns to find `addref`.

In [None]:
df.columns

dropping `addref` (it's just an id).

In [None]:
if 'addref' in df.columns:
    df = df.drop(columns=['addref'])

verifying drop.

In [None]:
df.columns

### 2.1 Missing Values
filling holes.
numbers get median.
words get mode.

In [None]:
# identifying number columns
num_cols = df.select_dtypes(include=['number']).columns
print("Numeric:", num_cols)

In [None]:
# filling numbers
for c in num_cols:
    df[c] = df[c].fillna(df[c].median())

In [None]:
# identifying text columns
cat_cols = df.select_dtypes(include=['object']).columns
print("Categorical:", cat_cols)

In [None]:
# filling text
for c in cat_cols:
    df[c] = df[c].fillna(df[c].mode()[0])

checking if clean.

In [None]:
df.isnull().sum()

### 2.2 Feature Engineering
creating `make_model` by combining two words.

In [None]:
df['make_model'] = df['make'] + " " + df['model']

looking at the result.

In [None]:
df[['make', 'model', 'make_model']].head()

### 2.3 Encoding
converting all words to numbers (one hot).

In [None]:
word_cols = df.select_dtypes(include=['object']).columns
df_encoded = pd.get_dummies(df, columns=word_cols, drop_first=True)

checking shape after encoding (will be wide).

In [None]:
df_encoded.shape

## 3. Splitting Data
separating target `price` from features `X`.

In [None]:
X = df_encoded.drop(columns=['price'])
y = df_encoded['price']

splitting into Train (70%) and Temp (30%).

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)

splitting Temp into Val (15%) and Test (15%).

In [None]:
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

verifying counts.

In [None]:
print("Train:", X_train.shape)
print("Val:", X_val.shape)
print("Test:", X_test.shape)

## 4. Phase 1: The Playground
understanding linear regression manually.

### 4.1 Prediction Function
$$ y = wX + b $$

In [None]:
def predict_v1(X, w, b):
    return np.dot(X, w) + b

testing manual prediction math.

In [None]:
dummy_X = np.array([[2]])
dummy_w = np.array([[10]])
dummy_b = 5

# 2*10+5 = 25
print("Dummy Pred:", predict_v1(dummy_X, dummy_w, dummy_b))

### 4.2 Gradient Function
calculating how much to change weights.

In [None]:
def grad_v1(X, pred, real):
    n = len(X)
    diff = pred - real
    # gradient of weights
    dw = (1/n) * np.dot(X.T, diff)
    # gradient of bias
    db = (1/n) * np.sum(diff)
    return dw, db

### 4.3 The Failure (Unscaled)
trying to train on raw data to see what happens.

In [None]:
# setting up for raw run
X_raw = X_train.astype(float)
y_raw = y_train.values.reshape(-1, 1).astype(float)

w = np.zeros((X_raw.shape[1], 1))
b = 0
lr = 0.0001

In [None]:
# taking 10 steps
for i in range(10):
    p = predict_v1(X_raw, w, b)
    dw, db = grad_v1(X_raw, p, y_raw)
    w -= lr * dw
    b -= lr * db
    
print("Weights after raw training (check for nan):")
print(w[0])

it exploded! value is `nan`.
we must scale.

### 4.4 The Fix (Scaling)
using StandardScaler.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)

transforming train, val, test.

In [None]:
X_train_s = scaler.transform(X_train)
X_val_s = scaler.transform(X_val)
X_test_s = scaler.transform(X_test)

running manual loop on scaled data (Success).

In [None]:
w = np.zeros((X_train_s.shape[1], 1))
b = 0
lr = 0.1

history = []

for i in range(500):
    p = predict_v1(X_train_s, w, b)
    # calc error
    mse = np.mean((p - y_raw)**2)
    history.append(mse)
    
    dw, db = grad_v1(X_train_s, p, y_raw)
    w -= lr * dw
    b -= lr * db

print("Final Manual Error:", mse)

error is going down! math works.

## 5. Phase 2: Maturation
organizing manual code into a class.

In [None]:
class CustomLinearRegression:
    def __init__(self, lr=0.01, steps=1000):
        self.lr = lr
        self.steps = steps
        self.w = None
        self.b = None
        self.loss = []
        
    def predict(self, X):
        return np.dot(X, self.w) + self.b
    
    def fit(self, X, y):
        n = X.shape[0]
        # init
        self.w = np.zeros((X.shape[1], 1))
        self.b = 0
        y = y.reshape(-1, 1)
        
        for i in range(self.steps):
            p = self.predict(X)
            
            # error
            mse = np.mean((p - y)**2)
            self.loss.append(mse)
            
            # update
            dw = (1/n) * np.dot(X.T, (p - y))
            db = (1/n) * np.sum(p - y)
            self.w -= self.lr * dw
            self.b -= self.lr * db

## 6. Phase 3: Production Training
training the mature class.

In [None]:
model = CustomLinearRegression(lr=0.1, steps=1500)
model.fit(X_train_s, y_train.values)

checking convergence graph.

In [None]:
plt.plot(model.loss)
plt.title("Error over Time")
plt.ylabel("MSE")
plt.show()

validating model.

In [None]:
p_val = model.predict(X_val_s)
mse_val = np.mean((p_val - y_val.values.reshape(-1, 1))**2)
print("Validation MSE:", mse_val)

final testing.

In [None]:
p_test = model.predict(X_test_s)
mse_test = np.mean((p_test - y_test.values.reshape(-1, 1))**2)
print("Test MSE:", mse_test)

saving model.

In [None]:
import pickle
with open('linear_model.pkl', 'wb') as f:
    pickle.dump(model, f)

## 7. Polynomial (Non-Linear)
using `year`, `engine`, `mileage`.

In [None]:
poly_cols = ['year', 'engine', 'mileage']
X_poly = df[poly_cols]

splitting.

In [None]:
Xp_train, Xp_test, yp_train, yp_test = train_test_split(X_poly, df['price'], test_size=0.3, random_state=42)

scaling & transforming (degree 2).

In [None]:
from sklearn.preprocessing import PolynomialFeatures

scaler_p = StandardScaler()
poly = PolynomialFeatures(degree=2)

# pipe manually
Xp_train_s = scaler_p.fit_transform(Xp_train)
Xp_train_poly = poly.fit_transform(Xp_train_s)

Xp_test_s = scaler_p.transform(Xp_test)
Xp_test_poly = poly.transform(Xp_test_s)

training Ridge regression.

In [None]:
from sklearn.linear_model import Ridge

reg = Ridge()
reg.fit(Xp_train_poly, yp_train)

print("Polynomial R2 Score:", reg.score(Xp_test_poly, yp_test))