# Analysis Notebook — AIIS-WH2

This notebook follows CRISP-DM and contains the full pipeline. Replace `data/vehicle_data.csv` with the Kaggle file.


## 1. Business & Data Understanding
Describe business goal and dataset (CarDekho).

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

path='data/vehicle_data.csv'
try:
    df = pd.read_csv(path)
    print('Loaded', df.shape)
except Exception as e:
    print('Place the Kaggle CSV at', path)
    df = None


Place the Kaggle CSV at data/vehicle_data.csv


In [3]:
# Preprocessing example
if df is not None:
    display(df.head())
    df = df.dropna().drop_duplicates()
    # convert categorical to dummies
    cat_cols = df.select_dtypes(include=['object']).columns.tolist()
    df = pd.get_dummies(df, columns=cat_cols, drop_first=True)
    print('After dummies', df.shape)


In [4]:
# Modeling example: Linear Regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
from sklearn.metrics import mean_squared_error, r2_score

if df is not None:
    if 'selling_price' not in df.columns:
        raise SystemExit('Please ensure selling_price column exists')
    y = df['selling_price']
    X = df.drop(columns=['selling_price'])
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    lr = LinearRegression().fit(X_train, y_train)
    pred = lr.predict(X_test)
    print('R2', r2_score(y_test, pred))


In [5]:
# Confidence intervals via statsmodels OLS
import statsmodels.api as sm
if df is not None:
    X2 = sm.add_constant(X_train)
    ols = sm.OLS(y_train, X2).fit()
    preds = ols.get_prediction(sm.add_constant(X_test))
    frame = preds.summary_frame(alpha=0.05)
    # frame contains mean, mean_ci_lower, mean_ci_upper, obs_ci_lower, obs_ci_upper
    print(frame.head())


## Evaluation and Plots
Plot Actual vs Predicted with prediction intervals.