# 🚗 Car Price Prediction - Regression Project

## 🧾 Problem Description
A Chinese automobile company wants to understand how car prices vary in the US market. The aim is to build a regression model to identify key factors influencing car prices, and use this model to predict prices based on various car features.

---

## 🎯 Business Goal
- Identify significant variables affecting car prices.
- Build predictive models to estimate car prices.
- Evaluate and compare model performance.
- Interpret results for strategic business decisions.


## 1️⃣ Loading and Preprocessing

In [None]:
import pandas as pd
import numpy as np

# Load data
url = "car_data.csv"  # Replace with your local file path if needed
df = pd.read_csv(url)

# Initial Inspection
df.head()


In [None]:
# Basic Info
df.info()

# Check for missing values
df.isnull().sum()


In [None]:
# Handle missing values
df = df.dropna()

# Convert categorical to numeric if necessary
df = pd.get_dummies(df, drop_first=True)

# Separate target and features
X = df.drop('price', axis=1)
y = df['price']

# Feature Scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train/Test Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)


## 2️⃣ Model Implementation

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR

models = {
    "Linear Regression": LinearRegression(),
    "Decision Tree": DecisionTreeRegressor(random_state=42),
    "Random Forest": RandomForestRegressor(random_state=42),
    "Gradient Boosting": GradientBoostingRegressor(random_state=42),
    "SVR": SVR()
}

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

results = {}

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    results[name] = {
        "R²": r2_score(y_test, y_pred),
        "MSE": mean_squared_error(y_test, y_pred),
        "MAE": mean_absolute_error(y_test, y_pred)
    }

results_df = pd.DataFrame(results).T.sort_values("R²", ascending=False)
results_df


## 3️⃣ Model Evaluation

- **Best Model**: Choose the model with highest R² and lowest MSE & MAE.
- **Worst Model**: Lowest R² and/or highest error metrics.

We'll justify the performance based on these.


## 4️⃣ Feature Importance Analysis

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Use Random Forest for feature importance
rf = RandomForestRegressor(random_state=42)
rf.fit(X_train, y_train)

importances = pd.Series(rf.feature_importances_, index=X.columns)
top_features = importances.sort_values(ascending=False).head(10)

plt.figure(figsize=(10,6))
sns.barplot(x=top_features.values, y=top_features.index)
plt.title("Top 10 Important Features")
plt.show()


## 5️⃣ Hyperparameter Tuning

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5]
}

grid = GridSearchCV(RandomForestRegressor(random_state=42), param_grid, cv=3, scoring='r2', n_jobs=-1)
grid.fit(X_train, y_train)

print("Best Parameters:", grid.best_params_)

best_model = grid.best_estimator_
y_pred_best = best_model.predict(X_test)

print("Improved R²:", r2_score(y_test, y_pred_best))


## 📤 Submission
- Upload this notebook and your dataset to **GitHub**.
- Share the GitHub repository link.
- Ensure all markdown explanations are clear.

---

✅ **End of Project**
