# Data Science Project Workflow

## Project Overview
- **Objective**: The goal of this project is to build a machine learning model that accurately predicts the estimated price of a used car based on its specifications. This project aims to make car pricing more transparent and data-driven, helping users make more informed buying and selling decisions.
- **Milestones**: Data Collection, Exploration, Preprocessing, Advanced Analysis, Model Development, Deployment, and Final Documentation.

---

## Domain and Research Questions

### Domain of the Project
- Car Market and Resale (Petrol or Electric Vehicles)

### Research Questions to be Answered
1. **Research Question 1:**
Does mileage significantly affect car price across brands?
2. **Question 2:**
Are electric and hybrid cars priced competitively compared to gasoline and diesel cars? 
3. **Question 3:**
How does transmission type impact fuel efficiency (MPG)?  
4. **Question 4:**
What combination of features best predicts the **price** of a car based on its specifications?

---

# Team Information

## Student Information
- **Name**: Mohamed Bedda
- **Email**: MB2401093@tkh.edu.eg
- **Role**: Data Science Student  
- **Institution**: Coventry University - TKH

## Additional Information
- **Project Timeline**: [Insert Start Date - End Date]  
- **Tools Used**: [Insert List of Tools or Frameworks, e.g., Python, SQLite, Pandas, etc.]  
- **Advisor/Instructor**: [Insert Advisor/Instructor Name, if applicable]  
- **Contact for Inquiries**: [Insert Email or Point of Contact]

---

# Milestone 1: Data Collection, Exploration, and Preprocessing

## Data Collection
- Acquire a dataset from reliable sources (e.g., Kaggle, UCI Repository, or APIs).
- **Scraping Data**:
  - Increase dataset size through web scraping or APIs (e.g., Selenium, BeautifulSoup).
  - Explore public repositories or other accessible sources for additional data.

In [22]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException
from bs4 import BeautifulSoup
import pandas as pd
import os
import time

# Suppress warnings from Selenium
import logging
from selenium.webdriver.remote.remote_connection import LOGGER
LOGGER.setLevel(logging.WARNING)


# ===== SETUP =====

service = Service(executable_path="./chromedriver")
options = Options()
# options.add_argument("--headless")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")

driver = webdriver.Chrome(service=service, options=options)

brands = ["bmw", "honda", "chevrolet", "toyota", "tesla", "nissan", "ford", "audi", "kia", "hyundai"]
pages_per_brand = 25

all_data = []

for brand in brands:
    print(f"🔍 Scraping brand: {brand.upper()}")
    for page in range(1, pages_per_brand + 1):
        print(f"  → Page {page}")
        url = f"https://www.cars.com/shopping/{brand}/?page={page}"
        driver.get(url)
        time.sleep(4)

        vehicle_cards = driver.find_elements(By.CLASS_NAME, "vehicle-card")

        for card in vehicle_cards:
            try:
                title = card.find_element(By.CLASS_NAME, "title").text
                price = card.find_element(By.CLASS_NAME, "primary-price").text
                mileage = card.find_element(By.CLASS_NAME, "mileage").text
                year = title.split()[0]
            except NoSuchElementException:
                continue

            try:
                spark_reveal = card.find_element(By.TAG_NAME, "spark-reveal")
                driver.execute_script("arguments[0].setAttribute('aria-expanded', 'true')", spark_reveal)
                time.sleep(0.5)
                html_content = spark_reveal.get_attribute("innerHTML")
            except Exception as e:
                print("⚠️ Failed to open 'Show details':", e)
                html_content = ""

            soup = BeautifulSoup(html_content, "html.parser")

            mpg = fuel = transmission = drivetrain = engine = None

            for feature in soup.select(".vehicle-feature"):
                label = feature.text.lower()
                if "mpg:" in label:
                    mpg = feature.text.split(":")[1].strip()
                elif "fuel type" in label:
                    fuel = feature.text.split(":")[1].strip()
                elif "transmission" in label:
                    transmission = feature.text.split(":")[1].strip()
                elif "drivetrain" in label:
                    drivetrain = feature.text.split(":")[1].strip()
                elif "engine" in label:
                    engine = feature.text.split(":")[1].strip()

            all_data.append({
                "Brand": brand.title(),
                "Title": title,
                "Year": year,
                "Price": price,
                "Mileage": mileage,
                "Fuel Type": fuel,
                "Transmission": transmission,
                "Drivetrain": drivetrain,
                "MPG": mpg,
                "Engine": engine
            })

driver.quit()

# ===== SAVE RESULTS =====

os.makedirs("output", exist_ok=True)
df = pd.DataFrame(all_data)
df.to_csv("output/scrape_cars_com.csv", index=False)
print("✅ Done! Data saved to output/scrape_cars_com.csv")


🔍 Scraping brand: BMW
  → Page 1
  → Page 2
  → Page 3
  → Page 4
  → Page 5
  → Page 6
  → Page 7
  → Page 8
  → Page 9
  → Page 10
  → Page 11
  → Page 12
  → Page 13
  → Page 14
  → Page 15
  → Page 16
  → Page 17
  → Page 18
  → Page 19
  → Page 20
  → Page 21
  → Page 22
  → Page 23
  → Page 24
  → Page 25
🔍 Scraping brand: HONDA
  → Page 1
⚠️ Failed to open 'Show details': Message: no such element: Unable to locate element: {"method":"tag name","selector":"spark-reveal"}
  (Session info: chrome=135.0.7049.85); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
Stacktrace:
0   chromedriver                        0x0000000100979f14 cxxbridge1$str$ptr + 2816404
1   chromedriver                        0x00000001009721cc cxxbridge1$str$ptr + 2784332
2   chromedriver                        0x00000001004ba40c cxxbridge1$string$len + 93024
3   chromedriver                        0x00000001005011

## Dataset Description
- Create a table to explain:
  - **Column Names**
  - **Data Types**
  - **Descriptions**
  - **Potential Use in Analysis**

## Dataset Description

| **Column Name** | **Data Type** | **Description** | **Potential Use in Analysis** |
|------------------|---------------|------------------|-------------------------------|
| `Brand` | String | The manufacturer of the car (e.g., Toyota, BMW). | Used for brand-based comparisons and market share analysis. |
| `Title` | String | Full listing title, usually includes year, make, and model. | Helps identify model variations and validate other fields. |
| `Year` | Integer | Model year of the vehicle. | Useful for price depreciation trends and historical insights. |
| `Price` | String (formatted currency) | The listed resale price of the vehicle. | Essential for resale value prediction and pricing strategy. |
| `Mileage` | String (formatted text) | The total miles the car has been driven. | Crucial for assessing wear, value, and vehicle condition. |
| `Fuel Type` | String | Type of fuel used (e.g., Gasoline, Electric). | Supports trend analysis in fuel preferences and EV adoption. |
| `Transmission` | String | Type of transmission (e.g., Automatic, Manual). | Relevant for buyer preferences and vehicle classification. |
| `Drivetrain` | String | Power distribution system (e.g., AWD, FWD). | Can influence pricing, regional demand, and performance segmentation. |
| `MPG` | String (range) | Estimated fuel efficiency in miles per gallon (e.g., 15–22). | Useful for eco-efficiency analysis and sustainability insights. |
| `Engine` | String | Description of engine specs (e.g., V6, Electric Motor). | Helps segment vehicles by performance, power, or engine type. |


## Data Exploration
- Summary statistics (mean, median, variance).
- Identify missing values, duplicates, and outliers.
- Data distribution visualizations: histograms, box plots, scatter plots.

In [17]:
# Import libraries
import pandas as pd
import plotly.express as px

# Load dataset
df = pd.read_csv("output/scrape_cars_com.csv")

# Display basic info
print("✅ Dataset Info:")
print(df.info(), "\n")

# Summary statistics
print("📊 Summary Statistics:")
print(df.describe(include='all'), "\n")

# Check for missing values
print("🕳️ Missing Values:")
print(df.isnull().sum(), "\n")

# Check for duplicates
duplicates = df.duplicated().sum()
print(f"🧹 Duplicate Rows: {duplicates}\n")

# (Optional) Remove duplicates
# df.drop_duplicates(inplace=True)

# --- Convert 'Price' and 'Mileage' to numeric ---
df = df[~df['Price'].isin(['Not Priced'])]  # Remove "Not Priced" rows
df['Price'] = df['Price'].replace('[\$,]', '', regex=True).astype(float)
df['Mileage'] = df['Mileage'].replace('[\, mi.]', '', regex=True).astype(float)

# --- Plotly Visualizations ---

# 1️⃣ Histogram for 'Year'
fig_year = px.histogram(
    df, x="Year", nbins=30, title="Year Distribution",
    color_discrete_sequence=["blue"]
)
fig_year.update_layout(xaxis_range=[1990, 2025])  # Zoom in to recent years
fig_year.show()

# 2️⃣ Histogram for 'Price'
fig_price = px.histogram(
    df, x="Price", nbins=40, title="Price Distribution",
    color_discrete_sequence=["green"]
)
fig_price.update_layout(xaxis_range=[0, 180000])  # Zoom into common price range
fig_price.show()

# 3️⃣ Histogram for 'Mileage'
fig_mileage = px.histogram(
    df, x="Mileage", nbins=40, title="Mileage Distribution",
    color_discrete_sequence=["orange"]
)
fig_mileage.update_layout(xaxis_range=[0, 200000])  # Focus on lower mileage cars
fig_mileage.show()

# 4️⃣ Box plot for 'Price'
fig_box_price = px.box(
    df, x="Price", title="Price Box Plot",
    color_discrete_sequence=["green"]
)
fig_box_price.update_layout(xaxis_range=[0, 180000])
fig_box_price.show()

# 5️⃣ Box plot for 'Mileage'
fig_box_mileage = px.box(
    df, x="Mileage", title="Mileage Box Plot",
    color_discrete_sequence=["orange"]
)
fig_box_mileage.update_layout(xaxis_range=[0, 200000])
fig_box_mileage.show()

# 6️⃣ Scatter plot: Mileage vs Price
fig_scatter = px.scatter(
    df, x="Mileage", y="Price", color="Brand",
    title="Price vs. Mileage",
    opacity=0.7
)
fig_scatter.update_layout(
    xaxis_range=[0, 200000],
    yaxis_range=[0, 100000]
)
fig_scatter.show()


✅ Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4999 entries, 0 to 4998
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Brand         4999 non-null   object
 1   Title         4999 non-null   object
 2   Year          4999 non-null   int64 
 3   Price         4999 non-null   object
 4   Mileage       4999 non-null   object
 5   Fuel Type     4973 non-null   object
 6   Transmission  4993 non-null   object
 7   Drivetrain    4989 non-null   object
 8   MPG           4176 non-null   object
 9   Engine        4982 non-null   object
dtypes: int64(1), object(9)
memory usage: 390.7+ KB
None 

📊 Summary Statistics:
       Brand                       Title         Year    Price    Mileage  \
count   4999                        4999  4999.000000     4999       4999   
unique    10                        2732          NaN     2949       4876   
top      Bmw  2022 Chevrolet Equinox 1LT          NaN  $20,99

## Preprocessing and Feature Engineering
- Handle missing values.
- Remove duplicates and outliers.
- Apply transformations (scaling, encoding, feature interactions).

In [None]:
%pip install scikit-learn

In [35]:
import pandas as pd
import numpy as np
import re
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Load data
df = pd.read_csv("output/scrape_cars_com.csv")

# ========== HANDLE MISSING VALUES ==========
print("🔍 Missing values before handling:")
print(df.isnull().sum())

# Drop rows missing essential values
df.dropna(subset=["Title", "Year", "Price", "Mileage"], inplace=True)

# Fill missing string-based columns with "Unknown"
for col in ["Fuel Type", "Transmission", "Drivetrain", "Engine", "MPG"]:
    df[col] = df[col].fillna("Unknown")

# ========== FILTER FUEL TYPE ==========
valid_fuels = ["gasoline", "electric", "hybrid", "plug in hybrid", "diesel"]
df["Fuel Type"] = df["Fuel Type"].apply(
    lambda x: x.lower() if isinstance(x, str) else "unknown"
)
df["Fuel Type"] = df["Fuel Type"].apply(
    lambda x: x if any(fuel in x for fuel in valid_fuels) else "unknown"
)
df["Fuel Type"] = df["Fuel Type"].replace({
    "plug in hybrid": "hybrid"
}).str.title()

# ========== EXTRACT ENGINE SIZE ==========
def extract_engine_size(row):
    fuel = row["Fuel Type"].lower()
    engine = str(row["Engine"]).lower()
    if fuel == "electric":
        return "Electric"
    match = re.search(r'(\d\.\d|\d)(\s)?(l|liter)', engine)
    if match:
        size = float(match.group(1))
        return f"{size:.1f}L"
    return "Unknown"

df["Engine Size"] = df.apply(extract_engine_size, axis=1)

# Drop the messy 'Engine' text column
df.drop(columns=["Engine"], inplace=True)

# ========== GROUP TRANSMISSION TYPES ==========
def group_transmission(trans):
    trans_lower = trans.lower()
    if "manual" in trans_lower:
        return "Manual"
    elif "automatic" in trans_lower or "a/t" in trans_lower:
        return "Automatic"
    else:
        return np.nan  # Remove Other transmissions

df["Transmission"] = df["Transmission"].apply(group_transmission)

# ========== GROUP DRIVETRAIN TYPES ==========
def group_drivetrain(drive):
    drive_lower = drive.lower()
    if "all" in drive_lower or "awd" in drive_lower:
        return "AWD"
    elif "four" in drive_lower or "4wd" in drive_lower or "4x4" in drive_lower:
        return "4WD"
    elif "front" in drive_lower or "fwd" in drive_lower:
        return "FWD"
    elif "rear" in drive_lower or "rwd" in drive_lower:
        return "RWD"
    else:
        return np.nan  # Remove Other drivetrains

df["Drivetrain"] = df["Drivetrain"].apply(group_drivetrain)

# ========== REMOVE DUPLICATES ==========
df.drop_duplicates(inplace=True)

# ========== CLEAN AND TRANSFORM NUMERICAL FIELDS ==========
# Remove rows where Price is "Not Priced" or invalid
df = df[~df["Price"].str.contains("Not Priced", na=False)]
df["Price"] = df["Price"].str.replace("[$,]", "", regex=True).astype(float)
df["Mileage"] = df["Mileage"].str.replace("[^0-9.]", "", regex=True).astype(float)
df["Year"] = df["Year"].astype(int)

# ========== CLEAN MPG ==========
def extract_avg_mpg(mpg_str):
    if "-" in mpg_str:
        low, high = mpg_str.split("-")
        return (float(low.strip()) + float(high.strip())) / 2
    try:
        return float(mpg_str)
    except:
        return np.nan

df["MPG"] = df["MPG"].apply(extract_avg_mpg)
df["MPG"].fillna(df["MPG"].median(), inplace=True)

# ========== REMOVE OUTLIERS ==========
df = df[(df["Mileage"] < 300_000) & (df["Price"] < 200_000)]

# ========== REMOVE UNKNOWN/OTHER ==========
df = df.dropna(subset=["Fuel Type", "Transmission", "Drivetrain", "Engine Size"])
df = df[~df[["Fuel Type", "Transmission", "Drivetrain", "Engine Size"]].isin(["Unknown"]).any(axis=1)]

# ========== FINAL FEATURE SCALING AND ENCODING SETUP ==========
numeric_features = ["Year", "Price", "Mileage", "MPG"]
categorical_features = ["Fuel Type", "Transmission", "Drivetrain", "Brand", "Engine Size"]

numeric_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(drop="first", handle_unknown="ignore")

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

# Apply transformations
X_preprocessed = preprocessor.fit_transform(df)

print("✅ Preprocessing complete.")

# Save cleaned dataset
df.to_csv("output/cleaned_scrape_cars_com.csv", index=False)


🔍 Missing values before handling:
Brand             0
Title             0
Year              0
Price             0
Mileage           0
Fuel Type        26
Transmission      6
Drivetrain       10
MPG             823
Engine           17
dtype: int64
✅ Preprocessing complete.


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["MPG"].fillna(df["MPG"].median(), inplace=True)



---

# Milestone 2: Advanced Data Analysis and Feature Engineering

## Statistical Analysis
- Conduct tests such as t-tests, ANOVA, and chi-squared to explore relationships.

In [None]:
%pip install statsmodels


In [36]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import ttest_ind, chi2_contingency
from statsmodels.formula.api import ols
import statsmodels.api as sm

# Load the cleaned dataset
df = pd.read_csv("output/cleaned_scrape_cars_com.csv")

# ========== T-TEST: ELECTRIC vs GASOLINE CAR PRICES ==========
electric_prices = df[df["Fuel Type"].str.lower() == "electric"]["Price"]
gasoline_prices = df[df["Fuel Type"].str.lower() == "gasoline"]["Price"]

t_stat, p_value = ttest_ind(electric_prices, gasoline_prices, equal_var=False)

print("🔍 T-Test (Electric vs Gasoline Prices):")
print(f"T-statistic = {t_stat:.2f}, P-value = {p_value:.4f}\n")

# ========== ANOVA: MPG ACROSS TRANSMISSION TYPES ==========
model = ols("MPG ~ C(Transmission)", data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

print("🔬 ANOVA: MPG across Transmission Types")
print(anova_table, "\n")

# ========== CHI-SQUARED TEST: Drivetrain vs Fuel Type ==========
contingency_table = pd.crosstab(df["Drivetrain"], df["Fuel Type"])
chi2, p, dof, _ = chi2_contingency(contingency_table)

print("🧪 Chi-Squared Test (Drivetrain vs Fuel Type):")
print(f"Chi2 = {chi2:.2f}, P-value = {p:.4f}, Degrees of Freedom = {dof}")


🔍 T-Test (Electric vs Gasoline Prices):
T-statistic = 8.14, P-value = 0.0000

🔬 ANOVA: MPG across Transmission Types
                        sum_sq      df          F        PR(>F)
C(Transmission)    1089.206592     1.0  41.522574  1.290683e-10
Residual         115078.349799  4387.0        NaN           NaN 

🧪 Chi-Squared Test (Drivetrain vs Fuel Type):
Chi2 = 496.29, P-value = 0.0000, Degrees of Freedom = 9


## 📊 Interpreting Statistical Analysis Results

This analysis helps us uncover relationships between key variables in the car dataset. Here's what each test reveals and why it matters:

### 🔍 T-Test: Electric vs Gasoline Prices
- **What it tells us:** Whether there’s a statistically significant difference in prices between electric and gasoline cars.
- **Why it matters:** Helps evaluate if electric vehicles (EVs) are priced higher or lower than traditional gasoline cars — crucial for market segmentation, pricing strategy, and consumer behavior analysis.
- **Use case:** A car dealership might use this to understand if EVs are justifiably priced higher, and whether customer education or price alignment is needed.

### 🔬 ANOVA: MPG across Transmission Types
- **What it tells us:** Whether average fuel efficiency (MPG) differs across transmission types (Manual, Automatic, Other).
- **Why it matters:** Provides insight into how the choice of transmission affects efficiency. This could influence manufacturing decisions or highlight which types offer the best balance between performance and fuel economy.
- **Use case:** Marketing teams can highlight the fuel-efficiency advantage of a specific transmission to attract eco-conscious buyers.

### 🧪 Chi-Squared Test: Drivetrain vs Fuel Type
- **What it tells us:** Whether there's an association between drivetrain types (AWD, FWD, etc.) and fuel types (Electric, Gasoline, etc.).
- **Why it matters:** Understanding drivetrain–fuel type relationships can help with product planning. For instance, AWD might be more common in hybrids or gas cars than in EVs.
- **Use case:** Car manufacturers can optimize configurations based on market demands or geographical preferences (e.g., AWD in snowy regions).

---
By identifying these relationships, stakeholders—from data scientists to marketers to manufacturers—can make informed, data-backed decisions that align with trends and user preferences.

## Feature Engineering
- Create derived features based on domain knowledge.
- Apply transformations such as normalization, log scaling, or polynomial features.

In [None]:
# ===============================================
# 📦 FEATURE ENGINEERING
# ===============================================

import pandas as pd
import numpy as np

# Load cleaned dataset
df = pd.read_csv("output/cleaned_scrape_cars_com.csv")

# ====== Feature Engineering ======

# 1. Engine Size (numeric format)
def parse_engine_size(engine_str):
    if engine_str != "Electric":
        try:
            return float(engine_str.replace("L", ""))
        except:
            return np.nan
    return np.nan  # Electric cars have no engine size

df["Engine Size (L)"] = df["Engine Size"].apply(parse_engine_size)

# Remove original "Engine Size" text column
df.drop(columns=["Engine Size"], inplace=True)

# 2. Log transformations (for normalization)
df["Log Price"] = np.log1p(df["Price"])
df["Log Mileage"] = np.log1p(df["Mileage"])

# 3. Car Age
current_year = pd.Timestamp.now().year
df["Car Age"] = current_year - df["Year"]

# 4. MPG * Engine interaction
df["MPG*Engine"] = df["MPG"] * df["Engine Size (L)"]
df["MPG*Engine"].fillna(0, inplace=True)  # Set 0 for electric cars (no engine size)

# 5. Binary flags for Fuel Type
df["IsElectric"] = df["Fuel Type"].str.lower().apply(lambda x: 1 if "electric" in x else 0)
df["IsHybrid"] = df["Fuel Type"].str.lower().apply(lambda x: 1 if "hybrid" in x else 0)
df["IsGasoline"] = df["Fuel Type"].str.lower().apply(lambda x: 1 if "gasoline" in x else 0)
df["IsDiesel"] = df["Fuel Type"].str.lower().apply(lambda x: 1 if "diesel" in x else 0)

# 6. Low Mileage Flag
df["IsLowMileage"] = df["Mileage"].apply(lambda x: 1 if x < 20000 else 0)

# 7. Target Variable: IsExpensive (Top 25% of prices)
price_threshold = df["Price"].quantile(0.75)
df["IsExpensive"] = df["Price"].apply(lambda x: 1 if x > price_threshold else 0)

# ====== Final Cleanup ======

# Drop rows where critical engineered fields are missing (except Engine Size L)
df.dropna(subset=["Log Price", "Log Mileage", "Car Age"], inplace=True)

# ====== Save ======
df.to_csv("output/engineered_scrape_cars_com.csv", index=False)

print("✅ Feature engineering completed and saved to 'output/engineered_scrape_cars_com.csv'")


✅ Feature engineering completed and saved to 'output/engineered_scrape_cars_com.csv'


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["MPG*Engine"].fillna(0, inplace=True)  # Set 0 for electric cars (no engine size)


## 🔍 Feature Engineering Explanation

Below is a breakdown of the newly engineered columns, along with their purpose and potential benefit in downstream analysis or machine learning:

### 🔧 Numeric Features

- **Engine Size (L):**  
  Extracts numeric engine size (in liters) from descriptive strings.  
  *Purpose:* Helps capture vehicle power and fuel efficiency trends. Often correlated with performance, emissions, and fuel economy.

- **Log Price:**  
  Applies a log transformation to the Price column.  
  *Purpose:* Normalizes price data, reducing skewness and making it more suitable for statistical models or machine learning.

- **Log Mileage:**  
  Applies a log transformation to the Mileage column.  
  *Purpose:* Reduces the impact of high-mileage outliers and better captures the mileage distribution.

- **Car Age:**  
  Calculates how old the car is based on the current year.  
  *Purpose:* Age is a key variable in car depreciation and reliability assessments.


- **MPG*Engine:**  
  Interaction feature combining fuel economy (MPG) and engine size.  
  *Purpose:* Balances performance and efficiency; useful in regression models or clustering.

---

### 🔲 Binary Flag Features

- **IsElectric:**  
  1 if the car is electric, else 0.  
  *Purpose:* Useful for comparing electric cars against traditional fuel types or filtering for EVs.

- **IsHybrid:**  
  1 if the car is hybrid or plug-in hybrid, else 0.  
  *Purpose:* Groups hybrid technologies under one flag for analysis and modeling.

- **IsGasoline:**  
  1 if the car runs on gasoline, else 0.  
  *Purpose:* Enables comparative analysis and segmentation by fuel type.

- **IsDiesel:**  
  1 if the car uses diesel, else 0.  
  *Purpose:* Helps study older or commercial vehicle segments that typically use diesel.

- **IsLowMileage:**  
  1 if mileage is under 20,000 miles.  
  *Purpose:* Flags low-usage vehicles, usually indicating better condition or newer usage.

- **IsExpensive:**  
  1 if the price is in the top 25% percentile.  
  *Purpose:* Identifies luxury or premium pricing tiers and helps in market segmentation.

---

These engineered features enhance interpretability, enable better model performance, and offer deeper insights for both statistical analysis and machine learning workflows.


## Data Visualization
- Generate insightful visualizations:
  - Correlation heatmaps, pair plots.
  - Trends and comparisons using bar charts, line charts, and dashboards.

In [2]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

# Load feature-engineered dataset
df = pd.read_csv("output/engineered_scrape_cars_com.csv")

# ========== 1. Correlation Heatmap ==========
corr_matrix = df[["Price", "Mileage", "MPG", "Engine Size (L)", "Car Age"]].corr()
heatmap = px.imshow(
    corr_matrix,
    text_auto=True,
    color_continuous_scale="RdBu_r",
    title="Correlation Heatmap of Key Numerical Features"
)

# ========== 2. Scatter Matrix ==========
scatter_matrix = px.scatter_matrix(
    df,
    dimensions=["Price", "Mileage", "MPG", "Engine Size (L)", "Car Age"],
    color="Fuel Type",
    title="Scatter Matrix of Key Features Colored by Fuel Type",
    height=700
)

# ========== 3. Price Distribution by Transmission Type ==========
box_transmission = px.box(
    df, x="Transmission", y="Price", color="Transmission",
    title="Price Distribution by Transmission Type"
)

# ========== 4. Bar Chart: Average MPG by Fuel Type ==========
avg_mpg_by_fuel = df.groupby("Fuel Type")["MPG"].mean().reset_index()
bar_mpg = px.bar(
    avg_mpg_by_fuel, x="Fuel Type", y="MPG", color="Fuel Type",
    title="Average MPG by Fuel Type"
)

# ========== 5. Line Chart: Average MPG vs Car Age ==========
avg_mpg_by_age = df.groupby("Car Age")["MPG"].mean().reset_index()
line_mpg_age = px.line(
    avg_mpg_by_age, x="Car Age", y="MPG", markers=True,
    title="Average MPG by Car Age"
)

# ========== 6. Count of Listings per Drivetrain ==========
drivetrain_counts = df["Drivetrain"].value_counts().reset_index()
drivetrain_counts.columns = ["Drivetrain", "Count"]
bar_drivetrain = px.bar(
    drivetrain_counts, x="Drivetrain", y="Count", color="Drivetrain",
    title="Listing Count by Drivetrain"
)

# ========== DISPLAY ==========
heatmap.show()
scatter_matrix.show()
box_transmission.show()
bar_mpg.show()
line_mpg_age.show()
bar_drivetrain.show()



---


# Milestone 3: Machine Learning Model Development and Optimization

## Model Selection
- Choose appropriate models for the problem type (classification, regression, clustering, etc.).


In [1]:
## Model Selection
# - Choose appropriate models for the problem type (classification, regression, clustering, etc.).

import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

# Load data
df = pd.read_csv("output/engineered_scrape_cars_com.csv")

# Define target and drop irrelevant columns
target = "Price"  # 🎯 Now predicting continuous Price
drop_cols = ["Title", "Log Price"]  # Drop Title (non-numeric) and Log Price (we predict raw Price)

X = df.drop(columns=drop_cols + [target])
y = df[target]

print("✅ Features and target for regression defined.")


✅ Features and target for regression defined.


## Model Training
- Split data into training, validation, and testing sets.
- Address imbalances using techniques like SMOTE or stratified sampling.

In [2]:
## Model Training
# - Split data into training, validation, and testing sets.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# ===== 1. Load data (already done above)

# ===== 2. Split Data (70/15/15) (no stratification for regression)
X_train_val, X_test, y_train_val, y_test = train_test_split(
    X, y, test_size=0.15, random_state=42
)
X_train, X_val, y_train, y_val = train_test_split(
    X_train_val, y_train_val, test_size=0.176, random_state=42
)

# ===== 3. Separate feature types
numeric_features = X.select_dtypes(include=["int64", "float64"]).columns.tolist()
categorical_features = X.select_dtypes(include=["object"]).columns.tolist()

# ===== 4. Preprocessing Pipelines
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")),
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(drop="first", handle_unknown="ignore"))
])

preprocessor = ColumnTransformer(transformers=[
    ("num", numeric_transformer, numeric_features),
    ("cat", categorical_transformer, categorical_features)
])

# ===== 5. Preprocess the sets
X_train_processed = preprocessor.fit_transform(X_train)
X_val_processed = preprocessor.transform(X_val)
X_test_processed = preprocessor.transform(X_test)

print(f"✅ Data preprocessed for regression. Shapes: {X_train_processed.shape}, {X_val_processed.shape}, {X_test_processed.shape}")


✅ Data preprocessed for regression. Shapes: (3073, 29), (657, 29), (659, 29)


## Model Evaluation
- Metrics to consider: Accuracy, Precision, Recall, F1-score, RMSE, etc.
- Visual tools: Confusion matrices, ROC curves.

In [4]:
# --- Preprocessing
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import pandas as pd
import numpy as np
import plotly.express as px

preprocessor.fit(X_train)

X_train_processed = preprocessor.transform(X_train)
X_val_processed = preprocessor.transform(X_val)
X_test_processed = preprocessor.transform(X_test)

# --- Models
models = {
    "Linear Regression": LinearRegression(),
    "Random Forest": RandomForestRegressor(n_estimators=100, random_state=42)
}

evaluation_results = []

for name, model in models.items():
    print(f"\n🔍 Evaluating: {name}")
    
    model.fit(X_train_processed, y_train)  # <--- No SMOTE needed

    y_pred = model.predict(X_val_processed)

    mae = mean_absolute_error(y_val, y_pred)
    mse = mean_squared_error(y_val, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_val, y_pred)

    evaluation_results.append({
        "Model": name,
        "MAE": mae,
        "MSE": mse,
        "RMSE": rmse,
        "R2 Score": r2
    })

# --- Plot
results_df = pd.DataFrame(evaluation_results).sort_values(by="R2 Score", ascending=False)

fig = px.bar(
    results_df.melt(id_vars=["Model"], value_vars=["MAE", "RMSE", "R2 Score"]),
    x="variable", y="value", color="Model", barmode="group",
    title="Model Evaluation Metrics",
    labels={"variable": "Metric", "value": "Score"}
)

fig.update_layout(yaxis_title="Score", xaxis_title="Metric", height=500)
fig.show()



🔍 Evaluating: Linear Regression

🔍 Evaluating: Random Forest


## Hyperparameter Tuning
- Techniques: Grid Search, Random Search, or advanced methods like Bayesian Optimization.

In [5]:
## Hyperparameter Tuning
# - Grid search for best model parameters.

from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

# --- Updated Parameter Grids for Regression Models ---
param_grids = {
    "Linear Regression": {
        # No major hyperparameters to tune for basic Linear Regression
    },
    "Random Forest": {
        "model__n_estimators": [100, 200],
        "model__max_depth": [None, 10, 20],
        "model__min_samples_split": [2, 5],
        "model__min_samples_leaf": [1, 2, 4]
    }
}

# --- Best models dictionary ---
best_models = {}

for name, model in models.items():
    print(f"\n🔧 Tuning hyperparameters for {name}...")

    pipeline = Pipeline(steps=[
        ("model", model)
    ])

    if param_grids[name]:  # If there are hyperparameters to tune
        grid_search = GridSearchCV(
            pipeline,
            param_grids[name],
            cv=5,
            scoring="neg_root_mean_squared_error",  # Use RMSE for regression scoring
            n_jobs=-1
        )

        grid_search.fit(X_train_processed, y_train)  # No SMOTE needed now
        best_models[name] = grid_search.best_estimator_

        print(f"✅ Best Params for {name}: {grid_search.best_params_}")
        print(f"📉 Best CV RMSE: {-grid_search.best_score_:.2f}")

    else:  # For Linear Regression (no real tuning needed)
        pipeline.fit(X_train_processed, y_train)
        best_models[name] = pipeline

        print(f"✅ {name} trained with default parameters.")



🔧 Tuning hyperparameters for Linear Regression...
✅ Linear Regression trained with default parameters.

🔧 Tuning hyperparameters for Random Forest...
✅ Best Params for Random Forest: {'model__max_depth': 20, 'model__min_samples_leaf': 2, 'model__min_samples_split': 2, 'model__n_estimators': 200}
📉 Best CV RMSE: 7178.84


## Model Comparison
- Compare multiple models and justify the final model selection.

In [6]:
## Model Comparison (Regression)
# - Compare multiple models and justify the final model selection.

import plotly.express as px
import pandas as pd
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# Create table to store comparison metrics
comparison_results = []

for name, model in best_models.items():
    y_pred = model.predict(X_val_processed)

    mae = mean_absolute_error(y_val, y_pred)
    mse = mean_squared_error(y_val, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_val, y_pred)

    comparison_results.append({
        "Model": name,
        "MAE": mae,
        "MSE": mse,
        "RMSE": rmse,
        "R2 Score": r2
    })

# Put into DataFrame
comparison_df = pd.DataFrame(comparison_results).sort_values(by="R2 Score", ascending=False)

print("\n📊 Model Comparison Results:")
print(comparison_df)

# Plot with Plotly
fig = px.bar(
    comparison_df,
    x="Model",
    y="R2 Score",
    color="Model",
    title="Model Comparison (based on R2 Score)",
    text="R2 Score"
)
fig.update_traces(texttemplate='%{text:.2f}', textposition='outside')
fig.update_layout(uniformtext_minsize=8, uniformtext_mode='hide')
fig.show()



📊 Model Comparison Results:
               Model          MAE           MSE         RMSE  R2 Score
1      Random Forest  3722.868023  6.215580e+07  7883.895009  0.801571
0  Linear Regression  5474.120988  8.950569e+07  9460.744558  0.714258


In [1]:
## Save Final Random Forest Regressor Model
# - Predicting actual price (not just expensive yes/no)
# - Matching the final cleaned + feature-engineered dataset

import pandas as pd
import joblib
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

# --- Load engineered dataset ---
df = pd.read_csv("output/engineered_scrape_cars_com.csv")

# --- Final selected features ---
final_features = [
    "Brand", "Fuel Type", "Transmission", "Drivetrain",
    "Year", "Mileage", "MPG", "Engine Size (L)",
    "MPG*Engine", "Log Mileage",
    "IsElectric", "IsHybrid", "IsGasoline", "IsDiesel", "IsLowMileage"
]

X_final = df[final_features]
y_final = df["Price"]   # 🛑 Predicting Price

# --- Drop rows with missing values ---
final_data = pd.concat([X_final, y_final], axis=1).dropna()
X_final_clean = final_data[final_features]
y_final_clean = final_data["Price"]

# --- Define feature types ---
numeric_features = ["Year", "Mileage", "MPG", "Engine Size (L)", "MPG*Engine", "Log Mileage"]
categorical_features = ["Brand", "Fuel Type", "Transmission", "Drivetrain"]
binary_features = ["IsElectric", "IsHybrid", "IsGasoline", "IsDiesel", "IsLowMileage"]

# --- Preprocessing pipeline ---
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")),
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(drop="first", handle_unknown="ignore"))
])

preprocessor = ColumnTransformer(transformers=[
    ("num", numeric_transformer, numeric_features),
    ("cat", categorical_transformer, categorical_features),
    ("bin", "passthrough", binary_features)
])

# --- Final Model (Random Forest Regressor) ---
final_model = Pipeline(steps=[
    ("preprocess", preprocessor),
    ("regressor", RandomForestRegressor(
        n_estimators=200,
        max_depth=20,
        min_samples_split=2,
        random_state=42
    ))
])

# --- Fit on full cleaned data ---
final_model.fit(X_final_clean, y_final_clean)

# --- Save model with compression ---
joblib.dump(final_model, "output/random_forest_model.pkl", compress=3)
print("✅ Final Random Forest Regressor model trained and saved as 'output/random_forest_model.pkl' with compression ✅")


✅ Final Random Forest Regressor model trained and saved as 'output/random_forest_model.pkl' with compression ✅


## Visualization for Research Questions
- This section will include the visualizations that provide insights for the research questions defined earlier.  
- **Development Steps for Answering the Research Questions**:
  1. During **Exploratory Data Analysis (EDA)**, visualize initial patterns or trends related to the research questions.
  2. During **Model Evaluation**, provide visualizations to interpret model performance with respect to the research questions.
  3. During the **Final Analysis and Reporting**, present polished visualizations that summarize findings for each research question.

- Create the visualizations for each research question you defined, prove it or answer it, then add a markdown cell after each visual to comment and explain how the visual support your research question.

🧪 **Research Question 1:**
Does mileage significantly affect car price across brands?

In [15]:
import pandas as pd
import plotly.express as px

df = pd.read_csv("output/engineered_scrape_cars_com.csv")

fig_mileage_price = px.scatter(
    df, x="Mileage", y="Price", color="Brand",
    title="Mileage vs Price by Brand",
    labels={"Mileage": "Mileage", "Price": "Price ($)"},
    trendline="ols", log_y=True, opacity=0.6, height=500
)
fig_mileage_price.show()


**Interpretation:**  
This scatter plot shows that as mileage increases, car prices tend to drop, although the trend varies by brand. The trendlines help visualize this negative correlation. Luxury brands generally retain higher prices even at higher mileage, while economy brands depreciate more quickly.


🔋 **Research Question 2:**
Are electric and hybrid cars priced competitively compared to gasoline and diesel cars?

In [16]:
fig_price_fuel = px.box(
    df, x="Fuel Type", y="Price",
    title="Price Comparison by Fuel Type",
    color="Fuel Type",
    points="all", log_y=True,
    labels={"Fuel Type": "Fuel Type", "Price": "Price ($)"}
)
fig_price_fuel.show()


**Interpretation:**  
Electric and hybrid vehicles generally fall in the mid-to-high price range, with electric cars occasionally spiking into premium territory. This suggests that while electric vehicles can be competitively priced, many are positioned as higher-end models compared to gasoline or diesel alternatives.


⚙️ **Research Question 3:**
How does transmission type impact fuel efficiency (MPG)?

In [None]:
fig_mpg_trans = px.box(
    df, x="Transmission", y="MPG",
    title="Fuel Efficiency (MPG) by Transmission Type",
    color="Transmission",
    points="all",
    labels={"Transmission": "Transmission Type", "MPG": "Miles per Gallon"}
)
fig_mpg_trans.show()


**Interpretation:**  
Automatic transmissions show a wider range of fuel efficiency, likely due to a mix of vehicle sizes and engine types. Manuals tend to have slightly better MPG on average in this dataset, possibly reflecting a concentration in smaller, economy-focused cars.

💰 **Research Question 4:**
What combination of features best predicts the **price** of a car based on its specifications?


In [64]:
import plotly.figure_factory as ff

# Select relevant columns (remove 'IsExpensive', focus on Price prediction)
corr_cols = [
    "Price", "Year", "Mileage", "MPG", "Engine Size (L)", "Car Age",
    "IsElectric", "IsHybrid", "IsLowMileage", "MPG*Engine"
]

# Compute correlation matrix
corr_matrix = df[corr_cols].corr().round(2)

# Create Plotly annotated heatmap
fig_corr = ff.create_annotated_heatmap(
    z=corr_matrix.values,
    x=corr_matrix.columns.tolist(),
    y=corr_matrix.index.tolist(),
    colorscale="YlGnBu",
    showscale=True,
    hoverinfo="z"
)

fig_corr.update_layout(
    title_text="Correlation Heatmap with Price",
    height=700
)
fig_corr.show()


**Interpretation:**  
The heatmap reveals that higher engine size, lower mileage, and newer car age are positively associated with a higher car price. MPG shows a negative correlation, indicating that more fuel-efficient cars generally have lower prices. This supports the idea that premium vehicles tend to have bigger engines, lower usage, and newer model years.


---


# Milestone 4: Deployment and Monitoring

## Deployment
- Deploy the model as a REST API (Flask, FastAPI) or interactive dashboards (Streamlit, Dash).
- Host on cloud platforms (AWS, Azure, GCP) or local servers.

In [None]:
%%writefile streamlit_app.py

import streamlit as st
import pandas as pd
import numpy as np
import joblib
import plotly.express as px
import plotly.graph_objects as go

# --- Load trained pipeline model ---
model_path = "output/random_forest_model.pkl"
model = joblib.load(model_path)

# --- Load dataset (for price search) ---
data_path = "output/engineered_scrape_cars_com.csv"
df_full = pd.read_csv(data_path)

# --- Page Settings ---
st.set_page_config(page_title="Car Price Estimator", layout="wide")
st.title("🚘 Car Price Estimator")
st.markdown("Estimate the **expected price** of a car based on its specifications.")

# --- User Input Form ---
with st.form("input_form"):
    col1, col2 = st.columns(2)

    with col1:
        brand = st.selectbox("Brand", ["BMW", "Audi", "Chevrolet", "Toyota", "Honda", "Ford", "Tesla", "Hyundai", "Kia", "Nissan"])
        fuel = st.selectbox("Fuel Type", ["Gasoline", "Hybrid", "Electric", "Diesel"])
        transmission = st.selectbox("Transmission", ["Automatic", "Manual"])
        drivetrain = st.selectbox("Drivetrain", ["AWD", "4WD", "FWD", "RWD"])
        year = st.slider("Year", 2000, 2024, 2023)

    with col2:
        mileage = st.slider("Mileage (mi)", 0, 100000, 5000, step=1000)
        mpg = st.slider("Fuel Efficiency (MPG)", 5.0, 80.0, 25.0)

        if fuel == "Electric":
            engine_size = 0.0
            st.caption("⚡ Electric selected — Engine Size set to 0.0L automatically.")
        else:
            engine_size = st.slider("Engine Size (Liters)", 0.0, 6.0, 2.0, step=0.1)

    submitted = st.form_submit_button("🔍 Predict")

# --- Prediction Logic ---
if submitted:
    car_age = 2025 - year
    mpg_engine = mpg * engine_size
    is_electric = 1 if fuel.lower() == "electric" else 0
    is_hybrid = 1 if "hybrid" in fuel.lower() else 0
    is_gasoline = 1 if fuel.lower() == "gasoline" else 0
    is_diesel = 1 if fuel.lower() == "diesel" else 0
    is_low_mileage = 1 if mileage < 20000 else 0
    log_mileage = np.log1p(mileage)

    input_df = pd.DataFrame([{
        "Brand": brand,
        "Fuel Type": fuel,
        "Transmission": transmission,
        "Drivetrain": drivetrain,
        "Year": year,
        "Mileage": mileage,
        "MPG": mpg,
        "Engine Size (L)": engine_size,
        "MPG*Engine": mpg_engine,
        "Log Mileage": log_mileage,
        "IsElectric": is_electric,
        "IsHybrid": is_hybrid,
        "IsGasoline": is_gasoline,
        "IsDiesel": is_diesel,
        "IsLowMileage": is_low_mileage
    }])

    try:
        predicted_price = model.predict(input_df)[0]
        buffer_percentage = 0.05  # ±5%
        price_min_estimate = predicted_price * (1 - buffer_percentage)
        price_max_estimate = predicted_price * (1 + buffer_percentage)

        st.subheader("🔎 Prediction Result")
        st.success(f"💰 Estimated Price Range: **${price_min_estimate:,.0f} - ${price_max_estimate:,.0f}**")
        st.caption(f"🎯 Central Estimate: ${predicted_price:,.2f}")

        # --- Three Side-by-Side Charts ---
        col_a, col_b, col_c = st.columns(3)

        # 1️⃣ Gauge Chart for Price
        with col_a:
            fig_gauge = go.Figure(go.Indicator(
                mode="gauge+number",
                value=predicted_price,
                domain={"x": [0, 1], "y": [0, 1]},
                title={"text": "Estimated Price ($)"},
                gauge={
                    "axis": {"range": [0, df_full["Price"].max()]},
                    "bar": {"color": "darkblue"},
                    "steps": [
                        {"range": [0, df_full["Price"].median()], "color": "lightgray"},
                        {"range": [df_full["Price"].median(), df_full["Price"].max()], "color": "lightgreen"},
                    ],
                }
            ))
            st.plotly_chart(fig_gauge, use_container_width=True)

        # 2️⃣ Radar Chart for Inputs
        with col_b:
            features = ["Mileage", "MPG", "Year", "Engine Size (L)"]
            values = [mileage, mpg, year, engine_size]

            bounds = {
                "Mileage": (0, 200000),
                "MPG": (5, 80),
                "Year": (2000, 2025),
                "Engine Size (L)": (0, 6)
            }

            normalized = [(v - bounds[f][0]) / (bounds[f][1] - bounds[f][0]) for f, v in zip(features, values)]

            radar_df = pd.DataFrame({
                "Feature": features + [features[0]],
                "Normalized": normalized + [normalized[0]]
            })

            fig_radar = px.line_polar(
                radar_df,
                r="Normalized",
                theta="Feature",
                line_close=True,
                title="Your Car's Profile",
                range_r=[0, 1]
            )
            fig_radar.update_traces(fill="toself")
            st.plotly_chart(fig_radar, use_container_width=True)

        # 3️⃣ Feature Importance
        with col_c:
            try:
                base_model = model.named_steps.get("regressor") or model.named_steps.get("model") or model
                if hasattr(base_model, "feature_importances_"):
                    importances = base_model.feature_importances_
                    feature_names = model.named_steps["preprocess"].get_feature_names_out()

                    allowed_features = [
                        "Brand", "Fuel Type", "Transmission", "Drivetrain", "Year", "Mileage", "MPG", "Engine Size (L)"
                    ]

                    mapped_features = []
                    for fname in feature_names:
                        clean = fname.split("__")[-1]
                        for af in allowed_features:
                            if af in clean:
                                mapped_features.append((af, importances[list(feature_names).index(fname)]))
                                break

                    if mapped_features:
                        df_feat = pd.DataFrame(mapped_features, columns=["Feature", "Importance"])
                        df_feat = df_feat.groupby("Feature").sum().sort_values("Importance", ascending=False).reset_index()

                        fig_bar = px.bar(
                            df_feat.head(10),
                            x="Importance",
                            y="Feature",
                            orientation="h",
                            color="Importance",
                            color_continuous_scale="Blues"
                        )
                        fig_bar.update_layout(yaxis=dict(autorange="reversed"))
                        st.plotly_chart(fig_bar, use_container_width=True)
                    else:
                        st.info("ℹ️ No matched features.")
                else:
                    st.info("ℹ️ Feature importance unavailable.")
            except Exception as e:
                st.warning(f"⚠️ Feature importance plot error: {e}")

    except Exception as e:
        st.error(f"❌ Error during prediction: {e}")

# --- Price Range Recommendation Feature ---
st.header("🎯 Best Cars within Your Estimated Price Range")

if submitted:
    # Filter the cars within predicted personalized range
    df_recommend = df_full[
        (df_full["Price"] >= price_min_estimate) &
        (df_full["Price"] <= price_max_estimate)
    ]

    if not df_recommend.empty:
        st.subheader("🏎️ 5 Best Cars by Sub-Price Categories")
        
        # Split into 5 price ranges
        price_bins = np.linspace(price_min_estimate, price_max_estimate, 6)
        df_recommend['Price Bin'] = pd.cut(df_recommend["Price"], bins=price_bins, labels=[1,2,3,4,5])

        final_selection = df_recommend.sort_values(
            ["Price Bin", "MPG", "Mileage", "Year"],
            ascending=[True, False, True, False]
        ).groupby("Price Bin").head(1)

        if not final_selection.empty:
            final_selection["Price"] = final_selection["Price"].apply(lambda x: f"${x:,.2f}")
            final_selection["Mileage"] = final_selection["Mileage"].apply(lambda x: f"{x:,.1f} mi")
            final_selection["Average MPG"] = final_selection["MPG"].apply(lambda x: f"{int(round(x))} MPG")

            display_cols = ["Title", "Price", "Mileage", "Average MPG", "Fuel Type", "Transmission", "Drivetrain"]
            final_selection = final_selection[display_cols].reset_index(drop=True)
            final_selection.index = np.arange(1, len(final_selection) + 1)
            st.table(final_selection)
        else:
            st.info("🔍 No cars found matching across sub-ranges.")
    else:
        st.info("🔍 No matching cars found in estimated range.")



Writing streamlit_app2.py



---


# Milestone 5: Final Documentation and Presentation

## Final Report
- Summarize all project phases, including methodologies and insights.
- Provide actionable recommendations based on findings.

## Final Presentation
- Create a presentation for stakeholders, highlighting key results.
- Include a demo of the deployed application or model.

## Future Improvements
- Suggest areas for enhancement:
  - Incorporating more diverse data.
  - Experimenting with additional algorithms.
  - Optimizing deployment for scalability.

---

# Additional Sections

## Challenges Faced
- Document key challenges encountered during the project lifecycle.

## Lessons Learned
- Reflect on insights and skills gained through the project.

## References
- List resources for datasets, tools, and techniques utilized.

---

# More Sections for Specific Projects

## Ethical Considerations
- Discuss privacy, fairness, or other ethical implications.

## Business Impact
- Highlight how the findings address the original objective.

## Team Contributions
- Acknowledge contributions from team members and collaborators.


---


# Reflection: Data Science Lifecycle Steps and Big Data

Reflect on which steps of the data science lifecycle can and cannot be effectively applied to big data, and justify your answers:

## Steps That Can Be Used with Big Data

Create a list for the steps that can be used with big data with how each step can be used, why is such step applicable and an example tool used for such step in big data world.

Include tools, methods, or technologies that make these steps scalable.

## Steps That Are Challenging with Big Data

Create a list for the steps that are challenging with big data with the solution for the challenge, why is such step challenging and an example tool used for such step in big data world.

Explain why these steps are difficult and suggest potential solutions.

## Recommendations for Big Data Projects

Create a list for the recommendations you have for other data scientists willing to take your project with a big data approach.

---