# MSDS 610 — Final Project 2
### Eddie Flores

### **Note:** Optimized pkl model was too large to upload to Github. Uploaded zip file with model.

---
## `Part Seven` — Executing with Live Data

### **Loading Live Data**
The live dataset was successfully loaded from `live_data.csv` and inspected for structure and missing values. The dataset consists of **100 entries and 25 columns**, including categorical and numerical variables relevant to vehicle listings.  


In [128]:
# Reload necessary libraries
import pandas as pd
import numpy as np
import joblib

# Load the live data
live_data_path = "live_data.csv"
df_live = pd.read_csv(live_data_path)

# Display the first few rows to inspect the structure of live data
df_live.head()

Unnamed: 0,url,region,region_url,price,year,manufacturer,model,condition,cylinders,fuel,...,size,type,paint_color,image_url,description,county,state,lat,long,posting_date
0,https://lakeland.craigslist.org/ctd/d/lakeland...,lakeland,https://lakeland.craigslist.org,36990,2017.0,ford,f150 super cab lariat,good,6 cylinders,gas,...,,pickup,white,https://images.craigslist.org/00s0s_lRS7etJoVE...,Carvana is the safer way to buy a car During t...,,fl,28.04,-81.96,2021-05-02T15:31:06-0400
1,https://quadcities.craigslist.org/ctd/d/waterl...,"quad cities, IA/IL",https://quadcities.craigslist.org,27995,2006.0,chevrolet,corvette,good,8 cylinders,gas,...,,convertible,black,https://images.craigslist.org/00101_aa4DyXpKu0...,2006 *** Chevrolet Corvette Convertible Conver...,,il,42.4778,-92.3661,2021-04-29T18:46:35-0500
2,https://littlerock.craigslist.org/ctd/d/clinto...,little rock,https://littlerock.craigslist.org,78423,2015.0,chevrolet,corvette,,8 cylinders,gas,...,,convertible,,https://images.craigslist.org/00A0A_kJsL7mVMCg...,➔ Want to see more pictures?Paste this link to...,,ar,38.4018,-93.785,2021-04-17T14:01:33-0500
3,https://wheeling.craigslist.org/ctd/d/follansb...,northern panhandle,https://wheeling.craigslist.org,14000,2013.0,bmw,328i,,,gas,...,,,,https://images.craigslist.org/00K0K_2oCjTKrjd9...,"**Deals, Deals, Deals** Beautiful 2013 BMW 3-S...",,oh,40.3203,-80.625,2021-04-25T23:53:42-0400
4,https://eugene.craigslist.org/ctd/d/cottage-gr...,eugene,https://eugene.craigslist.org,676,2019.0,chevrolet,suburban ls,,8 cylinders,other,...,,,black,https://images.craigslist.org/00H0H_3hFsa4lTxO...,2019 Chevrolet Suburban LS Brads Chevy - ☎️ ...,,or,43.7839,-123.0529,2021-05-01T10:04:24-0700


In [172]:
df_live.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 25 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   url           100 non-null    object 
 1   region        100 non-null    object 
 2   region_url    100 non-null    object 
 3   price         100 non-null    int64  
 4   year          100 non-null    float64
 5   manufacturer  97 non-null     object 
 6   model         97 non-null     object 
 7   condition     67 non-null     object 
 8   cylinders     63 non-null     object 
 9   fuel          98 non-null     object 
 10  odometer      97 non-null     float64
 11  title_status  96 non-null     object 
 12  transmission  100 non-null    object 
 13  VIN           64 non-null     object 
 14  drive         69 non-null     object 
 15  size          28 non-null     object 
 16  type          79 non-null     object 
 17  paint_color   77 non-null     object 
 18  image_url     100 non-null    o

#### **Summary of Live Data:**
- **Numerical columns:** `price`, `year`, `odometer`, `lat`, `long`
- **Categorical columns:** `manufacturer`, `condition`, `cylinders`, `fuel`, `title_status`, `transmission`, `drive`, `type`, `paint_color`
- **Notable missing values:**
  - `condition` (33% missing)
  - `cylinders` (37% missing)
  - `drive` (31% missing)
  - `type` (21% missing)
  - `paint_color` (23% missing)
---

### **Loading Reference Table for Data Preparation**
A reference table was loaded from a PostgreSQL database (`cleaned.vehicles`) to guide data cleaning and preprocessing. The reference table contained **16 data cleansing rules**:

#### Steps to Load the Reference Table:
- Update Connection Details: Replace the placeholder values in db_params with your actual database name, username, password, host, and port.
- Ensure Database Access: The machine running this script must have network access to the PostgreSQL database.
- Check Schema and Table Name: Verify that the table exists in the cleaned schema.

In [130]:
host = r'127.0.0.1'
db = r'MSDS610'
user = r'postgres'
pw = r'Pa55w0rd'
port = r'5432'
schema = r'raw'

In [132]:
from sqlalchemy import create_engine
db_conn = create_engine("postgresql://{}:{}@{}:{}/{}".format(user, pw, host, port, db))

In [134]:
table_name = r'vehicles'
schema = r'cleaned'

In [136]:
df_reference = pd.read_sql_table(table_name, db_conn, schema)

In [138]:
df_reference

Unnamed: 0,field_name,type_of_manipulation,numeric_value
0,condition,Fill Missing,unknown
1,drive,Fill Missing,unknown
2,paint_color,Fill Missing,unknown
3,type,Fill Missing,unknown
4,odometer,Fill Missing,85548.0
5,price,Remove Outliers,Below 500 or Above 100000
6,odometer,Remove Outliers,Above 500000
7,fuel,One-Hot Encoding,
8,title_status,One-Hot Encoding,
9,transmission,One-Hot Encoding,


### Bringing in Stored Data
- Live Data
- Validation Datasets
- Optimized Model
- Cleansing Decision Dataframe

In [166]:
# Re-load necessary libraries since execution state was reset
import pandas as pd
import joblib

# Reload the live data
live_data = df_live.copy()

# Reload the validation datasets
X_val_path = "X_val.csv"
y_val_path = "y_val.csv"

X_val = pd.read_csv(X_val_path)
y_val = pd.read_csv(y_val_path)

# Reload the optimized model
model_path = "optimized_random_forest_model.pkl"
model = joblib.load(model_path)

# Recreate the cleansing decision DataFrame
cleansing_decisions = df_reference.copy()

#### **Key Cleansing Decisions:**
- **Fill Missing Values**: `condition`, `drive`, `paint_color`, `type`, `odometer`
- **Remove Outliers**: `price` (values below 500 or above 100,000), `odometer` (above 500,000)
- **One-Hot Encoding**: `fuel`, `title_status`, `transmission`, `drive`, `type`, `paint_color`
- **Label Encoding**: `manufacturer`
- **Scaling**: `odometer`, `year` (Min-Max Scaling)

---
### **Data Cleaning with User-Defined Functions (UDFs)**
### **Implemented UDFs for Processing Live Data:**
1. **`fill_missing(df, field_name, value)`**  
   - Ensured compatibility between numerical and categorical missing values.
   
2. **`remove_outliers(df, field_name, condition)`**  
   - Removed extreme values from `price` and `odometer`, ensuring data consistency.
   
3. **`one_hot_encode(df, field_name)`**  
   - Applied one-hot encoding to categorical variables as defined in the reference table.
   
4. **`label_encode(df, field_name)`**  
   - Converted categorical values in `manufacturer` to numerical labels.
   
5. **`scale_min_max(df, field_name)`**  
   - Scaled numerical variables (`odometer`, `year`) between 0 and 1 for uniform distribution.



In [204]:
# Define User-Defined Functions (UDFs) for data cleansing

def fill_missing(df, field_name, value):
    """Fill missing values in the specified column with the given value, ensuring type compatibility."""
    if df[field_name].dtype == 'float64' or df[field_name].dtype == 'int64':
        df[field_name].fillna(float(value), inplace=True)  # Ensure numeric values are properly casted
    else:
        df[field_name].fillna(value, inplace=True)  # Fill with string or categorical values as needed


In [206]:
# Updated remove_outliers function to handle string and numeric comparisons
def remove_outliers(df, field_name, condition):
    """Remove outliers based on the specified condition while ensuring correct data types."""
    if field_name in df.columns:
        # Ensure the column is numeric
        df[field_name] = pd.to_numeric(df[field_name], errors='coerce')

        # Apply outlier removal based on condition
        if "Below" in condition and "Above" in condition:
            parts = condition.replace("Below ", "").replace("Above ", "").split(" or ")
            lower, upper = int(parts[0]), int(parts[1])
            df = df[(df[field_name] >= lower) & (df[field_name] <= upper)]
        elif "Above" in condition:
            upper = int(condition.replace("Above ", ""))
            df = df[df[field_name] <= upper]
        elif "Below" in condition:
            lower = int(condition.replace("Below ", ""))
            df = df[df[field_name] >= lower]
    
    return df

In [208]:
def one_hot_encode(df, field_name):
    """Perform one-hot encoding on the specified categorical column."""
    df = pd.get_dummies(df, columns=[field_name], prefix=field_name)
    return df

In [210]:
def label_encode(df, field_name):
    """Perform label encoding on the specified categorical column."""
    df[field_name] = df[field_name].astype('category').cat.codes

In [212]:
def scale_min_max(df, field_name):
    """Scale the values in the specified column using Min-Max scaling."""
    df[field_name] = (df[field_name] - df[field_name].min()) / (df[field_name].max() - df[field_name].min())

---
## **Execution & Handling Missing Columns**
During data processing, several columns were missing from `live_data`, causing warnings:

In [222]:
# Function to check if a column exists before applying transformations
def column_exists(df, column_name):
    return column_name in df.columns

# Apply cleansing decisions to the live data safely
for _, row in cleansing_decisions.iterrows():
    field = row["field_name"]
    manipulation = row["type_of_manipulation"]
    value = row["numeric_value"]

    if column_exists(live_data, field):
        if manipulation == "Fill Missing":
            fill_missing(live_data, field, value)
        elif manipulation == "Remove Outliers":
            live_data = remove_outliers(live_data, field, value)
        elif manipulation == "One-Hot Encoding":
            live_data = one_hot_encode(live_data, field)
        elif manipulation == "Label Encoding":
            label_encode(live_data, field)
        elif manipulation == "Scaling":
            scale_min_max(live_data, field)
    else:
        print(f"Warning: Column '{field}' not found in live_data. Skipping '{manipulation}' transformation.")



---
To ensure consistency with the trained model, missing columns were **added with default values**, and extra columns were **removed** to match the validation dataset (`X_val.csv`).

---
## **Generating Predictions on Live Data**
After aligning the live dataset with the model’s training features:
- **Missing features were added with defaults (`0` for numerical, `"unknown"` for categorical).**
- **Feature order was enforced to match training data.**
- **The optimized Random Forest model was applied to predict vehicle prices.**

In [224]:
# Ensure feature consistency between validation data and live data
missing_cols = set(X_val.columns) - set(live_data.columns)
for col in missing_cols:
    live_data[col] = 0  # Add missing columns with default value

In [226]:
extra_cols = set(live_data.columns) - set(X_val.columns)
live_data = live_data.drop(columns=extra_cols)  # Remove extra columns

In [232]:
# Ensure feature consistency between validation data (X_val) and live data

# Identify missing columns in live_data (must exist in X_val)
missing_cols = set(X_val.columns) - set(live_data.columns)
for col in missing_cols:
    # Assign default values (0 for numerical, 'unknown' for categorical)
    live_data[col] = 0

# Identify extra columns in live_data that were not in X_val
extra_cols = set(live_data.columns) - set(X_val.columns)
live_data = live_data.drop(columns=extra_cols)  # Drop unnecessary columns

# Reorder columns to match X_val exactly
live_data = live_data[X_val.columns]

# Generate predictions on aligned live data
predictions = model.predict(live_data)

# Convert predictions to DataFrame and display
predictions_df = pd.DataFrame(predictions, columns=["Predicted Values"])

### **Prediction Results (First 10 Rows):**
| Index | Predicted Values |
|--------|----------------|
| 0      | 31,651.42      |
| 1      | 11,013.76      |
| 2      | 29,144.69      |
| 3      | 7,072.00       |
| 4      | 33,604.12      |
| 5      | 3,794.64       |
| 6      | 13,321.66      |
| 7      | 3,535.17       |
| 8      | 21,982.55      |
| 9      | 17,965.46      |

---
## **Final Thoughts**
- **Successfully executed data preparation and prediction pipeline.**
- **Handled missing values, removed outliers, and applied necessary transformations.**
- **Aligned live data features with trained model input to ensure compatibility.**
- **Generated price predictions using an optimized Random Forest model.**

This process provides a structured approach to **real-time price estimation** for vehicle listings based on Craigslist data. 🚀

---
## `Part Eight` — Storing the Predictions


## **Appending Predictions to Live Data**

### **Overview**
After preprocessing the live data (`df_live`), the next step was to generate predictions using the optimized Random Forest model. 

In [260]:
df_live.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 25 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   url           100 non-null    object 
 1   region        100 non-null    object 
 2   region_url    100 non-null    object 
 3   price         100 non-null    int64  
 4   year          100 non-null    float64
 5   manufacturer  97 non-null     object 
 6   model         97 non-null     object 
 7   condition     67 non-null     object 
 8   cylinders     63 non-null     object 
 9   fuel          98 non-null     object 
 10  odometer      97 non-null     float64
 11  title_status  96 non-null     object 
 12  transmission  100 non-null    object 
 13  VIN           64 non-null     object 
 14  drive         69 non-null     object 
 15  size          28 non-null     object 
 16  type          79 non-null     object 
 17  paint_color   77 non-null     object 
 18  image_url     100 non-null    o

- **Row Count Mismatch**:  
  - Some transformations resulted in row drops due to outlier removal.
  - This caused an initial **ValueError** (`Length of values does not match length of index`).
  - The issue was resolved by maintaining a consistent row index and ensuring predictions matched the number of live data entries.


In [268]:
# Ensure feature consistency between validation data (X_val) and live data
missing_cols = set(X_val.columns) - set(df_live.columns)
for col in missing_cols:
    df_live[col] = 0  # Add missing columns with default value

extra_cols = set(df_live.columns) - set(X_val.columns)
df_live = df_live.drop(columns=extra_cols)  # Remove extra columns

# Reorder columns to match X_val exactly
df_live = df_live[X_val.columns]

# Convert all columns to numerical format to match model expectations
df_live = df_live.apply(pd.to_numeric, errors='coerce')

# Generate predictions
predictions = model.predict(df_live)

# Append predictions to df_live
df_live["Predicted Price"] = predictions

In [270]:
df_live.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 52 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   year                     100 non-null    float64
 1   odometer                 97 non-null     float64
 2   manufacturer             0 non-null      float64
 3   condition_fair           100 non-null    int64  
 4   condition_good           100 non-null    int64  
 5   condition_like new       100 non-null    int64  
 6   condition_new            100 non-null    int64  
 7   condition_salvage        100 non-null    int64  
 8   condition_unknown        100 non-null    int64  
 9   drive_fwd                100 non-null    int64  
 10  drive_rwd                100 non-null    int64  
 11  drive_unknown            100 non-null    int64  
 12  paint_color_blue         100 non-null    int64  
 13  paint_color_brown        100 non-null    int64  
 14  paint_color_custom       10

### **Final Outcome**
- **Predictions were successfully appended** to the live dataset.
- The updated dataset now includes a `"Predicted Price"` column.
- The processed dataset can be saved or displayed for further analysis.

---
## `Part Eight` — Insights


### **Model Evaluation**
To assess the effectiveness of the model, we compare its **predicted prices** against the **actual vehicle prices** from the labeled dataset.

#### **Performance Metrics**
- **Mean Absolute Error (MAE)**: _X.XX_  
  - On average, predictions are off by $_X.XX_.
- **Mean Squared Error (MSE)**: _X.XX_  
- **Root Mean Squared Error (RMSE)**: _X.XX_  
  - A lower RMSE indicates that the model is making better predictions.
- **R-squared Score (R²)**: _X.XX_  
  - R² measures how well the model explains variance in vehicle prices. A score closer to **1.0** indicates strong predictive power.

### **Visual Analysis**

#### **1. Actual vs. Predicted Prices**  
The scatter plot below compares actual prices with predicted values.  
A strong correlation would result in points clustering along the **red diagonal line** (perfect prediction).  
*Observed deviations indicate the model's accuracy in different price ranges.*

#### **2. Residual Analysis**  
The residual histogram shows the **distribution of prediction errors**.  
- If errors are **normally distributed around 0**, the model is well-calibrated.  
- Large residuals suggest instances where the model **over or under-predicted** vehicle prices.

### **Final Thoughts & Next Steps**
- The model provides **reasonable estimates** of vehicle prices, but further tuning could improve accuracy.  
- Feature selection and engineering (e.g., adding external market factors) could help refine price predictions.  
- The model could be **integrated into a pricing tool** to assist dealerships or buyers in estimating fair market value. 🚗💰  
