**Boat Listing Conversion Propensity Model**

OBJECTIVE

Build an interpretable classification model to predict the likelihood/probability that a boat listing will generate a lead in a marine marketplace.

This model will demonstrate how internal machine learning models can support:
- Listing ranking
- Sales prioritization
- Pricing optimization
- Inventory performance monitoring

This project empghasizes:
- Feature engineering
- Model interpretability
- Business-aligned evaluation
- Clear documentation of both assumptions and limitations

In [None]:
### Imports & Environment Setup ###

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import (
    roc_auc_score,
    precision_score,
    recall_score,
    f1_score,
    confusion_matrix
)

import matplotlib.pyplot as plt
import seaborn as sns

# reproducibility

np.random.seed(42)

In [None]:
### Generate Synthetic Marketplace Data ###

n = 5000 # Number of listings in the marketplace

df = pd.DataFrame({
    "price": np.random.normal(80000, 25000, n).clip(10000, 300000),
    "length_ft": np.random.normal(28, 8, n).clip(10, 80),
    "year": np.random.randint(1995, 2024, n),

    "boat_type": np.random.choice(
        ["center_console", "bowrider", "pontoon", "cabin_cruiser", "sportfish"],
        n
    ),

    "condition": np.random.choice(
        ["new", "used", "salvage"],
        n,
        p = [0.15, 0.82, 0.03]

    ),

    "dealer_type": np.random.choice(
        ["dealer", "private"],
        n,
        p = [0.6, 0.4]
    ),

    "seller_rating": np.round(
        np.random.normal(4.2, 0.5, n).clip(1.0, 5.0),
        2
    ),

    "days_on_site": np.random.randint(1, 180, n),

    "views_7d": np.random.poisson(50, n),
    "saves_7d": np.random.poisson(5, n),
    "inquiries_7d": np.random.poisson(2, n),

    "num_photos": np.random.randint(3, 40, n)

})

df.head()


Unnamed: 0,price,length_ft,year,boat_type,condition,dealer_type,seller_rating,days_on_site,views_7d,saves_7d,inquiries_7d,num_photos
0,92417.853825,24.609923,2004,center_console,used,dealer,4.57,8,50,6,2,10
1,76543.392471,24.372687,2018,bowrider,used,dealer,4.07,93,46,7,2,31
2,96192.213453,13.634855,2011,sportfish,used,private,4.2,165,47,3,4,7
3,118075.74641,25.359278,2015,cabin_cruiser,used,dealer,4.22,91,61,6,4,21
4,74146.165632,33.862633,1996,cabin_cruiser,used,private,3.94,165,62,1,1,37


In [None]:
### Feature Engineering ###

CURRENT_YEAR = 2026

# Boat Age
df["boat_age"] = CURRENT_YEAR - df["year"]

# Price Per Foot (Relative Value)
df["price_per_foot"] = df["price"] / df["length_ft"]

# Engagement Intensity
df["engagement_rate"] = (
    df["saves_7d"] + df["inquiries_7d"]
) / (df["views_7d"] + 1)

# Dealer Indicator (binary)
df["is_dealer"] = (df["dealer_type"] == "dealer").astype(int)

# Log Transform For Days On Site
df["log_days_on_site"] = np.log1p(df["days_on_site"])

df.head()

Unnamed: 0,price,length_ft,year,boat_type,condition,dealer_type,seller_rating,days_on_site,views_7d,saves_7d,inquiries_7d,num_photos,boat_age,price_per_foot,engagement_rate,is_dealer,log_days_on_site
0,92417.853825,24.609923,2004,center_console,used,dealer,4.57,8,50,6,2,10,22,3755.308602,0.156863,1,2.197225
1,76543.392471,24.372687,2018,bowrider,used,dealer,4.07,93,46,7,2,31,8,3140.539738,0.191489,1,4.543295
2,96192.213453,13.634855,2011,sportfish,used,private,4.2,165,47,3,4,7,15,7054.876355,0.145833,0,5.111988
3,118075.74641,25.359278,2015,cabin_cruiser,used,dealer,4.22,91,61,6,4,21,11,4656.116165,0.16129,1,4.521789
4,74146.165632,33.862633,1996,cabin_cruiser,used,private,3.94,165,62,1,1,37,30,2189.6161,0.031746,0,5.111988


In [None]:
### Create Conversion Target ###

# Linear combination of important features
logit = (
    3.0 * df["engagement_rate"]
    - 0.8 * (df["price_per_foot"] / df["price_per_foot"].mean())
    + 0.25 * df["is_dealer"]
    + 0.3 * (df["seller_rating"] - 4.0)
    - 0.01 * df["days_on_site"]
    + 0.02 * df["num_photos"]
)

# Convert to probability using logistic function
prob = 1 / (1 + np.exp(-logit))

# Generate binary outcome
df["conversion"] = np.random.binomial(1, prob)

# Check conversion rate
df["conversion"].mean()

np.float64(0.364)

In [None]:
### Train / Test Split ###

# Define Target
target = "conversion"

# Define numeric and categorical features
numeric_features = [
    "price_per_foot",
    "boat_age",
    "engagement_rate",
    "log_days_on_site",
    "seller_rating",
    "num_photos",
    "is_dealer"
]

categorical_features = [
    "boat_type",
    "condition"
]

x = df[numeric_features + categorical_features]
y = df[target]

# Stratified split to preserve conversion ratio
X_train, x_test, y_train, y_test = train_test_split(
    x,
    y,
    test_size = 0.2,
    random_state = 42,
    stratify = y
)

print("Train size", X_train.shape)
print("Test size", x_test.shape)
print("Test conversion rate:", y_test.mean())

Train size (4000, 9)
Test size (1000, 9)
Test conversion rate: 0.364


In [None]:
### Preprocessing + Model Pipeline ###

# Preprocessing:
# -- Numeric features will pass through
# -- Categorical features will be one-hot encoded

preprocess = ColumnTransformer(
    transformers = [
        ("num", StandardScaler(), numeric_features),
        ("cat", OneHotEncoder(handle_unknown = "ignore"), categorical_features),
    ]
)

# Full pipeline: preprocessing + logistic regression
model = Pipeline(steps = [
    ("preprocess", preprocess),
    ("clf", LogisticRegression(max_iter = 2000))
])

# Training The Model
model.fit(X_train, y_train)

print("Model training completed.")

Model training completed.


In [None]:
### Model Evaluation ###

# Get predicted probabilities
y_pred_proba = model.predict_proba(x_test)[:, 1]

# Convert probabilities to class predictions using threshold 0.5
threshold = 0.5
y_pred_class = (y_pred_proba >= threshold).astype(int)

# Metrics
auc = roc_auc_score(y_test, y_pred_proba)
precision = precision_score(y_test, y_pred_class)
recall = recall_score(y_test, y_pred_class)
f1 = f1_score(y_test, y_pred_class)
cm = confusion_matrix(y_test, y_pred_class)

print("ROC_AUC:", round(auc, 3))
print("Precision:", round(precision, 3))
print("Recall:", round(recall, 3))
print("F1 Score:", round(f1, 3))
print("Confusion Matrix:")
print(cm)

ROC_AUC: 0.673
Precision: 0.555
Recall: 0.236
F1 Score: 0.331
Confusion Matrix:
[[567  69]
 [278  86]]


In [None]:
### Extract Model Coefficients ###

# Get One-hot encoded feature names
ohe = model.named_steps["preprocess"].named_transformers_["cat"]
cat_feature_names = ohe.get_feature_names_out(categorical_features)

# Combine numeric and encoded categorical names
all_feature_names = numeric_features + list(cat_feature_names)

# Get coefficients
coefficients = model.named_steps["clf"].coef_[0]

coef_df = pd.DataFrame({
    "feature": all_feature_names,
    "coefficient": coefficients
}).sort_values(by = "coefficient", ascending = False)

coef_df.head(15)

Unnamed: 0,feature,coefficient
5,num_photos,0.223538
2,engagement_rate,0.139928
4,seller_rating,0.126661
6,is_dealer,0.104714
8,boat_type_cabin_cruiser,0.027851
10,boat_type_pontoon,0.009607
1,boat_age,-0.03559
13,condition_salvage,-0.049415
7,boat_type_bowrider,-0.086377
11,boat_type_sportfish,-0.11212


**Business Interpretation & Key Drivers**
- Engagement intensity is the strongest driver of lead submission probability.
- Seller reputation and dealer affiliation positively influence buyer trust and conversion.
- Listings with longer time on the site show significantly reduced conversion likelihood.
- Certain boat categories exhibit stronger baseline conversion behavior.

**Operational Applications**
- Prioritize high-probability listings for sales outreach.
- Rank listings dynamically based on predicted conversion likelihood.
- Flag stale inventory for pricing review.
- Use engagement metrics to identify high-intent buyer segments.

**Operational Use & Threshold Strategy**

This model outputs a ptobability score for each listing.

The classification score should align with business objectives:
- If sales resources are limited (e.g., can only prioritize 20% of listings), the threshold should be adjusted to maximize precision and reduce false positives.
- If the goal is to identify all potential high-converting listings for ranking optimization, recall may be prioritized instead.

For internal prioritization, a dynamic threshold (e.g., top decile by predicted probability) may be more appropriate than a 0.5 cutoff.

**Model Limitations & Assumptions**
- This analysis uses synthetic data for demonstration purposes; model performance metrics are illustrative rather than production-validated.
- Engagement features may introduce temporal leakage depending on when prediction occurs (e.g., using 7-day engagement signals for early-stage listings).
- Logistic regression assumes linear relationships in log-odds space; nonlinear interactions may not be fully captured.
- Feature distributions, seasonality, and buyer behavior shifts would require monitoring in a production environment.
- Model performance should be monitored over time for drift in engagement patterns, pricing trends, and inventory mix.