# Regression Model Comparison

This project compares the performance of various regression models to predict a target variable (e.g., net worth) using image embeddings or other features. The models evaluated include:

- Linear Regression
- Ridge Regression
- Lasso Regression
- ElasticNet Regression
- Polynomial Regression (degree=2)

## Dataset

Assumes feature matrix `X` and log-transformed target `y` are preloaded (e.g., from image-based features and net worth in M dollors USD).


In [120]:
from qdrant_client import QdrantClient
import numpy as np
import pandas as pd

client = QdrantClient(host="localhost", port=6333)
COLLECTION_NAME = "celebrity_faces"

# Retrieve all vectors and payloads
scroll_result = client.scroll(
    collection_name=COLLECTION_NAME,
    with_payload=True,
    with_vectors=True,
    limit=1000  # adjust if you have more
)

vectors = []
net_worths = []

for point in scroll_result[0]:  # scroll_result = (points, next_page_offset)
    vectors.append(point.vector)
    net_worths.append(point.payload["net_worth"])

X = np.array(vectors)
y = np.array(net_worths)


In [121]:
y = y / 1000000  # Convert net worth to millions
y = np.log1p(y)  # Apply log1p transformation

In [122]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_absolute_error, r2_score


# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Models to compare
models = {
    "Linear Regression": LinearRegression(),
    "Ridge": Ridge(alpha=1.0),
    "Lasso": Lasso(alpha=0.1),
    "ElasticNet": ElasticNet(alpha=0.1, l1_ratio=0.5),
    "Polynomial Regression (deg=2)": make_pipeline(PolynomialFeatures(degree=2), LinearRegression())
}

# Evaluate models
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    # Inverse transformation for evaluation
    y_pred_lin = np.expm1(y_pred)  # Inverse of log1p
    y_test_lin = np.expm1(y_test)  # Inverse of log1p

    # Metrics
    mae = mean_absolute_error(y_test_lin, y_pred_lin)
    r2 = r2_score(y_test_lin, y_pred_lin)

    print(f"{name}: MAE = {mae:.3f}, R² = {r2:.3f}")


Linear Regression: MAE = 47165.613, R² = -340.114
Ridge: MAE = 3829.742, R² = -0.025
Lasso: MAE = 3884.463, R² = -0.032
ElasticNet: MAE = 3884.392, R² = -0.032
Polynomial Regression (deg=2): MAE = 5219.524, R² = -0.381


Among all models tested, **Ridge Regression** provided the best performance based on MAE and R² metrics after inverse log transformation, and that was selected as the best model.

In [123]:
import joblib

model = Ridge(alpha=1.0)
model.fit(X_train, y_train)

joblib.dump(model, "app/networth_regressor.pkl")

['app/networth_regressor.pkl']