# Machine Learning Analysis: Predictability of AI Art Popularity

In my earlier EDA and hypothesis testing, individual metadata variables (such as tool, style, platform, and region) did not show statistically significant effects on popularity.  
In this notebook, I test whether **multivariate machine learning models** can extract predictive signal from metadata alone, or whether popularity is fundamentally driven by unobserved factors.

In [27]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import Ridge
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import accuracy_score, roc_auc_score, confusion_matrix, classification_report, mean_squared_error, r2_score

## Feature Engineering

To enrich the input space without using visual content, I derive additional features from existing metadata:

- Temporal features (Year, Month, recency indicators)
- Cyclic encodings for seasonal patterns
- Simple binary flags derived from metadata

to capture potential hidden structure while remaining fully interpretable.

In [28]:
df = pd.read_csv("ai_generated_art_trends_2024.csv")

# ----- DATE FEATURES -----
df["Creation_Date"] = pd.to_datetime(df["Creation_Date"], errors="coerce")
df["Year"]  = df["Creation_Date"].dt.year
df["Month"] = df["Creation_Date"].dt.month

df["is_recent"] = (df["Year"] >= 2023).astype(int)

def month_to_season(m):
    if pd.isna(m): return np.nan
    if m in [12,1,2]: return "Winter"
    if m in [3,4,5]: return "Spring"
    if m in [6,7,8]: return "Summer"
    return "Fall"

df["Season"] = df["Month"].apply(month_to_season)

# cyclic month
df["month_sin"] = np.sin(2*np.pi*df["Month"]/12)
df["month_cos"] = np.cos(2*np.pi*df["Month"]/12)

In [29]:
#-----FREQUENCY/RARITY FEATURES------

cat_cols = ["Art_Style","Tools_Used","Platform","Region","Art_Genre","Medium"]

for c in cat_cols:
    freq = df[c].value_counts()
    df[f"{c}_freq"] = df[c].map(freq)
    df[f"{c}_is_rare"] = (df[f"{c}_freq"] <= 50).astype(int)

Despite incorporating temporal, cyclic, and rarity-based features, we see later that model performance does not improve, suggesting that these engineered features still fail to capture the underlying drivers of popularity.

## Regression Task: Predicting Popularity Score

At first, I formulate the problem as a regression task, where the goal is to predict the numerical popularity score of an artwork from metadata features, and to test whether a linear relationship exists between metadata and popularity.

In [30]:
y = df["Popularity_Score"]
X = df.drop(columns=["Popularity_Score", "Artwork_ID", "Artist_Name"], errors="ignore")

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

cat_cols = X.select_dtypes(include="object").columns
num_cols = X.select_dtypes(include="number").columns

preprocessor = ColumnTransformer(
    transformers=[
        ("num", SimpleImputer(strategy="median"), num_cols),
        ("cat", Pipeline([
            ("imputer", SimpleImputer(strategy="most_frequent")),
            ("ohe", OneHotEncoder(handle_unknown="ignore"))
        ]), cat_cols)
    ]
)

pipe = Pipeline(steps=[
    ("prep", preprocessor),
    ("model", Ridge(alpha=1.0))
])

pipe.fit(X_train, y_train)

pred = pipe.predict(X_test)

mse = mean_squared_error(y_test, pred)
rmse = np.sqrt(mse)

print("RMSE:", rmse)
print("R2  :", r2_score(y_test, pred))

RMSE: 1436.5059680175225
R2  : -0.007570819100202408


### Regression Results Interpretation

The regression model yields a negative R² value, indicating that it performs worse than simply predicting the mean popularity score for all artworks.

This result suggests that: Linear relationships between metadata and popularity are weak or nonexistent, and popularity is likely influenced by complex, non-linear, or unobserved factors.

## Non-linear Regression: Random Forest Regressor

Since linear regression fails to capture meaningful patterns, I apply a Random Forest Regressor to model potential non-linear interactions between metadata features.

In [31]:
rf_pipe = Pipeline(steps=[
    ("prep", preprocessor),
    ("model", RandomForestRegressor(
        n_estimators=200,
        random_state=42,
        n_jobs=-1
    ))
])

rf_pipe.fit(X_train, y_train)

rf_pred = rf_pipe.predict(X_test)

rf_rmse = np.sqrt(mean_squared_error(y_test, rf_pred))
rf_r2   = r2_score(y_test, rf_pred)

print("Random Forest RMSE:", rf_rmse)
print("Random Forest R2  :", rf_r2)

Random Forest RMSE: 1463.141653703112
Random Forest R2  : -0.04528197261842948


### Random Forest Regression Results

Random Forest did not improve performance because the R² score remains negative, and prediction error stays high.
Since an ensemble model still fails, the issue is likely insufficient information in metadata, not model choice.

## From Regression to Classification

Predicting the exact popularity score is hard and noisy, so I reframe the task as a binary classification problem, and ask
**Can metadata distinguish high-performing vs low-performing artworks?**

## Classification Models: High vs. Low Popularity

I define two classes:
**High popularity:** popularity score ≥ median, and **Low popularity:** popularity score < median

and I train and evaluate:
- Logistic Regression (baseline classifier)
- Random Forest Classifier (non-linear classifier)

Model performance is evaluated using: Accuracy, ROC-AUC, Confusion Matrix and classification metrics

In [32]:
# ---- 1) TARGET: High vs Low popularity -----
median_pop = df["Popularity_Score"].median()
df["popularity_class"] = (df["Popularity_Score"] >= median_pop).astype(int)

y = df["popularity_class"]
X = df.drop(columns=["popularity_class", "Popularity_Score"], errors="ignore")
X = X.drop(columns=["Artwork_ID", "Artist_Name"], errors="ignore")

# ---- 2) SPLIT (stratify important)-----
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# ----- 3) PREPROCESSING -----
cat_cols = X.select_dtypes(include=["object"]).columns.tolist()
num_cols = X.select_dtypes(include=[np.number]).columns.tolist()

numeric_pipe = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median"))
])

categorical_pipe = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("ohe", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_pipe, num_cols),
        ("cat", categorical_pipe, cat_cols)
    ]
)

# ---- 4) MODEL 1: Logistic Regression -----
log_clf = Pipeline(steps=[
    ("prep", preprocessor),
    ("model", LogisticRegression(
        solver="saga",
        max_iter=20000,
        C=0.1,
        tol=1e-3, 
    ))
])

log_clf.fit(X_train, y_train)
log_pred = log_clf.predict(X_test)
log_proba = log_clf.predict_proba(X_test)[:, 1]

print("=== Logistic Regression ===")
print("Accuracy:", accuracy_score(y_test, log_pred))
print("ROC-AUC :", roc_auc_score(y_test, log_proba))
print("Confusion matrix:\n", confusion_matrix(y_test, log_pred))
print(classification_report(y_test, log_pred, digits=4))

# ---- 5) MODEL 2: Random Forest -----
rf_clf = Pipeline(steps=[
    ("prep", preprocessor),
    ("model", RandomForestClassifier(
        n_estimators=300,
        random_state=42,
        n_jobs=-1
    ))
])

rf_clf.fit(X_train, y_train)
rf_pred = rf_clf.predict(X_test)
rf_proba = rf_clf.predict_proba(X_test)[:, 1]

print("\n=== Random Forest Classifier ===")
print("Accuracy:", accuracy_score(y_test, rf_pred))
print("ROC-AUC :", roc_auc_score(y_test, rf_proba))
print("Confusion matrix:\n", confusion_matrix(y_test, rf_pred))
print(classification_report(y_test, rf_pred, digits=4))


=== Logistic Regression ===
Accuracy: 0.506
ROC-AUC : 0.49875849999999994
Confusion matrix:
 [[500 500]
 [488 512]]
              precision    recall  f1-score   support

           0     0.5061    0.5000    0.5030      1000
           1     0.5059    0.5120    0.5089      1000

    accuracy                         0.5060      2000
   macro avg     0.5060    0.5060    0.5060      2000
weighted avg     0.5060    0.5060    0.5060      2000


=== Random Forest Classifier ===
Accuracy: 0.516
ROC-AUC : 0.5097195000000001
Confusion matrix:
 [[535 465]
 [503 497]]
              precision    recall  f1-score   support

           0     0.5154    0.5350    0.5250      1000
           1     0.5166    0.4970    0.5066      1000

    accuracy                         0.5160      2000
   macro avg     0.5160    0.5160    0.5158      2000
weighted avg     0.5160    0.5160    0.5158      2000



### Classification Results Interpretation

Both Logistic Regression and Random Forest achieved ROC-AUC values close to 0.50, which is near random guessing.

This indicates that even after simplifying the target into two classes, the available metadata does not provide enough signal to separate high vs low popularity.

## Machine Learning Conclusion for Dataset 1 and Additional Machine Learning Experiment

Across regression and classification experiments, machine learning models consistently fail to predict or distinguish artwork popularity using metadata alone.

Unlike the previous dataset, this dataset contains direct user interaction metrics
such as views, likes, comments, and shares, and allows testing whether direct engagement metrics contain predictive signal beyond descriptive metadata

We therefore evaluate whether these features can predict the overall engagement score.

In [33]:
df2 = pd.read_csv("AI_Generated_Art_Popularity.csv")
y = df2["Engagement_Score"]
X = df2.drop(columns=["Engagement_Score"])

numeric_features = X.select_dtypes(include=["int64", "float64"]).columns
categorical_features = X.select_dtypes(include=["object"]).columns

# Preprocessing
numeric_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer([
    ("num", numeric_pipe, numeric_features),
    ("cat", categorical_pipe, categorical_features)
])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# -------- Ridge Regression --------
ridge = Pipeline([
    ("prep", preprocessor),
    ("model", Ridge(alpha=1.0))
])

ridge.fit(X_train, y_train)
pred_ridge = ridge.predict(X_test)

print("=== Ridge Regression ===")
print("RMSE:", np.sqrt(mean_squared_error(y_test, pred_ridge)))
print("R2  :", r2_score(y_test, pred_ridge))

# -------- Random Forest --------
rf = Pipeline([
    ("prep", preprocessor),
    ("model", RandomForestRegressor(
        n_estimators=300,
        random_state=42,
        n_jobs=-1
    ))
])

rf.fit(X_train, y_train)
pred_rf = rf.predict(X_test)

print("\n=== Random Forest Regressor ===")
print("RMSE:", np.sqrt(mean_squared_error(y_test, pred_rf)))
print("R2  :", r2_score(y_test, pred_rf))

=== Ridge Regression ===
RMSE: 0.2520875396662983
R2  : -0.005725867425898423

=== Random Forest Regressor ===
RMSE: 0.258758757848567
R2  : -0.059661062728380454


### Interpretation of Initial Engagement Score Regression Results

The initial regression models yield an RMSE of approximately 0.25 and a slightly negative R² value.
A negative R² indicates that the models perform worse than a simple mean-based predictor,
suggesting a weak alignment between the selected features and the engagement score.

This outcome is not due to a modeling or implementation error.
Instead, it likely reflects a mismatch between the target variable and the available features.
Since the engagement score is already a normalized or aggregated function of interaction metrics
(such as views, likes, comments, and shares), attempting to predict it using these same variables
resembles reverse engineering and limits the learnable signal.

## Baseline Comparison Using Dummy Regressor
To assess whether regression models learn meaningful patterns, a Dummy
Regressor is used as a baseline that predicts the mean engagement score.
Model performance is compared against this baseline.


In [34]:
from sklearn.dummy import DummyRegressor

dummy = Pipeline([
    ("prep", preprocessor),
    ("model", DummyRegressor(strategy="mean"))
])

dummy.fit(X_train, y_train)
pred_dummy = dummy.predict(X_test)

print("=== Dummy Regressor ===")
print("RMSE:", np.sqrt(mean_squared_error(y_test, pred_dummy)))
print("R2  :", r2_score(y_test, pred_dummy))

=== Dummy Regressor ===
RMSE: 0.25137079555161135
R2  : -1.4967734849102854e-05


Regression models do not outperform a simple dummy baseline,
indicating that the available features do not provide meaningful
predictive information for engagement score estimation.

## Classification Models – Binary Popularity Prediction
In addition to regression, popularity prediction is reformulated as a
binary classification task, where artworks are labeled as high or low
popularity based on a threshold.

In [35]:
df2["High_Engagement"] = (
    df2["Engagement_Score"] > df2["Engagement_Score"].median()
).astype(int)

y_cls = df2["High_Engagement"]
X_cls = df2.drop(columns=["Engagement_Score", "High_Engagement"])

In [36]:
X_train, X_test, y_train, y_test = train_test_split(
    X_cls, y_cls, test_size=0.2, random_state=42, stratify=y_cls
)

### Logistic Regression
Logistic Regression is applied as a probabilistic linear classifier to
establish a baseline for binary popularity prediction.

In [37]:
log_clf = Pipeline([
    ("prep", preprocessor),
    ("model", LogisticRegression(
        solver="saga",
        max_iter=20000,
        C=0.1,
        tol=1e-3
    ))
])

log_clf.fit(X_train, y_train)
pred_log = log_clf.predict(X_test)
proba_log = log_clf.predict_proba(X_test)[:, 1]

print("=== Logistic Regression ===")
print("Accuracy:", accuracy_score(y_test, pred_log))
print("ROC-AUC :", roc_auc_score(y_test, proba_log))
print(classification_report(y_test, pred_log))


=== Logistic Regression ===
Accuracy: 0.483
ROC-AUC : 0.4722730018280658
              precision    recall  f1-score   support

           0       0.49      0.50      0.49       503
           1       0.48      0.46      0.47       497

    accuracy                           0.48      1000
   macro avg       0.48      0.48      0.48      1000
weighted avg       0.48      0.48      0.48      1000



### Random Forest Classifier
A Random Forest Classifier is used to model non-linear decision boundaries
and compare performance against the linear classifier.

In [38]:
rf_clf = Pipeline([
    ("prep", preprocessor),
    ("model", RandomForestClassifier(
        n_estimators=300,
        random_state=42
    ))
])

rf_clf.fit(X_train, y_train)
pred_rf = rf_clf.predict(X_test)
proba_rf = rf_clf.predict_proba(X_test)[:, 1]

print("=== Random Forest Classifier ===")
print("Accuracy:", accuracy_score(y_test, pred_rf))
print("ROC-AUC :", roc_auc_score(y_test, proba_rf))
print(classification_report(y_test, pred_rf))


=== Random Forest Classifier ===
Accuracy: 0.469
ROC-AUC : 0.45917853042709544
              precision    recall  f1-score   support

           0       0.47      0.48      0.48       503
           1       0.47      0.46      0.46       497

    accuracy                           0.47      1000
   macro avg       0.47      0.47      0.47      1000
weighted avg       0.47      0.47      0.47      1000



Both Logistic Regression and Random Forest achieve accuracy and ROC-AUC values
close to random guessing, suggesting that high and low engagement levels
cannot be reliably distinguished using the current feature set.


## Overall Conclusion

Across two different datasets and multiple machine learning formulations,
the models consistently fail to achieve meaningful predictive performance.

In the first dataset, which relies primarily on descriptive metadata
such as style, tool, platform, region, and time, neither regression nor
classification models outperform simple baselines. This indicates that
artwork popularity cannot be reliably inferred from high-level attributes alone.

In the second dataset, despite the inclusion of interaction-related variables,
regression models do not improve over a dummy predictor and classification
models perform close to random guessing. This suggests that engagement and
popularity are likely driven by latent factors—such as visual aesthetics,
platform-specific algorithms, or social dynamics—that are not captured in
the available features.

Overall, these results highlight the limitations of metadata-driven approaches
and emphasize the importance of richer content-based or network-level information
for predicting popularity and engagement in AI-generated art.
