# Title and summary 

NeurIPS 2025 Polymer Prediction — Non-GNN Baselines with Stacking and Blending

This notebook builds two strong non-GNN pipelines using RDKit descriptors and 2048-bit Morgan fingerprints, trains LGBM/XGBoost/CatBoost with KFold OOF, stacks with ElasticNetCV, and blends two pipelines 50/50, followed by light post-processing (Tg mean matching). It produces submission.csv.

# Setup and configuration 

In [None]:
# --- Cell 1: Setup and Configuration ---

# General data handling and modeling
import pandas as pd
import numpy as np
import pickle
from tqdm.auto import tqdm

# Scikit-learn for modeling and feature engineering
from sklearn.model_selection import KFold
from sklearn.linear_model import ElasticNetCV
from sklearn.preprocessing import PolynomialFeatures

# RDKit for chemistry features
from rdkit import Chem, DataStructs
from rdkit.Chem import Descriptors, AllChem

# Gradient Boosting Models
from catboost import CatBoostRegressor
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor

# --- Global Configuration ---
SEED = 42
np.random.seed(SEED)
targets = ["Tg", "FFV", "Rg", "Density", "Tc"]

# --- The Proven Hyperparameters (identical for both pipelines) ---
optuna_results = {'Tg': {'depth': 4, 'learning_rate': 0.0138, 'iterations': 909, 'l2_leaf_reg': 0.0029},'FFV': {'depth': 6, 'learning_rate': 0.2792, 'iterations': 319, 'l2_leaf_reg': 7.5440},'Rg': {'depth': 4, 'learning_rate': 0.0429, 'iterations': 548, 'l2_leaf_reg': 0.0342},'Density': {'depth': 10, 'learning_rate': 0.0626, 'iterations': 272, 'l2_leaf_reg': 0.8955},'Tc': {'depth': 7, 'learning_rate': 0.0281, 'iterations': 386, 'l2_leaf_reg': 0.0038}}
lgbm_results = {'Tg': {'num_leaves': 127, 'learning_rate': 0.0126, 'n_estimators': 177, 'max_depth': 12, 'reg_alpha': 0.0011, 'reg_lambda': 0.0243},'FFV': {'num_leaves': 65, 'learning_rate': 0.0694, 'n_estimators': 983, 'max_depth': 15, 'reg_alpha': 0.0218, 'reg_lambda': 0.0020},'Rg': {'num_leaves': 136, 'learning_rate': 0.0151, 'n_estimators': 297, 'max_depth': 4, 'reg_alpha': 8.5566, 'reg_lambda': 0.0121},'Density': {'num_leaves': 96, 'learning_rate': 0.0126, 'n_estimators': 180, 'max_depth': 8, 'reg_alpha': 0.0456, 'reg_lambda': 1.4955},'Tc': {'num_leaves': 109, 'learning_rate': 0.0268, 'n_estimators': 699, 'max_depth': 6, 'reg_alpha': 0.1622, 'reg_lambda': 0.3150}}
xgb_results = {'Tg': {'max_depth': 3, 'learning_rate': 0.0100, 'n_estimators': 668, 'reg_alpha': 0.0014, 'reg_lambda': 0.9604},'FFV': {'max_depth': 7, 'learning_rate': 0.0454, 'n_estimators': 942, 'reg_alpha': 0.0057, 'reg_lambda': 6.3910},'Rg': {'max_depth': 13, 'learning_rate': 0.0331, 'n_estimators': 906, 'reg_alpha': 8.4273, 'reg_lambda': 1.3291},'Density': {'max_depth': 3, 'learning_rate': 0.2504, 'n_estimators': 511, 'reg_alpha': 0.0114, 'reg_lambda': 1.8864},'Tc': {'max_depth': 9, 'learning_rate': 0.1164, 'n_estimators': 475, 'reg_alpha': 0.0131, 'reg_lambda': 3.4711}}

print("✅ Setup and configuration complete.")


# Data loading and base feature generation

Loads competition train/test, merges official supplements (Tc/Tg/FFV), deduplicates on SMILES, and computes base features:

* RDKit descriptors from Descriptors.descList.
* 2048-bit Morgan fingerprints (radius=2).
  
Features are concatenated and cleaned with np.nan_to_num, then split back to train/test matrices.

In [None]:
# --- Cell 2: Data Loading and Base Feature Generation ---

print("--- Loading and preparing data ---")
# Load all data sources
train_df_raw = pd.read_csv(' ')
test_df_raw = pd.read_csv(' ')
dataset_tc = pd.read_csv(' ')
dataset_tg = pd.read_csv(' ')
dataset_ffv = pd.read_csv(' ')

# Combine and deduplicate
augmented = pd.concat([dataset_tc, dataset_tg, dataset_ffv], ignore_index=True)
train_full = pd.concat([train_df_raw, augmented], ignore_index=True).drop_duplicates(subset=['SMILES']).reset_index(drop=True)
train_size = len(train_full)
combined_df = pd.concat([train_full.drop(columns=targets), test_df_raw], ignore_index=True)

print(f"Data loaded. Train size: {train_size}, Test size: {len(test_df_raw)}")

# Feature generation function
print("\n--- Generating base RDKit and Morgan fingerprint features ---")
descriptor_names, descriptor_funcs = zip(*Descriptors.descList)
fp_bits = 2048

def compute_base_features(smiles_series):
    rdkit, fp = [], []
    for smiles in tqdm(smiles_series, desc="Calculating Base Features"):
        mol = Chem.MolFromSmiles(smiles)
        rdkit_vals = [func(mol) if mol else np.nan for func in descriptor_funcs]
        rdkit.append(rdkit_vals)
        arr = np.zeros((fp_bits,), dtype=np.float32)
        if mol: DataStructs.ConvertToNumpyArray(AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=fp_bits), arr)
        fp.append(arr)
    return np.array(rdkit), np.array(fp)

X_rdkit_all, X_fp_all = compute_base_features(combined_df['SMILES'])

# Combine and clean the feature matrix
X_all_features = np.nan_to_num(
    np.hstack([X_rdkit_all, X_fp_all]), 
    nan=0.0, posinf=0.0, neginf=0.0
)
feature_names = list(descriptor_names) + [f'FP_{i}' for i in range(fp_bits)]
print(f"Base feature matrix created with shape: {X_all_features.shape}")

# Split back into training and testing sets
X_train_all = X_all_features[:train_size]
X_test_all = X_all_features[train_size:]

print("✅ Base feature generation complete.")


# Feature subsets per pipeline

Two pipelines are prepared:

* Original (≈0.067): Tg/FFV use top-15 descriptors + top-15 FP bits; Rg uses its descriptors + all FP; Density and Tc use all features.
* Interaction (≈0.066): Tg/FFV as above; Tc all features; Rg and Density are augmented with degree-2 interaction-only terms on selected descriptors.

In [None]:
# --- Cell 3: Pipeline-Specific Feature Subset Functions ---

# Define the feature lists from the successful runs
top15_descriptors = {'Tg': ['MolLogP', 'TPSA', 'FractionCSP3', 'NumHDonors', 'NumValenceElectrons','NumRings', 'NumHAcceptors', 'NumRotatableBonds', 'HeavyAtomCount','NumAliphaticRings', 'NumBridgeheadAtoms', 'NumSaturatedRings','BertzCT', 'NHOHCount', 'NOCount'],'FFV': ['MolLogP', 'TPSA', 'NumHDonors', 'NumHAcceptors', 'FractionCSP3','NumRings', 'NumRotatableBonds', 'HeavyAtomCount', 'NumHeteroatoms','NumBridgeheadAtoms', 'NumSaturatedRings', 'BertzCT', 'NHOHCount', 'NOCount', 'RingCount'],'Rg': ['MolLogP', 'NumValenceElectrons', 'FractionCSP3', 'MolWt','NumRotatableBonds', 'HeavyAtomCount', 'NumHAcceptors', 'NumHDonors','NumRings', 'BertzCT', 'NumAliphaticRings', 'NumSaturatedRings','NHOHCount', 'NOCount', 'RingCount']}
top15_fps = {'Tg': ['FP_650', 'FP_1152', 'FP_587', 'FP_378', 'FP_1057', 'FP_1928', 'FP_80', 'FP_695','FP_1143', 'FP_891', 'FP_1911', 'FP_314', 'FP_1873', 'FP_1855', 'FP_1097'],'FFV': ['FP_587', 'FP_650', 'FP_1152', 'FP_378', 'FP_1057', 'FP_80', 'FP_695', 'FP_1928','FP_1143', 'FP_414', 'FP_891', 'FP_807', 'FP_1831', 'FP_322', 'FP_1873']}
top8_density_desc = ['MolWt','TPSA','HeavyAtomCount','NumHAcceptors','FractionCSP3','NumSaturatedRings','MolLogP','NumRings']

def get_feature_indices(features_to_select, all_feature_names):
    return [i for i, f in enumerate(all_feature_names) if f in features_to_select]

def create_original_feature_subsets(X_train_all, X_test_all, feature_names):
    """Creates the feature subsets for the original 0.067 pipeline."""
    X_train_subsets, X_test_subsets = {}, {}
    num_rdkit_features = len(descriptor_names)
    
    for target in ['Tg', 'FFV']:
        selected_features = top15_descriptors[target] + top15_fps[target]
        idxs = get_feature_indices(selected_features, feature_names)
        X_train_subsets[target], X_test_subsets[target] = X_train_all[:, idxs], X_test_all[:, idxs]
        
    rg_desc_indices = get_feature_indices(top15_descriptors['Rg'], feature_names)
    rg_fp_indices = list(range(num_rdkit_features, len(feature_names)))
    rg_indices = sorted(list(set(rg_desc_indices + rg_fp_indices)))
    X_train_subsets['Rg'], X_test_subsets['Rg'] = X_train_all[:, rg_indices], X_test_all[:, rg_indices]

    X_train_subsets['Density'], X_test_subsets['Density'] = X_train_all, X_test_all
    X_train_subsets['Tc'], X_test_subsets['Tc'] = X_train_all, X_test_all
    
    print("Created feature subsets for the 'Original' pipeline.")
    return X_train_subsets, X_test_subsets

def create_interaction_feature_subsets(X_train_all, X_test_all, feature_names):
    """Creates the feature subsets for the 0.066 pipeline with interactions."""
    X_train_subsets, X_test_subsets = {}, {}
    num_rdkit_features = len(descriptor_names)

    # Standard subsets for Tg, FFV, Tc
    for target in ['Tg', 'FFV']:
        selected_features = top15_descriptors[target] + top15_fps[target]
        idxs = get_feature_indices(selected_features, feature_names)
        X_train_subsets[target], X_test_subsets[target] = X_train_all[:, idxs], X_test_all[:, idxs]
    X_train_subsets['Tc'], X_test_subsets['Tc'] = X_train_all, X_test_all

    # Interaction features for Rg
    rg_desc_indices = get_feature_indices(top15_descriptors['Rg'], feature_names)
    rg_fp_indices = list(range(num_rdkit_features, len(feature_names)))
    rg_full_indices = sorted(list(set(rg_desc_indices + rg_fp_indices)))
    poly_rg = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False).fit(X_train_all[:, rg_desc_indices])
    X_train_subsets['Rg'] = np.hstack([X_train_all[:, rg_full_indices], poly_rg.transform(X_train_all[:, rg_desc_indices])])
    X_test_subsets['Rg'] = np.hstack([X_test_all[:, rg_full_indices], poly_rg.transform(X_test_all[:, rg_desc_indices])])

    # Interaction features for Density
    density_desc_indices = get_feature_indices(top8_density_desc, feature_names)
    poly_density = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False).fit(X_train_all[:, density_desc_indices])
    X_train_subsets['Density'] = np.hstack([X_train_all, poly_density.transform(X_train_all[:, density_desc_indices])])
    X_test_subsets['Density'] = np.hstack([X_test_all, poly_density.transform(X_test_all[:, density_desc_indices])])
    
    print("Created feature subsets for the 'Interaction' pipeline.")
    return X_train_subsets, X_test_subsets

print("✅ Feature subset creation functions defined.")


# CV, OOF, and stacking

For each target:

* Mask rows with missing labels.
* KFold (10 or 5 splits depending on rows), train LGBM/XGB/Cat.
* Collect OOF predictions; average test predictions across folds.
* Train ElasticNetCV meta-learner on OOF; predict meta on test stack.

In [None]:
# --- Cell 4: Training, OOF, and Stacking ---

def train_and_predict_pipeline(X_train_subsets, X_test_subsets, train_labels, test_scaffold):
    """Trains a full stacking ensemble and returns test predictions."""
    pipeline_test_preds = pd.DataFrame(index=test_scaffold.index, columns=targets)

    for target in tqdm(targets, desc="Training Pipeline"):
        X_tr = X_train_subsets[target]
        X_te = X_test_subsets[target]

        # Mask to drop NaN targets
        y_mask = ~train_labels[target].isna()
        X_tr_masked, y_tr_masked = X_tr[y_mask], train_labels.loc[y_mask, target].values
        
        num_train_samples = X_tr_masked.shape[0]
        n_test_samples = X_te.shape[0]
        n_splits = 10 if num_train_samples >= 10 else 5

        kf = KFold(n_splits=n_splits, shuffle=True, random_state=SEED)
        
        oof_preds_lgbm = np.zeros(num_train_samples)
        oof_preds_xgb  = np.zeros(num_train_samples)
        oof_preds_cat  = np.zeros(num_train_samples)
        test_preds_lgbm = np.zeros(n_test_samples)
        test_preds_xgb  = np.zeros(n_test_samples)
        test_preds_cat  = np.zeros(n_test_samples)

        for fold, (train_idx, val_idx) in enumerate(kf.split(X_tr_masked)):
            X_train, y_train = X_tr_masked[train_idx], y_tr_masked[train_idx]
            X_val = X_tr_masked[val_idx]
            
            # Base models
            lgbm = LGBMRegressor(**lgbm_results[target], random_state=SEED, verbosity=-1)
            xgb  = XGBRegressor(**xgb_results[target], random_state=SEED)
            cat  = CatBoostRegressor(**optuna_results[target], random_seed=SEED, logging_level='Silent')
            
            lgbm.fit(X_train, y_train)
            xgb.fit(X_train, y_train)
            cat.fit(X_train, y_train)
            
            # Out-of-fold preds
            oof_preds_lgbm[val_idx] = lgbm.predict(X_val)
            oof_preds_xgb[val_idx]  = xgb.predict(X_val)
            oof_preds_cat[val_idx]  = cat.predict(X_val)
            
            # Test preds (averaged)
            test_preds_lgbm += lgbm.predict(X_te) / n_splits
            test_preds_xgb  += xgb.predict(X_te) / n_splits
            test_preds_cat  += cat.predict(X_te) / n_splits

        # Meta model training
        X_meta_train = np.column_stack([oof_preds_lgbm, oof_preds_xgb, oof_preds_cat])
        X_meta_test  = np.column_stack([test_preds_lgbm, test_preds_xgb, test_preds_cat])
        meta_model = ElasticNetCV(cv=5, random_state=SEED, max_iter=10000)
        meta_model.fit(X_meta_train, y_tr_masked)
        
        pipeline_test_preds.loc[:, target] = meta_model.predict(X_meta_test)

    return pipeline_test_preds


# Main execution, blending, submission

* Build Original and Interaction feature subsets, train and predict each pipeline.
* 50/50 blend per target.
* Post-process: mean-match Tg to train mean to reduce bias shift.
* Save submission.csv.

In [None]:
# --- Cell 5: Main Execution, Blending, and Submission ---

# 1. Generate feature subsets for the Original Pipeline
print("--- Generating predictions for Original Pipeline (0.067) ---")
X_train_orig, X_test_orig = create_original_feature_subsets(X_train_all, X_test_all, feature_names)
preds_original = train_and_predict_pipeline(X_train_orig, X_test_orig, train_full, test_df_raw)
print("Original pipeline predictions generated.\n")

# 2. Generate feature subsets for the Interaction Pipeline
print("--- Generating predictions for Interaction Pipeline (0.066) ---")
X_train_inter, X_test_inter = create_interaction_feature_subsets(X_train_all, X_test_all, feature_names)
preds_interaction = train_and_predict_pipeline(X_train_inter, X_test_inter, train_full, test_df_raw)
print("Interaction pipeline predictions generated.\n")

# 3. Blend the predictions
print("--- Blending predictions from both pipelines ---")
submission = test_df_raw[['id']].copy()
for target in targets:
    # Using a simple 50/50 blend
    submission[target] = 0.5 * preds_original[target] + 0.5 * preds_interaction[target]
print("Blending complete.")

# 4. Apply final post-processing
print("\n--- Applying post-processing (Tg mean-matching) ---")
tg_train_mean = train_full['Tg'].mean()
submission['Tg'] += tg_train_mean - submission['Tg'].mean()
print("Post-processing for Tg complete.")

# 5. Save the final submission file
submission.to_csv('submission.csv', index=False)
print("\n✅ Final blended submission file 'submission.csv' created successfully.")
display(submission.head())


# Methodology

This notebook delivers a clean, reproducible non‑GNN solution for the NeurIPS Polymer Prediction task. The pipeline unifies deterministic chemistry features with strong tree ensembles and a light meta‑learner:

1. Data and features

    * Merged competition train with supplements and deduplicated on SMILES.

    * Generated RDKit physico‑chemical descriptors and 2048‑bit Morgan fingerprints (radius 2).

    * Cleaned features with consistent NaN/Inf handling; split back into train/test matrices.

2. Two complementary pipelines

    * Original (≈0.067): Tg/FFV use curated 15 descriptors + 15 FP bits; Rg uses its descriptors + all FP; Density/Tc use full features.

    * Interaction (≈0.066): Same Tg/FFV setup; Tc uses full features; Rg/Density add degree‑2 interaction‑only terms on selected descriptors.

3. Modeling and stacking

    * For each target, trained LightGBM, XGBoost, and CatBoost with fixed, proven hyperparameters.

    * Used KFold (5–10 folds) to obtain out‑of‑fold predictions, masking rows with missing labels per target.

    * Trained an ElasticNetCV meta‑learner on OOF stacks; inferred on the stacked test predictions.

4. Blending and post‑processing

    * Blended the two pipelines 50/50 per target.

    * Applied a minimal Tg mean‑matching step to reduce bias from distribution shift.

    * Wrote a ready‑to‑submit CSV.

5. Notes and learnings

    * Interaction terms most help Rg and Density; Original pipeline is stronger for Tg/Tc.

    * Stacking outperforms simple averaging with small but consistent gains.

    * Mean matching stabilizes Tg on the leaderboard without heavy post‑processing.


Planned extensions (handled in companion notebooks):

    * GNN embeddings/predictions blended with this stack for additional lift.

    * Diagnostics and ablations: feature importances, residuals, and shift checks.