## 

## Autogluon Notebook

This is a modified version of this amazing notebook: [smiles-rdkit-lgbm-ftw](https://www.kaggle.com/code/richolson/smiles-rdkit-lgbm-ftw). Instead of using LGBM, we use Autogluon, an auto-ml framework.

# SMILES->RDKIT->LGBM->FTW  🧪⚡🚀

**SMILES** *(Simplified Molecular Input Line Entry System)*  
**RDKIT** *(Open-source cheminformatics toolkit)*   
**AutoGLuon** *AUtomated Machine Learning*
**FTW** *(For The Win!)*  

## 1. Use RDKit to calculate *chemical descriptors* from our SMILES molecule data
- **Structural Counts:** Ring counts, rotatable bonds, molecular weight
- **Calculated Properties:** LogP (oiliness), TPSA (surface area), qed (drug-likeness), complexity/shape stuff
- We infer these for both **train** and **test** data
- **We are using RDKit to do feature engineering**

## 2. Train models using those features to predict our *targets*:
- **Tg** - Glass transition temperature (°C)
- **FFV** - Fractional free volume
- **Tc** - Thermal conductivity (W/m·K)
- **Density** - Polymer density (g/cm³)
- **Rg** - Radius of gyration (Å)

## We train unique Autogluon models for each target!
- Actually 5 models per target using CV / averaging predictions (**25 models total!**)
- **RDKit is doing the heavy-lifting here** - we just train a model to figure out how to translate the data to our targets...

*Friendly Reminder:* If re-using large parts of this work in a public notebook - **please credit where you found the code**

# Install RDKit and Autogluon
* https://www.kaggle.com/datasets/richolson/rdkit-install-whl

In [22]:
# Install Autogluon for offline use
!pip install -q autogluon --no-index --find-links=file:///kaggle/input/autogluon-install-notebook

In [23]:
# install RDKit for offline use
!pip install /kaggle/input/rdkit-install-whl/rdkit_wheel/rdkit_pypi-2022.9.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

Processing /kaggle/input/rdkit-install-whl/rdkit_wheel/rdkit_pypi-2022.9.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
rdkit-pypi is already installed with the same version as the provided wheel. Use --force-reinstall to force an installation of the wheel.


In [24]:
import os
def is_interactive_session():
    return os.environ.get('KAGGLE_KERNEL_RUN_TYPE','') == 'Interactive'

is_interactive_session()

config = {
    "autogluon_time": 60*60*0.2,
    "autogluon_presets": "best_quality",
    #"reduce_features": 0, # Set to >0 to use only the first n features
    "tail_rows": 0 # Set to >0 to use only the last n rows in the file
    
}

if is_interactive_session():
    print("Interactive session")
    config["autogluon_time"] = 100
    #config["reduce_features"] = 200
    config["autogluon_presets"] = "medium_quality"
    config["tail_rows"] = 2000
    print(config)
else:
    print("running as job")
    print(config)

Interactive session
{'autogluon_time': 100, 'autogluon_presets': 'medium_quality', 'tail_rows': 2000}


In [25]:
import pandas as pd
import autogluon as ag
import numpy as np
from rdkit import Chem
from rdkit.Chem import Descriptors, rdMolDescriptors
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error
import lightgbm as lgb
import warnings
warnings.filterwarnings('ignore')

# Simple SMILES / RDKit Demo
* So you can see how this works...

In [26]:
molecules = [
    ('CCO', 'Ethanol - simple alcohol'),
    ('CCCCCCCC', 'Octane - long chain'),
    ('c1ccccc1', 'Benzene - aromatic ring'),
    ('COO', 'CO2'),
    ("O", "Water")

]

for smiles, description in molecules:
    mol = Chem.MolFromSmiles(smiles)
    
    print(f"\n{description}")
    print(f"SMILES: {smiles}")
    print(f"  Molecular Weight: {Descriptors.MolWt(mol):.1f}")
    print(f"  LogP (oiliness): {Descriptors.MolLogP(mol):.2f}")
    print(f"  Rotatable Bonds: {Descriptors.NumRotatableBonds(mol)}")
    print(f"  Aromatic Rings: {Descriptors.NumAromaticRings(mol)}")
    print(f"  Complexity (BertzCT): {Descriptors.BertzCT(mol):.0f}")


Ethanol - simple alcohol
SMILES: CCO
  Molecular Weight: 46.1
  LogP (oiliness): -0.00
  Rotatable Bonds: 0
  Aromatic Rings: 0
  Complexity (BertzCT): 3

Octane - long chain
SMILES: CCCCCCCC
  Molecular Weight: 114.2
  LogP (oiliness): 3.37
  Rotatable Bonds: 5
  Aromatic Rings: 0
  Complexity (BertzCT): 25

Benzene - aromatic ring
SMILES: c1ccccc1
  Molecular Weight: 78.1
  LogP (oiliness): 1.69
  Rotatable Bonds: 0
  Aromatic Rings: 1
  Complexity (BertzCT): 72

CO2
SMILES: COO
  Molecular Weight: 48.0
  LogP (oiliness): 0.11
  Rotatable Bonds: 0
  Aromatic Rings: 0
  Complexity (BertzCT): 3

Water
SMILES: O
  Molecular Weight: 18.0
  LogP (oiliness): -0.82
  Rotatable Bonds: 0
  Aromatic Rings: 0
  Complexity (BertzCT): 0


# Load Data

In [27]:
# Load data
train_df = pd.read_csv('/kaggle/input/neurips-open-polymer-prediction-2025/train.csv')
test_df = pd.read_csv('/kaggle/input/neurips-open-polymer-prediction-2025/test.csv')

In [28]:
train_df.head()

Unnamed: 0,id,SMILES,Tg,FFV,Tc,Density,Rg
0,87817,*CC(*)c1ccccc1C(=O)OCCCCCC,,0.374645,0.205667,,
1,106919,*Nc1ccc([C@H](CCC)c2ccc(C3(c4ccc([C@@H](CCC)c5ccc(N*)cc5)cc4)CCC(CCCCC)CC3)cc2)cc1,,0.37041,,,
2,388772,*Oc1ccc(S(=O)(=O)c2ccc(Oc3ccc(C4(c5ccc(Oc6ccc(S(=O)(=O)c7ccc(Oc8ccc(C=C9CCCC(=Cc%10ccc(*)cc%10)C9=O)cc8)cc7)cc6)cc5)CCCCC4)cc3)cc2)cc1,,0.37886,,,
3,519416,*Nc1ccc(-c2c(-c3ccc(C)cc3)c(-c3ccc(C)cc3)c(N*)c(-c3ccc(C)cc3)c2-c2ccc(C)cc2)cc1,,0.387324,,,
4,539187,*Oc1ccc(OC(=O)c2cc(OCCCCCCCCCOCC3CCCN3c3ccc([N+](=O)[O-])cc3)c(C(*)=O)cc2OCCCCCCCCCOCC2CCCN2c2ccc([N+](=O)[O-])cc2)cc1,,0.35547,,,


# Define molecular descriptions to be generated by RDKit
* These are properties that RDKit can determine based on SMILES data
* Auto-discovers 400 descriptors defined by RDKit
* Only uses a subset of 192 AUTOCORR2D descriptors (as defined by max_autocorr) 
* We run this function on **train and test** data
* We develop models to take this information and predict our actual targets  ('Tg', 'FFV', 'Tc', 'Density', 'Rg')

In [29]:
from rdkit.Chem import Descriptors
from rdkit import Chem
import numpy as np

def get_molecular_descriptors(max_autocorr=10):
    """Get molecular descriptors - either hardcoded list or auto-discovered"""

    descriptor_list_all = []
    test_mol = Chem.MolFromSmiles('CCO')

    # Collect all valid descriptors first
    for name in dir(Descriptors):
        if not name.startswith('_'):
            try:
                func = getattr(Descriptors, name)
                if callable(func):
                    result = func(test_mol)
                    if isinstance(result, (int, float)) and not np.isnan(result):
                        descriptor_list_all.append((name, func))
            except:
                pass

    print(f"🔍 Total discovered descriptors before filtering: {len(descriptor_list_all)}")

    # Sort AUTOCORR2D descriptors by their numeric suffix
    autocorr_descriptors = [
        (name, func)
        for name, func in descriptor_list_all
        if name.startswith('AUTOCORR2D_')
    ]
    autocorr_descriptors.sort(key=lambda x: int(x[0].split('_')[-1]))

    # Select only the lowest-numbered ones
    limited_autocorr = autocorr_descriptors[:max_autocorr]

    # Include all other descriptors
    other_descriptors = [
        (name, func)
        for name, func in descriptor_list_all
        if not name.startswith('AUTOCORR2D_')
    ]

    # Final descriptor list
    descriptor_list = limited_autocorr + other_descriptors

    print(f"✅ Auto-discovered {len(descriptor_list)} descriptors (limited to {max_autocorr} AUTOCORR2D):")
    names = [name for name, _ in descriptor_list]
    print("  " + ", ".join(names))

    feature_names = [name for name, _ in descriptor_list]
    return descriptor_list, feature_names

molecular_descriptors =  get_molecular_descriptors(max_autocorr=10) 

🔍 Total discovered descriptors before filtering: 400
✅ Auto-discovered 218 descriptors (limited to 10 AUTOCORR2D):
  AUTOCORR2D_1, AUTOCORR2D_2, AUTOCORR2D_3, AUTOCORR2D_4, AUTOCORR2D_5, AUTOCORR2D_6, AUTOCORR2D_7, AUTOCORR2D_8, AUTOCORR2D_9, AUTOCORR2D_10, BCUT2D_CHGHI, BCUT2D_CHGLO, BCUT2D_LOGPHI, BCUT2D_LOGPLOW, BCUT2D_MRHI, BCUT2D_MRLOW, BCUT2D_MWHI, BCUT2D_MWLOW, BalabanJ, BertzCT, Chi0, Chi0n, Chi0v, Chi1, Chi1n, Chi1v, Chi2n, Chi2v, Chi3n, Chi3v, Chi4n, Chi4v, EState_VSA1, EState_VSA10, EState_VSA11, EState_VSA2, EState_VSA3, EState_VSA4, EState_VSA5, EState_VSA6, EState_VSA7, EState_VSA8, EState_VSA9, ExactMolWt, FpDensityMorgan1, FpDensityMorgan2, FpDensityMorgan3, FractionCSP3, HallKierAlpha, HeavyAtomCount, HeavyAtomMolWt, Ipc, Kappa1, Kappa2, Kappa3, LabuteASA, MaxAbsEStateIndex, MaxAbsPartialCharge, MaxEStateIndex, MaxPartialCharge, MinAbsEStateIndex, MinAbsPartialCharge, MinEStateIndex, MinPartialCharge, MolLogP, MolMR, MolWt, NHOHCount, NOCount, NumAliphaticCarbocycles, 

# Run RDKit on SMILES Train data
* Predicts molecular descriptors we previously defined
* This is time intensive - so we do it once here instead of in training loop
* This function is also used to process Test data later

In [30]:
def smiles_to_features(smiles_list, descriptor_functions):
   """Convert SMILES strings to raw feature matrix"""
   
   features = []
   total = len(smiles_list)
   
   print(f"Processing {total} SMILES...", end="", flush=True)
   
   for i, smiles in enumerate(smiles_list):
       # Progress indicator every 1000 molecules or at milestones
       if i > 0 and (i % 1000 == 0 or i == total - 1):
           print(f" {i+1}/{total}", end="", flush=True)
       
       mol_features = []
       try:
           mol = Chem.MolFromSmiles(smiles)
           if mol is None:
               # Invalid SMILES - fill with NaN
               mol_features = [np.nan] * len(descriptor_functions)
           else:
               # Calculate each descriptor
               for name, func in descriptor_functions:
                   try:
                       value = func(mol)
                       # Handle problematic values
                       if np.isinf(value) or abs(value) > 1e10:
                           value = np.nan
                       mol_features.append(value)
                   except:
                       # Descriptor calculation failed
                       mol_features.append(np.nan)
       except:
           # Complete failure - fill entire row with NaN
           mol_features = [np.nan] * len(descriptor_functions)
       
       features.append(mol_features)
   
   print(" ✅", flush=True)
   return np.array(features, dtype=float)

descriptor_functions, feature_names = molecular_descriptors
X_raw = smiles_to_features(train_df['SMILES'].values, descriptor_functions)    

Processing 7973 SMILES... 1001/7973 2001/7973 3001/7973 4001/7973 5001/7973 6001/7973 7001/7973 7973/7973 ✅


# Clean Train Data
* Just replaces any NaNs with Median for the column
* This function is also used to process Test data later

In [31]:
def clean_features(X):
   """Handle NaN/inf values and impute missing data"""
   X[np.isinf(X)] = np.nan
   
   # Count and report missing values
   missing = np.isnan(X).sum()
   print(f"🧹 Cleaned {missing:,} missing values ({missing/X.size*100:.1f}%)")
   
   # Median imputation
   for i in range(X.shape[1]):
       col = X[:, i]
       if np.isnan(col).any():
           X[np.isnan(col), i] = np.nanmedian(col) if not np.isnan(np.nanmedian(col)) else 0
   
   return X

train_features = pd.DataFrame(clean_features(X_raw))

🧹 Cleaned 96,915 missing values (5.6%)


In [32]:
train_features.columns = feature_names
train_features.head()

Unnamed: 0,AUTOCORR2D_1,AUTOCORR2D_2,AUTOCORR2D_3,AUTOCORR2D_4,AUTOCORR2D_5,AUTOCORR2D_6,AUTOCORR2D_7,AUTOCORR2D_8,AUTOCORR2D_9,AUTOCORR2D_10,...,fr_sulfonamd,fr_sulfone,fr_term_acetylene,fr_tetrazole,fr_thiazole,fr_thiocyan,fr_thiophene,fr_unbrch_alkane,fr_urea,qed
0,2.944,3.126,3.178,3.12,2.89,2.539,2.428,2.335,2.842,2.978,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.500278
1,3.919,4.229,4.352,4.361,4.374,4.258,4.174,4.248,3.902,4.205,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.125364
2,4.631,4.955,5.035,4.908,4.892,4.814,4.697,4.754,4.393,4.724,...,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.092387
3,3.878,4.244,4.357,4.435,4.67,4.793,4.785,4.631,3.861,4.22,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.20959
4,4.429,4.697,4.734,4.718,4.653,4.617,4.66,4.629,4.227,4.468,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,18.0,0.0,0.014164


 # Function to Train AUTOGLUON
 for a given target
 * Runs RDKit feature generation on SMILES data
 * Creates / trains a model for a specific target ('Tg', 'FFV', 'Tc', 'Density', 'Rg')
 * Uses 5x cross-validation to utilize all training data (5 models per feature)

In [33]:
train_targets = train_df[['Tg', 'FFV', 'Tc', 'Density', 'Rg']]  # Y targets


In [36]:
from autogluon.tabular import TabularPredictor

def train_target_property_autogluon(X, train_df, target_name, time_limit=300, presets="best_quality"):
    """
    Trains an AutoGluon model to predict a single target property.

    Returns:
        predictor: Trained TabularPredictor.
        scaler: None (for compatibility with legacy unpacking).
        feature_names: List of feature names used.
        best_model_score: MAE on internal validation.
        leaderboard_df: AutoGluon leaderboard DataFrame.
    """
    # Filter samples with target value
    mask = train_df[target_name].notna()
    X_target = X.loc[mask].copy()
    y_target = train_df.loc[mask, target_name].copy()

    print(f"📊 Training on {len(y_target)} samples with target = '{target_name}'")
    print(f"📈 Target range: {y_target.min():.4f} to {y_target.max():.4f}")

    # Prepare training data
    train_data = X_target.copy()
    train_data[target_name] = y_target
    feature_names = list(X_target.columns)

    # Train
    predictor = TabularPredictor(label=target_name, eval_metric='mae').fit(
        train_data,
        time_limit=time_limit,
        presets=presets
    )

    # Leaderboard
    leaderboard_df = predictor.leaderboard(silent=True)
    best_model_score = leaderboard_df.loc[0, 'score_val']
    print(f"✅ Best AutoGluon model MAE: {best_model_score:.4f}")

    return predictor, None, feature_names, best_model_score, leaderboard_df

# Define all target properties
targets = ['Tg', 'FFV', 'Tc', 'Density', 'Rg']

# Store trained models and scalers
trained_models = {}
trained_scalers = {}  # will remain None
cv_scores = []
leaderboards = {}

for target in targets:
    print(f"Training {target}...")
    model, scaler, features, cv_score, lb = train_target_property_autogluon(
        train_features, train_df, target,
        time_limit=config["autogluon_time"],
        presets=config["autogluon_presets"],
        hyperparameters={
        'GBM': {}, 'CAT': {}, 'RF': {}, 'XT': {}, # Removing XGboost due to incompatibility with scikit's version
        'NN_TORCH': {}, 'FASTAI': {}, 'VW': {}, 'KNN': {},
    }
    )
    trained_models[target] = model
    trained_scalers[target] = scaler  # remains None
    cv_scores.append(cv_score)
    leaderboards[target] = lb
    print()

No path specified. Models will be saved in: "AutogluonModels/ag-20250618_142605"
Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.3.1
Python Version:     3.11.11
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP PREEMPT_DYNAMIC Sun Nov 10 10:07:59 UTC 2024
CPU Count:          4
Memory Avail:       29.35 GB / 31.35 GB (93.6%)
Disk Space Avail:   19.36 GB / 19.52 GB (99.2%)
Presets specified: ['medium_quality']
Beginning AutoGluon training ... Time limit = 100s
AutoGluon will save models to "/kaggle/working/AutogluonModels/ag-20250618_142605"
Train Data Rows:    511
Train Data Columns: 218
Label Column:       Tg
AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == float and many unique label-values observed).
	Label info (max, min, mean, stddev): (472.25, -148.0297376, 96.45231, 111.22828)
	If 'regression' is not the correct problem_type, please manually specify the problem_type parameter during Predictor init (You

Training Tg...
📊 Training on 511 samples with target = 'Tg'
📈 Target range: -148.0297 to 472.2500


	Useless Original Features (Count: 32): ['BCUT2D_CHGHI', 'BCUT2D_CHGLO', 'BCUT2D_LOGPHI', 'BCUT2D_LOGPLOW', 'BCUT2D_MRHI', 'BCUT2D_MRLOW', 'BCUT2D_MWHI', 'BCUT2D_MWLOW', 'NumRadicalElectrons', 'SMR_VSA8', 'SlogP_VSA9', 'fr_HOCCN', 'fr_SH', 'fr_aldehyde', 'fr_barbitur', 'fr_benzodiazepine', 'fr_dihydropyridine', 'fr_epoxide', 'fr_guanido', 'fr_isocyan', 'fr_isothiocyan', 'fr_lactam', 'fr_morpholine', 'fr_nitroso', 'fr_oxime', 'fr_phos_acid', 'fr_phos_ester', 'fr_priamide', 'fr_prisulfonamd', 'fr_term_acetylene', 'fr_tetrazole', 'fr_thiocyan']
		These features carry no predictive signal and should be manually investigated.
		This is typically a feature which has the same value for all rows.
		These features do not need to be present at inference time.
	Unused Original Features (Count: 9): ['MaxEStateIndex', 'fr_Al_OH_noTert', 'fr_COO2', 'fr_NH2', 'fr_Nhpyrrole', 'fr_benzene', 'fr_diazo', 'fr_phenol', 'fr_phenol_noOrthoHbond']
		These features were not used to generate any of the output f

[1000]	valid_set's l1: 55.4612
[2000]	valid_set's l1: 55.461


	-55.461	 = Validation score   (-mean_absolute_error)
	21.68s	 = Training   runtime
	0.03s	 = Validation runtime
Fitting model: WeightedEnsemble_L2 ... Training model for up to 99.04s of the 60.80s of remaining time.
	Ensemble Weights: {'XGBoost': 0.412, 'NeuralNetTorch': 0.412, 'NeuralNetFastAI': 0.176}
	-47.4217	 = Validation score   (-mean_absolute_error)
	0.08s	 = Training   runtime
	0.0s	 = Validation runtime
AutoGluon training complete, total runtime = 39.31s ... Best model: WeightedEnsemble_L2 | Estimated inference throughput: 1331.8 rows/s (103 batch size)
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("/kaggle/working/AutogluonModels/ag-20250618_142605")
No path specified. Models will be saved in: "AutogluonModels/ag-20250618_142644"
Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.3.1
Python Version:     3.11.11
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP PREEMPT_DYNAMIC Sun Nov 10 10:07:59 UTC 2024
CPU Count: 

✅ Best AutoGluon model MAE: -47.4217

Training FFV...
📊 Training on 7030 samples with target = 'FFV'
📈 Target range: 0.2270 to 0.7771


	Stage 5 Generators:
		Fitting DropDuplicatesFeatureGenerator...
	Useless Original Features (Count: 22): ['BCUT2D_CHGHI', 'BCUT2D_CHGLO', 'BCUT2D_LOGPHI', 'BCUT2D_LOGPLOW', 'BCUT2D_MRHI', 'BCUT2D_MRLOW', 'BCUT2D_MWHI', 'BCUT2D_MWLOW', 'NumRadicalElectrons', 'SMR_VSA8', 'SlogP_VSA9', 'fr_azide', 'fr_barbitur', 'fr_benzodiazepine', 'fr_diazo', 'fr_dihydropyridine', 'fr_epoxide', 'fr_isothiocyan', 'fr_lactam', 'fr_nitroso', 'fr_prisulfonamd', 'fr_thiocyan']
		These features carry no predictive signal and should be manually investigated.
		This is typically a feature which has the same value for all rows.
		These features do not need to be present at inference time.
	Unused Original Features (Count: 10): ['fr_COO2', 'fr_Nhpyrrole', 'fr_aldehyde', 'fr_benzene', 'fr_hdrzone', 'fr_morpholine', 'fr_nitro_arom', 'fr_phenol', 'fr_phenol_noOrthoHbond', 'fr_phos_ester']
		These features were not used to generate any of the output features. Add a feature generator compatible with these features to 

[1000]	valid_set's l1: 0.00547555
[2000]	valid_set's l1: 0.00522952
[3000]	valid_set's l1: 0.00513289
[4000]	valid_set's l1: 0.00510065
[5000]	valid_set's l1: 0.00508986
[6000]	valid_set's l1: 0.00508876
[7000]	valid_set's l1: 0.00508824
[8000]	valid_set's l1: 0.00508363
[9000]	valid_set's l1: 0.00508133
[10000]	valid_set's l1: 0.00508124


	-0.0051	 = Validation score   (-mean_absolute_error)
	36.82s	 = Training   runtime
	0.48s	 = Validation runtime
Fitting model: LightGBM ... Training model for up to 58.99s of the 58.99s of remaining time.


[1000]	valid_set's l1: 0.00579225
[2000]	valid_set's l1: 0.00567151
[3000]	valid_set's l1: 0.005655
[4000]	valid_set's l1: 0.00564348
[5000]	valid_set's l1: 0.00564302


	-0.0056	 = Validation score   (-mean_absolute_error)
	33.48s	 = Training   runtime
	0.22s	 = Validation runtime
Fitting model: RandomForestMSE ... Training model for up to 25.03s of the 25.02s of remaining time.
	-0.0066	 = Validation score   (-mean_absolute_error)
	56.86s	 = Training   runtime
	0.11s	 = Validation runtime
Fitting model: WeightedEnsemble_L2 ... Training model for up to 97.22s of the -32.20s of remaining time.
	Ensemble Weights: {'LightGBMXT': 0.909, 'RandomForestMSE': 0.091}
	-0.0051	 = Validation score   (-mean_absolute_error)
	0.03s	 = Training   runtime
	0.0s	 = Validation runtime
AutoGluon training complete, total runtime = 132.28s ... Best model: WeightedEnsemble_L2 | Estimated inference throughput: 1182.3 rows/s (703 batch size)
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("/kaggle/working/AutogluonModels/ag-20250618_142644")
No path specified. Models will be saved in: "AutogluonModels/ag-20250618_142857"
Verbosity: 2 (Standard Logging

✅ Best AutoGluon model MAE: -0.0051

Training Tc...
📊 Training on 737 samples with target = 'Tc'
📈 Target range: 0.0465 to 0.5240


	Useless Original Features (Count: 38): ['BCUT2D_CHGHI', 'BCUT2D_CHGLO', 'BCUT2D_LOGPHI', 'BCUT2D_LOGPLOW', 'BCUT2D_MRHI', 'BCUT2D_MRLOW', 'BCUT2D_MWHI', 'BCUT2D_MWLOW', 'NumRadicalElectrons', 'PEOE_VSA5', 'SMR_VSA8', 'SlogP_VSA9', 'fr_ArN', 'fr_C_S', 'fr_HOCCN', 'fr_SH', 'fr_azide', 'fr_barbitur', 'fr_benzodiazepine', 'fr_diazo', 'fr_dihydropyridine', 'fr_epoxide', 'fr_guanido', 'fr_hdrzine', 'fr_hdrzone', 'fr_isocyan', 'fr_isothiocyan', 'fr_lactam', 'fr_nitroso', 'fr_oxazole', 'fr_oxime', 'fr_phos_acid', 'fr_phos_ester', 'fr_piperzine', 'fr_prisulfonamd', 'fr_term_acetylene', 'fr_tetrazole', 'fr_thiocyan']
		These features carry no predictive signal and should be manually investigated.
		This is typically a feature which has the same value for all rows.
		These features do not need to be present at inference time.
	Unused Original Features (Count: 8): ['MaxEStateIndex', 'fr_COO2', 'fr_Nhpyrrole', 'fr_benzene', 'fr_nitro_arom_nonortho', 'fr_phenol', 'fr_phenol_noOrthoHbond', 'fr_priam

✅ Best AutoGluon model MAE: -0.0264

Training Density...
📊 Training on 613 samples with target = 'Density'
📈 Target range: 0.7487 to 1.8410


	Useless Original Features (Count: 37): ['BCUT2D_CHGHI', 'BCUT2D_CHGLO', 'BCUT2D_LOGPHI', 'BCUT2D_LOGPLOW', 'BCUT2D_MRHI', 'BCUT2D_MRLOW', 'BCUT2D_MWHI', 'BCUT2D_MWLOW', 'NumRadicalElectrons', 'PEOE_VSA5', 'SMR_VSA8', 'SlogP_VSA9', 'fr_ArN', 'fr_C_S', 'fr_HOCCN', 'fr_SH', 'fr_azide', 'fr_barbitur', 'fr_benzodiazepine', 'fr_diazo', 'fr_dihydropyridine', 'fr_epoxide', 'fr_guanido', 'fr_hdrzone', 'fr_isocyan', 'fr_isothiocyan', 'fr_lactam', 'fr_nitroso', 'fr_oxazole', 'fr_oxime', 'fr_phos_acid', 'fr_phos_ester', 'fr_piperzine', 'fr_prisulfonamd', 'fr_term_acetylene', 'fr_tetrazole', 'fr_thiocyan']
		These features carry no predictive signal and should be manually investigated.
		This is typically a feature which has the same value for all rows.
		These features do not need to be present at inference time.
	Unused Original Features (Count: 8): ['MaxEStateIndex', 'fr_COO2', 'fr_Nhpyrrole', 'fr_benzene', 'fr_nitro_arom_nonortho', 'fr_phenol', 'fr_phenol_noOrthoHbond', 'fr_priamide']
		These 

[1000]	valid_set's l1: 0.0324844


	-0.0324	 = Validation score   (-mean_absolute_error)
	1.12s	 = Training   runtime
	0.01s	 = Validation runtime
Fitting model: LightGBM ... Training model for up to 97.63s of the 97.63s of remaining time.
	-0.0388	 = Validation score   (-mean_absolute_error)
	0.94s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting model: RandomForestMSE ... Training model for up to 96.69s of the 96.68s of remaining time.
	-0.0402	 = Validation score   (-mean_absolute_error)
	2.75s	 = Training   runtime
	0.09s	 = Validation runtime
Fitting model: CatBoost ... Training model for up to 93.82s of the 93.81s of remaining time.
	-0.0356	 = Validation score   (-mean_absolute_error)
	48.59s	 = Training   runtime
	0.01s	 = Validation runtime
Fitting model: ExtraTreesMSE ... Training model for up to 45.20s of the 45.20s of remaining time.
	-0.04	 = Validation score   (-mean_absolute_error)
	1.14s	 = Training   runtime
	0.09s	 = Validation runtime
Fitting model: NeuralNetFastAI ... Training model for up t

✅ Best AutoGluon model MAE: -0.0271

Training Rg...
📊 Training on 614 samples with target = 'Rg'
📈 Target range: 9.7284 to 34.6729


	Useless Original Features (Count: 38): ['BCUT2D_CHGHI', 'BCUT2D_CHGLO', 'BCUT2D_LOGPHI', 'BCUT2D_LOGPLOW', 'BCUT2D_MRHI', 'BCUT2D_MRLOW', 'BCUT2D_MWHI', 'BCUT2D_MWLOW', 'NumRadicalElectrons', 'PEOE_VSA5', 'SMR_VSA8', 'SlogP_VSA9', 'fr_ArN', 'fr_C_S', 'fr_HOCCN', 'fr_SH', 'fr_azide', 'fr_barbitur', 'fr_benzodiazepine', 'fr_diazo', 'fr_dihydropyridine', 'fr_epoxide', 'fr_guanido', 'fr_hdrzine', 'fr_hdrzone', 'fr_isocyan', 'fr_isothiocyan', 'fr_lactam', 'fr_nitroso', 'fr_oxazole', 'fr_oxime', 'fr_phos_acid', 'fr_phos_ester', 'fr_piperzine', 'fr_prisulfonamd', 'fr_term_acetylene', 'fr_tetrazole', 'fr_thiocyan']
		These features carry no predictive signal and should be manually investigated.
		This is typically a feature which has the same value for all rows.
		These features do not need to be present at inference time.
	Unused Original Features (Count: 8): ['MaxEStateIndex', 'fr_COO2', 'fr_Nhpyrrole', 'fr_benzene', 'fr_nitro_arom_nonortho', 'fr_phenol', 'fr_phenol_noOrthoHbond', 'fr_priam

✅ Best AutoGluon model MAE: -1.7163



In [None]:
import matplotlib.pyplot as plt

for target, lb in leaderboards.items():
    plt.figure()
    plt.title(f"Leaderboard: {target}")
    plt.bar(lb['model'], lb['score_val'])
    plt.xticks(rotation=90)
    plt.ylabel("MAE")
    plt.tight_layout()
    plt.show()


# Perform Training
* Loops through all 5 polymer target properties (Tg, FFV, Tc, Density, Rg)
* Trains LGBM models for each target

In [None]:
# # Define all target properties
# targets = ['Tg', 'FFV', 'Tc', 'Density', 'Rg']

# # Store trained models and scalers
# trained_models = {}
# trained_scalers = {}
# cv_scores = []

# # Train each target - collect results for summary
# for target in targets:
#     print(f"Training {target}...")
#     models, scaler, features, cv_score = train_target_property_autogluon(train_features, train_df, target)
#     trained_models[target] = models
#     trained_scalers[target] = scaler
#     cv_scores.append(cv_score)
#     print()

# # Clean summary with average
# print("=" * 40)
# print(f"Trained: {len(targets)} targets × 5 CV folds = {len(targets) * 5} models")
# print(f"Average CV MAE across all targets: {np.mean(cv_scores):.4f}")

# Function to Predict using trained LGBM models for a given target
* Runs same RDKit feature generation on test SMILES data
* Uses the 5 trained models to predict a specific target
* Averages predictions from all 5 models for final result

In [None]:
def predict_target_property_autogluon(test_df, target_name, predictor):
    print(f"PREDICTING: {target_name}")
    
    if predictor is None:
        print(f"❌ No trained predictor available for {target_name}, returning zeros")
        return np.zeros(len(test_df))
    
    # Make sure test_df is processed to match training features
    descriptor_functions, _ = molecular_descriptors
    X_raw = smiles_to_features(test_df['SMILES'].values, descriptor_functions)
    X = clean_features(X_raw)
    
    # AutoGluon works directly with DataFrames
    predictions = predictor.predict(X).values
    print(f"📊 Predictions range: {predictions.min():.4f} to {predictions.max():.4f}")
    
    return predictions


# Predict All Targets / Submit
* Predicts on test data
* Creates final submission CSV with all predictions

In [None]:
print(f"\nMAKING PREDICTIONS...")
all_predictions = {}
for target in targets:
    predictions = predict_target_property_autogluon(
        test_df, target, trained_models[target]
    )
    all_predictions[target] = predictions


# Create submission
submission = pd.DataFrame({'id': test_df['id']})
for target in targets:
    submission[target] = all_predictions[target]

submission.to_csv('submission.csv', index=False)

print(f"Predicted: {len(test_df)} test samples")
print(f"Saved: submission.csv")

print(f"\n👀 SUBMISSION PREVIEW:")
print(submission.head().to_string(index=False, float_format='%.4f'))