# POI Success Prediction - Enhanced Solution

**Goal:** Predict POI success rating (1-5) using location, demographics, and reviews.

**Strategy:**
- Feature engineering: coordinates, geo clusters, density ratios, XLM-R embeddings, PCA
- Target encoding within CV (no leakage)
- Ensemble: CatBoost + LightGBM with tuned hyperparameters
- 5-fold Stratified CV with MAE metric

In [1]:
import os
import pandas as pd
import numpy as np
from catboost import CatBoostRegressor
from lightgbm import LGBMRegressor
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import mean_absolute_error
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD, PCA
from sklearn.cluster import KMeans
from sentence_transformers import SentenceTransformer
from tqdm.auto import tqdm
import re
from typing import Tuple
import warnings
warnings.filterwarnings('ignore')

tqdm.pandas()
pd.set_option('display.max_columns', 50)




## 1. Load Data

In [2]:
def read_tsv(path: str) -> pd.DataFrame:
    try:
        return pd.read_csv(path, sep='\t', encoding='utf-8')
    except UnicodeDecodeError:
        return pd.read_csv(path, sep='\t', encoding='cp1251')

train = read_tsv('vseross/C/train.tsv')
test = read_tsv('vseross/C/test.tsv')
reviews = read_tsv('vseross/C/reviews.tsv')

# Optional debug sampling
debug_rows = os.environ.get('DEBUG_ROWS')
if debug_rows is not None:
    try:
        n = int(debug_rows)
        train = train.head(n).copy()
        test = test.head(max(1, n // 2)).copy()
        reviews = reviews[reviews['id'].isin(pd.concat([train['id'], test['id']]).unique())].copy()
        print(f'DEBUG mode: train {len(train)}, test {len(test)}, reviews {len(reviews)}')
    except Exception as e:
        print('DEBUG_ROWS set but failed to sample:', e)

print(f'Train: {train.shape}, Test: {test.shape}, Reviews: {reviews.shape}')
print(f'Target distribution:\n{train["target"].value_counts().sort_index()}')

Train: (41105, 286), Test: (9276, 285), Reviews: (440082, 2)
Target distribution:
target
0.0    3938
0.1      16
0.2       3
0.4       1
0.5       1
0.8      14
0.9      13
1.0      14
1.1       7
1.2       6
1.3      21
1.4      16
1.5      16
1.6      17
1.7      25
1.8      47
1.9      43
2.0      43
2.1      61
2.2      85
2.3     111
2.4      84
2.5     101
2.6     111
2.7     106
2.8     271
2.9     207
3.0     246
3.1     346
3.2     616
3.3    1801
3.4    2321
3.5    2977
3.6    4426
3.7    4288
3.8    3370
3.9    3012
4.0    2887
4.1    2668
4.2    2735
4.3    1763
4.4     880
4.5     418
4.6     275
4.7     196
4.8     225
4.9     133
5.0     144
Name: count, dtype: int64


## 2. Feature Engineering

In [3]:
# Parse coordinates
for df in [train, test]:
    coords = df['coordinates'].str.strip('[]').str.split(',', expand=True)
    df['longitude'] = pd.to_numeric(coords[0].str.strip(), errors='coerce')
    df['latitude'] = pd.to_numeric(coords[1].str.strip(), errors='coerce')
    df['category'] = df['category'].fillna('unknown').astype(str)

    df["lat_rad"] = np.radians(df["latitude"])
    df["lon_rad"] = np.radians(df["longitude"])
    df["sin_lat"] = np.sin(df["lat_rad"])
    df["cos_lat"] = np.cos(df["lat_rad"])
    df["sin_lon"] = np.sin(df["lon_rad"])
    df["cos_lon"] = np.cos(df["lon_rad"])
    df["lat_lon_ratio"] = df["latitude"] / (df["longitude"] + 1e-6)
    df["coord_density"] = df.groupby(["latitude", "longitude"])["id"].transform("count")

print('Coordinates parsed')

Coordinates parsed


In [4]:
# Name/address features and chain indicator
both = pd.concat([train[['id', 'name']], test[['id', 'name']]])
name_counts = both['name'].fillna('').astype(str).value_counts()
train['name_count'] = train['name'].fillna('').astype(str).map(name_counts).fillna(1).astype(int)
test['name_count'] = test['name'].fillna('').astype(str).map(name_counts).fillna(1).astype(int)

for df in [train, test]:
    df['is_chain'] = (df['name_count'] > 1).astype(int)
    df['name_len'] = df['name'].fillna('').astype(str).str.len()
    df['name_words'] = df['name'].fillna('').astype(str).str.split().map(len)
    df['has_digits_name'] = df['name'].fillna('').astype(str).str.contains(r'\d').astype(int)
    df['address_len'] = df['address'].fillna('').astype(str).str.len()
    df['has_digits_address'] = df['address'].fillna('').astype(str).str.contains(r'\d').astype(int)
    # NEW: Extract brand/network from name (first word often brand)
    df['name_first_word'] = df['name'].fillna('').astype(str).str.split().str[0].str.lower()

print('Name/address features created')

Name/address features created


### Review Features

In [5]:
# Aggregate reviews: count, mean length, and sentiment
review_stats = reviews.groupby('id').agg(
    review_count=('text', 'count'),
    review_len_mean=('text', lambda x: np.mean([len(str(t)) for t in x])),
    review_len_std=('text', lambda x: np.std([len(str(t)) for t in x]))
).reset_index()

# Simple sentiment: count positive/negative words
POS_WORDS = ['отлично', 'хорошо', 'рекомендую', 'понравилось', 'лучший', 'доброжелательный', 
             'вкусно', 'чисто', 'удобно', 'дешево', 'прекрасно', 'замечательно', 'отличный']
NEG_WORDS = ['плохо', 'ужасно', 'не понравилось', 'дорого', 'грязно', 'хамство', 'разочарован', 
             'отвратительно', 'проблема', 'жалоба', 'ужасный', 'никогда', 'обман']

def sentiment_score(text):
    text = str(text).lower()
    pos = sum(word in text for word in POS_WORDS)
    neg = sum(word in text for word in NEG_WORDS)
    return pos - neg

ids_grouped = reviews.groupby('id')['text']
sentiment = ids_grouped.progress_apply(lambda x: np.mean([sentiment_score(t) for t in x]))
sentiment = sentiment.reset_index().rename(columns={'text': 'sentiment_score'})
review_stats = review_stats.merge(sentiment, on='id', how='left')

train = train.merge(review_stats, on='id', how='left')
test = test.merge(review_stats, on='id', how='left')
for col in ['review_count', 'review_len_mean', 'review_len_std', 'sentiment_score']:
    train[col] = train[col].fillna(0)
    test[col] = test[col].fillna(0)

print('Review stats created')

  0%|          | 0/38770 [00:00<?, ?it/s]

Review stats created


### TF-IDF Text Features

In [6]:
all_reviews = (
    ids_grouped.progress_apply(lambda x: ' '.join(x))
    .reset_index()
)

# Stronger text: bigrams and more features, reduce with SVD
vectorizer = TfidfVectorizer(max_features=400, ngram_range=(1, 2), lowercase=True)
X_tfidf = vectorizer.fit_transform(all_reviews['text'])
svd = TruncatedSVD(n_components=32, random_state=42)
X_svd = svd.fit_transform(X_tfidf)
svd_df = pd.DataFrame(X_svd, columns=[f'tfidf_{i}' for i in range(X_svd.shape[1])])
svd_df['id'] = all_reviews['id']
train = train.merge(svd_df, on='id', how='left')
test = test.merge(svd_df, on='id', how='left')
for col in [c for c in train.columns if c.startswith('tfidf_')]:
    train[col] = train[col].fillna(0)
    test[col] = test[col].fillna(0)

print(f'TF-IDF features: {X_svd.shape[1]} dims')

  0%|          | 0/38770 [00:00<?, ?it/s]

TF-IDF features: 32 dims


### XLM-RoBERTa Embeddings

In [9]:
USE_XLMR = os.environ.get('USE_XLMR', '1') != '0'
if USE_XLMR and len(reviews) > 0:
    MAX_REVIEWS_PER_ID = int(os.environ.get('MAX_REVIEWS_PER_ID', '50'))
    MAX_CHARS = int(os.environ.get('MAX_REVIEW_CHARS', '2000'))
    model_name = 'sergeyzh/BERTA'#os.environ.get('XLMR_MODEL', 'paraphrase-xlm-r-multilingual-v1')
    batch_size = int(os.environ.get('XLMR_BATCH', '32'))
    n_comp = int(os.environ.get('XLMR_SVD', '64'))

    agg_reviews = (
        reviews.groupby('id')['text']
        .apply(lambda x: ' '.join(list(x)[:MAX_REVIEWS_PER_ID])[:MAX_CHARS])
        .reset_index()
        .rename(columns={'text': 'agg_text'})
    )

    print(f'Encoding XLM-R embeddings using {model_name} ...')
    st_model = SentenceTransformer(model_name)
    emb = st_model.encode(
        agg_reviews['agg_text'].tolist(),
        batch_size=batch_size,
        show_progress_bar=True,
        convert_to_numpy=True,
        normalize_embeddings=False
    )

    if n_comp > 0 and emb.shape[1] > n_comp:
        print(f'Reducing XLM-R embeddings from {emb.shape[1]} to {n_comp} via SVD')
        svd_x = TruncatedSVD(n_components=n_comp, random_state=42)
        emb = svd_x.fit_transform(emb)

    # Create column names based on embedding dimensions
    xlmr_cols = [f'xlmr_{i}' for i in range(emb.shape[1])]
    xlmr_df = pd.DataFrame(emb, columns=xlmr_cols)
    xlmr_df['id'] = agg_reviews['id'].values
    train = train.merge(xlmr_df, on='id', how='left')
    test = test.merge(xlmr_df, on='id', how='left')
    for c in xlmr_cols:
        train[c] = train[c].fillna(0)
        test[c] = test[c].fillna(0)
    print(f'XLM-R embeddings: {len(xlmr_cols)} dims')
else:
    print('XLM-R embeddings skipped')

Encoding XLM-R embeddings using sergeyzh/BERTA ...


Default prompt name is set to 'Classification'. This prompt will be applied to all `encode()` calls, except if `encode()` is called with `prompt` or `prompt_name` parameters.


Batches:   0%|          | 0/1212 [00:00<?, ?it/s]

Reducing XLM-R embeddings from 768 to 64 via SVD
XLM-R embeddings: 64 dims


### Geo Features

In [10]:
# Geo clusters from coordinates
coords_all = pd.concat([
    train[['longitude', 'latitude']],
    test[['longitude', 'latitude']]
], axis=0).reset_index(drop=True)

coords_fit = coords_all.dropna()
if len(coords_fit) > 0:
    # Dynamic cluster count based on dataset size
    try:
        n_clusters = int(np.clip(np.sqrt(len(coords_fit) / 2), 20, 80))
    except Exception:
        n_clusters = 50
    n_clusters = max(2, min(n_clusters, len(coords_fit)))
    kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
    kmeans.fit(coords_fit[['longitude', 'latitude']])
    
    def predict_clusters(df_xy: pd.DataFrame) -> np.ndarray:
        xy = df_xy[['longitude', 'latitude']].copy()
        xy['longitude'] = xy['longitude'].fillna(coords_fit['longitude'].mean())
        xy['latitude'] = xy['latitude'].fillna(coords_fit['latitude'].mean())
        return kmeans.predict(xy)
    
    train['kmeans_cluster'] = predict_clusters(train)
    test['kmeans_cluster'] = predict_clusters(test)
else:
    train['kmeans_cluster'] = -1
    test['kmeans_cluster'] = -1

print(f'KMeans clusters: {n_clusters if len(coords_fit) > 0 else 0}')

KMeans clusters: 80


In [11]:
# Distances: to global median center and to cluster center
if len(coords_fit) > 0:
    center_lon = coords_fit['longitude'].median()
    center_lat = coords_fit['latitude'].median()
    
    def haversine(lon1, lat1, lon2, lat2):
        R = 6371.0
        lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
        dlon = lon2 - lon1
        dlat = lat2 - lat1
        a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
        c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1-a))
        return R * c
    
    for df in (train, test):
        df['dist_to_center_km'] = haversine(df['longitude'], df['latitude'], center_lon, center_lat)
    
    if 'kmeans_cluster' in train.columns:
        centers = kmeans.cluster_centers_ if 'kmeans' in locals() else None
        if centers is not None:
            def dist_to_cluster(df):
                idx = df['kmeans_cluster'].fillna(-1).astype(int).values
                idx = np.where((idx >= 0) & (idx < len(centers)), idx, 0)
                c_lon = centers[idx, 0]
                c_lat = centers[idx, 1]
                return haversine(df['longitude'].values, df['latitude'].values, c_lon, c_lat)
            train['dist_to_kmeans_km'] = dist_to_cluster(train)
            test['dist_to_kmeans_km'] = dist_to_cluster(test)

print('Distance features created')

Distance features created


### Density Ratios & Log Transform

In [12]:
# Density ratios between 300m and 1000m rings
def add_density_ratios(tr: pd.DataFrame, te: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame]:
    cols_300 = [c for c in tr.columns if c.endswith('_300m')]
    for c300 in tqdm(cols_300, desc='Density ratios'):
        base = c300[:-5]
        c1000 = base + '_1000m'
        if c1000 in tr.columns:
            for df in (tr, te):
                num = pd.to_numeric(df[c300], errors='coerce').fillna(0.0)
                den = pd.to_numeric(df[c1000], errors='coerce').fillna(0.0)
                df[f'{base}_ratio_300_1000'] = (num / (den + 1e-6)).astype(float)
                df[f'{base}_perimeter_1000_only'] = (den - num).clip(lower=0)
    return tr, te

train, test = add_density_ratios(train, test)

# Log-transform skewed features
skewed = [c for c in train.columns if re.search(r'_(300m|1000m)$', c)]
for col in tqdm(skewed, desc='Log transform'):
    train[col] = np.log1p(train[col].clip(lower=0))
    test[col] = np.log1p(test[col].clip(lower=0))

print('Density ratios and log transform applied')

Density ratios:   0%|          | 0/140 [00:00<?, ?it/s]

Log transform:   0%|          | 0/280 [00:00<?, ?it/s]

Density ratios and log transform applied


### PCA on Dense Demographic Features (NEW)

In [13]:
# Apply PCA to capture latent demographic patterns
USE_PCA = True
if USE_PCA:
    demo_cols = [c for c in train.columns if re.search(r'_(300m|1000m)$', c)]
    # Take a subset to speed up
    demo_subset = demo_cols[:100] if len(demo_cols) > 100 else demo_cols
    
    pca = PCA(n_components=20, random_state=42)
    X_pca_train = pca.fit_transform(train[demo_subset].fillna(0))
    X_pca_test = pca.transform(test[demo_subset].fillna(0))
    
    pca_cols = [f'pca_demo_{i}' for i in range(X_pca_train.shape[1])]
    train[pca_cols] = X_pca_train
    test[pca_cols] = X_pca_test
    print(f'PCA features: {len(pca_cols)} components, explained variance: {pca.explained_variance_ratio_.sum():.3f}')
else:
    print('PCA skipped')

PCA features: 20 components, explained variance: 0.993


## 3. Prepare Features & Target Encoding Helper

In [14]:
# Select features: keep 'category' as categorical
exclude = ['id', 'name', 'address', 'coordinates', 'target']
features = [c for c in train.columns if c not in exclude]
cat_features = []
if 'category' in features:
    cat_features.append('category')
if 'kmeans_cluster' in features:
    cat_features.append('kmeans_cluster')
if 'name_first_word' in features:
    cat_features.append('name_first_word')

print(f'Total features: {len(features)}')
print(f'Categorical features: {cat_features}')

Total features: 702
Categorical features: ['category', 'kmeans_cluster', 'name_first_word']


In [15]:
# Target encoding helper inside CV for selected categoricals (smoothed)
def target_encode(train_df: pd.DataFrame, y_tr: pd.Series, val_df: pd.DataFrame, 
                  test_df: pd.DataFrame, cols, alpha=20.0):
    prior = y_tr.mean()
    tr_out = train_df.copy()
    val_out = val_df.copy()
    te_out = test_df.copy()
    for col in cols:
        stats = y_tr.groupby(train_df[col]).agg(['mean', 'count'])
        m = stats['mean']
        c = stats['count']
        smooth = (m * c + prior * alpha) / (c + alpha)
        tr_out[f'te_{col}'] = train_df[col].map(smooth)
        val_out[f'te_{col}'] = val_df[col].map(smooth).fillna(prior)
        te_out[f'te_{col}'] = test_df[col].map(smooth).fillna(prior)
    return tr_out, val_out, te_out

## 4. Train Ensemble Model (CatBoost + LightGBM)

In [16]:
# Train with 5-fold Stratified CV (MAE)
mask = train['target'] > 0
X_all = train.loc[mask, features].copy()
y_all = train.loc[mask, 'target'].copy()
X_test_base = test[features].copy()

# Stratification bins for regression stability
bins = pd.qcut(y_all, q=10, labels=False, duplicates='drop')
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

oof_cat = np.zeros(len(X_all))
oof_lgb = np.zeros(len(X_all))
test_preds_cat = np.zeros(len(test))
test_preds_lgb = np.zeros(len(test))
fold_maes_cat = []
fold_maes_lgb = []

splits = list(skf.split(X_all, bins))
print(f'Starting {len(splits)}-fold CV with CatBoost + LightGBM ensemble...')

Starting 5-fold CV with CatBoost + LightGBM ensemble...


In [17]:
for fold, (tr_idx, va_idx) in enumerate(tqdm(splits, total=len(splits), desc='CV folds'), 1):
    print(f'\n{"="*60}')
    print(f'FOLD {fold}/{len(splits)}')
    print(f'{"="*60}')
    
    X_tr, X_val = X_all.iloc[tr_idx].copy(), X_all.iloc[va_idx].copy()
    y_tr, y_val = y_all.iloc[tr_idx], y_all.iloc[va_idx]

    # Target encoding on selected categorical cols
    te_cols = []
    if 'category' in X_tr.columns:
        te_cols.append('category')
    if 'kmeans_cluster' in X_tr.columns:
        te_cols.append('kmeans_cluster')
    if 'is_chain' in X_tr.columns:
        te_cols.append('is_chain')
    if 'name_first_word' in X_tr.columns:
        te_cols.append('name_first_word')
    
    X_tr_te, X_val_te, X_test_te = target_encode(X_tr, y_tr, X_val, X_test_base, te_cols, alpha=30.0)

    # --- CatBoost ---
    print(f'\n[Fold {fold}] Training CatBoost...')
    model_cat = CatBoostRegressor(
        iterations=3000,
        learning_rate=0.05,
        depth=7,
        loss_function='MAE',
        eval_metric='MAE',
        random_state=42 + fold,
        l2_leaf_reg=3.0,
        subsample=0.8,
        bootstrap_type='Bernoulli',
        verbose=False,
        task_type='CPU',
        thread_count=-1
    )
    model_cat.fit(
        X_tr_te, y_tr,
        eval_set=(X_val_te, y_val),
        use_best_model=True,
        early_stopping_rounds=300,
        cat_features=cat_features,
        verbose=False
    )
    val_pred_cat = model_cat.predict(X_val_te)
    oof_cat[va_idx] = val_pred_cat
    fold_mae_cat = mean_absolute_error(y_val, val_pred_cat)
    fold_maes_cat.append(fold_mae_cat)
    test_preds_cat += model_cat.predict(X_test_te) / skf.n_splits
    print(f'[Fold {fold}] CatBoost MAE: {fold_mae_cat:.5f} (iterations: {model_cat.best_iteration_})')

    # --- LightGBM ---
    # Convert object columns to numeric for LightGBM
    X_tr_te_lgb = X_tr_te.copy()
    X_val_te_lgb = X_val_te.copy()
    X_test_te_lgb = X_test_te.copy()
    
    for col in cat_features:
        if col in X_tr_te_lgb.columns and X_tr_te_lgb[col].dtype == 'object':
            # Convert to category codes
            all_vals = pd.concat([X_tr_te_lgb[col], X_val_te_lgb[col], X_test_te_lgb[col]]).astype(str)
            categories = all_vals.unique()
            cat_map = {cat: i for i, cat in enumerate(categories)}
            
            X_tr_te_lgb[col] = X_tr_te_lgb[col].astype(str).map(cat_map).fillna(-1).astype(int)
            X_val_te_lgb[col] = X_val_te_lgb[col].astype(str).map(cat_map).fillna(-1).astype(int)
            X_test_te_lgb[col] = X_test_te_lgb[col].astype(str).map(cat_map).fillna(-1).astype(int)
    
    print(f'[Fold {fold}] Training LightGBM...')
    model_lgb = LGBMRegressor(
        n_estimators=3000,
        learning_rate=0.05,
        num_leaves=64,
        max_depth=7,
        subsample=0.8,
        colsample_bytree=0.8,
        reg_alpha=0.3,
        reg_lambda=1.5,
        random_state=42 + fold,
        n_jobs=-1,
        verbose=-1
    )
    model_lgb.fit(
        X_tr_te_lgb, y_tr,
        eval_set=[(X_val_te_lgb, y_val)],
        eval_metric='mae'
    )
    val_pred_lgb = model_lgb.predict(X_val_te_lgb)
    oof_lgb[va_idx] = val_pred_lgb
    fold_mae_lgb = mean_absolute_error(y_val, val_pred_lgb)
    fold_maes_lgb.append(fold_mae_lgb)
    test_preds_lgb += model_lgb.predict(X_test_te_lgb) / skf.n_splits
    print(f'[Fold {fold}] LightGBM MAE: {fold_mae_lgb:.5f}')
    
    print(f'\n[Fold {fold}] Summary: CatBoost={fold_mae_cat:.5f}, LightGBM={fold_mae_lgb:.5f}')

print(f'\n{"="*60}')
print('FINAL RESULTS')
print(f'{"="*60}')

print('\n=== CatBoost ===')
print(f'OOF MAE: {mean_absolute_error(y_all, oof_cat):.5f}')
print(f'Fold MAEs: {[f"{m:.5f}" for m in fold_maes_cat]}')
print(f'Mean: {np.mean(fold_maes_cat):.5f} ± {np.std(fold_maes_cat):.5f}')

print('\n=== LightGBM ===')
print(f'OOF MAE: {mean_absolute_error(y_all, oof_lgb):.5f}')
print(f'Fold MAEs: {[f"{m:.5f}" for m in fold_maes_lgb]}')
print(f'Mean: {np.mean(fold_maes_lgb):.5f} ± {np.std(fold_maes_lgb):.5f}')

CV folds:   0%|          | 0/5 [00:00<?, ?it/s]


FOLD 1/5

[Fold 1] Training CatBoost...
[Fold 1] CatBoost MAE: 0.26101 (iterations: 2943)
[Fold 1] Training LightGBM...
[Fold 1] LightGBM MAE: 0.26609

[Fold 1] Summary: CatBoost=0.26101, LightGBM=0.26609

FOLD 2/5

[Fold 2] Training CatBoost...
[Fold 2] CatBoost MAE: 0.26383 (iterations: 2951)
[Fold 2] Training LightGBM...
[Fold 2] LightGBM MAE: 0.26440

[Fold 2] Summary: CatBoost=0.26383, LightGBM=0.26440

FOLD 3/5

[Fold 3] Training CatBoost...
[Fold 3] CatBoost MAE: 0.26294 (iterations: 2867)
[Fold 3] Training LightGBM...
[Fold 3] LightGBM MAE: 0.26568

[Fold 3] Summary: CatBoost=0.26294, LightGBM=0.26568

FOLD 4/5

[Fold 4] Training CatBoost...
[Fold 4] CatBoost MAE: 0.25992 (iterations: 1355)
[Fold 4] Training LightGBM...
[Fold 4] LightGBM MAE: 0.26041

[Fold 4] Summary: CatBoost=0.25992, LightGBM=0.26041

FOLD 5/5

[Fold 5] Training CatBoost...
[Fold 5] CatBoost MAE: 0.26090 (iterations: 2024)
[Fold 5] Training LightGBM...
[Fold 5] LightGBM MAE: 0.26277

[Fold 5] Summary: CatBo

## 5. Ensemble & Generate Submission

In [18]:
# Weighted ensemble: blend CatBoost and LightGBM
# Find optimal weight on OOF
best_weight = 0.5
best_mae = float('inf')
for w in np.linspace(0, 1, 21):
    oof_blend = w * oof_cat + (1 - w) * oof_lgb
    mae = mean_absolute_error(y_all, oof_blend)
    if mae < best_mae:
        best_mae = mae
        best_weight = w

print(f'\nOptimal ensemble weight (CatBoost): {best_weight:.2f}')
print(f'Ensemble OOF MAE: {best_mae:.5f}')

# Apply ensemble to test predictions
pred = best_weight * test_preds_cat + (1 - best_weight) * test_preds_lgb
pred = np.clip(pred, 1, 5)

submission = pd.DataFrame({'id': test['id'], 'target': pred})
submission.to_csv('mellstroy.game.csv', index=False)


Optimal ensemble weight (CatBoost): 0.55
Ensemble OOF MAE: 0.25795


In [19]:
submission

Unnamed: 0,id,target
0,21472,3.932949
1,9837,3.386047
2,41791,3.857323
3,18441,3.372389
4,49348,3.226973
...,...,...
9271,30097,3.540172
9272,21993,3.938983
9273,43919,3.983552
9274,46598,3.561007


# Automl

In [22]:
from autogluon.tabular import TabularDataset, TabularPredictor

In [21]:
full_df = X_all.copy()
full_df['target'] = y_all

In [26]:
train_data = TabularDataset(full_df)
predictor = TabularPredictor(
    label='target', 
    eval_metric='mae'
).fit(
    train_data,
    time_limit=900,        # тайм-лимит в секундах
    presets='medium_quality',  # можно 'high_quality', 'medium_quality', 'fast_train'
    ag_args_fit={'num_gpus': 1},  # используем GPU (1 GPU)
    excluded_model_types=['RF']
)

No path specified. Models will be saved in: "AutogluonModels\ag-20251021_155055"
Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.4.0
Python Version:     3.11.9
Operating System:   Windows
Platform Machine:   AMD64
Platform Version:   10.0.19045
CPU Count:          12
Memory Avail:       46.51 GB / 63.46 GB (73.3%)
Disk Space Avail:   179.70 GB / 879.97 GB (20.4%)
Presets specified: ['medium_quality']
Using hyperparameters preset: hyperparameters='default'
Beginning AutoGluon training ... Time limit = 900s
AutoGluon will save models to "C:\Projects\aiijc\AutogluonModels\ag-20251021_155055"
Train Data Rows:    37167
Train Data Columns: 702
Label Column:       target
AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == float and label-values can't be converted to int).
	Label info (max, min, mean, stddev): (5.0, 0.1, 3.75986, 0.46157)
	If 'regression' is not the correct problem_type, please manually specify the problem_type parameter during Pre

[1000]	valid_set's l1: 0.232579


	-0.2325	 = Validation score   (-mean_absolute_error)
	38.67s	 = Training   runtime
	0.02s	 = Validation runtime
Fitting model: LightGBM ... Training model for up to 858.35s of the 858.35s of remaining time.
	Fitting with cpus=6, gpus=1, mem=1.0/46.7 GB
	Training LightGBM with GPU, note that this may negatively impact model quality compared to CPU training.
	-0.2339	 = Validation score   (-mean_absolute_error)
	9.54s	 = Training   runtime
	0.02s	 = Validation runtime
Fitting model: CatBoost ... Training model for up to 848.77s of the 848.77s of remaining time.
	Fitting with cpus=6, gpus=1, mem=1.9/46.7 GB
	Training CatBoost with GPU, note that this may negatively impact model quality compared to CPU training.
Default metric period is 5 because MAE is/are not implemented for GPU
Default metric period is 5 because MAE is/are not implemented for GPU
	-0.2317	 = Validation score   (-mean_absolute_error)
	38.36s	 = Training   runtime
	0.02s	 = Validation runtime
Fitting model: ExtraTreesMSE

In [27]:
automl_sub = submission.copy()
test_data = TabularDataset(X_test_base)

In [29]:
automl_sub['target'] = predictor.predict(test_data)
automl_sub

Unnamed: 0,id,target
0,21472,3.723485
1,9837,3.352265
2,41791,4.197441
3,18441,3.405224
4,49348,3.263135
...,...,...
9271,30097,3.103590
9272,21993,3.968985
9273,43919,3.977589
9274,46598,3.586091


In [30]:
automl_sub.to_csv('automl_submission.csv', index=False)