In [None]:
from sklearn.experimental import enable_iterative_imputer
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import lightgbm as lgb
import matplotlib.pyplot as plt
import seaborn as sns

## Legend
## Slava's glass. Pt. 2

The safe clicked open with a satisfying sound, almost like it knew Slava was not just any thief, but the one who came for his treasure.

He pulled out his glass. The real one. Warm. Intact.
For a moment, everything felt right. "This is it," he thought to himself, but then, the familiar, haunting voice of the mansion's security system echoed:

"What if I told you, this isn't what it seems?"
Slava shook his head, pushing the strange thought away, but that ominous question lingered in his mind.

He pulled out his phone to call a taxi — and realized something was wrong.
The GPS showed he was in the Pacific Ocean.
The Wi-Fi was connected to “Andrey_AI_Lab_5GHz” and was giving him a cryptic message:

“no escape until the map is whole.”

He looked around. The mansion’s hallways stretched endlessly, twisted and changed.
One minute he was in a library filled with books on linear algebra and the sound of an owl,
the next — in a gym with a punching bag labeled “Bayes”.

And then it hit him: the map had glitched. Its pieces were distorted, scrambled.
He needed to assemble it like a puzzle to figure out how to escape.
Al Pacino’s voice suddenly echoed in his headphones — or maybe it was just in his mind:

"Inch by inch, play by play, until we're finished. We’re in hell right now, gentlemen. Believe me. But we can climb out. Every inch is a step up.”

Slava looked at the glass. It wasn’t just glass, it was his lucky glass, the one that always brought him good fortune. He felt his fingers tighten around the cold surface, his treasure once again in his hands.

— “It’s mine... my precious...” — he whispered, gripping the glass.
Slava cracked his knuckles and muttered:

"I’m gonna do what’s called a ‘dirty hack’." 
That was the only way out of this twisted mansion: break the rules.

He opened Jupyter.

On one of the desks, he found an old laptop with a task flashing on the screen:

❗ To restore the map, solve this:

You have a dataset that can only be used with gradient boosting — no neural networks, no fancy transformers.

But:
— Some parts of the data are missing or corrupted.
— It’s unclear how to handle these gaps efficiently.

Your task:
— Understand how to prepare and process this incomplete data;
— Train a gradient boosting model that still gives strong, reliable predictions;
— Piece together the fragments of the dataset... and of the map itself.

Slava looked at the glass again. It was cold and real, and it was all he needed. But the way out was only through solving this task. He grabbed the laptop, started typing, and plunged into the world of data, determined to find a way to put this puzzle together.

"I’m going to have to make the choice," he whispered to himself, recalling a familiar line from The Matrix. "The red pill or the blue pill?"

But there was no turning back now.


## Overview

In this task you will have to solve a standard problem on tabular data. However, the final solution must be obtained using the function **clf_train**.

## Metric

squared [RMSE](https://en.wikipedia.org/wiki/Root_mean_square_deviation) 
$$ SCORE = (\sum_{i=1}^{n}{(true_i - predict_i)^{2}}/n)^{1/4} $$
* **true** - real value of target
* **predict** - your predict
* **n** - length of target

## Restriction

You cannot change the code of the **clf_train** function. You can only use submissions produced by this function. This function takes the following s input: training and test datasets with the features preprocessed by you, weights of target for training, id column for generating the sample_submission.csv and function for inverting the target. 

## Data

* **train_tables.csv** - train dataset with 9 numeric features, 3 datetime features and target
* **test_tables.csv** - test dataset with 9 numeric features, 3 datetime features and id for submission
* **sample_submission.csv** -  example of submission file with id coluns and target column that needed to predict.

read train and test dataframes

In [None]:
train = pd.read_csv('/kaggle/input/neoai-2025-tricy-table-data/train_tables.csv')
test = pd.read_csv('/kaggle/input/neoai-2025-tricy-table-data/test_tables.csv')

Inference function

**You cannot change this function.**

In [None]:
def clf_train(train, test, target, weight_col, id_col, name_file = 'sub.csv', func_inv =None):

    param = {
    'learning_rate': 0.1,
    'num_leaves': 48,
    'lambda_l1' : 1,
    'lambda_l2' : 1,
    'min_data_in_leaf' : 100,
    'objective': 'mae',
    'verbosity':-1,
    }
    
    predict_test = np.zeros(len(test))

    tr = lgb.Dataset(train, target, weight=weight_col)
    bst = lgb.train(param, tr, num_boost_round=500)
    lgb.plot_importance(bst, importance_type="gain", title="LightGBM Feature Importance (Gain)")
    plt.show()

    predict_test = bst.predict(test)
    if func_inv:
        predict_test = func_inv(predict_test)
    sub = pd.DataFrame()
    sub['id'] = id_col
    sub['target'] = predict_test
    sub.to_csv(name_file, index = None)

Function to change target if you need it

In [None]:
def func_inv(x):
    return x

Train and inference. 
You should use **clf_train** for generating submission

In [None]:
num_cols = [c for c in train.columns if train[c].dtype != bool]

corr = train[num_cols].corr(numeric_only=True)


sns.heatmap(corr, cmap='coolwarm')

In [None]:
drop_cols = ['target']
target=train["target"]
df=train.drop(["target"], axis=1)
train_cols = [c for c in train.columns if c not in drop_cols]
cols = [f"feat_{x}" for x in range(9)] + ["day", "hour", "minute"]
missing_frac = train[cols].isna().mean(axis=1)
weight = 1.0 - missing_frac

In [None]:
train[cols].isna().mean(axis=1)
test_data = test.drop(['id'], axis=1)

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import KNNImputer, IterativeImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression

In [None]:
df.describe()

In [None]:
scaler = MinMaxScaler()
X_train_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
X_test_scaled  = pd.DataFrame(scaler.transform(test_data), columns=test_data.columns)

In [None]:
kimp = KNNImputer(n_neighbors=3)
iimp = IterativeImputer()

specials = ['feat_0','feat_3','feat_5', 'feat_6']
X_train_scaled[specials] = pd.DataFrame(iimp.fit_transform(X_train_scaled[specials]), columns=specials)
X_test_scaled[specials] = pd.DataFrame(iimp.transform(X_test_scaled[specials]), columns=specials)

In [None]:
X_train_filled = pd.DataFrame(kimp.fit_transform(X_train_scaled), columns=df.columns)
X_test_filled = pd.DataFrame(kimp.transform(X_test_scaled), columns=test_data.columns)

In [None]:
test_sub = clf_train(X_train_filled, X_test_filled, target , weight, test['id'].tolist(), 'knn_i3.csv', func_inv = func_inv)