<a href="https://www.kaggle.com/code/shokhjahonisroilov/ioai-hyperspase?scriptVersionId=294163165" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

<img src="./figs/IOAI-Logo.png" alt="IOAI Logo" width="200" height="auto">

[IOAI 2024 (Burgas, Bulgaria), On-Site Round](https://ioai-official.org/bulgaria-2024)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/IOAI-official/IOAI-2024/blob/main/On-Site-Round/Lost_in_Hyperspace/Lost_in_Hyperspace.ipynb)

# Lost in a Hyperspace: ML Regression challenge

<img src="./figs/Lost in Hyperspace Fig 1.png" width="300">

## Story Background
Congratulations on your promotion to Principal Engineering Detective! Your impressive work on the previous task has earned you this exciting new challenge.
Now, you are entrusted with the ancient and mesmerizing Glowing Hypercubes, which share some intriguing characteristics with the "Pulse of the Machine" widgets from your last mission (refer to Important Tips for details).
Your mission is to unravel the mysteries of these Glowing Hypercubes by predicting three vital properties using the provided data.
## Objective and Limitations
- Your ultimate goal is to effectively predicts three properties of the Glowing Hypercubes
- Every Glowing Hypercube is represented by the (5 x 5 x 5 x 6) array with lots of symmetries and unique properties (see Important tips section for details)
- You need to engineer a small number of features from the Glowing Hypercube data, since  efficient factory procedures allow you to **only use Linear Regression** as a model, with no hyperparameters change allowed. You are also limited by 300 features for each task.
- Your success will be measured by Root Mean Square Error metric for each feature independently and is translated into the score on the leaderboard.
- Note that different features have different weights in the final score. See `SCALING_WEIGHTS` variable for details. After scaling, to make a single score number, we will average normalized RMSEs for each property.
- Your solution for each task should not exceed 5 minutes for feature generation, training, and inference on the standard Colab non-GPU instance.
- Share the `ml_feature_0.txt`, `ml_feature_1.txt`, and `ml_feature_2.txt` files with us, and don't forget to supply your Google Colab as well

## Important Tips

<img src="./figs/Lost in Hyperspace Fig 2.png" width="600">

- Linear Regression documentation
  - https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

- Handy Numpy functions:
  - https://numpy.org/doc/stable/reference/generated/numpy.swapaxes.html
  - https://numpy.org/doc/stable/reference/generated/numpy.ravel.html
  - https://numpy.org/doc/stable/reference/generated/numpy.reshape.html

- Root Mean Square Error
  - https://en.wikipedia.org/wiki/Root_mean_square_deviation

## Clarification of the methods' usage

Methods' usage limitations mostly resemble the ones for the home task, namely:

- Mind the time limits. These are separate for each forecasting model, and suggest non-GPU instance. This time includes feature generation, training, and inference on the validation/test set. Data analysis, feature search and selection are not subject to the time limitation

- Supervised neural networks (and any supervised models: LDA, boosting trees, etc) are not allowed as a feature extractor. Usage of simpler supervised models (e.g., ensembles on trees, linear regression) for the feature selection is allowed, given the model not being used as a feature extractor. Unsupervised learning is allowed (including autoencoders).

- Usage of pretrained models or auto ML solutions is not allowed. Libraries that automatically sort through various approaches (including unsupervised ones) for the users are not allowed as well.

- Given the time constraints, Colab notebooks should be as reproducible as possible. In case of doubt (abuse of the time limits, data usage, etc), Jury have the right to use the models/answers generated by the notebook, and pick the answers with the lower score.

- Yes, different models can use different feature sets.

- No, you cannot use the validation data for training.

**If you are not sure, please ask Jury!**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

SCALING_WEIGHTS = [100/15, 100/8, 100/100]

In [None]:
data = pd.read_pickle('/kaggle/input/input-data/ml_data_onsite_start.pickle')
for key in data.keys():
  print(key)

In [None]:
for key in data['X'].keys():
  print(key)

In [None]:
for key in data['y'].keys():
  print(key)

In [None]:
X_train = data['X']['train']
y_train = data['y']['train']

X_val = data['X']['val']
y_val = data['y']['val']

X_test = data['X']['live_test']

In [None]:
X_train.shape, y_train.shape, X_val.shape, y_val.shape, X_test.shape

In [None]:
def vis(arr):
  plt.figure(figsize=(8, 8))

  cnt = 1
  for z in range(5):
    for q in range(6):
      plt.subplot(5, 6, cnt)
      plt.imshow(arr[:, :, z, q], vmin=-50, vmax=40, cmap='hsv')
      plt.grid()
      plt.axis('off')
      cnt += 1
  plt.tight_layout()

In [None]:
X_train.shape

In [None]:
vis(X_train[11])

## Functions for result evaluation / writing predictions

Do not change it!

In [None]:
def test_solution(X_train, y_train, X_val, y_val, feature_num=0):
    assert X_train.shape[-1] <= 300, "Too many features! Should be less than 300"
    assert X_val.shape[-1] <= 300, "Too many features! Should be less than 300"

    model =  LinearRegression().fit(
        X_train,
        y_train[:, feature_num]
    )
    predictions = model.predict(X_val)
    rmse = mean_squared_error(
        predictions,
        y_val[:, feature_num]
    )**.5
    normalized_rmse = rmse * SCALING_WEIGHTS[feature_num]
    print(f"Property #{feature_num}:    raw RMSE={rmse:.6f}")
    print(f"Property #{feature_num}: scaled RMSE={normalized_rmse:.6f}")
    return round(normalized_rmse, 6)

## Let's try a baseline solution

In [None]:
def symmetrize_x(X_tr, y_tr):
    xxx = [
        X_tr,
        X_tr.swapaxes(1, 2),
        X_tr.swapaxes(1, 3),
        X_tr.swapaxes(2, 3),
        X_tr.swapaxes(1, 3).swapaxes(1, 2),
        X_tr.swapaxes(1, 3).swapaxes(2, 3),
    ]
    return np.concatenate(xxx), np.vstack([y_tr]*6)

In [None]:
def augment_rotations_xyz(X, y=None):

    def rotate_xy(X, k):
        return np.rot90(X, k=k, axes=(1, 2))

    def rotate_xz(X, k):
        return np.rot90(X, k=k, axes=(1, 3))

    def rotate_xz(X, k):
        return np.rot90(X, k=k, axes=(1, 3))

    def rotate_yz(X, k):
        return np.rot90(X, k=k, axes=(2, 3))

    
    X_aug = []
    y_aug = []

    for k in [0, 1, 2, 3]:
        X1 = rotate_xy(X, k)
        X_aug.append(X1)
        if y is not None:
            y_aug.append(y)

        X2 = rotate_xz(X, k)
        X_aug.append(X2)
        if y is not None:
            y_aug.append(y)

        X3 = rotate_yz(X, k)
        X_aug.append(X3)
        if y is not None:
            y_aug.append(y)

    X_aug = np.concatenate(X_aug, axis=0)

    if y is not None:
        y_aug = np.concatenate(y_aug, axis=0)
        return X_aug, y_aug

    return X_aug


In [None]:
X_train_symm, y_train_symm = symmetrize_x(X_train, y_train)

In [None]:
X_train_symm.shape

In [None]:
def base(X):
    X_new = X.reshape((X.shape[0], -1)) # ravel
    X_new = X_new[:, :300] # pick first 300 features
    return X_new

In [None]:
def get_diff(X):
    bgs = []
    for _x in X:
        bg = _x[:,:,:,3].ravel().min() - _x[:,:,:,2].ravel().max()
        bgs.append(bg)
    bgs = np.array(bgs)

    bgs2 = []
    for x in X:
        bg = x[:,:,:,5].ravel().min() - x[:,:,:,4].ravel().max()
        bgs2.append(bg)
    bgs2 = np.array(bgs2)
    return np.concatenate([bgs[:, None], bgs2[:, None]], axis=1)

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

def ravelize(x):
    return x.reshape(x.shape[0], -1)

pca = PCA(n_components=300-2-98-50-16)
scaler = StandardScaler()
X_tr_sc = scaler.fit_transform(ravelize(X_train_symm))
X_val_sc= scaler.transform(ravelize(X_val))
X_train_pca = pca.fit_transform(X_tr_sc)
X_val_pca = pca.transform((X_val_sc))

In [None]:
X_train_diff = get_diff(X_train_symm)
X_val_diff = get_diff(X_val)

In [None]:
from sklearn.cluster import KMeans

kmeans = KMeans(
    n_clusters=98,
    random_state=42,
    n_init=10
)

X_tr_cluster = kmeans.fit_transform(X_tr_sc)
X_val_cluster = kmeans.transform(X_val_sc)

In [None]:
from sklearn.decomposition import FastICA

ica = FastICA(
    n_components=50,
    random_state=42,
    max_iter=1000,
    whiten="unit-variance"
)

X_train_ica = ica.fit_transform(X_tr_sc)
X_val_ica   = ica.transform(X_val_sc)

In [None]:
import tensorflow as tf
from tensorflow.keras import layers, Model

input_dim = ravelize(X_train_symm).shape[1]
latent_dim = 16

inp = layers.Input(shape=(input_dim,))
x = layers.Dense(128, activation="relu")(inp)
x = layers.Dense(64, activation="relu")(x)
latent = layers.Dense(latent_dim, activation="linear")(x)

x = layers.Dense(64, activation="relu")(latent)
x = layers.Dense(128, activation="relu")(x)
out = layers.Dense(input_dim, activation="linear")(x)

autoencoder = Model(inp, out)
encoder = Model(inp, latent)

autoencoder.compile(
    optimizer="adam",
    loss="mse"
)


In [None]:
autoencoder.fit(
    ravelize(X_train_symm),
    ravelize(X_train_symm),
    validation_data=(ravelize(X_val), ravelize(X_val)),
    epochs=50,
    batch_size=64,
    verbose=0
)

In [None]:
X_tr_ae = encoder.predict(ravelize(X_train_symm))
X_val_ae = encoder.predict(ravelize(X_val))

In [None]:
X_tr_comb = np.concatenate([X_train_pca, X_train_diff,X_tr_cluster, X_train_ica, X_tr_ae], axis=-1)
X_val_comb = np.concatenate([X_val_pca, X_val_diff, X_val_cluster, X_val_ica, X_val_ae], axis=-1)

In [None]:
X_val_comb.shape

In [None]:
%%time
total_score = 0
for feature_number in range(3):
  total_score += test_solution(
      base(X_train),
      y_train,
      base(X_val),
      y_val,
      feature_num=feature_number
  )
  print()
total_score /= 3
print('='*16)
print(f"Total score = {total_score:.6f}")

In [None]:
%%time
total_score = 0
for feature_number in range(3):
  total_score += test_solution(
      X_train_pca,
      y_train_symm,
      X_val_pca,
      y_val,
      feature_num=feature_number
  )
  print()
total_score /= 3
print('='*16)
print(f"Total score = {total_score:.6f}")

In [None]:
%%time
total_score = 0
for feature_number in range(3):
  total_score += test_solution(
      X_tr_comb,
      y_train_symm,
      X_val_comb,
      y_val,
      feature_num=feature_number
  )
  print()
total_score /= 3
print('='*16)
print(f"Total score = {total_score:.6f}")

****Best score : 1.433150****

## How to prepare the answer files

In [None]:
# def generate_predictions(X_train, y_train, X_test, feature_num=0):
#     assert X_train.shape[-1] <= 300
#     assert X_test.shape[-1] <= 300

#     model =  LinearRegression().fit(
#         X_train,
#         y_train[:, feature_num]
#     )
#     predictions = model.predict(X_test)
#     return predictions


# ## Generate solutions and write to the file
# combined = {'ID': np.arange(X_test.shape[0])}

# for feature_number in range(3):
#     predictions = generate_predictions(
#         dummy_feature_extractor(X_train),
#         y_train,
#         dummy_feature_extractor(X_test),
#         feature_num=feature_number
#     )

#     combined[f'y{feature_number+1}'] = predictions

# pd.DataFrame(combined).to_csv('predictions.csv', index=False)

In [None]:
# # load the test dataset
# loaded = pd.read_pickle("/kaggle/input/test-data/ml_data_onsite_final_test.pickle")
# X_test_final = loaded['X']['final_test']


# # make final predictions
# combined = {'ID': np.arange(X_test_final.shape[0])}

# for feature_number in range(3):
#     predictions = generate_predictions(
#         dummy_feature_extractor(X_train),
#         y_train,
#         dummy_feature_extractor(X_test_final),
#         feature_num=feature_number
#     )

#     combined[f'y{feature_number+1}'] = predictions

# pd.DataFrame(combined).to_csv('final_predictions.csv', index=False)