# Imputation Code for Jupyter Euler Server

To make the imputation on the Euler server as easy as possible, we made a seperate file for you to just copy this code into the Euler Server and work with the designated Euler folder.

----
Process:

We opened a Jupyterhub Euler Server (https://jupyter.euler.hpc.ethz.ch/hub/login?next=%2Fhub%2F) with the following specifications:
- 25 CPU Cores
- 0 GPUs
- 32 GB RAM
- 12 hours Run Time


We copied the Euler folder and all it's content and this file into Jupyterhub.

(The three files in the data folder get created in 02_Data Preprocessing.py - We recommend skipping this file, running everything else based off of our imputation files and then checking whether the imputation runs as well, running it afterwards seperately on a server with several cpu cores)

These files need to be available on the server - in the given structure:
- This file (02_Data Preprocessing 11-Imputations.ipynb)
- Euler folder
  - Data
    - X_test.csv
    - X_train.csv
    - X.csv
  - Output
    - (empty)

Like this we ran the imputation.

The Imputation took 8.22 hours - for this whole duration there needs to be a vpn-connection, see to it that your device does not go into sleep mode in the mean time and break the connection.

----

Import packages:

In [2]:
import pandas as pd
import numpy as np

# 11
import time
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor

Import data sets:

In [3]:
X = pd.read_csv("./Euler folder/Data/X.csv", header=0)

X_train = pd.read_csv("./Euler folder/Data/X_train.csv", header=0)
X_test = pd.read_csv("./Euler folder/Data/X_test.csv", header=0)

# Convert pandas DataFrame to numpy array
X_train, X_test = (
    np.array(X_train),
    np.array(X_test),
)

Imputation:

In [4]:
# Estimator
est = RandomForestRegressor(
        n_jobs=-1, # The number of jobs to run in parallel. 1 means using all processors. # 20 CPU cores on euler -> 496.10 minutes (~8.27h)
        random_state=42,
        n_estimators=50, # Half of default to reduce compution time.
    )

# Imputer
imp = IterativeImputer(
        estimator=est,
        random_state=42,
        max_iter=5, # Half of default to reduce compution time.
    )

# Imputing process
t1_train = time.time()
X_train_imputed = imp.fit_transform(X_train)
t2_train = time.time()

t1_test = time.time()
X_test_imputed = imp.transform(X_test)
t2_test = time.time()

# Save Imputed Data
df_X_train_imputed = pd.DataFrame(X_train_imputed)
df_X_test_imputed = pd.DataFrame(X_test_imputed)

# Add names to imputed data
df_X_train_imputed.columns = X.columns
df_X_test_imputed.columns = X.columns

print(f"The training imputation process took {(t2_train - t1_train) / 60:.2f} minutes.")
print(f"The testing imputation process took {(t2_test - t1_test) / 60:.2f} minutes.")
print()

KeyboardInterrupt: 

Save files:

In [None]:
# Save Imputed data to .csv file
df_X_train_imputed.to_csv("./Euler folder/Output/X_train_imputed.csv", sep=',', index=False, encoding='utf-8')
df_X_test_imputed.to_csv("./Euler folder/Output/X_test_imputed.csv", sep=',', index=False, encoding='utf-8')

----
Now copy the output into the folder "./Output/Data/02 11-Imputations".

We already provide these files in our submission, so 02_Data Preprocessing runs in it's entirety without you running this imputation.

But like this you can run the imputation on the server without a lot of extra effort. 

----