
<div style="text-align: center;">
  <img width="420" height="420" src="https://www.naterscreations.com/imputegap/logo_imputegab.png" />
</div>

<h1>ImputeGAP: Repair Notebook</h1>

In [None]:
%pip install imputegap
%pip install -U ipywidgets

<h1>Loading</h1>

ImputeGAP comes with several time series datasets.

As an example, we use the eeg-alcohol dataset, composed of individuals with a genetic predisposition to alcoholism. The dataset contains measurements from 64 electrodes placed on subject’s scalps, sampled at 256 Hz. The dimensions of the dataset are 64 series, each containing 256 values.

To load and plot the eeg-alcohol dataset from the library:

In [10]:
from imputegap.recovery.manager import TimeSeries
from imputegap.tools import utils

# initialize the time series object
ts = TimeSeries()

# load and normalize the dataset from the library
ts.load_series(utils.search_path("eeg-alcohol"))
ts.normalize(normalizer="z_score")

# plot and print a subset of time series
ts.print(nbr_series=3, nbr_val=7)
ts.plot(input_data=ts.data, nbr_series=9, nbr_val=100, save_path="./imputegap_assets")


(SYS) The time series have been loaded from /mnt/c/Users/nquen/switchdrive/MST_MasterThesis/imputegap/imputegap/dataset/eeg-alcohol.txt

> logs: normalization (z_score) of the data - runtime: 0.0007 seconds

shape of eeg-alcohol : (64, 256)
	number of series = 64
	number of values = 256

                  TS_1           TS_2           TS_3           
idx_0             2.1903285853   1.6705262632   2.2587543576
idx_1             1.9472603285   1.5655748037   2.1966398110
idx_2             1.7041920717   1.4605158119   2.1966398110
idx_3             1.5825957138   1.3029810904   2.1345252645
idx_4             1.4003878654   1.0929706391   1.8858125103
idx_5             1.2180555580   0.7779011960   1.5751124936
idx_6             0.9749873012   0.6204740067   1.2644124768
...

plots saved in  ./imputegap_assets/25_04_03_18_00_08_imputegap_plot.jpg


'./imputegap_assets/25_04_03_18_00_08_imputegap_plot.jpg'

The module **ts.datasets** contains all the publicly available datasets provided by the library, which can be listed as follows:

In [11]:
ts.datasets

['airq',
 'bafu',
 'chlorine',
 'climate',
 'drift',
 'eeg-alcohol',
 'eeg-reading',
 'electricity',
 'fmri-objectviewing',
 'fmri-stoptask',
 'forecast-economy',
 'meteo',
 'motion',
 'soccer',
 'temperature']

To load your own dataset, add the path to your file in the **ts.load_series** function:

In [12]:
from imputegap.recovery.manager import TimeSeries
ts = TimeSeries()
ts.load_series("./dataset/test.txt")
ts.data


(SYS) The time series have been loaded from ./dataset/test.txt



array([[ 1.5,  2.5,  1.5,  2.5,  1.5,  2. ,  2. ,  2. ,  2. ,  2. ,  2. ,
         1. ,  1. ,  1. ,  1. ,  2. ,  1. ,  2. ,  1. ,  2. ,  2. ,  1.8,
         2.2,  2. ,  2.1],
       [ 0.5,  0.2,  0.3,  0.4,  0.9,  1. ,  0. ,  0. ,  0. ,  0. ,  0. ,
         0. ,  0. ,  1. ,  1. ,  1. ,  1. ,  1. ,  1. ,  1. ,  1. ,  0.8,
         1.2,  1. ,  1.1],
       [ 0.1,  2.9,  2.8,  2.7,  2.6,  2.5,  2.4,  2.3,  2.1,  2. ,  1.9,
         1.8,  1.7,  1.6,  1.5,  1.3,  1.2,  1.1,  1. ,  0.9,  0.9,  0.7,
         1.1,  1. ,  1.1],
       [ 0.1,  2. ,  1.9,  1.8,  1.7,  1.6,  1.5,  1.4,  1.3,  1.2,  1.1,
         1. ,  0.9,  0.8,  0.7,  0.6,  0.5,  0.4,  0.3,  0.2,  0.2,  0. ,
         0.4,  0.2,  0.1],
       [ 0.1,  1.8,  1.7,  1.6,  1.5,  1.4,  1.3,  1.2,  1.1,  1. ,  0.9,
         0.8,  0.7,  0.6,  0.5,  0.4,  0.3,  0.2,  0.1,  0. ,  0. , -0.2,
         0.2,  0. ,  0.1],
       [ 0.1,  1.8,  1.9,  1.6,  0.8,  1.8,  1.9,  1.1,  1. ,  1.9,  0.1,
         0.2,  0.6,  0.7,  0.8,  0.9,  0.4,  0.1,  

To import the time series as a matrix, add it to the **ts.import_matrix** function:

In [13]:
from imputegap.recovery.manager import TimeSeries
ts = TimeSeries()
mat = [[2,3,9], [3,10,5], [-1,4,5], [0,0,0]]
ts.import_matrix(mat)
ts.data

array([[ 2,  3,  9],
       [ 3, 10,  5],
       [-1,  4,  5],
       [ 0,  0,  0]])

<h1>Contamination</h1>

We now describe how to simulate missing values in the loaded dataset. ImputeGAP implements eight different missingness patterns.

As example, we show how to contaminate the eeg-alcohol dataset with the MCAR pattern:

In [14]:
from imputegap.recovery.manager import TimeSeries
from imputegap.tools import utils

# initialize the time series object
ts = TimeSeries()

# load and normalize the dataset
ts.load_series(utils.search_path("eeg-alcohol"))
ts.normalize(normalizer="z_score")

# contaminate the time series with MCAR pattern
ts_m = ts.Contamination.mcar(ts.data, rate_dataset=0.2, rate_series=0.4, block_size=10, offset=0.05, seed=True)


(SYS) The time series have been loaded from /mnt/c/Users/nquen/switchdrive/MST_MasterThesis/imputegap/imputegap/dataset/eeg-alcohol.txt

> logs: normalization (z_score) of the data - runtime: 0.0005 seconds

(CONT) missigness pattern: MCAR
	selected series: ['52', '58', '0', '44', '5', '36', '16', '12', '25', '61', '56', '9', '40']
	rate series impacted: 20.0%
	missing rate per series: 40.0%
	block size: 10
	starting position: 12
	seed value: 42


In [15]:
ts_m

array([[ 2.19032859,  1.94726033,  1.70419207, ...,  1.76492802,
         1.64345612,  1.03572325],
       [ 1.67052626,  1.5655748 ,  1.46051581, ...,  1.61805053,
         1.61805053,  1.30298109],
       [ 2.25875436,  2.19663981,  2.19663981, ...,  1.51299795,
         1.26441248,  0.26994331],
       ...,
       [-0.88963417, -0.56993316,  0.07012397, ..., -0.0900541 ,
        -0.0900541 ,  0.07012397],
       [ 2.69887602,  2.1816959 ,  1.77939239, ...,  0.11274005,
         0.22761665, -0.28956347],
       [ 2.70885585,  2.18716136,  1.83928587, ...,  0.15808559,
         0.27396493, -0.24772957]])

In [17]:
# plot the contaminated time series
ts.plot(ts.data, ts_m, nbr_series=9, subplot=True, save_path="./imputegap_assets/contamination")


plots saved in  ./imputegap_assets/contamination/25_04_03_18_00_29_imputegap_plot.jpg


'./imputegap_assets/contamination/25_04_03_18_00_29_imputegap_plot.jpg'

All missingness patterns developed in ImputeGAP are available in the ts.patterns module, which can be listed as follows:

In [18]:
ts.patterns

['aligned',
 'disjoint',
 'distribution',
 'gaussian',
 'mcar',
 'overlap',
 'scattered']

<h1>Imputation</h1>

In this section, we will illustrate how to impute the contaminated time series. Our library implements five families of imputation algorithms. Statistical, Machine Learning, Matrix Completion, Deep Learning, and Pattern Search.

Let’s illustrate the imputation using the CDRec algorithm from the Matrix Completion family.

In [19]:
from imputegap.recovery.imputation import Imputation
from imputegap.recovery.manager import TimeSeries
from imputegap.tools import utils

# initialize the time series object
ts = TimeSeries()

# load and normalize the dataset
ts.load_series(utils.search_path("eeg-alcohol"))
ts.normalize(normalizer="z_score")

# contaminate the time series
ts_m = ts.Contamination.mcar(ts.data)

# impute the contaminated series
imputer = Imputation.MatrixCompletion.CDRec(ts_m)
imputer.impute()

# compute and print the imputation metrics
imputer.score(ts.data, imputer.recov_data)
ts.print_results(imputer.metrics)

# plot the recovered time series
ts.plot(input_data=ts.data, incomp_data=ts_m, recov_data=imputer.recov_data, nbr_series=9, subplot=True, algorithm=imputer.algorithm, save_path="./imputegap_assets/imputation")


(SYS) The time series have been loaded from /mnt/c/Users/nquen/switchdrive/MST_MasterThesis/imputegap/imputegap/dataset/eeg-alcohol.txt

> logs: normalization (z_score) of the data - runtime: 0.0005 seconds

(CONT) missigness pattern: MCAR
	selected series: ['52', '58', '0', '44', '5', '36', '16', '12', '25', '61', '56', '9', '40']
	rate series impacted: 20.0%
	missing rate per series: 20.0%
	block size: 10
	starting position: 25
	seed value: 42

(IMPUTATION) CDRec: (64,256) for rank 3, epsilon 1e-06, and iterations 100.
> logs: imputation cdrec - Execution Time: 0.4537 seconds.

Results of the analysis :
RMSE                 = 0.40395406855137334
MAE                  = 0.3116556927747662
MI                   = 0.8410754313179323
CORRELATION          = 0.9127290819984344

plots saved in  ./imputegap_assets/imputation/25_04_03_18_01_03_cdrec_plot.jpg


'./imputegap_assets/imputation/25_04_03_18_01_03_cdrec_plot.jpg'

In [20]:
imputer.recov_data

array([[ 2.19032859,  1.94726033,  1.70419207, ...,  1.76492802,
         1.54269325,  0.6481282 ],
       [ 1.67052626,  1.5655748 ,  1.46051581, ...,  1.61805053,
         1.61805053,  1.30298109],
       [ 2.25875436,  2.19663981,  2.19663981, ...,  1.51299795,
         1.26441248,  0.26994331],
       ...,
       [-0.88963417, -0.56993316,  0.07012397, ..., -0.0900541 ,
        -0.0900541 ,  0.07012397],
       [ 2.69887602,  2.1816959 ,  1.77939239, ...,  0.11274005,
         0.22761665, -0.28956347],
       [ 2.70885585,  2.18716136,  1.83928587, ...,  0.15808559,
         0.27396493, -0.24772957]])

Imputation can be performed using either default values or user-defined values. To specify the parameters, please use a dictionary in the following format:

In [21]:
config = {"rank": 5, "epsilon": 0.01, "iterations": 100}
imputer.impute(params=config)


(IMPUTATION) CDRec: (64,256) for rank 5, epsilon 0.01, and iterations 100.
> logs: imputation cdrec - Execution Time: 0.3087 seconds.


<imputegap.recovery.imputation.Imputation.MatrixCompletion.CDRec at 0x7fb7dc331d90>

All algorithms developed in ImputeGAP are available in the **ts.algorithms** module, which can be listed as follows:


In [22]:
ts.algorithms

['BRITS',
 'BayOTIDE',
 'BitGraph',
 'CDRec',
 'DeepMVI',
 'DynaMMo',
 'GAIN',
 'GRIN',
 'GROUSE',
 'HKMF_T',
 'IIM',
 'Interpolation',
 'IterativeSVD',
 'KNNImpute',
 'MICE',
 'MPIN',
 'MRNN',
 'MeanImpute',
 'MeanImputeBySeries',
 'MinImpute',
 'MissForest',
 'MissNet',
 'PRISTI',
 'ROSL',
 'SPIRIT',
 'STMVL',
 'SVT',
 'SoftImpute',
 'TKCM',
 'TRMF',
 'XGBOOST',
 'ZeroImpute']

<h1>Parameter Tuning</h1>

The Optimizer component manages algorithm configuration and hyperparameter tuning. To invoke the tuning process, users need to specify the optimization option during the Impute call by selecting the appropriate input for the algorithm. The parameters are defined by providing a dictionary containing the ground truth, the chosen optimizer, and the optimizer’s options. Several search algorithms are available, including those provided by Ray Tune.

In [78]:
from imputegap.recovery.imputation import Imputation
from imputegap.recovery.manager import TimeSeries
from imputegap.tools import utils

# initialize the time series object
ts = TimeSeries()

# load and normalize the dataset
ts.load_series(utils.search_path("eeg-alcohol"))
ts.normalize(normalizer="z_score")

# contaminate and impute the time series
ts_m = ts.Contamination.mcar(ts.data)
imputer = Imputation.MatrixCompletion.CDRec(ts_m)

# use Ray Tune to fine tune the imputation algorithm
imputer.impute(user_def=False, params={"input_data": ts.data, "optimizer": "ray_tune"})

# compute the imputation metrics with optimized parameter values
imputer.score(ts.data, imputer.recov_data)

# compute the imputation metrics with default parameter values
imputer_def = Imputation.MatrixCompletion.CDRec(ts_m).impute()
imputer_def.score(ts.data, imputer_def.recov_data)

# print the imputation metrics with default and optimized parameter values
ts.print_results(imputer_def.metrics, text="Default values")
ts.print_results(imputer.metrics, text="Optimized values")

# plot the recovered time series
ts.plot(input_data=ts.data, incomp_data=ts_m, recov_data=imputer.recov_data, nbr_series=9, subplot=True, algorithm=imputer.algorithm, save_path="./imputegap_assets/imputation")

# save hyperparameters
utils.save_optimization(optimal_params=imputer.parameters, algorithm=imputer.algorithm, dataset="eeg-alcohol", optimizer="ray_tune")

2025-04-03 15:51:53,660	INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 134.21.219.180:6379...
2025-04-03 15:51:53,678	INFO worker.py:1841 -- Connected to Ray cluster.
2025-04-03 15:51:53,703	INFO tune.py:616 -- [output] This uses the legacy output and progress reporter, as Jupyter notebooks are not supported by the new engine, yet. For more information, please see https://github.com/ray-project/ray/issues/36949



(SYS) The time series have been loaded from /tmp/pycharm_project_615/imputegap/dataset/eeg-alcohol.txt

> logs: normalization (z_score) of the data - runtime: 0.0002 seconds

(CONT) missigness pattern: MCAR
	selected series: ['52', '58', '0', '44', '5', '36', '16', '12', '25', '61', '56', '9', '40']
	rate series impacted: 20.0%
	missing rate per series: 20.0%
	block size: 10
	starting position: 25
	seed value: 42

(OPTI) optimizer ray_tune has been called with cdrec ...


		(OPTI) > Ray Total accessible CPU cores for parallelization: 63.0

		(OPTI) > Ray Total accessible memory for parallelization: 341.62 GB

		(OPTI) > Ray tune max_concurrent_trials 63.0, for 1 calls and metric RMSE


		(OPTI) > Ray tune - SEARCH SPACE: {'rank': {'grid_search': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]}, 'eps': <ray.tune.search.sample.Float object at 0x7f2d5a7c12a0>, 'iters': {'grid_search': [50, 100, 150]}}



0,1
Current time:,2025-04-03 15:51:54
Running for:,00:00:00.34
Memory:,10.5/503.6 GiB

Trial name,status,loc,eps,iters,rank
objective_wrapper_9069f_00000,PENDING,,0.0850788,50,2
objective_wrapper_9069f_00001,PENDING,,0.000569628,100,2
objective_wrapper_9069f_00002,PENDING,,0.000140213,150,2
objective_wrapper_9069f_00003,PENDING,,0.000923155,50,3
objective_wrapper_9069f_00004,PENDING,,0.0932171,100,3
objective_wrapper_9069f_00005,PENDING,,0.000102624,150,3
objective_wrapper_9069f_00006,PENDING,,1.10298e-05,50,4
objective_wrapper_9069f_00007,PENDING,,0.0187104,100,4
objective_wrapper_9069f_00008,PENDING,,0.0903378,150,4
objective_wrapper_9069f_00009,PENDING,,4.01629e-06,50,5


[36m(objective_wrapper pid=4153593)[0m 
[36m(objective_wrapper pid=4153593)[0m 
[36m(objective_wrapper pid=4153593)[0m 
[36m(objective_wrapper pid=4153593)[0m PARAMS  {'rank': 3, 'eps': 0.0009231551216708345, 'iters': 50}
[36m(objective_wrapper pid=4153593)[0m 
[36m(objective_wrapper pid=4153593)[0m (IMPUTATION) CDRec: (64,256) for rank 3, epsilon 0.0009231551216708345, and iterations 50.


0,1
Current time:,2025-04-03 15:52:15
Running for:,00:00:21.98
Memory:,12.0/503.6 GiB

Trial name,status,loc,eps,iters,rank,iter,total time (s),RMSE
objective_wrapper_9069f_00000,TERMINATED,134.21.219.180:4153455,0.0850788,50,2,1,0.532666,0.54874
objective_wrapper_9069f_00001,TERMINATED,134.21.219.180:4153206,0.000569628,100,2,1,0.0249593,0.543549
objective_wrapper_9069f_00002,TERMINATED,134.21.219.180:4153201,0.000140213,150,2,1,0.335326,0.543543
objective_wrapper_9069f_00003,TERMINATED,134.21.219.180:4153593,0.000923155,50,3,1,0.569868,0.40398
objective_wrapper_9069f_00004,TERMINATED,134.21.219.180:4153409,0.0932171,100,3,1,0.466867,0.405368
objective_wrapper_9069f_00005,TERMINATED,134.21.219.180:4153832,0.000102624,150,3,1,0.0128779,0.403958
objective_wrapper_9069f_00006,TERMINATED,134.21.219.180:4153506,1.10298e-05,50,4,1,0.0532017,0.352783
objective_wrapper_9069f_00007,TERMINATED,134.21.219.180:4153098,0.0187104,100,4,1,0.0353734,0.352575
objective_wrapper_9069f_00008,TERMINATED,134.21.219.180:4154110,0.0903378,150,4,1,0.278268,0.354158
objective_wrapper_9069f_00009,TERMINATED,134.21.219.180:4153500,4.01629e-06,50,5,1,0.039011,0.309537


[36m(objective_wrapper pid=4153593)[0m > logs: imputation cdrec - Execution Time: 0.1568 seconds.
[36m(objective_wrapper pid=4157369)[0m 
[36m(objective_wrapper pid=4157369)[0m 
[36m(objective_wrapper pid=4157369)[0m 
[36m(objective_wrapper pid=4157369)[0m 
[36m(objective_wrapper pid=4154329)[0m 
[36m(objective_wrapper pid=4154329)[0m 
[36m(objective_wrapper pid=4154329)[0m 
[36m(objective_wrapper pid=4154329)[0m 
[36m(objective_wrapper pid=4153198)[0m 
[36m(objective_wrapper pid=4153198)[0m 
[36m(objective_wrapper pid=4153198)[0m 
[36m(objective_wrapper pid=4153198)[0m 
[36m(objective_wrapper pid=4153291)[0m 
[36m(objective_wrapper pid=4153291)[0m 
[36m(objective_wrapper pid=4153291)[0m 
[36m(objective_wrapper pid=4153291)[0m 
[36m(objective_wrapper pid=4154137)[0m 
[36m(objective_wrapper pid=4154137)[0m 
[36m(objective_wrapper pid=4154137)[0m 
[36m(objective_wrapper pid=4154137)[0m 
[36m(objective_wrapper pid=4154365)[0m 
[36m(objective_wr

2025-04-03 15:52:15,684	INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/home/ubuntu/ray_results/objective_wrapper_2025-04-03_15-51-53' in 0.0083s.
2025-04-03 15:52:15,694	INFO tune.py:1041 -- Total run time: 21.99 seconds (21.97 seconds for the tuning loop).



		(OPTI) > Ray tune - BEST CONFIG: {'rank': 11, 'eps': 2.3071580266676692e-05, 'iters': 150}


> logs: optimization ray tune - Execution Time: 22.0338 seconds_____

[36m(objective_wrapper pid=4156172)[0m PARAMS  {'rank': 14, 'eps': 6.663131854225725e-05, 'iters': 150}[32m [repeated 2x across cluster][0m
[36m(objective_wrapper pid=4156172)[0m (IMPUTATION) CDRec: (64,256) for rank 14, epsilon 6.663131854225725e-05, and iterations 150.[32m [repeated 2x across cluster][0m
[36m(objective_wrapper pid=4156172)[0m > logs: imputation cdrec - Execution Time: 0.1052 seconds.[32m [repeated 2x across cluster][0m

(IMPUTATION) CDRec: (64,256) for rank 11, epsilon 2.3071580266676692e-05, and iterations 150.
> logs: imputation cdrec - Execution Time: 0.5628 seconds.

(IMPUTATION) CDRec: (64,256) for rank 3, epsilon 1e-06, and iterations 100.
> logs: imputation cdrec - Execution Time: 0.1204 seconds.

Default values :
RMSE                 = 0.40395406855137334
MAE                  = 0.3116

In [79]:
imputer.parameters

(11, 2.3071580266676692e-05, 150)

All optimizers developed in ImputeGAP are available in the **ts.optimizers** module, which can be listed as follows:

In [23]:
ts.optimizers

['bayesian', 'greedy', 'particle_swarm', 'ray_tune', 'successive_halving']