# Creating records 
## Pipeline 2.0
##### ASTROMER dev team

*JAN 17 2023*

In [5]:
cd /home/ubuntu/astromer

/home/ubuntu/astromer


In [11]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import os

from src.data.record import DataPipeline

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [12]:
METAPATH = './data/raw_data/alcock/metadata.csv'
# LCDIR = 'LCs/' 
LCDIR = './data/raw_data/alcock/LCs/'

In [13]:
metadata = pd.read_csv(METAPATH)

In [14]:
metadata['Class'] = pd.Categorical(metadata['Class'])
metadata['Label'] = metadata['Class'].cat.codes
metadata['Path'] = metadata['Path'].apply(lambda x: os.path.join(LCDIR, x)) 

In [15]:
metadata.sample()

Unnamed: 0,ID,Class,Path,Band,Label
7509,22.4385.13,LPV,./data/raw_data/alcock/LCs/22.4385.13.dat,1.0,3


### Using DataPipeline class

In [16]:
pipeline = DataPipeline(metadata=metadata, 
                        context_features=['ID', 'Label', 'Class'],
                        sequential_features=['mjd', 'mag'],)

[INFO] 21444 samples loaded


To create training, validation, and testing splits we need to use the `train_val_test` method 
```
train_val_test(val_frac=0.2,
               test_frac=0.2,
               test_meta=None,
               val_meta=None,
               shuffle=True,
               id_column_name=None,
               k_fold=1)
``` 
where `val_frac` and `test_frac` are percentages containing the fraction of the metadata to be used as validation and testing subset respectively. 

Additionally, you can use `val_meta` and `test_meta` to use a preselected subset. **Notice that if you employ your own test/val subset, you should match one of the identifier columns of the main DataFrame** (by default it will assume the first column of the dataset is the identifier). 

Both `test_meta` and `val_meta` must be list of `DataFrames`

For cross-validation purposes, we can also sample different folds from the same dataset by using the `train_val_test(..., k_fold=1)` parameter.

If $k>1$ and **you want to use a predefined test/val selection**, you should pass a list of `DataFrame`s associated with each `test_meta`/`val_meta` fold as appropriate.

Don't worry about removing duplicated indices, the `train_val_test` method will do it for you.

In [17]:
test_metadata = metadata.sample(n=100)

In [18]:
k_folds = 4

pipeline.train_val_test(val_frac=0.2, 
                        test_meta=[test_metadata], 
                        k_fold=k_folds)

[INFO] Using ID col as sample identifier
[INFO] Shuffling
[INFO] Shuffling
[INFO] Shuffling
[INFO] Shuffling


In [19]:
a = pipeline.metadata['subset_0']
for k in range(k_folds):
    if k == 0: continue
    b = pipeline.metadata[f'subset_{k}']
    c = np.array_equal(a[a != 'test'].values, b[b!= 'test'].values)
    a = b
    print('Do {}-folds partitions have the same elements: '.format(k_folds), c)

Do 4-folds partitions have the same elements:  False
Do 4-folds partitions have the same elements:  False
Do 4-folds partitions have the same elements:  False


Now our metadata will contain an extra-column `subset` for the corresponding subset

In [20]:
pipeline.metadata.sample(3)

Unnamed: 0,ID,Class,Path,Band,Label,subset_0,subset_1,subset_2,subset_3
12966,6.6940.7522,RRab,./data/raw_data/alcock/LCs/6.6940.7522.dat,1.0,4,train,train,validation,train
10451,5.4646.26,LPV,./data/raw_data/alcock/LCs/5.4646.26.dat,1.0,3,validation,train,test,test
4828,17.2351.24,EC,./data/raw_data/alcock/LCs/17.2351.24.dat,1.0,2,train,test,validation,train


In [21]:
for k in range(k_folds):
    train_subset = pipeline.metadata[pipeline.metadata[f'subset_{k}'] == 'train']
    val_subset   = pipeline.metadata[pipeline.metadata[f'subset_{k}'] == 'validation']
    test_subset  = pipeline.metadata[pipeline.metadata[f'subset_{k}'] == 'test']

    print(train_subset.shape, val_subset.shape, test_subset.shape)

    print('test in train?: ', test_subset['ID'].isin(train_subset['ID']).all(),'\n',
          'val in train?: ', val_subset['ID'].isin(train_subset['ID']).all(),'\n',
          'val in test?: ', val_subset['ID'].isin(test_subset['ID']).all())

(17075, 9) (4269, 9) (100, 9)
test in train?:  False 
 val in train?:  False 
 val in test?:  False
(13724, 9) (3431, 9) (4289, 9)
test in train?:  False 
 val in train?:  False 
 val in test?:  False
(13724, 9) (3431, 9) (4289, 9)
test in train?:  False 
 val in train?:  False 
 val in test?:  False
(13724, 9) (3431, 9) (4289, 9)
test in train?:  False 
 val in train?:  False 
 val in test?:  False


Notice if you want to redo, you must initialize the object `DataPipeline` again

Now it is **time to the pipeline**

In [22]:
%%time
var = pipeline.run(n_jobs=8)

Processing train subset_0:   0%|[38;2;0;255;0m          [0m| 0/4 [00:00<?, ?it/s]

Writting train fold 0:   0%|[38;2;0;255;0m          [0m| 0/4 [00:16<?, ?it/s]    2023-06-21 08:48:47.100215: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-06-21 08:48:47.293041: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-06-21 08:48:47.293726: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-06-21 08:48:47.299367: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA

CPU times: user 1min 38s, sys: 9.38 s, total: 1min 47s
Wall time: 2min 5s





### Customize the preprocessing method of DataPipeline

You must keep the same parameters of the method i.e., `row, context_features, sequential_features`. 

Also the **output** should be tuple containing the lightcurve (`pd.DataFrame`) and the context values (`dict`)


To modify the `process_sample` method we need to create a new class (`MyPipeline`) that inherits from `DataPipeline` 

In [23]:
class MyPipeline(DataPipeline):
    @staticmethod
    def process_sample(row, context_features, sequential_features):
        observations = pd.read_csv(row['Path'])
        observations.columns = ['mjd', 'mag', 'errmag']
        observations = observations.dropna()
        observations.sort_values('mjd')
        observations[observations['errmag'] < 1]
        context_features_values = row[context_features]
        return observations, context_features_values

Next steps are the same as using the original `DataPipeline` class

In [24]:
custom_pipeline = MyPipeline(metadata=metadata, 
                             context_features=['ID', 'Label', 'Class'],
                             sequential_features=['mjd', 'mag'])

[INFO] 21444 samples loaded


In [25]:
%%time
var = custom_pipeline.run(n_jobs=8)

Processing full subset_0:   0%|[38;2;0;255;0m          [0m| 0/1 [00:00<?, ?it/s]

Writting full fold 0:   0%|[38;2;0;255;0m          [0m| 0/1 [00:29<?, ?it/s]    

CPU times: user 28.5 s, sys: 1.37 s, total: 29.9 s
Wall time: 30 s



