# Creating records 
## Pipeline 2.0
##### ASTROMER dev team

*JAN 17 2023*

In [1]:
cd /home

/home


In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import os

from src.data.record import DataPipeline

%load_ext autoreload
%autoreload 2

In [3]:
METAPATH = './data/raw_data/alcock/metadata.csv'
# LCDIR = 'LCs/' 
LCDIR = './data/raw_data/alcock/LCs/'

In [4]:
metadata = pd.read_csv(METAPATH)

In [5]:
metadata['Class'] = pd.Categorical(metadata['Class'])
metadata['Label'] = metadata['Class'].cat.codes
metadata['Path'] = metadata['Path'].apply(lambda x: os.path.join(LCDIR, x)) 

In [6]:
metadata.sample()

Unnamed: 0,ID,Class,Path,Band,Label
5798,18.3210.910,RRab,./data/raw_data/alcock/LCs/18.3210.910.dat,1.0,4


### Using DataPipeline class

In [7]:
pipeline = DataPipeline(metadata=metadata, 
                        context_features=['ID', 'Label', 'Class'],
                        sequential_features=['mjd', 'mag'],)

[INFO] 21444 samples loaded


To create training, validation, and testing splits we need to use the `train_val_test` method 
```
train_val_test(val_frac=0.2,
               test_frac=0.2,
               test_meta=None,
               val_meta=None,
               shuffle=True,
               id_column_name=None)
``` 
where `val_frac` and `test_frac` are percentages containing the fraction of the metadata to be used as validation and testing subset respectively. Additionally, you can use `val_meta` and `test_meta` to use your preselected subset. **Notice that if you employ your own test/val subset, you should match one of the identifier columns of the main DataFrame** (by default it will assume the first column of the dataset is the identifier)

In [8]:
test_metadata = metadata.sample(n=100)

Don't worry about removing duplicated indices, the `train_val_test` method will do it for you

In [9]:
pipeline.train_val_test(val_frac=0.2, test_meta=test_metadata)

[INFO] Using ID col as sample identifier
[INFO] Shuffling


Now our metadata will contain an extra-column `subset` for the corresponding subset

In [12]:
pipeline.metadata.sample(3)

Unnamed: 0,ID,Class,Path,Band,Label,subset
13406,7.7655.582,RRab,./data/raw_data/alcock/LCs/7.7655.582.dat,1.0,4,validation
11739,55.3616.278,RRc,./data/raw_data/alcock/LCs/55.3616.278.dat,1.0,5,train
20589,9.4518.64,EC,./data/raw_data/alcock/LCs/9.4518.64.dat,1.0,2,validation


In [20]:
train_subset = pipeline.metadata[pipeline.metadata['subset'] == 'train']
val_subset   = pipeline.metadata[pipeline.metadata['subset'] == 'validation']
test_subset  = pipeline.metadata[pipeline.metadata['subset'] == 'test']

print(train_subset.shape, val_subset.shape, test_subset.shape)

print('test in train?: ', test_subset['ID'].isin(train_subset['ID']).all(),'\n',
      'val in train?: ', val_subset['ID'].isin(train_subset['ID']).all(),'\n',
      'val in test?: ', val_subset['ID'].isin(test_subset['ID']).all())

(17075, 6) (4269, 6) (100, 6)
test in train?:  False 
 val in train?:  False 
 val in test?:  False


In [8]:
%%time
var = pipeline.run(n_jobs=8)

[INFO] Processing data...
[INFO] Writing records...


  0%|                                                              | 0/21444 [00:00<?, ?it/s]2023-01-17 15:12:58.726941: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: UNKNOWN ERROR (34)
2023-01-17 15:12:58.726982: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (e1977df45e7c): /proc/driver/nvidia/version does not exist
2023-01-17 15:12:58.727304: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
100%|████████████████████████████████████████████████| 21444/21444 [00:06<00:00, 3192.22it/s]

CPU times: user 22.3 s, sys: 1.24 s, total: 23.5 s
Wall time: 26.6 s





### Customize the preprocessing method of DataPipeline

You must keep the same parameters of the method i.e., `row, context_features, sequential_features`. 

Also the **output** should be tuple containing the lightcurve (`pd.DataFrame`) and the context values (`dict`)


To modify the `process_sample` method we need to create a new class (`MyPipeline`) that inherits from `DataPipeline` 

In [20]:
class MyPipeline(DataPipeline):
    @staticmethod
    def process_sample(row, context_features, sequential_features):
        observations = pd.read_csv(row['Path'])
        observations.columns = ['mjd', 'mag', 'errmag']
        observations = observations.dropna()
        observations.sort_values('mjd')
        observations[observations['errmag'] < 1]
        context_features_values = row[context_features]
        return observations, context_features_values

Next steps are the same as using the original `DataPipeline` class

In [21]:
custom_pipeline = MyPipeline(metadata=metadata, 
                             context_features=['ID', 'Label', 'Class'],
                             sequential_features=['mjd', 'mag'])

[INFO] 21444 samples loaded


In [22]:
%%time
var = custom_pipeline.run(n_jobs=8)

[INFO] Processing data...
[INFO] Writing records...


100%|████████████████████████████████████████████████| 21444/21444 [00:08<00:00, 2488.90it/s]

CPU times: user 22.1 s, sys: 1.27 s, total: 23.4 s
Wall time: 24.9 s



