# Tutorial: Creating TF Records

Last updated on March 21 2022 by
Cristobal Donoso

In [1]:
cd ../..

/home/cridonoso/Documents/astromer


In [3]:
import matplotlib.pyplot as plt
import pandas as pd

os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


`ASTROMER/core/data.py` contains everything you need to create your own tfrecord dataset. The only prerequisite is to have: 
- a folder with light curves, each of them in a single dataframe containing `times`, `mag`, and `obserr`
- a metadata file with at least: `Path` to the light curve file within the light curves folder, the object `Class` which can be defined as `UNK` for unknown sources, `Band` filter, and the `ID` unique identifier. 

Run `./presentation/scripts/get_data.py` for downloading datasets used in this work.
i.e.,
```
python -m presentation.scripts.get_data --dataset alcock
```

Assuming we already have the `alcock` dataset,

In [27]:
lcs_path = './data/raw_data/alcock/LCs/' # lightcurves folder
metadata = './data/raw_data/alcock/metadata.csv' # metadata file

Reading `metadata.csv` file

In [28]:
meta = pd.read_csv(metadata)
meta.sample(1)

Unnamed: 0,ID,Class,Path,Band
18555,80.7322.3124,RRc,80.7322.3124.dat,1.0


### Lightcurve frame sample

In [30]:
lc_path = os.path.join(lcs_path, meta['Path'].sample(1).values[0])
lc_df = pd.read_csv(lc_path)
lc_df.head()

Unnamed: 0,mjd,mag,err
0,48884.79297,-7.934,0.017
1,48885.74219,-7.708,0.016
2,48908.55859,-7.942,0.015
3,48915.67969,-7.865,0.042
4,48917.62891,-7.95,0.014


### Creating dataset 

First we divide our metadata to include `training` and `testing` samples. In this case, we randomnly select 100 objects per class for testing,

In [31]:
test_meta  = pd.concat([frame.sample(n=100) for g, frame in meta.groupby('Class')])
train_meta = meta[~meta['ID'].isin(test_meta['ID'])]

In [40]:
'testing samples: {} training samples: {}'.format(test_meta.shape[0], train_meta.shape[0])

'testing samples: 700 training samples: 20744'

`create_dataset()` function recieves:
- `meta_df`: training set metadata 
- `source`: folder with light curves samples
- `target`: folder to save tfrecord files
- `n_jobs`: number of cores to distribute writing
- `subset_frac`: training and validation fractions
- `test_subset`: testing set metadata
- `max_lcs_per_record`: maximum number of light curves per tfrecord chunk
- `**kwargs`: additional parameters for the `pd.read_csv()` function 

In [41]:
%%time
create_dataset(train_meta, 
               lcs_path, 
               target='./data/records/alcock/', 
               max_lcs_per_record=20000, 
               n_jobs=7, 
               subsets_frac=(0.8, 0.2), 
               test_subset=test_meta)

100%|████████████████████████████████████████████████████████████████████| 7/7 [00:30<00:00,  4.39s/it]

CPU times: user 20.2 s, sys: 726 ms, total: 20.9 s
Wall time: 30.7 s





To explore records visit `./presentation/notebooks/explore_records.ipynb`