# Data Standardisation Pipeline

In this notebook, I detail the process of importing data into `crowdastro`.

## Input data sources

The input data sources are:

- Radio Galaxy Zoo
    - Subjects (`radio_subjects.json`)
    - Classifications (`radio_classifications.json`)
- ATLAS
    - Catalogue (`ATLASDR3_cmpcat_23July2015.dat`)
    - FITS images of the radio sky (`cdfs` & `elais`)
- SPITZER
    - SWIRE Catalogue (`SWIRE3_CDFS_cat_IRAC24_21Dec05.tbl`)
    - FITS images of the infrared sky (`cdfs` & `elais`)

Paths to these should be specified in `crowdastro.json`. Radio Galaxy Zoo data should be imported into a database in MongoDB, specified by `radio_galaxy_zoo_db` in `crowdastro.json`. Following is an example `crowdastro.json`:

```json
{
    "data_sources": {
        "atlas_catalogue": "data/ATLASDR3_cmpcat_23July2015.dat",
        "cdfs_fits": "data/cdfs",
        "elais_s1_fits": "data/elais",
        "radio_galaxy_zoo_db": "radio",
        "swire_catalogue": "data/SWIRE3_CDFS_cat_IRAC24_21Dec05.tbl"
    },

    "mongo": {
        "host": "localhost",
        "port": 27017
    }
}
```

## Output data format

The input data is converted into the output data by this pipeline. There are two output files for the training data and two output files for the testing data. The files are `rgz_atlas_data_{test,train}.h5` and `rgz_atlas_data_{test,train}.csv`. The partitioning ratio is specified in `crowdastro.json` as `test_size`.

The `.h5` files contain numeric data, including all FITS images (both radio and infrared), classifications, and subject metadata. The structure is as follows, with datasets italicised:

- `/`
    - `atlas`
        - `cdfs`
            - *`catalogue`*
            - *`images`*
            - *`classifications`*
            - *`positions`*
            - *`
        - `elais-s1`
            - *`catalogue`*
            - *`images`*
            - *`classifications`*
            - *`positions`*
    - `swire`
        - `cdfs`
            - *`catalogue`*
            - *`images`*
            - *`classifications`*
            - *`positions`*
        - `elais-s1`
            - *`catalogue`*
            - *`images`*
            - *`classifications`*
            - *`positions`*

I'm only using ATLAS and SWIRE for now, but this is easily generalised to FIRST and WISE or EMU and MIGHTEE.

The `.csv` files contain textual data such as Zooniverse IDs. They are pretty much lookup tables; the reason for using CSV instead of HDF5 tables is partly for human readability and partly because dealing with textual data in HDF5 is unpleasant. The columns (examples parenthesised) are:

- `survey` (atlas)
- `field` (cdfs)
- `zooniverse_id` (ARG0003r18)
- `name` (ATLAS3_J033403.6-282423C)