### Recognizing High-redshift Galaxy Mergers with Convolutional Neural Networks using DeepMerge simulated data with an application on real-world data
# Simulated data

In [1]:
import os
import numpy as np
import time

from astropy.io import fits
from astropy.utils.data import download_file

import matplotlib.pyplot as plt

import torch
from torch.utils.data import DataLoader
from torch.utils.data import TensorDataset
import torch.nn as nn
from sklearn.model_selection import train_test_split
import pandas as pd

*Ema Donev, 2023.*

In this notebook you will find all the information about the DEEPMERGE simulated data and how to prepare it in reference to *DeepMerge: Classifying High-redshift Merging Galaxies with Deep Neural Networks by Ćiprijanović A., Snyder G.F., Nord B. and Peek J. E. G., 2020.*

## Section 1: about the data

Data source: https://archive.stsci.edu/hlsp/deepmerge

Data type: .fits file

#### Simulated images
All of the images in this dataset are 75x75 pixels and contain 3 filters or layers. They depict simulated images of galaxies and galaxy mergers. 

Originally, the pictures are part of the Illustris project which creates simulated images for galaxy related concepts. 7000 images were downloaded and then processed by the DEEPMERGE team. Every image was modified so that the light from each star was smoothed out, using the adaptive spreading of stellar light. Next, gas and dust was modified to be unclear. Finally, nebular emissions were modified so that bright stars were covered with gas and dust from their formation.

Each galaxy was simulated from 4 camera positions and 4 different imaging angles. The instruments picked were the James Webb Space Telescope and its NIRCAM camera, as well as the Hubble Space Telescope.

From the 70000 original images the dataset was imbalanced, so scientists used **data augmentation** to make it balanced. **Data augmentation** is a process where we create copies of images but adjust them by rotating, flipping, etc. so the model gets new images for training. After this process, the dataset contains 15426 images, out of which 8120 are pictures of galaxies and 7306 are pictures of galaxy mergers.

#### Data format

The data is stored in a `.fits` file. A `FITS` file is the most used file format for astronomy. It stands for **Flexible Image Transport System**. FITS files are used for transport of data as well as information about the data. It is composed of a `Header` and a `Data` part. The `Header` contains all the basic information about the FITS file and about the data. `Data` contains all the data in 2 parts: `Images` and `MergerLabel`. The `Data` can contain just 1 part, or more parts like in this FITS file. The `Images` part contains all the images, and the `MergerLabel` is a table which contains 15426 rows with the label of a Merger: (1) if positive, or *galaxy merger*, and (0) if negative, or *no galaxy merger*.   

#### Step 1: downloading the data

In [2]:
version_pristine = "pristine" # defining which version we are downloading: pristine
version_noisy = "noisy" # downloading noisy version

# link to download pristine data
file_url_pristine = 'https://archive.stsci.edu/hlsps/deepmerge/hlsp_deepmerge_hst-jwst_acs-wfc3-nircam_illustris-z2_f814w-f160w-f356w_v1_sim-'+version_pristine+'.fits'
# link to download noisy data
file_url_noisy = 'https://archive.stsci.edu/hlsps/deepmerge/hlsp_deepmerge_hst-jwst_acs-wfc3-nircam_illustris-z2_f814w-f160w-f356w_v1_sim-'+version_noisy+'.fits'

There are 2 versions of data: *noisy and pristine*. The **pristine** dataset contains images which are perfectly clear and do not contain any background noise. **Noisy dataset** contains images which have added background noise to mimick more realistic images. We are going to train 2 models on the 2 types of data.