### Recognizing High-redshift Galaxy Mergers with Convolutional Neural Networks using DeepMerge simulated data with an application on real-world data
# Real data

In [1]:
import os
import numpy as np
import time

from astropy.io import fits
from astropy.utils.data import download_file
from astropy.visualization import simple_norm

import matplotlib.pyplot as plt

import torch
from torch.utils.data import DataLoader
from torch.utils.data import TensorDataset
import torch.nn as nn
from sklearn.model_selection import train_test_split
import pandas as pd

*Ema Donev, 2023.*

In this notebook you will find all the information about the real data downloaded from AstroNN how to prepare it for modelling.

## Section 1: Downloading the data

The data is from AstroNN, https://astronn.readthedocs.io/en/latest/galaxy10.html. I am downloading the Galaxy10 DECals dataset, which includes 17736 images of different galaxy types. These images were obtained from the SDSS images from the Galaxy Zoo Data release 2. The images were processed and published so that the general public can classify the galaxy images, since there are so many. After many classified images, it was determined that ~38 classifications were made per image and that the labels are just as trustworthy as if professional scientists classified galaxies. 

The data is a `H5` file, or a `Hierarchical Data Format`. These are used to store massive amounts of data as multidimensional arrays, of which images are an example. Scientific data is typicaly stored in this format, and originally it was selected by NASA as a standard data format in science. 

In [15]:
import h5py
with h5py.File('../input/Galaxy10_DECals.h5', 'r') as F:
    images_real = np.array(F['images'])
    labels_real = np.array(F['ans'])

This piece of code was taken from the AstroNN website for instructions on how to load the dataset. After we read it, we take the array under tag 'images' and 'ans' to extract pixel values for the images and labels.

In [None]:
# select 16 random image indices:
example_ids = np.random.choice(hdu_noisy[1].data.shape[0], 16)
examples = [hdu_noisy[0].data[j, 1, :, :] for j in example_ids]

# initialize your figure
fig = plt.figure(figsize=(8, 8)) 

# loop through the randomly selected images and plot with labels
for i, image in enumerate(examples):
    ax = fig.add_subplot(4, 4, i+1)
    norm = simple_norm(image, 'log', max_percent=99.75)

    ax.imshow(image, aspect='equal', cmap='viridis', norm=norm)
    ax.set_title('Merger='+str(bool(hdu_noisy[1].data[example_ids[i]][0])))
    
    ax.axis('off')
    
plt.show()

In [18]:
images_real = images_real.astype(np.float32)
labels_real = labels_real.astype(np.float32)