# Performing QSO Classification using Variational Autoencoders¶

This notebook performs Quasar Classification via a simple Autoencoder. The frameworks used for this deep learning model are Tensorflow and Pytorch.


## Authors

* Ash Karale
    

## Contents:

* [Introduction](#one)
* [Importing Modules](#two)
* [Data Acquisition](#three)
* [Data Processing](#four)
* [Model Definition](#five)
* [Model Training](#six)


## Versions:

Initial Version: November 2022 (Ash Karale)

Updated Version: April 2023 (Ash Karale)



## Introduction <a class="anchor" id="one"></a>

In [1]:
# 

## Importing Modules <a class="anchor" id="two"></a>

It is considered good practice to import all the modules at the beginning of a Jupyter Notebook or any Python program.
By importing all the modules at the start, we ensure that the required dependencies are present and available when we need them.

In [2]:
# Importing all required modules

# System modules allow Python programs to interact with the operating system and perform tasks 
# such as reading and writing files, managing processes, and accessing environment variables 
import os
import sys
import importlib
import pickle
import argparse
import itertools
import csv
from tqdm import tqdm
import time

# Data manipulation modules allow users to perform various operations on data,
# such as cleaning, transforming, aggregating, filtering, and visualizing data
import math
import numpy as np
import pandas as pd

# Visualization modules allow users to create visual representations of data
import matplotlib as mpl
import matplotlib.pyplot as plt
import palettable
import seaborn as sns
from bokeh.io import output_notebook, show
from bokeh.plotting import figure, output_file, show
# pd.set_option('display.max_columns', 1000)

# Scikit-learn provides a range of supervised and unsupervised learning algorithms,
# as well as tools for model selection and data preprocessing
from sklearn import model_selection, preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import accuracy_score, f1_score, normalized_mutual_info_score, adjusted_rand_score

# Scipy is a Python library for scientific computing and technical computing
from scipy import stats
from scipy.optimize import linear_sum_assignment as linear_assignment

# Astropy is a Python library for astronomy and astrophysics
from astropy.io import fits
from astropy.table import Table

# TensorFlow is an open-source machine learning library that provides an extensive set of tools and libraries
# for building,training, and deploying neural networks, as well as other machine learning algorithms
import tensorflow as tf
from tensorflow.keras.layers import Input
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import BatchNormalization
from tensorflow.keras.layers import LeakyReLU
from tensorflow.keras.layers import MaxPooling2D, Flatten
from tensorflow.keras.models import Model
from tensorflow.keras.utils import plot_model
from tensorflow.keras.optimizers import SGD

# PyTorch is an open-source machine learning library for Python that provides a range of tools
# and functions for building and training neural networks and other machine learning models
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import Adam
from torch.optim.lr_scheduler import StepLR
from torchvision.transforms import ToTensor
from torch.utils.data import DataLoader,TensorDataset
from torch.autograd import Variable

print(sys.version)

3.9.15 | packaged by conda-forge | (main, Nov 22 2022, 08:48:25) 
[Clang 14.0.6 ]


## Data Acquisition <a class="anchor" id="three"></a>

Data Acquisition refers to the process of collecting and gathering data from various sources. It is the first step in the data analysis pipeline and involves identifying the sources of data and obtaining the data in a usable format.

This line sets the path to the data. Should another data source be used, replace the line in the next cell.

In [3]:
# Defining a variable named 'data_dir' and assigning it the string value /Users/ash/Research/Data/DELVE/ 
# This is the path to the directory where the dataset is stored on the local machine
data_dir = '/Users/ash/Research/Data/DELVE/'

# Using the display() function to display the value of the 'data_dir' variable in the output of the Jupyter Notebook
display(data_dir)

'/Users/ash/Research/Data/DELVE/'

Reading in the data file.
We use Astropy's Table to read in data files because it provides a powerful and flexible way to manipulate and work with tabular data, such as data stored in CSV, FITS, or other formats.

In [4]:
from astropy.table import Table

# Reading a data file stored in the FITS format using the Table.read() method 
# The path to the data file is constructed using the os.path.join() method to join the data_dir variable, 
# which specifies the directory containing the data file, and the filename 'fullcat15_30.fits'
data = Table.read(os.path.join(data_dir, 'fullcat15_30.fits'))

# Converting the FITS formatted data to a Pandas DataFrame using the to_pandas() method
# of the Table object for easier pandas manipulation
fcDF_15_30 = data.to_pandas()

#### Data types

Measurements fall into the following main catalogries:
- __Astromety__ -> ra, dec, proper motion and parallax
- __Photometry__ -> point and extended source photometry, in both AB magnitdues and fluxes (nJy)
- __Color__ -> Computed using the fluxes
- __Morphology__ -> 1 for extended and 0 for point-like
- __Light Curve Features__ -> Extrated on the SDSS light curves if matched
- __Redshift__ -> Both spectroscopic and photometric, wherever available

Inspecting the attributes

In [5]:
# Create a  list of feature column names for the dataset
# These features include photometric magnitudes, extended class, proper motion, and radial velocity
fc_list = [
    'mag_auto_g', 'mag_auto_i', 'mag_auto_r', 'mag_auto_z', 
    # Magnitudes in g, i, r, and z bandsfrom AUTO photometry
    'ypetromag', 'jpetromag', 'hpetromag', 'kspetromag',
    # Magnitudes in Y, J, H, and Ks bands from Petrosian photometry
    'w1mpro', 'w2mpro',
    # Magnitudes in WISE 1 and WISE 2 bands
    'extended_class_g', 'extended_class_r', 'extended_class_i', 'extended_class_z', 
    # Extended class in g, r, i, and z bands
    'pm', 'pmdec', 'pmra', 
    # Total proper motion, proper motion in declination, and proper motion in right ascension
    'radial_velocity',  
    # Radial velocity of the objects
    'classprob_dsc_combmod_star','classprob_dsc_combmod_galaxy','classprob_dsc_combmod_quasar', 
    # Classification of the objects (e.g., star, galaxy, QSO)
]

# Selecting a subset of columns from the DataFrame 'fcDF_15_30' based on the list 'fc_list'
fcDF_15_30 = fcDF_15_30[fc_list]

Visualizing the data

In [13]:
# Display descriptive statistics to describe and explore data

# The describe() method provides summary statistics for each column of the DataFrame, 
# giving insight into the distribution and spread of the data 
fcDF_15_30.describe()

Unnamed: 0,mag_auto_g,mag_auto_i,mag_auto_r,mag_auto_z,ypetromag,jpetromag,hpetromag,kspetromag,w1mpro,w2mpro,...,extended_class_r,extended_class_i,extended_class_z,pm,pmdec,pmra,radial_velocity,classprob_dsc_combmod_star,classprob_dsc_combmod_galaxy,classprob_dsc_combmod_quasar
count,8226904.0,8226904.0,8226904.0,8226904.0,1312523.0,7681616.0,2162878.0,6286130.0,8226904.0,8226904.0,...,8226904.0,8226904.0,8226904.0,1557373.0,1557373.0,1557373.0,48494.0,1854957.0,1854957.0,1854957.0
mean,36.21252,22.65878,24.80181,22.48277,18.57671,18.67103,17.83285,17.14291,16.80659,16.63629,...,1.607298,1.857492,1.831789,13.86721,-3.923559,8.006602,15.065438,0.9447181,0.03015133,0.02166022
std,30.01855,12.82219,16.52845,13.4482,1.565382,1.523811,1.433349,1.274097,1.079368,1.070547,...,2.654142,2.206325,2.266141,16.07001,12.40325,14.73784,42.134888,0.2224325,0.1683584,0.1381415
min,12.7819,11.67814,12.10906,11.3963,10.37403,9.689001,8.260656,8.25636,7.068,6.085,...,-9.0,-9.0,-9.0,0.0026,-802.6215,-365.6444,-389.8801,0.0,0.0,0.0
25%,21.37134,19.74994,20.29234,19.43825,17.78231,17.93413,17.16681,16.54,16.286,16.087,...,0.0,0.0,0.0,5.070828,-7.266419,0.8808007,-5.499497,0.999527,0.0,0.0
50%,22.71964,20.84213,21.50048,20.49579,18.87614,18.9775,18.12667,17.39955,16.936,16.742,...,3.0,3.0,3.0,9.90895,-1.996961,5.454478,12.312459,0.999964,0.0,0.0
75%,24.06641,21.70479,22.57661,21.28124,19.71436,19.74666,18.824,18.02353,17.53,17.382,...,3.0,3.0,3.0,17.31142,0.9636427,12.18151,31.317938,0.99999,0.0,0.0
max,99.0,99.0,99.0,99.0,28.68277,30.56738,29.72356,32.15351,20.041,18.892,...,3.0,3.0,3.0,802.6368,551.2816,676.6674,761.0851,1.0,1.0,1.0


## Data Processing <a class="anchor" id="four"></a>

Data Processing refers to the process of transforming raw data into a form that is suitable for analysis. It involves a series of steps that may include data cleaning, data integration, data transformation, data reduction, and data visualization.

Create a subset with the maximal number of objects where the data values are meaningful.
Specifically-
* Merge the Star, Galaxy, and QSO attributes

In [14]:
# Converting the list 'fc15_30' to a Pandas DataFrame using the pd.DataFrame() method
fcDF_15_30 = pd.DataFrame(fcDF_15_30)

# Replacing the string values of the columns named 'classprob_dsc_combmod_star', 'classprob_dsc_combmod_galaxy',
# and 'classprob_dsc_combmod_quasar' with numerical values 0, 1, and 2, respectively.
fcDF_15_30 = fcDF_15_30.replace({'classprob_dsc_combmod_star': 0,
                           'classprob_dsc_combmod_galaxy': 1,
                           'classprob_dsc_combmod_quasar': 2})

# Define a function to determine the class based on the highest class probability
# The function called 'assign_class' that takes a row as input, and based on the 
# class probabilities for each object, assigns the object to one of the classes
def assign_class(row):
    star_prob = row['classprob_dsc_combmod_star']
    galaxy_prob = row['classprob_dsc_combmod_galaxy']
    quasar_prob = row['classprob_dsc_combmod_quasar']
    
    if star_prob > galaxy_prob and star_prob > quasar_prob:
        return 0
    elif galaxy_prob > quasar_prob:
        return 1
    else:
        return 2

# Merging the class probability attributes of galaxies, quasars, and stars into a single 'class' attribute 
# based on the highest probability value
# The apply() method applies the function 'assign_class' to each row of the DataFrame
fcDF_15_30['class'] = fcDF_15_30[['classprob_dsc_combmod_galaxy', 'classprob_dsc_combmod_quasar',
                            'classprob_dsc_combmod_star']].apply(assign_class, axis=1)


## Model Definition <a class="anchor" id="five"></a>

Model architecture refers to the overall structure and design of a machine learning model. It includes the number and type of layers, the number of neurons or units in each layer, the activation functions used in each layer, the optimization algorithm used for training, and other design choices that are made when creating a model.

As we proceed with unsupervised classification, it is necessary to have functions to evaluate the performance of the models and the quality of the clustering results. 
The following functions are commonly used for evaluation:

## Model Training <a class="anchor" id="six"></a>

Model training is the process of training a machine learning model to make accurate predictions on new data.
In unsupervised learning, model training refers to the process of learning the underlying structure of the data without the use of explicit labels. 

Loading Data

Data Loading is a critical step in the data preprocessing pipeline and involves preparing the data for use in a machine learning model.

To help ensure that the data is correctly loaded and preprocessed, two of the most common programming constructs are defined below: 

Training Loop

Training Loop refers to the process of iteratively optimizing the internal parameters of a model to minimize the error on a training dataset.