<font size=6>**DL Regression**</font> </h6>

In this session we provide a short introduction to **Regression**, a topic which can be covered by both machine learning applications (such as those in [scikit-learn](https://scikit-learn.org/stable/index.html) we saw in clustering and classification session) and deep learning approaches. <br>
The goals are:

- to get an idea of how we **build and train** a deep learning model
- explore some **basic concepts** to better understand their use and impact. 

In the example that will follow we are mainly going through these steps:

    1. Load and Select Data
    2. Define Model
    3. Compile Model
    4. Fit Model 
    5. Iterate steps 2-3-4 (by adjusting various parameters or the model architecture)
    6. Evaluate Model
    7. Make Predictions


# What is regression ?

    Modeling problems where the output is a continuous numeric value.


# Linear regression and least squares

A linear regression model will try to predict the single value of the dependent variable $y$ given the independent variable $x$, with the most generic form being: 

$$y = f (x | β), $$

where $x$ corresponds to the input variable(s) and $β$ is an array of parameters.

The simplest linear model we can think of is a line:

$$y = β_0 + β_1 x $$

where $β_0$ and $β_1$ have the usual meaning of intercept and slope of a line. 

Generalizing a bit, for an input vector $x^T = (x_1, x_2, ..., x_N)$,  where $N$ is the total number of observations (or samples), the linear model to predict the real-valued output $y$ is:

$$ f(x) = β_0 + \sum_{i=1}^{N} x_i β_i. $$

To find these parameters we use the Ordinary Least Squares approach, i.e. we try to **minimize** the residual sum of the squares between the observations and the predictions by the model (i.e. the **loss or cost** function):

$$ RSS(β) = \sum_{i=1}^{N} (y_i - f(x_i))^2 =  \sum_{i=1}^{N} (y_i - β_0 - x_i β_i)^2 .$$




# The more general linear regression

Linear regression refers to modeling functions that are linear with respect to the parameters (coefficients) and not with respect to the variables! For example, the function:

$$f (x|β) = \sum_{i=1}^N β_i g_i(x) = β_1 g_1(x) + β_2 g_2(x)~+~...~+~β_N g_N(x) $$

describes a linear problem as long as the sub-functions $g_i(x)$ do not depend on any of the parameters $β_i$. This is not the most generic formulation of the linear regression but we are going to use this form in the following applications.

# On metrics ... or how well can we do

> _Accuracy is a measure for classification not regression_<br>
> _We cannot calculate accuracy for a regression model_
>
>  _by Jason Brownlee [Regression Metrics for Machine Learning](https://machinelearningmastery.com/regression-metrics-for-machine-learning/)_

In regression we are dealing with continuous values. Therefore, it is actually impossible to predict the exact same values. The idea is to get an estimate of how close the predictions are to the expected values.

We are going to refer only a few of the available metrics in [sklearn](https://machinelearningmastery.com/regression-metrics-for-machine-learning/). In the following $y$ refers to the dependent values while $\hat{y}$ to the predicted values.

**--> Mean Squared Error (MSE)**

$$ MSE = \frac{1}{N}\sum_{i=1}^{N} (y_i - \hat{y_i})^2 $$  

It is actually the cost function of the Ordinary Least Squares (check $RSS(β)$ above). The units returned in this case are squared. Best score is 0.

**--> Root Mean Squared Error (RMSE)**

$$ RMSE = \sqrt{ \frac{1}{N}\sum_{i=1}^{N} (y_i - \hat{y_i})^2 } $$  

It returns the square root of MSE so that the units match the units of the target value (so better interpretation).  Best score is 0.

**--> Mean Absolute Error (MAE)**

$$ MAE = {1 \over N}\sum_i^N{|  y_i-\hat{y_i} |}$$  

It is less sensitive to large errors when compared to (R)MSE. The score is in units of the target value and the best 

> Comment: Although arithetically the best scores for (R)MSE and MAE is 0 this cannot be the case in real-life problems. Instead a baseline model has to be determined and calculate its score. Then, any model that can achieve a score better that the baselie model is accepted as a skilful model.  


**--> R2 (coefficient of determination)**

$$R^2 = 1 - {\sum_{i=1}^N{(y_i-\hat{y_i})^2} \over \sum_{i=1}^N{(y_i - \bar{y})^2}}, $$ 

where $\bar{y} = \frac{1}{n}\sum_{i=1}^N y_i$. 

It represents the proportion of variance (of y) that has been explained by the independent variables in the model. It provides an indication of goodness of fit and therefore a measure of how well unseen samples are likely to be predicted by the model, through the proportion of explained variance. Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). 


# Application 1: Estimate SFR

We will use data derived from the Heraklion Extragalactic Catalogue (HECATE; [Kovlakas et al, 2021](https://ui.adsabs.harvard.edu/abs/2021MNRAS.506.1896K/abstract)) which is an all-sky galaxy catalogue, containing about 200k galaxies (up to z=0.047, D≲200Mpc), and it offers positions, sizes, distances, morphological classifications, star formation rates, stellar masses, metallicities, and nuclear activity classifications. 

In particular, we are going to use a dataset from the work of [Kouroumpatzakis et al. 2023](https://ui.adsabs.harvard.edu/abs/2023A%26A...673A..16K/abstract) where they estimate the Star Formation Rates (SFR) and stellar mass using model Spectral Enerny Distributions with a variaty of parameters (optical and IR photometry, color terms, and stellar populations). 

**TASK: build a regression model to predict SFR**


<font size=4>**CHALLENGE**:</font> </h4>   <br> 
    
They used MCMC and Random Forest, so your work is to build a DL model that will **outperform their results**!!! <br>

**Can you ?**
<br>
<br>

<center><img src="images/yoda_meme.jpg"></center>



In [None]:
# Loading packages and defining some functions

import numpy as np
import matplotlib.pyplot as plt
import time
import keras
from keras.layers import Activation, Dropout, Flatten, Dense, Input, BatchNormalization,Conv3D, MaxPooling3D, Dense, Add, Activation
from keras import regularizers
from keras.models import Model, Sequential
from keras.optimizers import Adam, SGD, Adagrad, RMSprop
from IPython.display import clear_output

class PlotLosses(keras.callbacks.Callback):
    # NOTE: the current version can print only the first metric from the list provided
    def on_train_begin(self, logs={}):
        self.i = 0
        self.x = []
        self.losses = []
        self.val_losses = []
        self.losses2 = []
        self.val_losses2 = []
        
        self.fig = plt.figure()
        
        self.logs = []

    def on_epoch_end(self, epoch, logs={}):
        self.logs.append(logs)
        self.x.append(self.i)
        self.losses.append(logs.get('loss'))
        self.val_losses.append(logs.get('val_loss'))
        self.losses2.append(logs.get( test_metrics[0] ) )
        self.val_losses2.append(logs.get( f'val_{test_metrics[0]}' ))
#         print(test_metrics[0] , logs.get( test_metrics[0] ), logs.get( f'val_{test_metrics[0]}' ))
#         print(self.losses2)

        self.i += 1
        
        clear_output(wait=True)
        plt.subplot(1,2,1)
        plt.plot(self.x, self.losses2, label="Train",linestyle='-')
        plt.plot(self.x, self.val_losses2, label="Validate",linestyle='--')
#        plt.ylim(0,1)
        plt.legend()
        plt.xlabel('Epoch')
        plt.ylabel(test_metrics[0])
        
        plt.subplot(1,2,2)
        plt.plot(self.x, self.losses, label="Train",linestyle='-')
        plt.plot(self.x, self.val_losses, label="Validate",linestyle='--')

        plt.legend()
        plt.xlabel('Epoch')
        plt.ylabel('Loss')
        
        plt.tight_layout()
        
        plt.show();
        
plot_losses = PlotLosses()


def sub_ploots(plots_cols, plots_array):
        """ Automatic adjustment of subplots
        Given the 
        plots_cols     : plots per row (set manually)
        plots_array    : array from which the number of 
                         individual subplots is derived

        the output is given as 
        plots_cols     : plots per row
        plots_rows     : number of necessary rows to create
        """
        plots_unique = len((list(set(plots_array))))
        plots_rows = int(np.ceil(plots_unique/plots_cols))
        
        return plots_rows, plots_cols

## Loading and checking data 

In [None]:
from astropy.io import fits
from astropy.table import Table
# following this tutorial :https://learn.astropy.org/tutorials/FITS-tables.html

data_fits = fits.open('data/CIGALE_SFR.fits', memmap=True)

# printing some information 
data_fits.info()
print('-----')
print(data_fits[1].columns)

In [None]:
# printing the data table 

data_tab = Table(data_fits[1].data)
data_tab

## Selecting features

In this part we can select which features we want to keep and use to train our model. Feel free to choose any number of feature - keep in mind though **NOT** to select `sfr` and `mstar` as these are needed for the output (i.e. are the values we want to predict).

_HINT_: for the column names check above.

In [None]:
sel_features = [ ... , ...  ]

fig = plt.figure(figsize=(15,15))

rws, cls = sub_ploots(2, sel_features)  # for subplots
for f in range(len(sel_features)):
    f_name = sel_features[f]
    ax = fig.add_subplot(rws, cls, f+1)
    ax.set_title(f'Feature {f_name}')
    ax.hist(data_tab[f_name], bins='auto', align='mid', density=True)
    ax.set_xlabel('feature value')
    ax.set_ylabel('number of objects')
    
plt.show()

and the data table looks like ...

In [None]:
data_tab[sel_features]


In [None]:
# converting table objects to numpy arrays

values = data_tab[sel_features].as_array()
values = values.view((float, len(values.dtype.names)))
target = np.array(data_tab['sfr']).reshape(-1,1)

print('The values for selected features:')
print(values)
print()
print('The target quantity:')
print(target)


## Split the sample and ... ?

**TASK: what should you do before training the model?**


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


X_train_full, X_test, y_train_full, y_test = train_test_split( ... , ... ,
                        test_size= ... ) #, random_state=42) 
X_train, X_valid, y_train, y_valid = train_test_split( ... , ...,
                        test_size= ... ) #, random_state=42) 

scaler = StandardScaler()
X_train_scaled = ... 
X_valid_scaled = ...
X_test_scaled = ...

print(f'> From {len(target)} sources:')
print(f'   {len(X_train)} (train)')
print(f'   {len(X_valid)} (validation)')
print(f'   {len(X_test)} (test)') 
print('\n> Statistics per feature:')
print(f'Mean: {scaler.mean_}')
print(f'Variance: {scaler.var_}')
print()
print(f' Target shape: {y_train.shape}') # print the shape of the output values


## Time to build our model! 

In [None]:
# if you want you can add more output metrics!
# but keep in mind that the plot_losses will use the first one.

test_metrics = [ 'mean_squared_error'] #, 'mean_absolute_error', ]


In [None]:
model = Sequential()

model.add( Dense( ... , input_shape=X_train_full.shape[1:], activation='relu') )

#model.add( Dense( ... , activation='relu') )


model.add( Dense( ... ) )

In [None]:
optzr = Adam(lr= ... ) 
model.compile(loss='mean_squared_error', optimizer=optzr, metrics = test_metrics)
model.summary()

## Finally ... train the model!

In [None]:
start_time = time.time() 

history=model.fit( ... ,...  
                    batch_size= ..., 
                    epochs= ...,
                    validation_data=[ X_valid_scaled, y_valid],
                    callbacks=[plot_losses],shuffle=True)

# to print history contents
# history.history
elapsed_time = time.time() - start_time
time.strftime("%H:%M:%S", time.gmtime(elapsed_time))

## ... and the moment of truth - evaluation

_HINT_: remember at which set we evaluate our model to

In [None]:
evaluation = model.evaluate( ... , ... )
print(f"Loss value: {evaluation[0]:.2f}")  
for m in range(len(test_metrics)):
    print(f"{test_metrics[m]}: {evaluation[m+1]:0.2f}")  
    
y_pred = model.predict( ... )    

## Checking the predicitons

In [None]:
from sklearn.metrics import r2_score

fig = plt.figure(figsize=(10,10))

r2 = r2_score( ... , ...)

min_ax, max_ax = np.min(y_test), np.max(y_test)
    
plt.plot(y_test, y_pred, '.', alpha=0.4)
plt.title( f'R2 = {r2:0.2f}', fontsize=20)
plt.xlabel('true', fontsize=16)
plt.ylabel('prediction', fontsize=16)
plt.xlim([min_ax, max_ax])

# 1-to-1 line:
xx = np.linspace(min_ax, max_ax, 100)
plt.plot(xx, xx, c="grey", linestyle='--')

plt.show()

## Comparing with paper results! Is it better ???

Comparing your results with Fig 3 by [Kouroumpatzakis et al. (2023)](https://ui.adsabs.harvard.edu/abs/2023A%26A...673A..16K/abstract), where the ratio of the predictions over true SFR values are plotted with true SFR values.

In [None]:
# converting data to 2d and sorting 
arr = np.concatenate( (y_test, y_pred), axis=1 )
res = np.sort(arr, axis = 0)       

fig = plt.figure(figsize=(10,10))

objs = np.arange(len(res))

plt.plot( np.log(res[:,0]), np.log( res[:,0]/res[:,1]), '-g')
plt.xlabel('log SFR true', fontsize=16)
plt.ylabel('log (SFR pred / SFR true)', fontsize=16)

plt.show()


<center><img src="images/Kouroumpatzakis2023-Fig3b.png"> 
Figure 5.1. Figure 3 from   <a href="https://ui.adsabs.harvard.edu/abs/2023A%26A...673A..16K/abstract" target="_blank" rel="noopener noreferrer">Kouroumpatzakis et al. (2023)</a> presenting the omparison between the true SFR as given by the CIGALE output (data used), and the best-fit models of MCMC, and the Random Forest.</center>

_**Question**: So... did you manage to do better ?_

_HINT_: the resulted range should be smaller...


# Application 2: Estimate SFR and stellar mass

In this case we want to use all/part of the available features to estimate both the SFR (`sfr`) **and** the stellar mass (`mstar`). 

For this we need to adjust the number of output nodes and the shape of the data. 

**TASK 1: adjust the model to train and predict both values**

**TASK 2: use a custom loss function**

In [None]:
print(data_fits[1].columns)

In [None]:
# select data 

sel_features = [ ... ... ]
pre_features = [ ... ...]

values = data_tab[sel_features].as_array()
values = values.view((float, len(values.dtype.names)))

target = data_tab[pre_features].as_array()
target = target.view((float, len(target.dtype.names)))


print('The values for selected features:')
print(values)
print()
print('The target quantity:')
print(target)


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


X_train_full, X_test, y_train_full, y_test = train_test_split( ... ,... ,
                        test_size= ... ) #, random_state=42) 
X_train, X_valid, y_train, y_valid = train_test_split(... ,... ,
                        test_size= ... ) #, random_state=42) 

scaler = StandardScaler()
X_train_scaled = ...
X_valid_scaled = ...
X_test_scaled = ...

print(f'> From {len(target)} sources:')
print(f'   {len(X_train)} (train)')
print(f'   {len(X_valid)} (validation)')
print(f'   {len(X_test)} (test)') 
print('\n> Statistics per feature:')
print(f'Mean: {scaler.mean_}')
print(f'Variance: {scaler.var_}')
print()
print(f' Target shape: {y_train.shape}') # print the shape of the output values


The function below defines an example of loss function. 

In [None]:
import keras.backend as K

def custom_mse(y_true, y_pred):
 
    # calculating squared difference between target and predicted values 
    loss = K.square(y_pred - y_true)  # (batch_size, 2)
    print(loss)
    
    # multiplying the values with weights along batch dimension
    loss = loss * [0.5, 0.5]          # (batch_size, 2)
    print(loss)  
    # summing both loss values along batch dimension 
    loss = K.sum(loss, axis=1)        # (batch_size,)

    return loss

# if ypu want to add more output metrics
# test_metrics = [ 'mean_absolute_error', 'mean_squared_error']
test_metrics = [ 'mean_squared_error'] #, 'mean_absolute_error', ]


In [None]:
model = Sequential()

model.add( Dense( ... , input_shape=X_train_full.shape[1:], activation='relu') )

#model.add( Dense( ... , activation='relu') )


model.add( Dense( ... ) )

In [None]:
optzr =  Adam(lr= ... )
model.compile(loss='mean_squared_error', optimizer=optzr, metrics = test_metrics)
model.summary()

In [None]:
start_time = time.time() 

history=model.fit( ... , ... 
                    batch_size= ..., 
                    epochs= ...,
                    validation_data=[ ... , ...],
                    callbacks=[plot_losses],shuffle=True)

# to print history contents
# history.history
elapsed_time = time.time() - start_time
time.strftime("%H:%M:%S", time.gmtime(elapsed_time))

In [None]:
evaluation = model.evaluate( ... ,...)
print(f"Loss value: {evaluation[0]:.2f}")  
for m in range(len(test_metrics)):
    print(f"{test_metrics[m]}: {evaluation[m+1]:0.2f}")  

print('------')

y_pred = model.predict( ... ) 

fig, axes = plt.subplots(1,2, figsize=(12,6))
axes = axes.flatten()

for i, ax in enumerate(axes):
  
    r2 = r2_score( y_test[:,i] , y_pred[:,i])
    
    min_ax, max_ax = np.min(y_test[:,i]), np.max(y_test[:,i])

    ax.plot(y_test[:,i], y_pred[:,i], '.', alpha=0.4)
    ax.set_title( f'R2 = {r2:0.2f} for {pre_features[i]}', fontsize=20)
    ax.set_xlabel('true', fontsize=16)
    ax.set_ylabel('prediction', fontsize=16)
    ax.set_xlim([min_ax, max_ax])

    min_ax, max_ax = np.min(y_test[:,i]), np.max(y_test[:,i])
    # 1-to-1 line:
    xx = np.linspace(min_ax, max_ax, 100)
    ax.plot(xx, xx, c="grey", linestyle='--')
    

plt.show()