## ML models

This notebook trains the following ML models:

1. Logistic Regressor
2. Decision Tree
3. Support-Vector Machine
4. K-Nearest Neighbours
5. Random Forests

as well as two boosting methods:

1. Extreme Gradient Boosting Machine
2. Light Gradient Boosting Machine

In [3]:
import pandas as pd
import numpy as np
import time
import os
import h5py
from loguru import logger
%matplotlib inline

from sklearn.linear_model import Ridge, LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import (RandomForestRegressor, VotingRegressor)
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.multioutput import MultiOutputRegressor
from lightgbm import LGBMRegressor as lgb
from xgboost import XGBRegressor as xgb

from sklearn.utils import shuffle
from sklearn.decomposition import PCA
from sklearn.metrics import (mean_squared_error, mean_absolute_error, r2_score, root_mean_squared_error, root_mean_squared_log_error, mean_absolute_percentage_error)
from sklearn.model_selection import (cross_validate, KFold, cross_val_score, train_test_split)
import optuna

Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.



In [None]:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Function to convert RA and Dec to Cartesian coordinates
def spherical_to_cartesian(ra, dec, distance=1):
    ra_rad = np.radians(ra)
    dec_rad = np.radians(dec)
    x = distance * np.cos(dec_rad) * np.cos(ra_rad)
    y = distance * np.cos(dec_rad) * np.sin(ra_rad)
    z = distance * np.sin(dec_rad)
    return x, y, z

# Sample data
ra = [10, 20, 30, 40]  # in degrees
dec = [-10, 0, 10, 20] # in degrees

# Convert to Cartesian coordinates
x, y, z = spherical_to_cartesian(ra, dec)

# Plotting
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(x, y, z)

ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')

plt.show()


In [4]:
log_dir = '../../../logs'
kfold = KFold(n_splits=5)
pca = PCA(n_components=1)

In [7]:
def train_model(model, X, y):
    
    y = pca.fit_transform(y)
    mse_scores = []
    st = time.time()
    for fold, (train_idx, test_idx) in enumerate(kfold.split(X, y)):

        print(f"Training on fold {fold}")
        x_train, x_test = X[train_idx], X[test_idx]
        y_train, y_test = y[train_idx], y[test_idx]

        model.fit(x_train, y_train.squeeze(1))
        
        y_pred = model.predict(x_test)
        
        # scoring metrics
        mse = mean_squared_error(y_test.squeeze(1), y_pred)

        print(f"MSE for fold {fold}: {mse}")
        mse_scores.append(mse)

    print("Mean MSE:",np.mean(mse_scores))
    

    log_file = os.path.join(log_dir, f'train_{model.__class__.__name__}.log')
    logger.add(log_file, format="{time} - {level} - {message}")
    logger.info(f"Cross-validation technique:{kfold.__class__.__name__},\
                Number of splits:{kfold.__dict__['n_splits']},\
                Time taken:{time.time()-st},\
                MSE:{mse_scores},\
                Mean MSE:{np.mean(mse_scores)}")
    return

In [6]:
def get_data(name:str='', SHUFFLE_FLAG:bool=False, NORM_FLAG:bool=True, random_state:int=42):
    '''
    Function to select data

    Arguments
    ---------
    name: str, (required)
        name of dataset to be returned
    SHUFFLE_FLAG: bool, (optional)
        Flag for if the data should be shuffled
    NORM_FLAG: bool, (optional)
        If the data should be normalized
    random_state: int, (optional)
        random_state
    
    Returns
    -------
    X: numpy.ndarray 
        training set 
    y: numpy.ndarray 
        test set
    '''
    
    if name is None:
        raise ValueError("Required argument 'name' is missing.")
    
    if name == "gaia":
        dir = '../data/Gaia DR3/gaia_lm_m_stars.parquet'
        data = pd.read_parquet(dir)
        if SHUFFLE_FLAG:
            df = shuffle(data)
        else:
            df = data
        X = np.vstack(df['flux'])
        y = np.vstack(df['Cat'])
        
        y = np.where(y == 'M', 1, y)
        y = np.where(y == 'LM', 0, y)

        y = y.astype(int)

        if NORM_FLAG:
            norm = np.linalg.norm(X,keepdims=True)
            X = X/norm
            

    elif name == 'apogee':
        dir = '../../../data/APOGEE'
        train_dir = dir + '/training_data.h5'
        tets_dir = dir +'/test_data.h5'

        with h5py.File(train_dir, 'r') as f:
            X = f['spectrum'][:]
            y = np.hstack((f['TEFF'],
                        f['LOGG'],
                        f['FE_H']))
        
        #TODO: add shuffle

        if NORM_FLAG:
            norm_dir = dir + '/mean_and_std.npy'
            norm_data = np.load(norm_dir)
            
            mean = norm_data[0]
            std = norm_data[1]
            y = (y-mean)/std

    return X, y

In [8]:
#X, y = get_data('gaia', SHUFFLE_FLAG=True)
X, y = get_data('apogee')

num_samples = X.shape[0]
spectrum_width = X.shape[1]

num_samples_m = np.count_nonzero(y)
num_samples_lm = len(y) - num_samples_m
num_classes = len(np.unique(y))

print("Total number of spectra:", num_samples)
print("Number of bins in each spectra:", spectrum_width)

Total number of spectra: 44784
Number of bins in each spectra: 7214


## Linear Regressor

In [None]:
lr = LinearRegression()
train_model(lr, X, y)

## Decision Trees

In [None]:
dtr = DecisionTreeRegressor()
train_model(dtr, X, y)

## Random Forest

In [None]:
rfr = RandomForestRegressor()
train_model(rfr, X, y)

## K-Nearest Neighbours

In [None]:
knn = KNeighborsRegressor()
train_model(knn, X, y)

## Light Gradient Boosting Machine

In [9]:
lgbm = lgb()
train_model(lgbm, X, y)

Training on fold 0
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 3.865823 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1839570
[LightGBM] [Info] Number of data points in the train set: 35827, number of used features: 7214
[LightGBM] [Info] Start training from score 0.001407
MSE for fold 0: 0.011321316591446471
Training on fold 1
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 3.943052 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1839570
[LightGBM] [Info] Number of data points in the train set: 35827, number of used features: 7214
[LightGBM] [Info] Start training from score 0.002391
MSE for fold 1: 0.009775777811238371
Training on fold 2
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 3.550972 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [I

[32m2024-05-22 18:13:15.602[0m | [1mINFO    [0m | [36m__main__[0m:[36mtrain_model[0m:[36m27[0m - [1mCross-validation technique:KFold,                Number of splits:5,                Time taken:624.9241380691528,                MSE:[0.011321316591446471, 0.009775777811238371, 0.01067043287345182, 0.011385926848572142, 0.010112977260240071],                Mean MSE:0.010653286276989773[0m


MSE for fold 4: 0.010112977260240071
Mean MSE: 0.010653286276989773


## Extreme Gradient Boosting Machine

In [10]:
xgbm = xgb()
train_model(xgbm, X, y)

Training on fold 0
MSE for fold 0: 0.015175866894423962
Training on fold 1
MSE for fold 1: 0.012978886254131794
Training on fold 2
MSE for fold 2: 0.014031355269253254
Training on fold 3
MSE for fold 3: 0.014668754301965237
Training on fold 4


[32m2024-05-22 18:34:50.912[0m | [1mINFO    [0m | [36m__main__[0m:[36mtrain_model[0m:[36m27[0m - [1mCross-validation technique:KFold,                Number of splits:5,                Time taken:1294.7455422878265,                MSE:[0.015175867, 0.012978886, 0.014031355, 0.014668754, 0.013670755],                Mean MSE:0.014105123467743397[0m


MSE for fold 4: 0.013670754618942738
Mean MSE: 0.0141051235
