# Exploring COSMOS with machine learning: galaxies’ physical properties

###### Pedro António Malta Ferreira

This notebook aims to concisely and comprehensibly describe the steps used throughout the study in question.
In total, 5 Jupyter notebooks were developed, whose names are:

- data_upload.ipynb
- data_preparation.ipynb
- ML_implementation.ipynb
- ML_algorithm_analysis.ipynb
- ML_algorithm_analysis_EASY.ipynb

Let's analyze each one in detail.

##### data_upload.ipynb

This notebook consists of establishing a first contact with the catalogs. Initially, we proceed with the opening of the .fit files where all the information we are looking for is available. A pre-selection of the data we wanted has already been done; as the study focuses on galaxies (using the 'lp_type' parameter), we removed irrelevant structures and (optionally) added a flag parameter to eliminate potential unreliable values. Subsequently, the catalogs were saved in .parquet files.

In [None]:
#catalog = catalog[catalog['FLAG_COMBINED'] == 0]
catalog = catalog[catalog['lp_type'] == 0]
catalog = catalog.to_pandas()
catalog = catalog.to_parquet('path')

##### data_preparation.ipynb

This notebook consists of the preselection of features and labels that we intend to extract from the catalogs. In this version (quickly adapted for other configurations), only magnitudes and fluxes whose aperture corresponds to APER3 were selected as features. 

In [None]:
columns_classic = set()
columns_farmer_not_classic = set()

for column in catalog_classic.columns:
    if 'APER3' in column:
        if 'ERR' not in column:
            columns_classic.add(column)

for column in catalog_farmer.columns:
    if ('MAG' in column or 'FLUX' in column):
        if 'ERR' not in column:
            if column not in columns_classic:
                columns_farmer_not_classic.add(column)

columns_classic = list(columns_classic)
columns_farmer_not_classic = list(columns_farmer_not_classic)

Two catalogs were created by merging the Classic and Farmer datasets, one containing the feature data and the other containing the labels that will be used.

In [None]:
data = cat_classic.merge(cat_farmer,left_index=True, right_index=True).fillna(0)

In [None]:
target = cat_labels.fillna(0)

Note: in this version, it is assumed that NaN (Not a Number) takes the value of zero, but other interpretations such as -99.9, used in the relevant study, are possible.

##### ML_implementation.ipynb

In this notebook, the focus was solely on a practical application of predicting physical quantities for the three algorithms xgboost, lightgbm, and catboost, using these parameters, saving them in a dataframe.

In [None]:
models = [
    XGBRegressor(
        objective='reg:squarederror', 
        n_estimators=100,
        max_depth=8,
        nthread=-1,
    ),
    LGBMRegressor(
        objective='regression', 
        n_jobs=-1,
        n_estimators=100,
        max_depth=8,
        subsample=0.8,
        verbosity=-1
    ),
    CatBoostRegressor(
        loss_function='RMSE',   
        logging_level='Silent',
        n_estimators=100,
        max_depth=8
    )
]

Used train-validation-test with the following parameters:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=1)

##### ML_algorithm_analysis.ipynb

Here, we focus solely on the development of an analysis, opting for the CatBoost algorithm and using MultiOutputRegression, (with LePhare labels) which has proven to be the most efficient for a reasonable time frame. distribution, for instance.

In [None]:
X = data
y = target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=1)

models = [
    CatBoostRegressor(
        n_estimators = 200,
        max_depth = 8,
        verbose=0
    )
]

Metrics were established for the predicted labels, and various graphical studies were presented, such as lines separating catastrophic outliers from other points and histograms depicting data.

In [None]:
def metric_redshift(x,y):
    met = np.abs(pd.Series(y-x))
    f_out = met/(1+x.astype(np.float32))
    nmad=1.48*np.median(f_out)
    bias = np.median(f_out)
    y_outlier = pd.Series(np.where(f_out > 0.15, 'outlier', 'not outlier'))
    print("{}\n".format(y_outlier.value_counts()))
    print("Bias: {:.2f}\n".format(bias))
    print("NMAD score: {:.2f}\n".format(nmad))
    print('r2= {:.2f}'.format(r2_score(x, y)))
    pass

In [None]:
def metrics(x, y):
    f_out = np.abs(pd.Series(y-x))
    nmad=1.48 * np.median(f_out)
    bias = np.median(f_out)
    y_outlier = pd.Series(np.where(f_out > 0.3, 'outlier', 'not outlier'))
    print("{}\n".format(y_outlier.value_counts()))
    print("Bias: {:.2f}\n".format(bias))
    print("NMAD score: {:.2f}\n".format(nmad))
    print('r2= {:.2f}'.format(r2_score(x, y)))
    pass

##### ML_algorithm_analysis_EASY.ipynb

Here we performed the exact same analysis, only using the labels provided by the EASY software.