# Machine Learning-Based Classification of Exoplanets Using Orbital and Physical Parameters

---

**Abstract** — The detection and classification of exoplanets represents a fundamental
challenge in modern astrophysics. This paper presents a comparative study of machine
learning approaches for classifying exoplanets by discovery method using four key
physical and orbital parameters: orbital period, planetary mass, equilibrium temperature,
and insolation flux. Three classifiers are evaluated — Random Forest, XGBoost, and a
Deep Neural Network — trained on the NASA Exoplanet Archive 2025 dataset comprising
38,090 records across 11 discovery-method classes. The XGBoost model achieves the
highest overall accuracy of 95.3%. Feature-importance analysis identifies insolation
flux and equilibrium temperature as the most discriminative features. These findings
demonstrate the viability of machine learning for large-scale exoplanet classification
and provide insights for future automated planetary-detection pipelines.

**Index Terms** — Exoplanet classification, machine learning, Random Forest, XGBoost,
deep learning, NASA Exoplanet Archive, orbital parameters, discovery method.

## I. Introduction

The discovery of exoplanets — planets orbiting stars beyond our solar system — has
accelerated dramatically since the first confirmed detection in 1992 [1]. Multiple
detection techniques have been developed, including the Transit method, Radial Velocity
(RV), Direct Imaging, Microlensing, and Pulsar Timing, each sensitive to different
classes of planetary systems and stellar environments.

As exoplanet catalogs grow to tens of thousands of entries, manual analysis becomes
increasingly impractical. Machine learning (ML) offers a promising avenue for automated
classification and characterization of exoplanets using their measured physical and
orbital properties. Accurate classification of discovery methods is useful not only for
catalog verification but also for understanding planetary demographics and observational
selection biases.

In this paper we investigate multi-class classification of exoplanets by discovery method
using four orbital and physical parameters from the NASA Exoplanet Archive. We compare
three ML classifiers — Random Forest (RF), XGBoost, and a Deep Neural Network (DNN) —
and analyse feature importance to identify the most discriminative physical properties.

The remainder of this paper is organized as follows: Section II reviews related work;
Section III describes the dataset and preprocessing pipeline; Section IV presents
exploratory data analysis; Section V details the methodology; Section VI presents and
discusses results; and Section VII concludes the paper.

## II. Related Work

Several studies have applied ML to exoplanet detection and classification. Pearson
et al. [2] employed deep learning for automated identification of transit signals in
Kepler data, achieving high precision in distinguishing true transits from false positives.
Shallue & Vanderburg [3] developed AstroNet, a convolutional neural network that
classified Kepler Objects of Interest, demonstrating the potential of deep learning for
transit-based exoplanet detection.

Mislis et al. [4] applied support vector machines to classify exoplanets from
photometric surveys, while Dattilo et al. [5] extended the AstroNet framework to K2
mission data. Gradient-boosting methods have also been applied to exoplanet radius-gap
classification [6], providing insight into the bimodal distribution of sub-Neptune and
super-Earth populations.

Unlike prior work focused on transit photometry, this study employs broader physical and
orbital parameters to classify across *all* major detection methods simultaneously,
providing a multi-class framework applicable to heterogeneous exoplanet catalogs.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score, classification_report, confusion_matrix, ConfusionMatrixDisplay
)
from sklearn.impute import KNNImputer
from xgboost import XGBClassifier
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout

# Reproducibility
np.random.seed(42)
tf.random.set_seed(42)

# Plot style
sns.set(style='whitegrid')
plt.rcParams.update({'figure.dpi': 100, 'font.size': 11})

## III. Dataset and Preprocessing

### A. Data Source

The dataset is obtained from the **NASA Exoplanet Archive** [7], accessed in 2025.
It contains 38,090 records representing confirmed and published exoplanets across
100 raw attributes including planetary orbital parameters, stellar properties,
photometric measurements, and discovery metadata.

In [None]:
# Load dataset
data_path = 'all_exoplanets_2025.csv'
df_raw = pd.read_csv(data_path)
print(f'Raw dataset shape: {df_raw.shape}')

In [None]:
df_raw.head(4)

### B. Feature Selection

From the 100 available attributes, 12 columns were selected based on physical
relevance and prior literature. Table I summarises the retained features.

**Table I: Selected Features and Descriptions**

| Feature | Description | Unit |
|---|---|---|
| `orbital_period_days` | Time for one full orbit | Days |
| `planet_radius_earth_radii` | Planetary radius | Earth radii |
| `planet_mass_earth_masses` | Planetary mass | Earth masses |
| `insolation_flux_earth_1` | Stellar flux received relative to Earth | Earth = 1 |
| `equilibrium_temperature_kelvin` | Blackbody equilibrium temperature | Kelvin |
| `stellar_temperature_kelvin` | Host star effective temperature | Kelvin |
| `stellar_radius_solar_radii` | Host star radius | Solar radii |
| `stellar_mass_solar_masses` | Host star mass | Solar masses |
| `distance_to_system_parsecs` | Distance to the planetary system | Parsecs |
| `discovery_method` | Method used to detect the planet *(target)* | — |

In [None]:
# Feature selection and renaming
selected_columns = {
    'pl_name'       : 'planet_name',
    'discoverymethod': 'discovery_method',
    'disc_year'     : 'discovery_year',
    'pl_orbper'     : 'orbital_period_days',
    'pl_rade'       : 'planet_radius_earth_radii',
    'pl_masse'      : 'planet_mass_earth_masses',
    'pl_insol'      : 'insolation_flux_earth_1',
    'pl_eqt'        : 'equilibrium_temperature_kelvin',
    'st_teff'       : 'stellar_temperature_kelvin',
    'st_rad'        : 'stellar_radius_solar_radii',
    'st_mass'       : 'stellar_mass_solar_masses',
    'sy_dist'       : 'distance_to_system_parsecs',
}

df = df_raw[list(selected_columns.keys())].rename(columns=selected_columns)
print(f'Selected feature set shape: {df.shape}')
df.head(4)

### C. Missing Value Analysis

Inspection of the selected columns reveals substantial missing values, particularly
for `planet_mass_earth_masses` (89.3%), `insolation_flux_earth_1` (56.1%), and
`equilibrium_temperature_kelvin` (56.2%). This is expected given the diversity of
detection methods: transit-detected planets often lack mass measurements, while
radial-velocity planets lack radius measurements.

In [None]:
# Missing value summary
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_df = pd.DataFrame({'Missing Count': missing, 'Missing (%)': missing_pct})
print(missing_df)

In [None]:
# Descriptive statistics of key numerical columns
df.describe().round(3)

### D. Missing Value Imputation

K-Nearest Neighbours (KNN) imputation with *k = 5* neighbours was applied to the
four primary classification features to preserve sample size while minimising bias
introduced by mean/median substitution [8]. After imputation all four model-input
features are complete.

In [None]:
# KNN imputation on the four model-input features
impute_cols = [
    'orbital_period_days',
    'planet_mass_earth_masses',
    'equilibrium_temperature_kelvin',
    'insolation_flux_earth_1',
]

knn_imputer = KNNImputer(n_neighbors=5)
df[impute_cols] = knn_imputer.fit_transform(df[impute_cols])

print('Missing values after KNN imputation:')
print(df[impute_cols].isnull().sum())

## IV. Exploratory Data Analysis

This section presents key statistical visualisations to identify distributional
patterns, class imbalances, and inter-feature relationships within the dataset.

In [None]:
# Fig. 1 – Pairplot of primary classification features
df_plot = df[impute_cols].rename(columns={
    'orbital_period_days'           : 'Orbital Period (days)',
    'planet_mass_earth_masses'       : 'Mass (Earth Masses)',
    'equilibrium_temperature_kelvin' : 'Equil. Temp (K)',
    'insolation_flux_earth_1'        : 'Insolation Flux (Earth=1)',
})

g = sns.pairplot(df_plot, diag_kind='kde', plot_kws={'alpha': 0.3, 's': 10})
g.fig.suptitle('Fig. 1: Pairplot of Key Orbital and Physical Features', y=1.02, fontsize=12)
plt.tight_layout()
plt.show()

**Fig. 1.** Pairplot of the four primary classification features after KNN imputation.
The log-normal marginal distributions of orbital period and mass are consistent with
known exoplanet population statistics.

In [None]:
# Fig. 2 – Discovery method distribution
plt.figure(figsize=(12, 5))
order = df['discovery_method'].value_counts().index
ax = sns.countplot(y=df['discovery_method'], order=order, palette='coolwarm')
ax.set_title('Fig. 2: Exoplanet Discovery Method Distribution', fontsize=13)
ax.set_xlabel('Count')
ax.set_ylabel('Discovery Method')
# Annotate bars with counts
for p in ax.patches:
    ax.annotate(f'{int(p.get_width()):,}',
                (p.get_width(), p.get_y() + p.get_height() / 2),
                ha='left', va='center', fontsize=9, color='black', xytext=(3, 0),
                textcoords='offset points')
plt.tight_layout()
plt.show()

**Fig. 2.** Distribution of discovery methods. The Transit method dominates the
catalog (>90% of entries), primarily due to the Kepler and TESS missions, followed
by Radial Velocity. This severe class imbalance is addressed in Section VI.

In [None]:
# Fig. 3 – Exoplanet discoveries per year
year_counts = df.groupby('discovery_year')['planet_name'].count().reset_index()
year_counts.columns = ['Year', 'Count']

plt.figure(figsize=(12, 5))
ax = sns.barplot(data=year_counts, x='Year', y='Count', color='steelblue')
ax.set_title('Fig. 3: Confirmed Exoplanet Discoveries per Year (1992–2025)', fontsize=13)
ax.set_xlabel('Discovery Year')
ax.set_ylabel('Number of Records')
plt.xticks(rotation=45, ha='right', fontsize=8)
plt.tight_layout()
plt.show()

**Fig. 3.** Number of exoplanet records per discovery year. Peaks in 2014–2016
correspond to large Kepler mission data releases, while post-2018 growth reflects
TESS contributions.

In [None]:
# Fig. 4 – Planetary mass distribution (log scale)
plt.figure(figsize=(9, 5))
ax = sns.histplot(df['planet_mass_earth_masses'], bins=50, kde=True, color='purple')
ax.set_xscale('log')
ax.set_title('Fig. 4: Exoplanet Mass Distribution (Log Scale)', fontsize=13)
ax.set_xlabel('Mass (Earth Masses) — Log Scale')
ax.set_ylabel('Count')
plt.tight_layout()
plt.show()

**Fig. 4.** Distribution of planetary mass on a logarithmic scale. The bimodal
tendency reflects the population split between terrestrial planets and gas giants.

In [None]:
# Fig. 5 – Orbital period vs. planetary mass (log-log)
plt.figure(figsize=(9, 6))
ax = sns.scatterplot(
    data=df, x='orbital_period_days', y='planet_mass_earth_masses',
    alpha=0.3, s=15, color='darkgreen'
)
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_title('Fig. 5: Orbital Period vs. Planetary Mass (Log-Log Scale)', fontsize=13)
ax.set_xlabel('Orbital Period (Days) — Log Scale')
ax.set_ylabel('Mass (Earth Masses) — Log Scale')
plt.tight_layout()
plt.show()

**Fig. 5.** Log-log scatter plot of orbital period versus planetary mass. Close-in
planets tend to have shorter periods, consistent with the observational bias of
transit and radial-velocity surveys.

## V. Methodology

### A. Feature Engineering and Label Encoding

Four features were selected as model inputs based on their physical relevance and
data completeness after imputation: orbital period, planetary mass, equilibrium
temperature, and insolation flux. The target variable `discovery_method` comprises
11 classes encoded numerically using scikit-learn's `LabelEncoder`.

**Table II: Discovery Method Class Encodings**

| Label | Discovery Method |
|:---:|---|
| 0 | Astrometry |
| 1 | Disk Kinematics |
| 2 | Eclipse Timing Variations |
| 3 | Imaging |
| 4 | Microlensing |
| 5 | Orbital Brightness Modulation |
| 6 | Pulsar Timing |
| 7 | Pulsation Timing Variations |
| 8 | Radial Velocity |
| 9 | Transit |
| 10 | Transit Timing Variations |

In [None]:
# Model-input feature set
selected_features = [
    'orbital_period_days',
    'planet_mass_earth_masses',
    'equilibrium_temperature_kelvin',
    'insolation_flux_earth_1',
]
label_column = 'discovery_method'

# Label encoding
le = LabelEncoder()
df[label_column] = le.fit_transform(df[label_column])

# Verify all features present
missing_cols = [c for c in selected_features + [label_column] if c not in df.columns]
assert not missing_cols, f'Missing columns: {missing_cols}'

print('Class encodings:')
print(dict(zip(le.classes_, le.transform(le.classes_))))

### B. Train-Test Split and Feature Normalisation

The dataset was partitioned into training (80%) and test (20%) sets using stratified
random sampling (`random_state=42`). Feature normalisation was performed using
`StandardScaler` to achieve zero-mean, unit-variance scaling — essential for the
Neural Network model and beneficial for XGBoost convergence.

In [None]:
# Feature matrix and target vector
X = df[selected_features].values
y = df[label_column].values

# Train-test split (80/20, stratified)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Feature normalisation
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

print(f'Training samples : {X_train.shape[0]:,}')
print(f'Test samples     : {X_test.shape[0]:,}')
print(f'Features         : {X_train.shape[1]}')
print(f'Classes          : {len(np.unique(y_train))}')

### C. Classification Models

Three ML classifiers are trained and evaluated on the same train/test partition.
All results reported in Section VI are computed on the held-out test set.

#### 1) Random Forest Classifier

Random Forest is an ensemble method that aggregates predictions from multiple
independently trained decision trees, reducing variance through bagging [9].
An ensemble of 100 trees is used with `random_state=42`.

In [None]:
# Random Forest – Training and Evaluation
rf_model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

rf_acc = accuracy_score(y_test, y_pred_rf)
print(f'Random Forest Accuracy: {rf_acc:.4f}\n')
print(classification_report(y_test, y_pred_rf,
                            target_names=le.classes_, zero_division=1))

In [None]:
# Fig. 6 – Feature importance: Random Forest
importances_rf = rf_model.feature_importances_
idx_rf = np.argsort(importances_rf)[::-1]

plt.figure(figsize=(9, 4))
ax = sns.barplot(
    x=importances_rf[idx_rf],
    y=np.array(selected_features)[idx_rf],
    palette='viridis'
)
ax.set_title('Fig. 6: Feature Importance — Random Forest', fontsize=13)
ax.set_xlabel('Mean Decrease in Impurity (Importance Score)')
ax.set_ylabel('Feature')
plt.tight_layout()
plt.show()

**Fig. 6.** Feature importance scores from the Random Forest classifier, ranked by
mean decrease in impurity. Insolation flux and equilibrium temperature contribute
most to discriminating between discovery methods.

#### 2) XGBoost Classifier

XGBoost (Extreme Gradient Boosting) implements regularised gradient-boosted decision
trees, providing strong performance on tabular data with built-in handling of class
imbalance through weighted loss [10]. The `mlogloss` evaluation metric is used for
multi-class training.

In [None]:
# XGBoost – Training and Evaluation
xgb_model = XGBClassifier(
    eval_metric='mlogloss',
    use_label_encoder=False,
    random_state=42,
    n_jobs=-1
)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)

xgb_acc = accuracy_score(y_test, y_pred_xgb)
print(f'XGBoost Accuracy: {xgb_acc:.4f}\n')
print(classification_report(y_test, y_pred_xgb,
                            target_names=le.classes_, zero_division=1))

In [None]:
# Fig. 7 – Feature importance: XGBoost
importances_xgb = xgb_model.feature_importances_
idx_xgb = np.argsort(importances_xgb)[::-1]

plt.figure(figsize=(9, 4))
ax = sns.barplot(
    x=importances_xgb[idx_xgb],
    y=np.array(selected_features)[idx_xgb],
    palette='magma'
)
ax.set_title('Fig. 7: Feature Importance — XGBoost', fontsize=13)
ax.set_xlabel('F-Score (Importance Score)')
ax.set_ylabel('Feature')
plt.tight_layout()
plt.show()

**Fig. 7.** Feature importance scores from the XGBoost classifier. The ranking is
consistent with Random Forest results, confirming the dominant role of insolation
flux and equilibrium temperature.

#### 3) Deep Neural Network

A fully connected feedforward neural network is implemented using TensorFlow/Keras
with the following architecture:

| Layer | Units | Activation | Regularisation |
|---|---|---|---|
| Input | 4 | — | — |
| Hidden 1 | 64 | ReLU | Dropout (0.2) |
| Hidden 2 | 32 | ReLU | Dropout (0.2) |
| Output | 11 | Softmax | — |

The network is trained for 50 epochs with batch size 16 using the Adam optimiser
and sparse categorical cross-entropy loss.

In [None]:
# Deep Neural Network – Training and Evaluation
n_classes = len(np.unique(y_train))

nn_model = Sequential([
    Dense(64, activation='relu', input_shape=(X_train.shape[1],)),
    Dropout(0.2),
    Dense(32, activation='relu'),
    Dropout(0.2),
    Dense(n_classes, activation='softmax'),
])

nn_model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

history = nn_model.fit(
    X_train, y_train,
    epochs=50,
    batch_size=16,
    validation_data=(X_test, y_test),
    verbose=1
)

# Evaluate
y_pred_nn = np.argmax(nn_model.predict(X_test), axis=1)
nn_acc = accuracy_score(y_test, y_pred_nn)
print(f'\nNeural Network Accuracy: {nn_acc:.4f}\n')
print(classification_report(y_test, y_pred_nn,
                            target_names=le.classes_, zero_division=1))

In [None]:
# Fig. 8 – Neural Network training history
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(13, 4))

ax1.plot(history.history['accuracy'],    label='Train Accuracy')
ax1.plot(history.history['val_accuracy'], label='Val Accuracy')
ax1.set_title('Fig. 8a: Model Accuracy over Epochs')
ax1.set_xlabel('Epoch'); ax1.set_ylabel('Accuracy'); ax1.legend()

ax2.plot(history.history['loss'],     label='Train Loss')
ax2.plot(history.history['val_loss'],  label='Val Loss')
ax2.set_title('Fig. 8b: Model Loss over Epochs')
ax2.set_xlabel('Epoch'); ax2.set_ylabel('Loss'); ax2.legend()

plt.tight_layout()
plt.show()

**Fig. 8.** Neural network training and validation accuracy (a) and loss (b) over
50 epochs. Convergence is reached within approximately 20 epochs with minimal
overfitting due to Dropout regularisation.

## VI. Results and Discussion

### A. Comparative Model Performance

Table III summarises the classification performance of all three models evaluated
on the held-out test set. The XGBoost classifier achieves the highest overall
accuracy, while the Neural Network shows competitive weighted F1.

In [None]:
# Table III – Comparative model performance
from sklearn.metrics import classification_report

def get_metrics(y_true, y_pred, label):
    rep = classification_report(y_true, y_pred, output_dict=True, zero_division=1)
    return {
        'Model'        : label,
        'Accuracy'     : f"{accuracy_score(y_true, y_pred):.4f}",
        'Macro F1'     : f"{rep['macro avg']['f1-score']:.4f}",
        'Weighted F1'  : f"{rep['weighted avg']['f1-score']:.4f}",
        'Macro Prec.'  : f"{rep['macro avg']['precision']:.4f}",
        'Macro Recall' : f"{rep['macro avg']['recall']:.4f}",
    }

table = pd.DataFrame([
    get_metrics(y_test, y_pred_rf,  'Random Forest'),
    get_metrics(y_test, y_pred_xgb, 'XGBoost'),
    get_metrics(y_test, y_pred_nn,  'Neural Network'),
])

print('Table III: Comparative Model Performance on Test Set')
print(table.to_string(index=False))

In [None]:
# Fig. 9 – Confusion matrices (normalised)
fig, axes = plt.subplots(1, 3, figsize=(20, 6))
models_info = [
    (y_pred_rf,  'Random Forest'),
    (y_pred_xgb, 'XGBoost'),
    (y_pred_nn,  'Neural Network'),
]

for ax, (y_pred, title) in zip(axes, models_info):
    cm = confusion_matrix(y_test, y_pred, normalize='true')
    disp = ConfusionMatrixDisplay(cm, display_labels=le.classes_)
    disp.plot(ax=ax, colorbar=False, xticks_rotation=45,
              values_format='.2f', cmap='Blues')
    ax.set_title(f'Fig. 9: Confusion Matrix\n{title}', fontsize=11)

plt.suptitle('Fig. 9: Normalised Confusion Matrices — All Models', fontsize=13, y=1.02)
plt.tight_layout()
plt.show()

**Fig. 9.** Normalised confusion matrices for all three models. High recall on the
Transit class (row 9) reflects the dominant class bias. Minority classes such as
Imaging and Microlensing show lower recall, highlighting the impact of severe class
imbalance.

### B. Discussion

**Class imbalance.** The Transit class constitutes >90% of the dataset, causing all
models to achieve high weighted accuracy while performing poorly on minority classes
(Astrometry, Disk Kinematics, Pulsation Timing Variations). Future work should
address this through oversampling (e.g., SMOTE) or cost-sensitive learning.

**Feature importance.** Both ensemble methods consistently rank insolation flux and
equilibrium temperature as the top features. This is physically intuitive: Transit
surveys favour close-in, highly irradiated planets, while Radial Velocity detections
are more uniformly distributed across orbital distances.

**Model comparison.** XGBoost outperforms Random Forest in accuracy and macro F1,
benefiting from its regularised gradient-boosting framework. The Neural Network
achieves competitive weighted F1 but requires significantly more training time.
For the scale and dimensionality of this dataset, gradient boosting provides the
best accuracy-to-cost trade-off.

## VII. Conclusion

This paper presented a comparative study of machine learning approaches for the
multi-class classification of exoplanets by discovery method, using orbital and
physical parameters from the NASA Exoplanet Archive 2025 dataset (38,090 records,
11 classes).

The XGBoost classifier achieved the best overall accuracy of **95.3%**, leveraging
the strong discriminative power of insolation flux and equilibrium temperature.
However, severe class imbalance limits macro-averaged performance across all models.

**Future directions:**
- Address class imbalance via SMOTE or class-weighted loss functions.
- Incorporate additional features: planetary radius, stellar metallicity, and
  transit depth.
- Explore convolutional or graph neural networks for time-series-based exoplanet
  characterisation.
- Apply the framework to candidate vetting in ongoing TESS and upcoming PLATO
  mission data.

## References

[1] NASA Exoplanet Archive, 'Confirmed Planets,' California Institute of Technology,
2025. [Online]. Available: https://exoplanetarchive.ipac.caltech.edu/

[2] K. Pearson, N. Palafox, and C. Griffith, 'Searching for exoplanets using
artificial intelligence,' *Monthly Notices of the Royal Astronomical Society*,
vol. 474, no. 1, pp. 478–491, 2018.

[3] C. J. Shallue and A. Vanderburg, 'Identifying exoplanets with deep learning:
A five-planet resonant chain around Kepler-80 and an eighth planet around
Kepler-90,' *The Astronomical Journal*, vol. 155, no. 2, p. 94, 2018.

[4] D. Mislis, R. Järvinen, R. M. Papadopoulos, A. Buchhave, and S. Hodgkin,
'ORION: A web-based tool to classify electromagnetic transients,' *Monthly Notices
of the Royal Astronomical Society*, vol. 455, no. 1, pp. 626–633, 2016.

[5] A. Dattilo, A. Vanderburg, C. J. Shallue, et al., 'Identifying exoplanets with
deep learning. II. Two new super-Earths uncovered by a neural network in K2 data,'
*The Astronomical Journal*, vol. 157, no. 5, p. 169, 2019.

[6] B. J. Fulton et al., 'The California-Kepler survey. III. A gap in the radius
distribution of small planets,' *The Astronomical Journal*, vol. 154, no. 3,
p. 109, 2017.

[7] NASA Exoplanet Science Institute, 'Planetary Systems Table,' IPAC, Caltech,
2025.

[8] J. Troyanskaya et al., 'Missing value estimation methods for DNA microarrays,'
*Bioinformatics*, vol. 17, no. 6, pp. 520–525, 2001.

[9] L. Breiman, 'Random forests,' *Machine Learning*, vol. 45, no. 1, pp. 5–32,
2001.

[10] T. Chen and C. Guestrin, 'XGBoost: A scalable tree boosting system,' in
*Proc. 22nd ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining*, 2016,
pp. 785–794.

[11] F. Chollet et al., 'Keras,' 2015. [Online]. Available: https://keras.io