#### Model Choice

##### Why Random Forest over other types?

Random Forest was the best choice over linear and logistic regression because it can handle more complex patterns in the data. It captures non-linear relationships and interactions between features, which these regression models often miss. 
It's also better at managing outliers, noise, and categorical variables without a lot of preprocessing. 
Plus, Random Forest handles class imbalances well, which was a big issue in this dataset. 
Linear and logistic regression models struggled with this imbalance, but Random Forest managed it effectively, making it the clear winner.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import joblib

# Load the dataset
file_path = '../data/combined_df.csv'
data = pd.read_csv(file_path, low_memory=False)

# Selecting features and target
features = [
    'period_mins', 'perigee_km', 'apogee_km', 'inclination', 
    'object_type', 'object_owner'
]
target = 'status'

# Handling missing values
imputer = SimpleImputer(strategy='most_frequent')
data[features] = imputer.fit_transform(data[features])

# Ensure the status column has no missing values
data[target] = data[target].fillna(data[target].mode()[0])

# Encode categorical features
label_encoders = {}
for col in features:
    if data[col].dtype == 'object':
        label_encoders[col] = LabelEncoder()
        data[col] = label_encoders[col].fit_transform(data[col])

# Identify unique values in the 'status' column
unique_statuses = data[target].unique()

# Create a status mapping dynamically to include all unique statuses
status_mapping = {label: idx for idx, label in enumerate(unique_statuses)}

# Map the status column using the dynamic status mapping
data[target] = data[target].map(status_mapping)

# Check for any NaN values in the mapped target column
if data[target].isnull().sum() > 0:
    raise ValueError("The 'status' column contains NaN values after mapping.")

# Save the status mapping for later use
joblib.dump(status_mapping, '../artifacts/status_mapping.joblib')

# Splitting the dataset into training and testing sets
X = data[features]
y = data[target]

# Check for any NaN values in y
if y.isnull().sum() > 0:
    raise ValueError("Input y contains NaN values.")

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardizing numerical features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Save the scaler for later use
joblib.dump(scaler, '../artifacts/scaler.joblib')

# Define the reduced parameter grid for hyperparameter tuning
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2],
    'class_weight': [None, 'balanced']
}

# Initialize the Random Forest classifier
rf = RandomForestClassifier(random_state=42)

# Perform GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, 
                           cv=3, n_jobs=-1, verbose=2, scoring='f1_macro')
grid_search.fit(X_train, y_train)

# Best parameters from the grid search
best_params = grid_search.best_params_

# Train the Random Forest classifier with the best parameters
best_rf = RandomForestClassifier(**best_params, random_state=42)
best_rf.fit(X_train, y_train)

# Predict on the test set
y_pred = best_rf.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

# Output the evaluation metrics
print("Best Parameters:", best_params)
print("Accuracy:", accuracy)
print("Confusion Matrix:")
print(conf_matrix)
print("Classification Report:")
print(class_report)

# Save the trained model to a file using joblib
model_path = '../artifacts/best_random_forest_orbital_parameters_model.joblib'
joblib.dump(best_rf, model_path)
print(f"Model saved to {model_path}")


Fitting 3 folds for each of 48 candidates, totalling 144 fits
Best Parameters: {'class_weight': 'balanced', 'max_depth': None, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 200}
Accuracy: 0.9311700259262357
Confusion Matrix:
[[5378  285    0    5    8    7    1   27]
 [ 187 5351   14   57    1    0    2    0]
 [   0   41   30    0    0    0    0    0]
 [   8  129    0   58    0    0    0    0]
 [   8    0    0    0  198    0    0    0]
 [   3    4    0    0    2   24    0    2]
 [   9    7    0    0    2    1    5    0]
 [   8    2    0    0    3    0    0   90]]
Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.94      0.95      5711
           1       0.92      0.95      0.94      5612
           2       0.68      0.42      0.52        71
           3       0.48      0.30      0.37       195
           4       0.93      0.96      0.94       206
           5       0.75      0.69      0.72        35
           