Final version of our pipeline. The code implements a machine learning pipeline using a Random Forest classifier for geospatial data. It begins by merging a ground truth dataset with a larger dataset on Latitude and Longitude. The merged data is preprocessed by separating features and target labels, scaling the features, and splitting the data into training and testing sets. The Random Forest model is trained on the training data, and its accuracy is evaluated on the test set. Feature importance is calculated and ranked to identify the most influential features. Further, the pipeline includes feature selection using statistical methods (e.g., ANOVA F-value) to optimize the input features. The model is retrained using the selected features, and additional features are incorporated iteratively to improve accuracy. Then, hyperparameter tuning with GridSearchCV is used for optimization. The tuned model is evaluated on metrics such as accuracy, precision, recall, and F1-score, with a classification report providing detailed performance insights.

In [1]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

ground_truth_file = r'C:/Users/T00667334/Downloads/gt.csv'  # Replace with your ground truth file name
all_data_file = r'C:/Users/T00667334/Downloads/combined.csv' 

# Load the data
ground_truth_df = pd.read_csv(ground_truth_file)
all_data_df = pd.read_csv(all_data_file)

# Merge data
merged_df = pd.merge(all_data_df, ground_truth_df, on=['Latitude', 'Longitude'], how='inner')

# Prepare features and target
X = merged_df.drop(columns=['Latitude', 'Longitude', 'Labels'])  # Drop non-feature columns
y = merged_df['Labels']

# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

# Train the Random Forest model
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)

# Make predictions and calculate accuracy
y_pred = rf_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Random Forest Accuracy: {accuracy * 100:.2f}%")

# Get feature importances and create DataFrame with feature names
feature_importances = pd.DataFrame(rf_model.feature_importances_,
                                   index=X.columns,  # Use the original column names
                                   columns=['importance']).sort_values('importance', ascending=False)

# Print the feature importances
print(feature_importances)


Random Forest Accuracy: 69.44%
                                 importance
Terrain_mrvbf                      0.070032
Terrain_tri                        0.067357
Terrain_openness_pos               0.061438
Terrain_aspect                     0.056668
Terrain_tpi                        0.053253
Terrain_slope                      0.050433
Terrain_hcurv                      0.049810
Terrain_vcurv                      0.048956
Terrain_mrrtf                      0.048575
Terrain_convergence                0.038960
Terrain_openness_neg               0.035214
Terrain_ls_factor                  0.035119
Terrain_tca                        0.032157
MS_MSAVI                           0.031763
Terrain_valley_depth               0.031604
MS_MSAVI2                          0.030105
MS_GNDVI                           0.028789
Terrain_dah                        0.027752
Terrain_twi                        0.026979
NDVI                               0.026188
Terrain_chnl_dist                  0.026075
T

In [2]:
from sklearn.feature_selection import SelectKBest, f_classif
import pandas as pd


data = merged_df

# Separate features (X) and target (y)
X = data.drop('Labels', axis=1)  # Drop the target column from features
y = data['Labels']  # Assign the target column to y

# Select the top k best features based on f_classif (ANOVA F-value)
selector = SelectKBest(f_classif, k=10)  # You can change k based on your needs
X_new = selector.fit_transform(X, y)

# Get the selected feature names
selected_features = X.columns[selector.get_support()]

# Print the selected features
print("Selected features:", selected_features)


Selected features: Index(['Latitude', 'Terrain_convergence', 'Terrain_ls_factor', 'Terrain_mrrtf',
       'Terrain_mrvbf', 'Terrain_openness_neg', 'Terrain_openness_pos',
       'Terrain_slope', 'Terrain_tca', 'Terrain_tri'],
      dtype='object')


In [3]:
# Train a new Random Forest model using only the selected features
X_train, X_test, y_train, y_test = train_test_split(X[selected_features], y, test_size=0.25, random_state=42)

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Evaluate the model
y_pred = rf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))


Accuracy: 0.7166666666666667


In [4]:
add_features = ['NDVI','MS_GNDVI', 'Longitude', 'Terrain_aspect', 'Terrain_twi', 'Terrain_tpi', 'MS_MSAVI', 'Terrain_hcurv', 'Terrain_vcurv','MS_MSAVI2']

In [5]:
final_features = list(selected_features) + add_features

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X[final_features], y, test_size=0.20, random_state=42)

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Evaluate the model
y_pred = rf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))


Accuracy: 0.7916666666666666


In [13]:
!pip install xgboost

Defaulting to user installation because normal site-packages is not writeable


In [14]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report


# Calculate accuracy, precision, recall, and F1 score with weighted average (use 'binary' for binary classification)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted', zero_division=0)  # Add zero_division to handle zero precision cases
recall = recall_score(y_test, y_pred, average='weighted', zero_division=0)  # Add zero_division to handle zero recall cases
f1 = f1_score(y_test, y_pred, average='weighted', zero_division=0)  # Add zero_division to handle zero F1 score cases

# Print the evaluation metrics
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")

# Print detailed classification report
print("\nClassification Report:\n", classification_report(y_test, y_pred))


Accuracy: 0.79
Precision: 0.78
Recall: 0.79
F1 Score: 0.77

Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.94      0.87        35
           1       0.71      0.38      0.50        13

    accuracy                           0.79        48
   macro avg       0.76      0.66      0.68        48
weighted avg       0.78      0.79      0.77        48



In [20]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# Define the parameter grid for Random Forest tuning
param_grid = {
    'n_estimators': [100, 200, 300, 500],      # Number of trees in the forest
    'max_depth': [None, 10, 20, 30],            # Maximum depth of the tree
    'min_samples_split': [2, 5, 10],            # Minimum number of samples required to split an internal node
    'min_samples_leaf': [1, 2, 4],              # Minimum number of samples required at a leaf node
    'max_features': ['auto', 'sqrt', 'log2'],  # Number of features to consider for the best split
}

# Initialize Random Forest classifier
rf_model = RandomForestClassifier(random_state=42)

# Perform Grid Search with cross-validation
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)

# Print the best parameters found by GridSearchCV
print(f"Best Hyperparameters: {grid_search.best_params_}")

# Use the best model found by GridSearchCV
best_rf_model = grid_search.best_estimator_

# Predict with the tuned model
y_pred = best_rf_model.predict(X_test)

# Print the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Improved Accuracy with Tuned Random Forest: {accuracy:.2f}")


Fitting 5 folds for each of 432 candidates, totalling 2160 fits
Best Hyperparameters: {'max_depth': None, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 100}
Improved Accuracy with Tuned Random Forest: 0.81


In [21]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report


# Calculate accuracy, precision, recall, and F1 score with weighted average (use 'binary' for binary classification)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted', zero_division=0)  # Add zero_division to handle zero precision cases
recall = recall_score(y_test, y_pred, average='weighted', zero_division=0)  # Add zero_division to handle zero recall cases
f1 = f1_score(y_test, y_pred, average='weighted', zero_division=0)  # Add zero_division to handle zero F1 score cases

# Print the evaluation metrics
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")

# Print detailed classification report
print("\nClassification Report:\n", classification_report(y_test, y_pred))


Accuracy: 0.81
Precision: 0.80
Recall: 0.81
F1 Score: 0.80

Classification Report:
               precision    recall  f1-score   support

           0       0.82      0.94      0.88        35
           1       0.75      0.46      0.57        13

    accuracy                           0.81        48
   macro avg       0.79      0.70      0.73        48
weighted avg       0.80      0.81      0.80        48

