# Ore Proximity Model

A consultant is using metal abundance changes to predict proximity to an orebody.  Samples are classified as proximal (A) or distal (B) based on Euclidean distance to a wireframe model of the orebody.

1. Can we use the same geochemical data and labels to generate a predictive model for future drill holes which can label samples on whether they are in class A or class B?
2. More data has been acquired since the geochemist completed her work - can we predict labels onto these data points (labelled “?”).


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, RandomizedSearchCV 
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
from sklearn.preprocessing import StandardScaler, LabelEncoder 
import seaborn as sns
from scipy.stats import randint

## Step 1: Data - QA/QC

In [None]:
data = pd.read_csv('data/data_for_distribution.csv')
data.head()

In [None]:
data.info()

In [None]:
# Check how many NaN values are in each column
nan_values = data.isna().sum()

# Print the number of NaN values per column
print("Number of NaN values per column:")
print(nan_values)

Looking at the data, I saw that there are some there are some QA/QC issues.
These included
- Au data type is object
- missing values
- values below the detection limit (<0.005)
- unsuitable values (-999)

In [None]:
elements = ['As', 'Au', 'Pb', 'Fe', 'Mo', 'Cu', 'S', 'Zn']

# Replace '<0.005' with half of the detection limit and '-999' with NaN
data[elements] = data[elements].replace({'<0.005': 0.005 / 2, '-999': np.nan})

# Convert the 'Au' column to float64
data['Au'] = pd.to_numeric(data['Au'], errors='coerce')

# Drop NaN values
data_cleaned = data.dropna(subset=elements)

data_cleaned.to_csv("./data/cleaned_data.csv", index=False)
data_cleaned.head()


In [None]:
# Some checks:

# Check the data type of 'Au' column
print(data['Au'].dtype)

nan_check = data_cleaned.isna().sum()

# Print NaN values if any exist
if nan_check.any():
    print("NaN values in cleaned data:")
    print(nan_check[nan_check > 0], "\n")

for element in elements:
    # Check for values equal to '-999'
    if (data_cleaned[element] == -999).any():
        print(f"Warning: '{element}' contains '-999' values.")
    
    # Check for values equal to '<0.005'
    if (data_cleaned[element] == '<0.005').any():
        print(f"Warning: '{element}' contains '<0.005' values.")

We can do other data checks but I am assuming these are the only issues and moving on

## Step 2: Modelling
### Aim 1: Generate a predictive model

In [None]:
# Check the distribution of classes
class_counts = data_cleaned['Class'].value_counts()
print(class_counts)

In [None]:
# Mostly from https://www.datacamp.com/tutorial/random-forests-classifier-python

# Separate labeled and unlabeled data
labeled_data = data_cleaned[data_cleaned['Class'].isin(['A', 'B'])]
unlabeled_data = data_cleaned[data_cleaned['Class'] == '?']

# Features and target
X = labeled_data[elements]
y = labeled_data['Class']

# Encode labels
le = LabelEncoder()
y_encoded = le.fit_transform(y)

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_encoded, test_size=0.2, random_state=42)


#Hyperparameter tuning 
param_dist = {'n_estimators': randint(50,500),
              'max_depth': randint(1,20)}

rf = RandomForestClassifier()

rand_search = RandomizedSearchCV(rf, 
                                 param_distributions = param_dist, 
                                 n_iter=5, 
                                 cv=5)
rand_search.fit(X_train, y_train)

best_rf = rand_search.best_estimator_

# Print the best hyperparameters
print('Best hyperparameters:',  rand_search.best_params_)



In [None]:
# Evaluate model
y_pred = best_rf.predict(X_test)
print("Classification Report:")
print(classification_report(y_test, y_pred))

In [None]:
print("Confusion Matrix:")
cm = confusion_matrix(y_test, y_pred)
ConfusionMatrixDisplay(confusion_matrix=cm).plot();


In [None]:
# #optional extra
# # Checking to see how the data was scaled

# # Visualize histograms before scaling for 'Fe' (example)
# plt.figure(figsize=(12, 6))
# plt.subplot(1, 2, 1)
# plt.hist(X['Fe'], bins=50, color='blue', alpha=0.7)
# plt.title('Before Scaling: Fe')

# # Visualize histograms after scaling for 'Fe'
# plt.subplot(1, 2, 2)
# plt.hist(X_scaled[:, X.columns.get_loc('Fe')], bins=50, color='green', alpha=0.7)
# plt.title('After Scaling: Fe')


In [None]:
# #optional extra
# Feature Importance
# feature_importances = best_rf.feature_importances_
# feature_importance_df = pd.DataFrame({'Element': elements, 'Importance': feature_importances})
# feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

# plt.figure(figsize=(10, 6))
# sns.barplot(x='Importance', y='Element', data=feature_importance_df, color='steelblue')
# plt.title('Feature Importance for Predicting Class A vs Class B')
# plt.tight_layout()
# plt.show()

### Aim 2: Predict labels 

In [None]:
unlabeled_data_copy = unlabeled_data.copy()

# Predict unlabeled data 
X_new = unlabeled_data[elements]
X_new_scaled = scaler.transform(X_new)
unlabeled_data.loc[:, 'Predicted_Class'] = le.inverse_transform(best_rf.predict(X_new_scaled))

print("Predictions on Unlabeled Data:")
print(unlabeled_data[['Unique_ID', 'holeid', 'Predicted_Class']].head(70))
unlabeled_data.to_csv('./data/predictions_on_unlabeled_data.csv', index=False)


Results available on: predictions_on_unlabeled_data.csv

In [None]:
# Filter the rows where the original 'Class' column is '?'
unlabeled_data_question = unlabeled_data[unlabeled_data['Class'] == '?']

# Count how many were predicted to be 'A' and 'B'
class_counts_from_question = unlabeled_data_question['Predicted_Class'].value_counts()

# Print out the prediction counts for 'A' and 'B' from '?'
print(f"Predictions for '?' class:")
print(class_counts_from_question)

### Observations

- The model performs well overall, with high accuracy and strong performance for Class A.
- Class B predictions are less accurate.
- The imbalance in the dataset (more Class A samples) is probably why the model performs better for Class A 
- Pb has the highest feature importance (makes sense as this is a lead deposit)

### Improvements: 
- Handle missing values more effectively (instead of dropping).
- Perform additional model validation (e.g., cross-validation, ROC curves).
- Experiment with other models (e.g., SVM, Gradient Boosting) for comparison.