# MOSAIKS Demonstration: Identifying Artisanal and Small-scale Mining in Sierra Leone

This notebook demonstrates the use of machine learning and satellite imagery to identify potential artisanal and small-scale mining (ASM) sites in Sierra Leone.




## Methodology

The process involves the following steps:

1. **Data Acquisition:**  Label data on known ASM sites is acquired. High-resolution satellite imagery is used to extract features for analysis.
2. **Data Preprocessing:** The label and feature data are preprocessed and cleaned. This involves filtering for high-confidence labels, removing irrelevant mine types, and ensuring sufficient inspection coverage. The features are scaled and formatted for model training.
3. **Data Joining:** The label and feature datasets are joined based on spatial intersection, creating a dataset linking site locations to satellite-derived features.
4. **Model Training and Evaluation:** A machine learning model, specifically Ridge Regression with isotonic alibration, is trained on the joined dataset to predict the likelihood of ASM presence. The model is evaluated using metrics like the Receiver Operating Characteristic Area Under the Curve (ROC AUC).
5. **Visualization and Interpretation:** The model results are visualized using maps and plots. Important features identified by the model are analyzed to understand the patterns associated with ASM sites.


## Setup

This section outlines the necessary steps to set up the environment for running the analysis. It involves:

1. **Loading Libraries:** Essential Python libraries for data manipulation, visualization, and machine learning are loaded.
2. **Mounting Google Drive:** Your Google Drive is mounted to access project data stored on the cloud.
3. **Defining Drive Directory:** The specific path to the project folder on your Google Drive is set.
4. **Creating Local Directory:** A temporary local directory within Colab is created to accelerate data processing. Data is copied to this directory to improve performance.



### Libraries

This notebook utilizes several key Python libraries:

* **Data Handling and Analysis:**  `pandas` and `numpy` provide fundamental data structures and functions for manipulating and analyzing data. `geopandas` extends these capabilities to work with geospatial data.
* **Visualization:** `matplotlib.pyplot` and `seaborn` enable the creation of static and interactive visualizations, aiding in data exploration and result presentation.
* **Machine Learning:** `sklearn` offers a comprehensive suite of tools for building and evaluating machine learning models, including algorithms like Ridge Regression and methods for data preprocessing.

In [None]:
!pip install mapclassify

In [50]:
import os
import shutil
import numpy as np
import pandas as pd
import seaborn as sns
import geopandas as gpd
import matplotlib.pyplot as plt

from google.colab import drive
from sklearn.linear_model import RidgeClassifier
from sklearn.calibration import IsotonicRegression
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.model_selection import train_test_split, GridSearchCV

In [51]:
import warnings
from scipy.linalg import LinAlgWarning

warnings.filterwarnings("ignore", category=UserWarning, module="sklearn.linear_model._ridge")
warnings.filterwarnings("ignore", category=LinAlgWarning, module="sklearn.linear_model._ridge")

### Mounting Google Drive

Mounting Google Drive in Colab essentially connects your Google Drive storage to the Colab environment. This allows you to access files and folders stored in your Drive directly within your Colab notebook. Think of it as creating a shortcut to your Drive within Colab.

**Why is this necessary?**

Colab notebooks run on temporary virtual machines. Mounting your Drive ensures that you can load and save data to your personal storage, persisting even after the Colab session ends. It also enables seamless access to larger datasets stored in your Drive, which would be impractical to upload directly to Colab.

After mounting Google Drive, we then define the path to the project folder on Google Drive. This directory contains the data files needed for the analysis.

In [None]:
drive.mount('/content/drive')

In [None]:
drive_directory = os.path.join(
    "/",
    "content",
    "drive",
    "MyDrive",
    "November 2024 Conference",
)
drive_directory

## Create a local directory

This section creates a local directory within the Colab environment to store the project data. Data files are copied from the Google Drive directory to this local folder.

**Why use a local directory?**  

Accessing data locally on the Colab virtual machine (VM) significantly improves processing speed compared to reading directly from Google Drive. While copying the data initially takes a bit of time, this is outweighed by the performance gains during analysis.

In [None]:
local_dir = "/content/data/"

os.makedirs(local_dir, exist_ok=True)

files_to_copy = os.path.join(drive_directory, "data")

shutil.copytree(files_to_copy, local_dir, dirs_exist_ok=True)

Here we take a look at the contents of the local drive to ensure we have the necessary files copied.

In [None]:
!ls -lh /content/data

## Load in the ASM label data

This section focuses on loading and preparing the artisanal and small-scale mining (ASM) label data for analysis. The data is read from a CSV file (`SLE_mines.csv`) located in the project directory.  Several preprocessing steps are then applied to ensure data quality and relevance:

1. **Initial Loading and Column Selection:** The label data is loaded into a pandas DataFrame. Specific columns relevant to the analysis, such as location coordinates, mine type, confidence level, and inspection coverage, are selected.
2. **Data Filtering:**
   * Sites with low inspection coverage (less than 20%) are excluded, unless they are labeled as positive for ASM activity. This ensures that only sites with sufficient inspection data are considered.
   * Sites classified as "commercial" mines are removed, as the focus is on artisanal and small-scale operations.
   * Sites with low confidence levels (below 3) are filtered out to retain high-quality labels.
3. **Data Restructuring:** The DataFrame is reset to ensure a contiguous index after filtering.
4. **GeoDataFrame Creation:**  The DataFrame is converted into a GeoDataFrame using `geopandas`. This allows for spatial operations and visualizations by associating each site with a geographical point based on its longitude and latitude. This GeoDataFrame is used in subsequent steps for spatial analysis and joining with satellite features.

In [None]:
labels = pd.read_csv(os.path.join(local_dir, "SLE_mines.csv"))
labels = labels[
    [
        "lon",
        "lat",
        "unique_id",
        "country",
        "sample_type",
        "mine_type",
        "confidence",
        "label",
        "proportion_inspected"
    ]
]
labels

In [None]:
labels = labels.loc[(labels["proportion_inspected"] >= 0.2) | (labels["label"] == 1)]
labels.shape

In [None]:
labels = labels.loc[labels["mine_type"] != "commercial"]
labels.shape

In [None]:
labels = labels.loc[labels["confidence"] >= 3]
labels.shape

In [None]:
labels = labels.reset_index(drop=True)
labels

In [None]:
labels.groupby('label').size().plot(kind='barh', color=sns.palettes.mpl_palette('Dark2'))
plt.gca().spines[['top', 'right',]].set_visible(False)

### Transforming Labels into a GeoDataFrame

In this step, we convert the pandas DataFrame containing the ASM label data into a GeoDataFrame using the `geopandas` library. This transformation is crucial for enabling spatial analysis and visualization.

**How it's done:**

1. **Create GeoDataFrame:** We use the `geopandas.GeoDataFrame()` function to create a GeoDataFrame from the existing pandas DataFrame (`labels`).
1. **Define Geometry:** Within the `GeoDataFrame()` function, we specify the `geometry` argument. This argument defines the spatial representation of each data point. We utilize `geopandas.points_from_xy()` to create point geometries from the longitude (`lon`) and latitude (`lat`) columns of the DataFrame.
1. **Set Coordinate Reference System (CRS):** We also set the `crs` argument to "EPSG:4326". This defines the coordinate reference system used for the data, ensuring that the spatial information is interpreted correctly. EPSG:4326 is a common standard for geographic coordinates.

**Why it's necessary:**

* **Spatial Operations:** GeoDataFrames allow us to perform spatial operations, such as determining which labels intersect with satellite imagery features. This is essential for linking the ASM site locations to the extracted features.
* **Visualization:** GeoDataFrames can be easily plotted on maps, enabling visualization of the spatial distribution of ASM sites. This aids in understanding the geographic patterns of ASM activity.
* **Integration with Geospatial Tools:** GeoDataFrames are compatible with other geospatial tools and libraries, providing flexibility for further analysis and integration with GIS workflows.

By transforming the labels into a GeoDataFrame, we empower our analysis with spatial capabilities, paving the way for more insightful explorations of ASM activity in Sierra Leone.

In [None]:
labels_gdf = gpd.GeoDataFrame(
    labels,
    geometry=gpd.points_from_xy(labels.lon, labels.lat),
    crs="EPSG:4326"
)
labels_gdf

In [None]:
labels_gdf.explore(
    column = "mine_type",
    cmap = "Dark2",
    tiles = "CartoDB positron",
    categorical=True
)

## Loading and Processing Satellite Features

This section focuses on loading and preparing the satellite imagery features for analysis. These features, derived from high-resolution satellite imagery, are crucial for identifying potential ASM sites.

**Loading the Features:**

* The features are read from a Feather file (`SLE_features.feather`) stored in the project directory. Feather is a fast and efficient format for storing dataframes, making it ideal for handling large datasets like this.
* The features are loaded into a pandas DataFrame. Each row represents a specific location, identified by an `image_id`, and the columns contain the extracted random convolutional features.


In [None]:
features = pd.read_feather(os.path.join(local_dir, "SLE_features.feather"))
features.head()


**Extracting Latitude and Longitude:**

* The `image_id` contains latitude and longitude information encoded in its structure.
* To use these for spatial analysis, we extract the latitude and longitude from the `image_id`.
* The extracted values are stored as separate 'lat' and 'lon' columns in the DataFrame.
* Features downloaded from the API come with latitude and longitude already


In [65]:
features[['lat_part', 'lon_part']] = features['image_id'].str.split('__', expand=True)
features['lat'] = features['lat_part'].str.replace('lat_', '').str.replace('--', '.').astype(float)
features['lon'] = features['lon_part'].str.replace('lon_', '').str.replace('--', '.').astype(float)
features = features.drop(['lat_part', 'lon_part'], axis=1)


**Creating a GeoDataFrame:**

* The pandas DataFrame is then transformed into a GeoDataFrame using the `geopandas` library. This step adds spatial capabilities to the data.
* We create point geometries from the 'lat' and 'lon' columns, representing each feature's location on a map.
* The coordinate reference system (CRS) is set to "EPSG:4326", ensuring proper interpretation of the spatial information.
* Finally, a buffer of 0.005 is applied to each point geometry. This creates small circular polygons around each point, however we also include the argument `cap_style=3` which makes square corners. This effectively turns our point geometry into a polygon over the same physical space as the imagery the features were computed from.


In [None]:
features = gpd.GeoDataFrame(
    features,
    geometry=gpd.points_from_xy(features.lon, features.lat),
    crs="EPSG:4326"
  )
features.geometry = features.geometry.buffer(0.005, cap_style=3)
features.head(3)

## Plot example features

This completes the loading and preparation of the satellite features. The resulting GeoDataFrame, containing both feature values and geographic locations, is now ready to be joined with the ASM labels to link the two datasets based on spatial proximity. This integration will enable the machine learning model to associate satellite-derived characteristics with ASM presence or absence.

**Plot**  
- The plot here shows a single layer of random convolutional features
- `feature_0` is chosen arbitrarily
- We are unable to use the `.explore()` method due to constraints with system memory on the free tier of Google Colab   

In [None]:
features.plot(column = "feature_0", legend = True)

## Joining Labels to Features

This section describes the process of joining the ASM label data with the satellite imagery features. This step is crucial for creating a dataset that links the location of potential ASM sites with their corresponding satellite-derived characteristics.

**1. Spatial Join:**

* We use the `geopandas.sjoin()` function to perform a spatial join between the labels GeoDataFrame (`labels_gdf`) and the features GeoDataFrame (`features`).
* The `how="inner"` argument ensures that only sites with both label and feature information are included in the resulting joined dataset.
* The `predicate="intersects"` argument specifies that the join should be based on spatial intersection. This means that a label will be joined to a feature if their geometries overlap.

**2. Resulting Dataset:**

* The spatial join creates a new GeoDataFrame called `joined`.
* This GeoDataFrame contains all the columns from both the labels and features datasets, along with a new 'index_right' column indicating the index of the matching feature for each label.
* This joined dataset is now ready for use in the subsequent model training and evaluation steps. The combined information allows the model to learn patterns between satellite imagery features and the presence of ASM activity.

In [None]:
joined = gpd.sjoin(labels_gdf, features, how="inner", predicate="intersects")
joined.head()

## Model Training and Evaluation

This section outlines the process of training a machine learning model to predict the likelihood of ASM presence based on the satellite imagery features, and evaluating its performance.

**1. Data Preparation:**

* The joined dataset is split into features (X) and the target variable (y).
    * `X` contains the satellite imagery features, specifically the columns with names starting with 'feature_'.
    * `y` contains the ASM label (1 for ASM presence, 0 for absence).
* The data is further split into training and testing sets using `train_test_split` to evaluate the model's performance on unseen data.


In [69]:
feature_cols = [f"feature_{i}" for i in range(4000)]

In [70]:
X = joined[feature_cols].values
y = joined["label"].values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1991
)

**2. Model Selection and Training:**

* A Ridge Regression model with Isotonic Calibration is used for prediction. This model is chosen for its ability to handle high-dimensional data and its robustness to outliers.
    * **Ridge Regression** is a linear model that adds a penalty term to the loss function, preventing overfitting.
    * **Isotonic Calibration** adjusts the model's predictions to improve their accuracy and reliability.
* Hyperparameter tuning is performed using `GridSearchCV` to find the best value for the regularization parameter (`alpha`) of the Ridge Regression model. This helps to optimize the model's performance.


In [71]:
alphas = [0.01, 0.1, 1.0, 10.0]
ridge = RidgeClassifier()
param_grid = {'alpha': alphas}
grid_search = GridSearchCV(
    ridge,
    param_grid,
    cv=5,
    scoring='roc_auc',
    verbose=2,
    n_jobs=-1
)


In [None]:
grid_search.fit(X_train, y_train)

**3. Model Evaluation:**

* The trained model is evaluated using the testing set, which was not used during training.
* The primary evaluation metric is the Receiver Operating Characteristic Area Under the Curve (ROC AUC). This metric measures the model's ability to distinguish between positive (ASM presence) and negative (ASM absence) cases.
* The ROC curve is visualized to provide a graphical representation of the model's performance across different thresholds.


In [None]:
# Get uncalibrated predictions (decision function)
y_pred_uncal = grid_search.decision_function(X_test)

# Fit isotonic calibration
iso_reg = IsotonicRegression(out_of_bounds='clip')
iso_reg.fit(y_pred_uncal, y_test)

# Get calibrated predictions
y_pred_cal = iso_reg.predict(y_pred_uncal)

# Calculate AUC scores
auc_uncal = roc_auc_score(y_test, y_pred_uncal)
auc_cal = roc_auc_score(y_test, y_pred_cal)

print(f"Best alpha: {grid_search.best_params_['alpha']}")
print(f"AUC before calibration: {auc_uncal:.3f}")
print(f"AUC after calibration: {auc_cal:.3f}")

In [None]:
# Calculate ROC curves
fpr_uncal, tpr_uncal, _ = roc_curve(y_test, y_pred_uncal)
fpr_cal, tpr_cal, _ = roc_curve(y_test, y_pred_cal)

# Create the plot
plt.figure(figsize=(5, 5))

# Plot both curves
plt.plot(fpr_uncal, tpr_uncal, 'b-', label=f'Uncalibrated (AUC = {auc_uncal:.3f})')
plt.plot(fpr_cal, tpr_cal, 'r-', label=f'Calibrated (AUC = {auc_cal:.3f})')

# Plot the diagonal reference line
plt.plot([0, 1], [0, 1], 'k--', label='Random')

# Customize the plot
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves - Before and After Calibration')
plt.legend(loc='lower right')
plt.grid(True, alpha=0.3)

# Set the plot limits
plt.xlim([-0.01, 1.01])
plt.ylim([-0.01, 1.01])

plt.tight_layout()
plt.show()

## Predicting ASM Probability Across Sierra Leone

This section extends the analysis by applying the trained machine learning model to the entire feature set, generating predictions for the likelihood of ASM presence across Sierra Leone.

**Prediction Process:**

1. **Applying the Model:** The trained Ridge Regression model, along with the Isotonic Calibration, is used to predict the probability of ASM presence for each location in the feature dataset. This involves applying the model to the satellite imagery features associated with each location.

2. **Storing Predictions:** The predicted probabilities are stored in a new column named 'predicted_probability' within the features GeoDataFrame. This column now represents the model's estimated likelihood of ASM activity at each location.


In [75]:
y_pred_full = grid_search.decision_function(features[feature_cols].values)
y_pred_full_calibrated = iso_reg.predict(y_pred_full)

features['predicted_probability'] = y_pred_full_calibrated
# features

**Visualization:**

- **Probability Map:** A map is generated using the `plot()` method to visualize the predicted probabilities spatially across Sierra Leone. Color gradients represent the predicted ASM probability, with areas of higher probability indicated by warmer colors (e.g., red, orange) and lower probability by cooler colors (e.g., blue, green).
- This map provides a comprehensive view of potential ASM hotspots within the region based on the model's predictions.

**Insights:**

The generated probability map serves as a valuable tool for identifying and prioritizing areas for further investigation or potential interventions related to ASM. By visually highlighting areas of higher ASM likelihood, this analysis supports decision-making processes aimed at mitigating the negative impacts or harnessing the economic potential of ASM in Sierra Leone.

In [None]:
# plt.figure(figsize=(10, 8))
features.plot(
    column="predicted_probability",
    cmap="viridis",
    legend=True,
    legend_kwds={'label': "Predicted ASM Probability"},
    figsize=(10, 10)
)
plt.title("Predicted ASM Probabilities Across Sierra Leone")
plt.xlabel("Longitude")
plt.ylabel("Latitude")
plt.show()

## Conclusion

This notebook demonstrated the use of machine learning and satellite imagery to identify potential artisanal and small-scale mining (ASM) sites in Sierra Leone. The process involved data acquisition, preprocessing, joining, model training and evaluation, and visualization and interpretation.

The model showed promising results for the identification of ASM activity in Sierra Leone. The calibrated model is capable of capturing relevant signals from satellite imagery. The features are mostly likely highlighting things like bare earth, water, infrastructure, and changes in vegetation, which can indicate possible mining operations.

This approach can be extended to analyze ASM in other regions, test alternative machine learning models, investigate specific mine types, and analyze additional satellite data sources or features.

**Further Work**

* Investigate the use of other machine learning models, such as Random Forest or Support Vector Machines.
* Incorporate other data sources, such as geological data or socio-economic indicators.
* Look at ASM hotspots on the prediction map and validate the signal in the imagery (ie. look for mines and make new labels).

**Acknowledgements**

We acknowledge the contributions of the Project on Resource Governance (PRG) from the University of Califronia Los Angeles, the Environmental Markets Lab (emLab) at the University of California Santa Barbara, and the Center for Effective Global Action (CEGA) at the University of California Berkeley in providing the data and support for this analysis.

**Disclaimer**

The results presented in this notebook are for informational and demonstration purposes only and should not be considered as definitive evidence of ASM activity. Further verification and ground-truthing are recommended for any decision-making related to ASM. The work behind this notebook is ongoing and is expected to be published within the next year.