This tutorial provides a simple and easy-to-understand guide for **beginner resource geologists** on how to use the *`pandas`*and `scipy` libraries in Python to find duplicates in drill hole data sets.

A simulated collars file, containing the coordinates of drill hole collars will be used to find  

* exact duplicates  
* partial duplicates
* and 'near' duplicates

In [106]:
# Import libraries
import numpy as np
import pandas as pd


# Load collars file (hosted on Github)
url = 'https://raw.githubusercontent.com/erebus-mre/Geostats_Code/main/00_Datasets/collars.csv'
collars = pd.read_csv(url)

# Have a look at the first few rows
display(collars.head())

Unnamed: 0,dhid,x,y,z
0,DH-1,-0.932096,-1.026137,97.937086
1,DH-2,9.439632,0.554131,117.312559
2,DH-3,17.622817,1.012468,105.339303
3,DH-4,30.031428,-0.423953,109.858463
4,DH-5,38.771637,0.830514,90.951326


# Exact Duplicates
Exact duplicates are rows that are identical in all columns.

These are easily identified using the [`duplicated()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html) method in `pandas`:

In [107]:
# Find exact duplicates
dups = collars.duplicated(keep = False)

The method returns a series of boolean values, where `True` indicates that the row is one member of a duplicate pair.

In [108]:
dups.head()

0    False
1    False
2    False
3    False
4    False
dtype: bool

This series can then be used to filter the original data frame to show only the duplicate rows.

In [109]:
duplicates = collars[dups].sort_values(by = 'dhid')
display(duplicates)

Unnamed: 0,dhid,x,y,z
23,DH-24,9.577436,20.124624,106.695154
122,DH-24,9.577436,20.124624,106.695154
77,DH-78,-0.477881,73.579353,98.29461
124,DH-78,-0.477881,73.579353,98.29461
80,DH-81,31.380224,69.606047,101.537821
125,DH-81,31.380224,69.606047,101.537821
84,DH-85,69.071556,68.889773,103.010377
121,DH-85,69.071556,68.889773,103.010377
98,DH-99,100.071209,78.464094,100.01005
123,DH-99,100.071209,78.464094,100.01005


The *keep* parameter determines which of the duplicates to mark as `True`. The default is `first`, which marks the first occurrence of the duplicate as `False` and the subsequent occurrences as `True`. Setting *keep* to `last` will mark the last occurrence as `False` and the previous occurrences as `True`.

This helps to remove **exact duplicates** from a data file:

In [110]:
collars_no_dups = collars[collars.duplicated(keep = 'first') == False]

The *pandas* method [`drop_duplicates()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.drop_duplicates.html#pandas.Series.drop_duplicates) will remove all the duplicates without the need to filter the data frame. 

In [111]:
collars = collars.drop_duplicates(keep = 'first')

# Partial Duplicates
Partial duplicates are rows that are identical in some columns but not all. These could occur when one or more of the coordinates in a record are changed without deleting the original record.

These holes will not be flagged as duplicated using the `duplicated()` method as implemented above. The **subset** parameter must be used to specify the columns to consider when identifying duplicates:

In the example below the *bhid*, *x*, and *y* values are the same but the *z* coordinate differs. Note the subset parameter: it is list of the columns that must be the same in two or more records for them to be considered duplicates.  

In [112]:
dups = collars.duplicated(subset = ['dhid','x','y'],keep = False)
duplicates = collars[dups].sort_values(by = 'dhid')
display(duplicates)

Unnamed: 0,dhid,x,y,z
106,DH-107,70.914712,89.471938,104.296403
127,DH-107,70.914712,89.471938,105.296403
27,DH-28,49.208602,19.620317,97.498452
128,DH-28,49.208602,19.620317,98.498452
35,DH-36,18.694472,29.944372,114.207091
129,DH-36,18.694472,29.944372,115.207091
37,DH-38,39.857632,30.420612,110.844214
130,DH-38,39.857632,30.420612,111.844214
62,DH-63,70.04347,50.552525,95.4267
126,DH-63,70.04347,50.552525,96.4267


The data frame can now be exported and the correct coordinates identified. Make sure you use the **keep = False** parameter as one does not know which of the duplicates is the correct one.

Remember to check all the possible combinations:  
* dhid, x and y are the same but z is different
* dhid, x and z are the same but y is different
* dhid, y and z are the same but x is different
* x,y and z are the same but dhid is different

Deleting the duplicate records are a bit more involved. I keep track of the index number of the erroneous records and then use the `.drop()` method to remove them from the original data frame.

For example, in the table above records with index numbers 101, 128, 12, 130 and 82 are wrong and must be removed: 

In [113]:
collars =collars.drop(index =[101, 128, 12, 130, 82])

# Confirm that the duplicates have been removed
dups = collars.duplicated(subset = ['dhid','x','y'],keep = False)
duplicates = collars[dups].sort_values(by = 'dhid')
display(duplicates)

Unnamed: 0,dhid,x,y,z
106,DH-107,70.914712,89.471938,104.296403
127,DH-107,70.914712,89.471938,105.296403
35,DH-36,18.694472,29.944372,114.207091
129,DH-36,18.694472,29.944372,115.207091
62,DH-63,70.04347,50.552525,95.4267
126,DH-63,70.04347,50.552525,96.4267


# Close Collars



When validating a collar dataset it is worthwhile checking for drill hole collars which lie (uncomfortably) close to one another.   

In [114]:
from scipy.spatial import KDTree

# 'collars' is a DataFrame with 'x', 'y', 'z' columns
coords = np.array(collars[['x', 'y', 'z']].values)

# Create a KDTree
tree = KDTree(coords)

# Find the nearest neighbor for each point
# k=2 because the nearest neighbor of a point is the point itself, so we need the second nearest
nnb_dist, nnb_index = tree.query(coords, k=2)

# Extract the nearest neighbors (excluding the point itself)
nearest_neighbors = nnb_index[:, 1]

# Filter records where the nearest neighbor distance is less than 1
close_indices = [i for i, dist in enumerate(nnb_dist[:, 1]) if dist < 1]

# Return records in collars DataFrame that are within the specified distance
close_records = collars.iloc[close_indices].drop_duplicates()

print(close_records)

       dhid           x          y           z
10    DH-11  100.436391  -1.248671   82.953069
13    DH-14   20.382481   9.051499  119.513460
37    DH-38   39.857632  30.420612  110.844214
57    DH-58   20.111327  49.607759  106.457482
109  DH-110  100.529449  89.735392  104.911750
141   DH-11  100.464807  -0.469951   83.542657
142   DH-14   20.410898   9.830219  120.103048
143  DH-110  100.557865  90.514111  105.501339
144   DH-58   20.139744  50.386479  107.047070
145   DH-38   39.886049  31.199332  111.433802


In [116]:
display(collars)

Unnamed: 0,dhid,x,y,z
0,DH-1,-0.932096,-1.026137,97.937086
1,DH-2,9.439632,0.554131,117.312559
2,DH-3,17.622817,1.012468,105.339303
3,DH-4,30.031428,-0.423953,109.858463
4,DH-5,38.771637,0.830514,90.951326
...,...,...,...,...
141,DH-11,100.464807,-0.469951,83.542657
142,DH-14,20.410898,9.830219,120.103048
143,DH-110,100.557865,90.514111,105.501339
144,DH-58,20.139744,50.386479,107.047070


In [92]:
collars.loc[10]

dhid         DH-11
x       100.436391
y        -1.248671
z        82.953069
Name: 10, dtype: object

In [94]:
collars.loc[131]

dhid         DH-10
x       100.995884
y         0.518798
z        97.504703
Name: 131, dtype: object