This tutorial provides a simple and easy-to-understand guide for **beginner resource geologists** on how to use the *`pandas`*and `scipy` libraries in Python to find duplicates in drill hole data sets.

A simulated collars file, containing the coordinates of drill hole collars will be used to find  

* exact duplicates  
* partial duplicates
* and 'near' duplicates

In [1]:
# Import libraries
import numpy as np
import pandas as pd


# Load collars file (hosted on Github)
url = 'https://raw.githubusercontent.com/erebus-mre/Geostats_Code/main/00_Datasets/collars.csv'
collars = pd.read_csv(url)

# Have a look at the first few rows
display(collars.head())

Unnamed: 0,dhid,x,y,z
0,DH-1,-1.195555,0.329439,83.831597
1,DH-2,8.390114,-2.036542,97.305504
2,DH-3,19.382881,0.722607,102.682669
3,DH-4,31.14089,0.666755,128.992683
4,DH-5,39.768896,-1.313822,97.109442


# Exact Duplicates
Exact duplicates are rows that are identical in all columns.

These are easily identified using the [`duplicated()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html) method in `pandas`:

In [2]:
# Find exact duplicates
dups = collars.duplicated(keep = False)

The method returns a series of boolean values, where `True` indicates that the row is one member of a duplicate pair.

In [3]:
dups.head()

0    False
1    False
2    False
3    False
4    False
dtype: bool

This series can then be used to filter the original data frame to show only the duplicate rows.

In [4]:
duplicates = collars[dups].sort_values(by = 'dhid')
display(duplicates)

Unnamed: 0,dhid,x,y,z
5,DH-6,49.479278,-0.834971,90.883643
121,DH-6,49.479278,-0.834971,90.883643
65,DH-66,100.920413,49.86666,98.566139
122,DH-66,100.920413,49.86666,98.566139
77,DH-78,-0.117835,71.621415,103.6386
124,DH-78,-0.117835,71.621415,103.6386
81,DH-82,40.209253,70.613537,89.73793
123,DH-82,40.209253,70.613537,89.73793
93,DH-94,51.541867,80.891065,101.115226
125,DH-94,51.541867,80.891065,101.115226


The *keep* parameter determines which of the duplicates to mark as `True`. The default is `first`, which marks the first occurrence of the duplicate as `False` and the subsequent occurrences as `True`. Setting *keep* to `last` will mark the last occurrence as `False` and the previous occurrences as `True`.

This helps to remove **exact duplicates** from a data file:

In [5]:
collars_no_dups = collars[collars.duplicated(keep = 'first') == False]

The *pandas* method [`drop_duplicates()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.drop_duplicates.html#pandas.Series.drop_duplicates) will remove all the duplicates without the need to filter the data frame. 

In [6]:
collars = collars.drop_duplicates(keep = 'first')

# Partial Duplicates
Partial duplicates are rows that are identical in some columns but not all. These could occur when one or more of the coordinates in a record are changed without deleting the original record.

These holes will not be flagged as duplicated using the `duplicated()` method as implemented above. The **subset** parameter must be used to specify the columns to consider when identifying duplicates:

In the example below the *bhid*, *x*, and *y* values are the same but the *z* coordinate differs. Note the subset parameter: it is list of the columns that must be the same in two or more records for them to be considered duplicates.  

In [7]:
dups = collars.duplicated(subset = ['dhid','x','y'],keep = False)
duplicates = collars[dups].sort_values(by = 'dhid')
display(duplicates)

Unnamed: 0,dhid,x,y,z
104,DH-105,50.928328,90.558396,128.006641
127,DH-105,50.928328,90.558396,129.006641
116,DH-117,60.782523,98.982337,88.927288
128,DH-117,60.782523,98.982337,89.927288
40,DH-41,70.149374,28.815263,99.095452
130,DH-41,70.149374,28.815263,100.095452
66,DH-67,-1.392264,59.360594,113.709485
126,DH-67,-1.392264,59.360594,114.709485
78,DH-79,9.195151,70.35558,89.373922
129,DH-79,9.195151,70.35558,90.373922


The data frame can now be exported and the correct coordinates identified. Make sure you use the **keep = False** parameter as one does not know which of the duplicates is the correct one.

Remember to check all the possible combinations:  
* dhid, x and y are the same but z is different  
* dhid, x and z are the same but y is different  
* dhid, y and z are the same but x is different  
* x,y and z are the same but dhid is different  

Deleting the duplicate records are a bit more involved. I keep track of the index number of the erroneous records and then use the `.drop()` method to remove them from the original data frame.

For example, in the table above records with index numbers 101, 128, 12, 130 and 82 are wrong and must be removed: 

In [8]:
collars =collars.drop(index =[101, 128, 12, 130, 82])

# Confirm that the duplicates have been removed
dups = collars.duplicated(subset = ['dhid','x','y'],keep = False)
duplicates = collars[dups].sort_values(by = 'dhid')
display(duplicates)

Unnamed: 0,dhid,x,y,z
104,DH-105,50.928328,90.558396,128.006641
127,DH-105,50.928328,90.558396,129.006641
66,DH-67,-1.392264,59.360594,113.709485
126,DH-67,-1.392264,59.360594,114.709485
78,DH-79,9.195151,70.35558,89.373922
129,DH-79,9.195151,70.35558,90.373922


# Close Collars



When validating a collar dataset it is worthwhile checking for drill hole collars which lie (uncomfortably) close to one another. 

This can be done by looping through the collar data twice, i.e. loop through the collars and determine the distance between each collar and all the other collars (the second loop). If the distance is less than a certain threshold, the two collars are considered to be too close to one another. This could be rather time consuming for a large number of collars.

A more efficient way is to use the [`KDTree`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.html) class in the `scipy.spatial` library. The `KDTree` class is a data structure for quick nearest-neighbor lookup. It is a binary tree structure that is used to partition the space in such a way that it is easy to find the nearest neighbors.

First, we need to import the  `scipy.spatial` library:

In [9]:
from scipy.spatial import KDTree

If you get the following error: 

> ModuleNotFoundError: No module named 'scipy'  

you can install the library by running the following command in your terminal: *pip install scipy*


In [10]:
# KDTree is initialized with the coordinates
# of the drill hole collars. 
# The collars must be presented as an `numpy` array:
coords = np.array(collars[['x', 'y', 'z']].values)

# Create the 3d tree:
tree = KDTree(coords)

# Find the nearest neighbors of each point
# in the dataset
nnb_dist, nnb_index = tree.query(coords, k=2)

# Extract the nearest neighbors (excluding the 
# point itself)
nearest_neighbors = nnb_index[:, 1]

# Filter records where the nearest neighbor 
# distance is less than 1
close_indices = [i for i, dist in enumerate(nnb_dist[:, 1]) if dist < 2]

# Return records in collars DataFrame that are within 
# the specified distance
close_records = collars.iloc[close_indices].drop_duplicates().sort_values(by = ['x','y','z'])

display(close_records)

Unnamed: 0,dhid,x,y,z
66,DH-67,-1.392264,59.360594,113.709485
126,DH-67,-1.392264,59.360594,114.709485
33,DH-34,-0.751361,30.697031,93.434043
143,DH-34c,-0.087231,31.606804,94.110586
22,DH-23,1.532359,21.331244,100.611858
145,DH-23c,2.196489,22.241017,101.2884
78,DH-79,9.195151,70.35558,89.373922
129,DH-79,9.195151,70.35558,90.373922
140,DH-14,19.337046,20.588132,113.884179
24,DH-25,19.824546,20.329603,113.170456


---