###Duplicate Detection

These are demonstrations to highlight the issues with common document similarity/duplication detection methods. It will cover two main issues:

1. Using TF-IDF (or any vector option) to determine metadata similarity. 
2. Using vector-based approaches to identify similar identifiers.

####Metadata Similarity

TF-IDF is commonly used for determining duplicate content on websites. It has been presented as the solution for determining duplicate metadata content. However, work demonstrating its utility in these kinds of tasks are generally limited to a single metadata specification and/or metadata generated from a single system. This is not a realistic approach as we deal with metadata across different kinds of federated platforms.

For this demonstration, we have three sets of metadata - from three data portals, we've collected multiple representations of the same dataset. So for one dataset, for example, we have a DIF record, an FGDC record and an ISO record. We are assuming that much of the text within those documents is the same and that the majority of the difference is related to the differences between the standards.

So these are sets of metadata we know describe one dataset (each set). Using common tools, can we identify those automatically?


In [24]:
import re
import glob
import os
from lxml import etree

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import linear_kernel
from sklearn import metrics

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
_stopwords = set(stopwords.words('english'))

from IPython.display import display

In [3]:
parser = etree.XMLParser(
    encoding='utf-8'
)

def find_nodes(xml):
    nodes = []
    for elem in xml.iter():
        t = elem.text.strip() if elem.text else ''
        tags = [elem.tag] + [e.tag for e in elem.iterancestors()]
        tags.reverse()

        att_texts = parse_node_attributes(elem)

        nodes += [a for a in [t] + att_texts if a]
  
    return nodes

def parse_node_attributes(node):
    if not node.attrib:
        return []
    return node.attrib.values() if node.attrib else []

def strip_punctuation(text, simple_pattern=r'[;|>+:=<?(){}`\'"]', replace_char=' '):
    text = re.sub(simple_pattern, replace_char, text)
    return text.replace("/", ' ')


def tokenize(text):
    return word_tokenize(text)


def remove_stopwords(words):
    return ' '.join([w for w in words if w not in _stopwords and w])

# bag of words parsing
# we'll drop stopwords but nothing else
def get_bag(text):
    try:
        xml = etree.fromstring(text, parser=parser)
    except Exception as ex:
        print ex
        xml = None
        
    if xml is None:
        print 'failed to parse'
        return ''
    
    nodes = find_nodes(xml)
    content = ' '.join(nodes if nodes else [])
    if not content:
        print 'failed to iterate'
        return ''

    content = strip_punctuation(content)
    words = tokenize(content)
    bag = remove_stopwords(words)
    bag = strip_punctuation(bag, r'[.,]', '')
    return bag

def prep_set(files):
    identifiers = []
    bags = []
    for f in files:
        bag = ''
        with open(f, 'r') as g:
            text = g.read()
            
        identifier = os.path.basename(f).split('_')[-1].replace('.xml', '')
        bag = get_bag(text)
        
        bags.append(bag)
        identifiers.append(identifier)
    
    return identifiers, bags

In [4]:
# tf-idf set up
def similarity(all_identifiers, all_bags):
    '''
    run through the tf-idf calcs using each bag as the initial comparator
    the scores vary based on that first object
    '''
    tfidf_vectorizer = TfidfVectorizer()
    
    # store the sets
    similarity_scores = {}
    for i in xrange(len(all_identifiers)):
        tf_identifiers = [all_identifiers[i]] + all_identifiers
        tf_data = [all_bags[i]] + all_bags

        tfidf_matrix_trainer = tfidf_vectorizer.fit_transform(tf_data)

        cos_sim = cosine_similarity(tfidf_matrix_trainer[0:1], tfidf_matrix_trainer)

        lk = linear_kernel(tfidf_matrix_trainer[0:1], tfidf_matrix_trainer).flatten()

        related_indices = cos_sim.argsort()[:-len(tf_data)-1:-1] 
        related_indices_with_lk_value = [r for r in related_indices[0] if lk[r]]
        related_indices_with_lk_value.reverse()

        related_set = [(lk[k], tf_data[k], tf_identifiers[k]) for k in related_indices_with_lk_value]

        similarity_scores[all_identifiers[i]] = [
            (similarity, identifier) for similarity, bag, identifier in related_set[1:]
        ]
    
    return similarity_scores

def print_matrix(scores):
    keys = scores.keys()
    keys.sort()
    
    print '|'.join([' '*10] + ['{:^11s}'.format(k) for k in keys])
    print '-' * ((len(keys) * 11) + len(keys) + 10)
    
    for k in keys:
        vert = scores[k]
        vert_vals = ['{:^10s}'.format(k)]
        for e in keys:
            val = next(iter([v[0] for v in vert if v[1] == e]), -99)
            vert_vals.append('{:^10.2f} '.format(val))
        
        print '|'.join(vert_vals)
        

In [7]:
# start with medin
medins = glob.glob('inputs/medin/*.xml')
identifiers, bags = prep_set(medins)
medin_scores = similarity(identifiers, bags)
print_matrix(medin_scores)

          |    dc     |    dif    |    iso    
----------------------------------------------
    dc    |   1.00    |   0.56    |   0.27    
   dif    |   0.59    |   1.00    |   0.27    
   iso    |   0.32    |   0.30    |   1.00    


In [10]:
# the medin bags of words, just to show
for i, b in enumerate(bags):
    f = identifiers[i]
    print f
    print b
    print '-'* 100
    print

dc
http wwwopenarchivesorg OAI 20 oai_dc http wwwopenarchivesorg OAI 20 oai_dcxsd British Geological Survey BGS Geophysical Survey 1995 2 The Wash 25 07 1995 01 08 1995 dataset bgsnercacuk DC abc9f747-537c-0f38-e044-0003ba9b0d98 This British Geological Survey BGS survey took place July August 1995 Wash board Greyhound Tracker  Technical details contained BGS Report WB 95 36  Report Brett  CP  1995  LOEPS Shallow Seismic Survey  Wash  N Norfolk  Humber Operations Report Project 95 02  1995-08-01
----------------------------------------------------------------------------------------------------

dif
http gcmdgsfcnasagov Aboutus xml dif http gcmdgsfcnasagov Aboutus xml dif dif_v94xsd bgsnercacuk DIF BGS_CMD_REF272 British Geological Survey BGS Geophysical Survey 1995 2 The Wash 25 07 1995 01 08 1995 NDGO0001 Geology soil sediment crust Bathymetry Elevation Two-dimensional seismic reflection Side-scan sonar geoscientificInformation NDGO0001 Geology soil sediment crust Bathymetry Elevation

In [83]:
# now gstore
gstores = glob.glob('inputs/gstore/*.xml')
identifiers, bags = prep_set(gstores)
gstore_scores = similarity(identifiers, bags)
print_matrix(gstore_scores)

          |   fgdc    |    iso    |    wfs    |    wms    | wms19119  
----------------------------------------------------------------------
   fgdc   |   1.00    |   0.70    |   0.44    |   0.45    |   0.09    
   iso    |   0.73    |   1.00    |   0.37    |   0.38    |   0.19    
   wfs    |   0.42    |   0.35    |   1.00    |   0.71    |   0.08    
   wms    |   0.44    |   0.37    |   0.72    |   1.00    |   0.09    
 wms19119 |   0.11    |   0.22    |   0.09    |   0.10    |   1.00    


In [84]:
# and finally devotes
devotes = glob.glob('inputs/devotes/*.xml')
identifiers, bags = prep_set(devotes)
devotes_scores = similarity(identifiers, bags)
print_matrix(devotes_scores)

          |   atom    |    csw    |    dif    |   fgdc    |    iso    
----------------------------------------------------------------------
   atom   |   1.00    |   0.89    |   0.77    |   0.86    |   0.28    
   csw    |   0.88    |   1.00    |   0.80    |   0.92    |   0.26    
   dif    |   0.79    |   0.82    |   1.00    |   0.84    |   0.24    
   fgdc   |   0.85    |   0.91    |   0.81    |   1.00    |   0.26    
   iso    |   0.32    |   0.30    |   0.28    |   0.30    |   1.00    


In [20]:
# because i am curious and we've got a tiny set here.
# let's just compare the three iso records and see those scores
# the text should not be similar, just knowing they are different datasets

isos = [
    'inputs/devotes/devotes-80c47f28-d0bc-11e3-9261-00163c43a2bd_iso.xml',
    'inputs/gstore/gstore-b4ae8f53-8dff-46bb-9058-e5501cabdd1b_iso.xml',
    'inputs/medin/medin-abc9f747-537c-0f38-e044-0003ba9b0d98_iso.xml'
]

identifiers = []
bags = []
for f in isos:
    bag = ''
    with open(f, 'r') as g:
        text = g.read()

    identifier = os.path.basename(f)
    bag = get_bag(text)

    bags.append(bag)
    identifiers.append(identifier)

iso_scores = similarity(identifiers, bags)

for k, v in iso_scores.iteritems():
    print k
    print '\n'.join(['\t{0}: {1}'.format(s, i) for s, i in v if i != k])


devotes-80c47f28-d0bc-11e3-9261-00163c43a2bd_iso.xml
	0.259552891407: medin-abc9f747-537c-0f38-e044-0003ba9b0d98_iso.xml
	0.235627369434: gstore-b4ae8f53-8dff-46bb-9058-e5501cabdd1b_iso.xml
medin-abc9f747-537c-0f38-e044-0003ba9b0d98_iso.xml
	0.267971178365: devotes-80c47f28-d0bc-11e3-9261-00163c43a2bd_iso.xml
	0.225254351844: gstore-b4ae8f53-8dff-46bb-9058-e5501cabdd1b_iso.xml
gstore-b4ae8f53-8dff-46bb-9058-e5501cabdd1b_iso.xml
	0.24774153906: devotes-80c47f28-d0bc-11e3-9261-00163c43a2bd_iso.xml
	0.227559645423: medin-abc9f747-537c-0f38-e044-0003ba9b0d98_iso.xml


#####Outcome

Disparities in vector size matters - there's more standards cruft in the ISO compared to DC and the DC tends to have fewer text elements. So we could have the same abstract and title, but with the additonal text found in the lineage, for example, the mismatch lowers the similarity score.

The order in which the bags of words are compared also affects the scores. Not hugely, but it could make the difference between similar or not depending ong the acceptance threshold. And for this we would have to set the threshold low to find any matches.

There are content issues, as well. For cross-standards similarity discussions, we can only rely on certain types of text - titles, abstracts, perhaps a keyword set. This is not unreasonable until we consider a certian cultural practice, that of using an organizational description or project description as the abstract for any dataset published. That changes the similarity question to a kind of forensic data portal analysis. And if the titles aren't highly variable, we have then near-duplicate metadata for likely highly variable datasets. 

Still, this is a small demonstration simply to highlight issues with these vector-based approaches. Identifying similar or duplicate metadata information requires a different approach and likely one that employs a variety of tools.

Finally, phew, at least there's not enough standards cruft in the ISO to render all of the similar here. Although that does read like a stable number. Someone should go sort out how much of an ISO record is likely to be the same across any ISO record (I am more on the side of better methods to identify high value text rather than generating some standards-specific extractor).

##Identifier Similarity

For this we're relying on the kinds of non-cryptographic hashing methods common in crawling projects.

This is as much about understanding what it means to provide different representations of identifiers across (or within) systems. It may not be the best method for getting at this but it does suggest that there's likely not one method to reliably identify an object across different identifier representations. For certain kinds of identifiers, more so certain representations of those, I'd suggest that none of the well-known vector methods are suitable. Specifically, mnemonic URLs where one character difference indicates an entirely different dataset, these methods will all return a very high similarity/near-duplicate score for the identifier. These kinds of representations are more likely to return false positives. And we can see, in the examples below, that other very well-defined PIDs might not be identifiable as referencing the same object because of how they're represented. In some cases, the set of representations appears stable and fairly limited, so a system can reliably extract the PID component of the representation. In others, it again becomes a server by server issue and that code burdern very quickly becomes unmanagable. 

The GCIS is an early system for providing very limited concordances (mapping short names and "cool" URIs, for example). It's not meant to be a generic identifier concordance provider - it's specific to NASA/NOAA and the assumptions they make about PIDs. 

This effort comes from less of a metadata quality perspective so it does have different biases. We're starting from an open world understanding. That the data here can come from any server, any service, so long as it's XML.

So. So can we grok references to an object from the extracted identifiers at scale (for our set of ~600K XMLs, over 6 million potential identifiers were extracted) knowing that they can be represented in different ways?

**Simhashes**

These are a kind of non-cryptographic hash described by Google for performant similarity checks at scale during web crawling activities. 

Two methods - a simple demonstration of simhashes with unmondified identifiers and The Daniel Method (splitting, sorting, concatenating, simhashing).

Refs:

https://liangsun.org/posts/a-python-implementation-of-simhash-algorithm/

https://github.com/liangsun/simhash


In [22]:
from simhash import Simhash
from itertools import chain
from operator import itemgetter
import collections
import pandas as pd

# a shallow fork of the index - we want to return
# the scores for demonstration purposes

class Indexer(object):
    def __init__(self, f=64, k=2):
        self.bucket = collections.defaultdict(set)
        self.k = k
        self.f = f
        
        # note: bucket mods were made for a different 
        # process, not really necessary to init the 
        # thing outside of the indexer proper

    def get_near_dups(self, simhash):
        """
        `simhash` is an instance of Simhash
        return a list of obj_id (pipe-delimited string of sha|text|distance)
        """
        assert simhash.f == self.f

        ans = set()

        for key in self.get_keys(simhash):
            dups = self.bucket.get(key, set())

            for dup in dups:
                sim2, obj_blob = dup.split(',', 1)
                sim2 = Simhash(long(sim2, 16), self.f)

                d = simhash.distance(sim2)
                if d <= self.k:
                    # modified to emit the score for testing/evaluation
                    ans.add('{0}|{1}'.format(obj_blob, d))
        return list(ans)
    
    def add(self, obj_id, obj_str, simhash):
        """
        `obj_id` is a string
        `simhash` is an instance of Simhash
        """
        assert simhash.f == self.f

        for key in self.get_keys(simhash):
            # modified to store the strings
            v = '%x,%s|%s' % (simhash.value, obj_id, obj_str)

            self.bucket.setdefault(key, set())
            self.bucket[key].add(v)

    def delete(self, obj_id, obj_str, simhash):
        """
        `obj_id` is a string
        `simhash` is an instance of Simhash
        """
        assert simhash.f == self.f

        for key in self.get_keys(simhash):
            v = '%x,%s|%s' % (simhash.value, obj_id, obj_str)

            if v in self.bucket.get(key, set()):
                self.bucket[key].remove(v)
    
    @property
    def offsets(self):
        """
        You may optimize this method according to <http://www.wwwconference.org/www2007/papers/paper215.pdf>
        """
        return [self.f // (self.k + 1) * i for i in range(self.k + 1)]

    def get_keys(self, simhash):
        for i, offset in enumerate(self.offsets):
            m = (i == len(self.offsets) - 1 and 2 ** (self.f - offset) - 1 or 2 ** (self.offsets[i + 1] - offset) - 1)
            c = simhash.value >> offset & m
            yield '%x:%x' % (c, i)

    def bucket_size(self):
        return len(self.bucket)

In [2]:
def evaluate_set(test_set, k=20):
    duplicates = {}
    index = Indexer(k=k)
    # build the index
    for test_id, test_item, test_simhash in test_set:
        index.add(test_id, test_item, test_simhash)
        
    # run per test string
    for test_id, test_item, test_simhash in test_set:
        dupes = index.get_near_dups(test_simhash)
        
        # as id, string, score
        duplicates[test_item] = [d.split('|') for d in dupes]
    return duplicates

In [6]:
# our set of identifiers

# dois
dois = [
    # one set of different doi representations
    'http://dx.doi.org/10.7916/D85B019G',
    '10.7916/D85B019G',
    'doi:10.7916/D85B019G',
    # additional strings
    '10.7916/D85B0121',
    'http://dx.doi.org/10.5257/iea/ept/2011q3',
    '10.3334/ORNLDAAC/887',
    'doi:10.5281/zenodo.11169',
    '10.5281/zenodo.15638'
]


# some mnemonic things (one or two characters difference only)
mnemonics = [
    # a one char difference in the URLs
    'http://enviroatlas.epa.gov/arcgis/rest/services/Communities/ESC_PNJ_ClimateStabilization/MapServer/info/metadata',
    'http://enviroatlas.epa.gov/arcgis/rest/services/Communities/ESC_POR_ClimateStabilization/MapServer/info/metadata',
    # a lot of overlap between this url and the above
    'http://enviroatlas.epa.gov/arcgis/rest/services/Communities/ESC_TFL_CleanAir/MapServer/layers?f=pjson',
    # same mnemonic name 
    'http://geodata.epa.gov/arcgis/rest/services/ORD/ROE_LongIslandHypoxia/MapServer/info/metadata',
    'http://geodata.epa.gov/arcgis/services/ORD/ROE_LongIslandHypoxia/MapServer/WMSServer?request=GetCapabilities&service=WMS',
    'http://gispub10.epa.gov/arcgis/rest/services/ORD/ROE_LongIslandHypoxia/MapServer/info/metadata',
    # another arcserver for kicks
    'http://certmapper.cr.usgs.gov/arcgis/rest/services/geology/southasia/MapServer/WMTS/1.0.0/WMTSCapabilities.xml',
    # some urls
    'http://pubs.usgs.gov/ds/556/data_files/be/be_e358_n3342_16/be_e358_n3342_16.las.xml',
    'http://pubs.usgs.gov/ds/628/data_files/fs/fs_e488_n4238_18/fs_e488_n4238_18.las.xml'
    
]


# thredds things (vector size issues )
thredds = [
    'http://acdisc.gsfc.nasa.gov/opendap/catalog.xml',
    'http://acdisc.gsfc.nasa.gov/opendap/Aqua_AIRS_Level3/catalog.xml',
    'http://acdisc.gsfc.nasa.gov/opendap/Aqua_AIRS_Level3/AIRX3QP5.006/catalog.xml',
    'http://acdisc.gsfc.nasa.gov/opendap/Aqua_AIRS_Level3/AIRX3QP5.006/2015/catalog.xml'
    'http://acdisc.gsfc.nasa.gov/opendap/Aqua_AIRS_Level3/AIRX3QP5.006/2015/AIRS.2015.01.26.L3.RetQuant006.v6.0.11.0.G15041181518.hdf.xml',
    # a stack of thredds links
    'http://acdisc.gsfc.nasa.gov/opendap/HDF-EOS5/Aura_MLS_Level2/ML2BRO.004/2004/MLS-Aura_L2GP-BrO_v04-20-c01_2004d220.he5.rdf',
    'http://disc2.nascom.nasa.gov/opendap/ncml/TRMM_RT/TRMM_3B42RT.007/2011/082/3B42RT.2011.03.23.15z.bin.ncml',
    'http://disc2.nascom.nasa.gov/opendap/ncml/TRMM_L2/TRMM_2A25/2000/096/2A25.20000405.13558.7.HDF.Z.ncml',
    'http://dataserver3.nccs.nasa.gov/thredds/view/idv.jnlp?url=http://dataserver3.nccs.nasa.gov/thredds/dodsC/NEX-GDDP/IND/rcp85/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp85_r1i1p1_inmcm4_2092.nc',
    'http://dataserver3.nccs.nasa.gov/thredds/ncss/grid/NEX-GDDP/IND/rcp45/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp45_r1i1p1_inmcm4_2069.nc/dataset.xml',
    # from potential "cool" uris (really not)
    '/archive/AIRSOps/test/CO2/v5_4_11/2002/09/19/airx2stc/AIRS.2002.09.19.109.L2.CO2_Std.v5.4.11.0.CO2.T09211021256.hdf'
]

In [7]:
k_values = [3, 5, 10, 20, 30]

In [12]:
doi_checks = []
for k in k_values:
    doi_index = [(d, d, Simhash(d)) for d in dois]
    dupes = evaluate_set(doi_index, k=k)
    
    # as csv with k, comparison str, potential dupe, hamming distance
    for key, vals in dupes.iteritems():
        for i, t, s in vals:
            if i == key or int(s) == 0:
                continue
            doi_checks.append([k, key, s, i, t])

In [13]:
df = pd.DataFrame(doi_checks, columns=['k', 'Identifier', 'Score', 'Potential Match', 'String Used'])
df

Unnamed: 0,k,Identifier,Score,Potential Match,String Used
0,20,http://dx.doi.org/10.7916/D85B019G,19,10.7916/D85B019G,10.7916/D85B019G
1,20,http://dx.doi.org/10.7916/D85B019G,19,http://dx.doi.org/10.5257/iea/ept/2011q3,http://dx.doi.org/10.5257/iea/ept/2011q3
2,20,http://dx.doi.org/10.5257/iea/ept/2011q3,19,http://dx.doi.org/10.7916/D85B019G,http://dx.doi.org/10.7916/D85B019G
3,20,10.7916/D85B0121,16,10.7916/D85B019G,10.7916/D85B019G
4,20,doi:10.5281/zenodo.11169,16,10.5281/zenodo.15638,10.5281/zenodo.15638
5,20,doi:10.7916/D85B019G,16,10.7916/D85B019G,10.7916/D85B019G
6,20,10.5281/zenodo.15638,16,doi:10.5281/zenodo.11169,doi:10.5281/zenodo.11169
7,20,10.7916/D85B019G,19,http://dx.doi.org/10.7916/D85B019G,http://dx.doi.org/10.7916/D85B019G
8,20,10.7916/D85B019G,16,doi:10.7916/D85B019G,doi:10.7916/D85B019G
9,20,10.7916/D85B019G,16,10.7916/D85B0121,10.7916/D85B0121


First thing to note is that there are no matches (outside of the identity match) for any k less than 20. Not great given the length of the strings. I am sure there will be reasonable approaches related to record matching and short strings, everything I've found to date raised performance issues at even the smaller scale of BCube.

In [18]:
# the daniel method:
# split string
# sort
# concatenate
# score

import re
pttn = r'[:./]'

sorted_dois = []
for d in dois:
    x = [s for s in re.split(pttn, d) if s]
    x.sort()

    sorted_dois.append((d, ''.join(x)))

sorted_matches = []
for k in k_values:
    doi_index = [(i, d, Simhash(d)) for i, d in sorted_dois]
    dupes = evaluate_set(doi_index, k=k)
    for key, vals in dupes.iteritems():
        for i, t, s in vals:
            if i == key or int(s) == 0:
                continue
            sorted_matches.append([k, key, s, i, t])
            
df = pd.DataFrame(sorted_matches, columns=['k', 'Identifier', 'Score', 'Potential Match', 'String Used'])
df

Unnamed: 0,k,Identifier,Score,Potential Match,String Used
0,20,107916D85B019G,14,doi:10.7916/D85B019G,107916D85B019Gdoi
1,20,107916D85B019G,16,10.7916/D85B0121,107916D85B0121
2,20,107916D85B019G,15,http://dx.doi.org/10.7916/D85B019G,107916D85B019Gdoidxhttporg
3,20,107916D85B019Gdoidxhttporg,17,doi:10.7916/D85B019G,107916D85B019Gdoi
4,20,107916D85B019Gdoidxhttporg,15,10.7916/D85B019G,107916D85B019G
5,20,107916D85B0121,16,10.7916/D85B019G,107916D85B019G
6,20,107916D85B019Gdoi,17,http://dx.doi.org/10.7916/D85B019G,107916D85B019Gdoidxhttporg
7,20,107916D85B019Gdoi,14,10.7916/D85B019G,107916D85B019G
8,30,107916D85B019G,14,doi:10.7916/D85B019G,107916D85B019Gdoi
9,30,107916D85B019G,16,10.7916/D85B0121,107916D85B0121


Looks like splitting & sorting reduces the score, ie the string are slightly more similar, but that could simply be removing the punctuation. [Totally making blanket statements about effectiveness based on three things. ¯\\_(ツ)_/¯]

From the [Manku paper](http://www2007.cpsc.ucalgary.ca/papers/paper215.pdf):
> Figure 1 clearly shows the trade-offs for various values of k: A very low value misses near-duplicates (false negatives), and a very high value tags incorrect pairs as near-duplicates (false positives). Choosing k = 3 is reasonable because both precision and recall are near 0.75. So, for 64-bit fingerprints, declaring two documents as near-duplicates when their fingerprints differ in at most 3 bits gives fairly high accuracy.

So the python implementation defaults to a 64 bit hash and we're setting the *k* value to 20. We are working with much shorter inputs. 

Running all of the identifier sets, as is and restructured, across a range of *k* values.

In [27]:
# with the mnemonic identifiers embedded in urls
sorted_matches = []
for k in k_values:
    index = [(d, d, Simhash(d)) for  d in mnemonics]
    dupes = evaluate_set(index, k=k)
    for key, vals in dupes.iteritems():
        for i, t, s in vals:
            if i == key or int(s) == 0:
                continue
            sorted_matches.append([k, key, s, i])
            
df = pd.DataFrame(sorted_matches, columns=['k', 'Identifier', 'Score', 'Potential Match'])
with pd.option_context('display.max_colwidth', 500, 'display.max_rows', 100):
    display(df)

Unnamed: 0,k,Identifier,Score,Potential Match
0,3,http://enviroatlas.epa.gov/arcgis/rest/services/Communities/ESC_POR_ClimateStabilization/MapServer/info/metadata,3,http://enviroatlas.epa.gov/arcgis/rest/services/Communities/ESC_PNJ_ClimateStabilization/MapServer/info/metadata
1,3,http://enviroatlas.epa.gov/arcgis/rest/services/Communities/ESC_PNJ_ClimateStabilization/MapServer/info/metadata,3,http://enviroatlas.epa.gov/arcgis/rest/services/Communities/ESC_POR_ClimateStabilization/MapServer/info/metadata
2,5,http://gispub10.epa.gov/arcgis/rest/services/ORD/ROE_LongIslandHypoxia/MapServer/info/metadata,4,http://geodata.epa.gov/arcgis/rest/services/ORD/ROE_LongIslandHypoxia/MapServer/info/metadata
3,5,http://enviroatlas.epa.gov/arcgis/rest/services/Communities/ESC_POR_ClimateStabilization/MapServer/info/metadata,3,http://enviroatlas.epa.gov/arcgis/rest/services/Communities/ESC_PNJ_ClimateStabilization/MapServer/info/metadata
4,5,http://enviroatlas.epa.gov/arcgis/rest/services/Communities/ESC_PNJ_ClimateStabilization/MapServer/info/metadata,3,http://enviroatlas.epa.gov/arcgis/rest/services/Communities/ESC_POR_ClimateStabilization/MapServer/info/metadata
5,5,http://geodata.epa.gov/arcgis/rest/services/ORD/ROE_LongIslandHypoxia/MapServer/info/metadata,4,http://gispub10.epa.gov/arcgis/rest/services/ORD/ROE_LongIslandHypoxia/MapServer/info/metadata
6,10,http://gispub10.epa.gov/arcgis/rest/services/ORD/ROE_LongIslandHypoxia/MapServer/info/metadata,4,http://geodata.epa.gov/arcgis/rest/services/ORD/ROE_LongIslandHypoxia/MapServer/info/metadata
7,10,http://enviroatlas.epa.gov/arcgis/rest/services/Communities/ESC_POR_ClimateStabilization/MapServer/info/metadata,3,http://enviroatlas.epa.gov/arcgis/rest/services/Communities/ESC_PNJ_ClimateStabilization/MapServer/info/metadata
8,10,http://enviroatlas.epa.gov/arcgis/rest/services/Communities/ESC_PNJ_ClimateStabilization/MapServer/info/metadata,3,http://enviroatlas.epa.gov/arcgis/rest/services/Communities/ESC_POR_ClimateStabilization/MapServer/info/metadata
9,10,http://geodata.epa.gov/arcgis/rest/services/ORD/ROE_LongIslandHypoxia/MapServer/info/metadata,4,http://gispub10.epa.gov/arcgis/rest/services/ORD/ROE_LongIslandHypoxia/MapServer/info/metadata


Lower scores! Matching datasets from the same system with a two character difference. Also note that the scores don't vary depending on the initial match string.

The issue in these mnemonic identifiers embedded in URLs is that it is, without knowledge of the service, very difficult to identify the identifier segment. Worse is if you need some other segment of the route to ensure uniqueness. In our open world, we don't have that knowledge to understand the identifier. We're left with options that twig to similarity and those EPA links are related! They're from the same platform and are probably similar in that they are similar kinds of datasets (nutrient information in coastal waters, for example). But we are looking for similar identifiers as an initial round of identifying previously seen things (updating the knowledge graph) before triggering the processing actions. 

Also, k=30, way to high for everything.

In [38]:
# sort and repack the mnemonics
import re
pttn = r'[:./]'

sorted_mnemonics = []
for d in mnemonics:
    x = [s for s in re.split(pttn, d) if s]
    x.sort()

    sorted_mnemonics.append((d, ''.join(x)))

sorted_matches = []
for k in k_values[:-1]:
    index = [(i, d, Simhash(d)) for  i, d in sorted_mnemonics]
    dupes = evaluate_set(index, k=k)
    for key, vals in dupes.iteritems():
        for i, t, s in vals:
            if i == key or int(s) == 0:
                continue
            sorted_matches.append([k, key, s, i, t])
            
df = pd.DataFrame(sorted_matches, columns=['k', 'Identifier', 'Score', 'Potential Match', 'String Used'])
with pd.option_context('display.max_colwidth', 150, 'display.max_rows', 100):
    display(df)

Unnamed: 0,k,Identifier,Score,Potential Match,String Used
0,10,MapServerORDROE_LongIslandHypoxiaarcgisepageodatagovhttpinfometadatarestservices,10,http://gispub10.epa.gov/arcgis/rest/services/ORD/ROE_LongIslandHypoxia/MapServer/info/metadata,MapServerORDROE_LongIslandHypoxiaarcgisepagispub10govhttpinfometadatarestservices
1,10,CommunitiesESC_POR_ClimateStabilizationMapServerarcgisenviroatlasepagovhttpinfometadatarestservices,7,http://enviroatlas.epa.gov/arcgis/rest/services/Communities/ESC_PNJ_ClimateStabilization/MapServer/info/metadata,CommunitiesESC_PNJ_ClimateStabilizationMapServerarcgisenviroatlasepagovhttpinfometadatarestservices
2,10,MapServerORDROE_LongIslandHypoxiaarcgisepagispub10govhttpinfometadatarestservices,10,http://geodata.epa.gov/arcgis/rest/services/ORD/ROE_LongIslandHypoxia/MapServer/info/metadata,MapServerORDROE_LongIslandHypoxiaarcgisepageodatagovhttpinfometadatarestservices
3,10,CommunitiesESC_PNJ_ClimateStabilizationMapServerarcgisenviroatlasepagovhttpinfometadatarestservices,7,http://enviroatlas.epa.gov/arcgis/rest/services/Communities/ESC_POR_ClimateStabilization/MapServer/info/metadata,CommunitiesESC_POR_ClimateStabilizationMapServerarcgisenviroatlasepagovhttpinfometadatarestservices
4,20,CommunitiesESC_TFL_CleanAirMapServerarcgisenviroatlasepagovhttplayers?f=pjsonrestservices,18,http://geodata.epa.gov/arcgis/services/ORD/ROE_LongIslandHypoxia/MapServer/WMSServer?request=GetCapabilities&service=WMS,MapServerORDROE_LongIslandHypoxiaWMSServer?request=GetCapabilities&service=WMSarcgisepageodatagovhttpservices
5,20,CommunitiesESC_TFL_CleanAirMapServerarcgisenviroatlasepagovhttplayers?f=pjsonrestservices,20,http://enviroatlas.epa.gov/arcgis/rest/services/Communities/ESC_PNJ_ClimateStabilization/MapServer/info/metadata,CommunitiesESC_PNJ_ClimateStabilizationMapServerarcgisenviroatlasepagovhttpinfometadatarestservices
6,20,MapServerORDROE_LongIslandHypoxiaWMSServer?request=GetCapabilities&service=WMSarcgisepageodatagovhttpservices,18,http://pubs.usgs.gov/ds/556/data_files/be/be_e358_n3342_16/be_e358_n3342_16.las.xml,556bebe_e358_n3342_16be_e358_n3342_16data_filesdsgovhttplaspubsusgsxml
7,20,MapServerORDROE_LongIslandHypoxiaWMSServer?request=GetCapabilities&service=WMSarcgisepageodatagovhttpservices,18,http://enviroatlas.epa.gov/arcgis/rest/services/Communities/ESC_TFL_CleanAir/MapServer/layers?f=pjson,CommunitiesESC_TFL_CleanAirMapServerarcgisenviroatlasepagovhttplayers?f=pjsonrestservices
8,20,MapServerORDROE_LongIslandHypoxiaWMSServer?request=GetCapabilities&service=WMSarcgisepageodatagovhttpservices,20,http://gispub10.epa.gov/arcgis/rest/services/ORD/ROE_LongIslandHypoxia/MapServer/info/metadata,MapServerORDROE_LongIslandHypoxiaarcgisepagispub10govhttpinfometadatarestservices
9,20,MapServerORDROE_LongIslandHypoxiaWMSServer?request=GetCapabilities&service=WMSarcgisepageodatagovhttpservices,18,http://geodata.epa.gov/arcgis/rest/services/ORD/ROE_LongIslandHypoxia/MapServer/info/metadata,MapServerORDROE_LongIslandHypoxiaarcgisepageodatagovhttpinfometadatarestservices


In [28]:
# once more with thredds
sorted_matches = []
for k in k_values:
    index = [(d, d, Simhash(d)) for  d in thredds]
    dupes = evaluate_set(index, k=k)
    for key, vals in dupes.iteritems():
        for i, t, s in vals:
            if i == key or int(s) == 0:
                continue
            sorted_matches.append([k, key, s, i])
            
df = pd.DataFrame(sorted_matches, columns=['k', 'Identifier', 'Score', 'Potential Match'])
with pd.option_context('display.max_colwidth', 500, 'display.max_rows', 100):
    display(df)

Unnamed: 0,k,Identifier,Score,Potential Match
0,20,http://disc2.nascom.nasa.gov/opendap/ncml/TRMM_L2/TRMM_2A25/2000/096/2A25.20000405.13558.7.HDF.Z.ncml,17,http://dataserver3.nccs.nasa.gov/thredds/view/idv.jnlp?url=http://dataserver3.nccs.nasa.gov/thredds/dodsC/NEX-GDDP/IND/rcp85/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp85_r1i1p1_inmcm4_2092.nc
1,20,http://acdisc.gsfc.nasa.gov/opendap/Aqua_AIRS_Level3/AIRX3QP5.006/catalog.xml,15,http://acdisc.gsfc.nasa.gov/opendap/Aqua_AIRS_Level3/catalog.xml
2,20,http://acdisc.gsfc.nasa.gov/opendap/Aqua_AIRS_Level3/AIRX3QP5.006/catalog.xml,12,http://acdisc.gsfc.nasa.gov/opendap/Aqua_AIRS_Level3/AIRX3QP5.006/2015/catalog.xmlhttp://acdisc.gsfc.nasa.gov/opendap/Aqua_AIRS_Level3/AIRX3QP5.006/2015/AIRS.2015.01.26.L3.RetQuant006.v6.0.11.0.G15041181518.hdf.xml
3,20,http://acdisc.gsfc.nasa.gov/opendap/Aqua_AIRS_Level3/AIRX3QP5.006/catalog.xml,18,http://acdisc.gsfc.nasa.gov/opendap/catalog.xml
4,20,http://acdisc.gsfc.nasa.gov/opendap/Aqua_AIRS_Level3/AIRX3QP5.006/2015/catalog.xmlhttp://acdisc.gsfc.nasa.gov/opendap/Aqua_AIRS_Level3/AIRX3QP5.006/2015/AIRS.2015.01.26.L3.RetQuant006.v6.0.11.0.G15041181518.hdf.xml,12,http://acdisc.gsfc.nasa.gov/opendap/Aqua_AIRS_Level3/AIRX3QP5.006/catalog.xml
5,20,http://acdisc.gsfc.nasa.gov/opendap/catalog.xml,15,http://acdisc.gsfc.nasa.gov/opendap/Aqua_AIRS_Level3/catalog.xml
6,20,http://acdisc.gsfc.nasa.gov/opendap/catalog.xml,18,http://acdisc.gsfc.nasa.gov/opendap/Aqua_AIRS_Level3/AIRX3QP5.006/catalog.xml
7,20,http://dataserver3.nccs.nasa.gov/thredds/ncss/grid/NEX-GDDP/IND/rcp45/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp45_r1i1p1_inmcm4_2069.nc/dataset.xml,12,http://dataserver3.nccs.nasa.gov/thredds/view/idv.jnlp?url=http://dataserver3.nccs.nasa.gov/thredds/dodsC/NEX-GDDP/IND/rcp85/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp85_r1i1p1_inmcm4_2092.nc
8,20,http://dataserver3.nccs.nasa.gov/thredds/view/idv.jnlp?url=http://dataserver3.nccs.nasa.gov/thredds/dodsC/NEX-GDDP/IND/rcp85/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp85_r1i1p1_inmcm4_2092.nc,12,http://dataserver3.nccs.nasa.gov/thredds/ncss/grid/NEX-GDDP/IND/rcp45/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp45_r1i1p1_inmcm4_2069.nc/dataset.xml
9,20,http://dataserver3.nccs.nasa.gov/thredds/view/idv.jnlp?url=http://dataserver3.nccs.nasa.gov/thredds/dodsC/NEX-GDDP/IND/rcp85/day/atmos/pr/r1i1p1/v1.0/pr_day_BCSD_rcp85_r1i1p1_inmcm4_2092.nc,17,http://disc2.nascom.nasa.gov/opendap/ncml/TRMM_L2/TRMM_2A25/2000/096/2A25.20000405.13558.7.HDF.Z.ncml


In [35]:
# sort and repack the thredds
import re
pttn = r'[:./]'

sorted_thredds = []
for d in thredds:
    x = [s for s in re.split(pttn, d) if s]
    x.sort()

    sorted_thredds.append((d, ''.join(x)))

sorted_matches = []
for k in k_values:
    index = [(i, d, Simhash(d)) for  i, d in sorted_thredds]
    dupes = evaluate_set(index, k=k)
    for key, vals in dupes.iteritems():
        for i, t, s in vals:
            if i == key or int(s) == 0:
                continue
            sorted_matches.append([k, key, s, i, t])
            
df = pd.DataFrame(sorted_matches, columns=['k', 'Identifier', 'Score', 'Potential Match', 'String Used'])
with pd.option_context('display.max_colwidth', 75, 'display.max_rows', 100):
    display(df)

Unnamed: 0,k,Identifier,Score,Potential Match,String Used
0,10,006AIRX3QP5Aqua_AIRS_Level3acdisccataloggovgsfchttpnasaopendapxml,8,http://acdisc.gsfc.nasa.gov/opendap/Aqua_AIRS_Level3/catalog.xml,Aqua_AIRS_Level3acdisccataloggovgsfchttpnasaopendapxml
1,10,Aqua_AIRS_Level3acdisccataloggovgsfchttpnasaopendapxml,9,http://acdisc.gsfc.nasa.gov/opendap/catalog.xml,acdisccataloggovgsfchttpnasaopendapxml
2,10,Aqua_AIRS_Level3acdisccataloggovgsfchttpnasaopendapxml,8,http://acdisc.gsfc.nasa.gov/opendap/Aqua_AIRS_Level3/AIRX3QP5.006/catal...,006AIRX3QP5Aqua_AIRS_Level3acdisccataloggovgsfchttpnasaopendapxml
3,10,acdisccataloggovgsfchttpnasaopendapxml,9,http://acdisc.gsfc.nasa.gov/opendap/Aqua_AIRS_Level3/catalog.xml,Aqua_AIRS_Level3acdisccataloggovgsfchttpnasaopendapxml
4,20,00006006011120152015201526AIRSAIRX3QP5AIRX3QP5Aqua_AIRS_Level3Aqua_AIRS...,20,http://acdisc.gsfc.nasa.gov/opendap/catalog.xml,acdisccataloggovgsfchttpnasaopendapxml
5,20,006AIRX3QP5Aqua_AIRS_Level3acdisccataloggovgsfchttpnasaopendapxml,11,http://acdisc.gsfc.nasa.gov/opendap/catalog.xml,acdisccataloggovgsfchttpnasaopendapxml
6,20,006AIRX3QP5Aqua_AIRS_Level3acdisccataloggovgsfchttpnasaopendapxml,8,http://acdisc.gsfc.nasa.gov/opendap/Aqua_AIRS_Level3/catalog.xml,Aqua_AIRS_Level3acdisccataloggovgsfchttpnasaopendapxml
7,20,0INDNEX-GDDPatmosdataserver3datasetdaygovgridhttpnasancnccsncssprpr_day...,20,http://acdisc.gsfc.nasa.gov/opendap/catalog.xml,acdisccataloggovgsfchttpnasaopendapxml
8,20,0INDNEX-GDDPatmosdataserver3datasetdaygovgridhttpnasancnccsncssprpr_day...,12,http://dataserver3.nccs.nasa.gov/thredds/view/idv.jnlp?url=http://datas...,0INDNEX-GDDPatmosdataserver3dataserver3daydodsCgovgovhttpidvjnlp?url=ht...
9,20,0070308215z20112011233B42RTTRMM_3B42RTTRMM_RTbindisc2govhttpnasanascomn...,19,http://disc2.nascom.nasa.gov/opendap/ncml/TRMM_L2/TRMM_2A25/2000/096/2A...,096135582000200004052A257HDFTRMM_2A25TRMM_L2Zdisc2govhttpnasanascomncml...


Ok, let's just remove the punctuation to verify that here. 

(Also, not going to do anything without the larger tests against types of things.)

In [74]:
# the "is it just removing punctuation?" question
# on the first three known same object, different representation question

import re
pttn = r'[:./]'

unsorted_dois = []
for d in dois[:3]:
    x = [s for s in re.split(pttn, d) if s]
    
    unsorted_dois.append((d, ''.join(x)))
    
doi_index = [(i, d, Simhash(d)) for i, d in unsorted_dois]
dupes = evaluate_set(doi_index)
print_dupes(dupes)

httpdxdoiorg107916D85B019G
	10.7916/D85B019G : 19 (107916D85B019G)

doi107916D85B019G
	10.7916/D85B019G : 16 (107916D85B019G)

107916D85B019G
	http://dx.doi.org/10.7916/D85B019G : 19 (httpdxdoiorg107916D85B019G)
	doi:10.7916/D85B019G : 16 (doi107916D85B019G)



So, no, it isn't just stripping out the punctuation. We'll stick with The Daniel Method for comparing normal simhashing to this modified string simhashing.