## The objective of this notebook is to compare several similarity metrics, in the area of biocollections, and envisioning the construction process of a similarity graph

### Essential answers (from Dr. Fortes):
- % of records with certain % of similarity
- % of records with certain % of similarity considering n fields

### Some implementation reasoning:
- When we talk about similarity, we assume the existence of two strings or a ground true value which is compared to other test values. In the case of biocollection metadata, what to we want to compare to what?
- A cross product of all the values of all the terms compared to all the other values seem too costly, computationally.
- The similarity that is computed in this notebook does not have any biological knowledge: we do not consider similarity among specimens' families or genus or something similar. This is a general or ignorant procedure, considering the semantic value of the data.

### Probabilistic databases: 
I checked some documentation about probabilistic databases, because they could be a good storage option for the processed data, but the available software I found is outdated or not mature enough. 
Moreover, most of these databases include probability values for the tuple (tuple-level uncertainty) or for some column (attribute-level uncertainty), not for the relations among values.

## Similarity analysis of the Scientific name

In [2]:
import pandas as pd
from pyspark.sql.functions import count, col

In [3]:
dataset = spark.read.parquet("./preston-amazon/data-processed/core.parquet")
dataset = dataset.fillna('')
N = dataset.count()
print("Number of rows: " + str(N))

Number of rows: 3858


In [4]:
df = pd.DataFrame( columns = ['sc_name', 'count'] )


In [5]:
dataset_sn = dataset.groupBy("`http://rs.tdwg.org/dwc/terms/scientificName`").count().select(col("`http://rs.tdwg.org/dwc/terms/scientificName`").alias('sc_name'), col('count').alias('n'))

In [6]:
dataset_sn.count()
# Number of different scientific names

708

This means that for our collection of 3,858 records, the complete scientific name analysis would imply (708 x 708 / 2) - 708 = 249,924 comparisons.
We should think about questions like:
- Is it meaningful to conclude that a scientific name is 65% similar to another one?
- Is it meaningful to conclude that a date is 70% similar to another one?

In [7]:
dataset_sn.orderBy('count', ascending=False).show(10, False)

+-------------------------------------------+---+
|sc_name                                    |n  |
+-------------------------------------------+---+
|Triportheus angulatus (Spix & Agassiz 1829)|51 |
|Pygocentrus nattereri Kner 1858            |49 |
|Serrasalmus rhombeus (Linnaeus 1766)       |49 |
|Pimelodus blochii Valenciennes 1840        |48 |
|Sorubim lima (Bloch & Schneider 1801)      |45 |
|Moenkhausia dichroura (Kner 1858)          |43 |
|Ageneiosus inermis (Linnaeus 1766)         |43 |
|Roeboides affinis (Günther 1868)           |42 |
|Schizodon fasciatus Spix & Agassiz 1829    |41 |
|Acestrorhynchus gr. lacustris (Lütken 1875)|40 |
+-------------------------------------------+---+
only showing top 10 rows



In [8]:
sn_list = dataset_sn.select('sc_name').rdd.flatMap(lambda x: x).collect()

In [9]:
print(sn_list[0])

Arocera (Euopta) placens Walker 1867


In [10]:
from pyxdameraulevenshtein import normalized_damerau_levenshtein_distance

In [28]:
%%time
#subset = sn_list[0:100]
subset = sn_list
i = 0
j = 0
while i < len(subset):
    j = i+1
    while j < len(subset):
        if (i != j):
            sim = 1 - normalized_damerau_levenshtein_distance (subset[i], subset[j])
            if (sim > 0.9):
                print(subset[i] + " vs. " + subset[j] + "\t" + str(sim))
        j = j + 1
    i = i + 1

Rio indistinctus Fortes & Grazia 2000 vs. Rio distinctus Fortes & Grazia 2000	0.9459459446370602
Edessa rufodorsata Silva, Fernandes & Grazia 2006 vs. Edessa virididorsata Silva, Fernandes & Grazia 2006	0.9019607827067375
Gonatopus neotropicus Olmi 1984 vs. Gonatopus neotropicus Olmi 1986	0.9677419364452362
CPU times: user 14.9 s, sys: 0 ns, total: 14.9 s
Wall time: 14.8 s


In short terms, like scientific name, the Damerau Levenshtein similarity may be more useful than using n-grams.
To have a better idea about how to implement it and how to present the results, I believe we should discuss: how do we expect these results are going to be used? 