# creation d'une matrice tf-idf sur des espèces collectés dans la base gbif

A) tf-idf

- load and clean gbif dataset
- create a Tf matrix
- create a tf-idf matrix

B) find a documents similarity with a query search, 
- recherche du mode de calcul de distance le plus approprié


### Term Frequency (tf): 
gives us the frequency of the word in each document in the corpus. It is the ratio of number of times the word appears in a document compared to the total number of words in that document. It increases as the number of occurrences of that word within the document increases. Each document has its own tf

- tf_i = nombre d'occurence du terme i / nombre total dans le document


### Inverse Data Frequency (idf): 
used to calculate the weight of rare words across all documents in the corpus. The words that occur rarely in the corpus have a high IDF score.


- idf_i = log (Nombre de documents) / nombre de documents dans lequel le terme i apparait

- Combining these two tf and idf, we come up with the TF-IDF score (w) for a word in a document in the corpus

tf-idf = tfi * idf

### Le TF-IDF (term frequency-inverse document frequency) 
- Méthode de pondration souvent utilisée en recherche d'information et en particulier dans la fouille de textes.
Elle permet d'évaluer l'importance d'un terme contenu dans un document, relativement à une collection ou un corpus. Le poids augmente proportionnellement au nombre d'occurrences du mot dans le document.  
Il varie également en fonction de la fréquence du mot dans le corpus. Des variantes de la formule originale sont souvent utilisées dans des moteurs de recherche pour apprécier la pertinence d'un document en fonction des critères de recherche de l'utilisateur.

In [5]:

import sys
sys.path.append('../lib')


from search_gbif import load_clean_and_generate_tf_idf, transform_query, read_clean_dataset, load_tfidf
from scipy.spatial import distance


In [2]:
fname = "../data/tfidf/data_gbif.json"

### A) tf-idf

- load and clean gbif dataset
- create a Tf matrix
- create a tf-idf matrix


In [3]:
load_clean_and_generate_tf_idf(fname)

 load dataset, and clean
- We suppress key fields, and convert to minus
First doc, before
{ 'canonicalName': 'incertae sedis',
  'higherClassificationMap': {},
  'key': 0,
  'kingdom': 'incertae sedis',
  'kingdomKey': 0,
  'rank': 'KINGDOM',
  'scientificName': 'incertae sedis',
  'status': 'DOUBTFUL',
  'synonym': False}
then
{'terms': ['incertae sedis', 'incertae sedis', 'incertae sedis', 'kingdom', 'doubtful']}
- Verification:


Unnamed: 0,terms,d
0,"[incertae sedis, incertae sedis, incertae sedi...",0
1,"[animalia, animalia, animalia, kingdom, accepted]",1
2,"[archaea, archaea, archaea, kingdom, accepted]",2
3,"[bacteria, bacteria, bacteria, kingdom, accepted]",3
4,"[chromista, chromista, chromista, kingdom, acc...",4


Calcul de la frequence des mots
Words list created, size: 388
- mots les plus fréquents:
                  0      1
0          accepted  26853
1             class  19400
2          animalia  11993
3           plantae   6146
4          bacteria   4552
..              ...    ...
95          cestoda    322
96  verrucomicrobia    320
97      cycliophora    311
98      eurotatoria    308
99         heliozoa    306

[100 rows x 2 columns]
corpus len: 27600
- create tf matrix
-Vectorized matrix,  (27600, 390)
 first line:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 


### B) find a documents similarity to a query
- load tdif, gbif dataset

In [3]:
# load models
data_names, X, vectorizer_model = load_tfidf()
    
print(" load gbif dataset, and clean")
fname = "../data/tfidf/data_gbif.json"
dataset = read_clean_dataset(fname)
    
print('- Verification:')
print('head:')
display(dataset['terms'].head())
print('tail:')
display(dataset['terms'].tail())




-Example of transform names, with query
-Example of transform names, with query ['zygnematophyceae zygomycota']
 result vector:
(1, 390)
[[0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.        

0    [incertae sedis, incertae sedis, incertae sedi...
1    [animalia, animalia, animalia, kingdom, accepted]
2       [archaea, archaea, archaea, kingdom, accepted]
3    [bacteria, bacteria, bacteria, kingdom, accepted]
4    [chromista, chromista, chromista, kingdom, acc...
Name: terms, dtype: object

tail:


27595    [animalia, platyhelminthes, amphilinidea, cest...
27596    [animalia, rotifera, collothecacea, eurotatori...
27597    [chromista, ochrophyta, stictocyclales, bacill...
27598    [chromista, ochrophyta, paraliales, bacillario...
27599    [chromista, ochrophyta, thalassiosirales, baci...
Name: terms, dtype: object

### calculate different distance, with one query 

In [6]:
if True:
    query = ['zygnematophyceae zygomycota']
    print('Query:', query[0])
    x0 = transform_query(vectorizer_model, query)
    
    kind = ['braycurtis', 'canberra', 'chebyshev', 'cityblock', 'correlation', 'cosine', 'dice', 'euclidean', 'hamming', 'jaccard', \
            'kulsinski', 'mahalanobis', 'matching', 'minkowski', 'rogerstanimoto',\
            'russellrao', 'seuclidean', 'sokalmichener', 'sokalsneath', 'sqeuclidean', 'yule']
    for metric in kind:
        d = distance.cdist(x0, X.toarray(), metric)
        print('metric', metric)
        print(d)


Query: zygnematophyceae zygomycota
metric braycurtis
[[1. 1. 1. ... 1. 1. 1.]]
metric canberra
[[6. 5. 5. ... 8. 8. 8.]]
metric chebyshev
[[0.71861851 0.77977446 0.86693916 ... 0.90983784 0.9142572  0.91997606]]
metric cityblock
[[3.15855387 2.91872833 2.84320793 ... 3.05431457 3.04143086 3.02398492]]
metric correlation
[[1.00636631 1.0054856  1.00520879 ... 1.00598318 1.00593587 1.0058718 ]]
metric cosine
[[1. 1. 1. ... 1. 1. 1.]]
metric dice
[[1. 1. 1. ... 1. 1. 1.]]
metric euclidean
[[1.41421356 1.41421356 1.41421356 ... 1.41421356 1.41421356 1.41421356]]
metric hamming
[[0.01538462 0.01282051 0.01282051 ... 0.02051282 0.02051282 0.02051282]]
metric jaccard
[[1. 1. 1. ... 1. 1. 1.]]
metric kulsinski
[[1. 1. 1. ... 1. 1. 1.]]
metric mahalanobis
[[              nan 60119386.88941303 19749607.17384794 ...
                nan               nan               nan]]
metric matching
[[0.01538462 0.01282051 0.01282051 ... 0.02051282 0.02051282 0.02051282]]
metric minkowski
[[1.41421356 1.414

- les distances "chebyshev" et "seuclidean" apparaissent les plus intéressantes


### Definition
- The Chebyshev distance between two n-vectors u and v is the maximum norm-1 distance between their respective elements.
- seuclidean = Computes the standardized Euclidean distance


#### Distance chebyshev Versus distance standard euclidian

In [13]:
if True:
    metric_lst = ['chebyshev','seuclidean']
    test_lst = ['anthocerotophyta',
                'archaea kingdom accepted',
                'chromista ochrophyta thalassiosirales']
    for i, test in enumerate(test_lst):
        x0 = transform_query(vectorizer_model, [test])
        for metric in metric_lst:
            d = distance.cdist(x0, X.toarray(), metric)[0]
            print('query:',i+1,  '"', test, '"', ', metric:', metric)
            index_lst = sorted(range(len(d)), key=lambda k: d[k])
            dataset['d'] = d                     
            print(dataset['terms'][index_lst[0:10]])
            print()

query: 1 " anthocerotophyta " , metric: chebyshev
12     [plantae, anthocerotophyta, plantae, anthocero...
111    [plantae, anthocerotophyta, plantae, anthocero...
210    [plantae, anthocerotophyta, plantae, anthocero...
309    [plantae, anthocerotophyta, plantae, anthocero...
408    [plantae, anthocerotophyta, plantae, anthocero...
507    [plantae, anthocerotophyta, plantae, anthocero...
606    [plantae, anthocerotophyta, plantae, anthocero...
705    [plantae, anthocerotophyta, plantae, anthocero...
804    [plantae, anthocerotophyta, plantae, anthocero...
903    [plantae, anthocerotophyta, plantae, anthocero...
Name: terms, dtype: object

query: 1 " anthocerotophyta " , metric: seuclidean
12     [plantae, anthocerotophyta, plantae, anthocero...
111    [plantae, anthocerotophyta, plantae, anthocero...
210    [plantae, anthocerotophyta, plantae, anthocero...
309    [plantae, anthocerotophyta, plantae, anthocero...
408    [plantae, anthocerotophyta, plantae, anthocero...
507    [plantae,

- conclusion, la distance de chebychev est plus pertinente sur la query:2, mais un peu moins bien sur  query:3,
        