![](img/NER.png)

**NER** (named-entity recognition) is the problem of identifying and classifying proper names in text, including locations, such as *China*; people, such as *George Bush*; and organizations, such as the *United Nations*. 

**Problem**: the NER task is, given a sentence, first to segment which words are part of entities, and then to classify each entity by type (person, organization, location, and so on). 

**Challenge**: many named entities are too rare to appear even in a large training set, and therefore the system must identify them based only on context. Also, identifying independent words is not a good approach: *New York* is a location but *New York Times* is an organization.

**Solution**: linear chain Conditional Random Field (CRF) sequence models. 


More details here:

- [StanfordNER](https://nlp.stanford.edu/software/CRF-NER.shtml).


## Caso de uso:

Dado un documento de texto queremos encontrar entidades que aparecen en ese texto (Personas y Organizaciones) e identificarlas en nuestra base de datos de nombres. Por ejemplo si en el documento aparece:

- Lucia O'Keeffe,
- O'Keeffe, Lucia,
- Lucia Keeffe,
- O'Keeffe, okeeffe, Keefe, O'Keffe, ..., 

el algoritmo debería identificarla una PERSONA llamada *Lucia O'Keeffe* que pertenece a *GP Bullhound*. Para ello no podemos hacer una comparación exacta entre las entidades encontradas por NER y la base de datos. La comparación la hacemos usando un algoritmo de [Fuzzy Name Matching](https://medium.com/bcggamma/an-ensemble-approach-to-large-scale-fuzzy-name-matching-b3e3fa124e3c).

![](img/mmds.png)

In [30]:
import pandas as pd
import numpy as np
import time
from datetime import timezone

import requests
from requests.adapters import HTTPAdapter
from requests import Session

import re
from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize
from nltk import ngrams
import unicodedata

import json
import urllib3

# !pip install arango-python
import arango
from arango import ArangoClient
from arango.response import Response
from arango.http import HTTPClient

In [31]:
from sklearn import preprocessing

In [32]:
import pyspark.sql.functions as F
from pyspark.ml import Pipeline
from pyspark.ml.feature import RegexTokenizer, NGram, HashingTF, MinHashLSH, CountVectorizer
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, IntegerType, LongType
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

In [33]:
pd.set_option('display.max_colwidth', -1)
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
import warnings
warnings.filterwarnings("ignore")

# Funciones auxiliares

In [34]:
from functools import wraps
from time import time

def timing(f):
    @wraps(f)
    def wrapper(*args, **kwargs):
        start = time()
        result = f(*args, **kwargs)
        end = time()
        print('Elapsed time: {}'.format(end-start))
        return result
    return wrapper

In [35]:
import logging

from requests.adapters import HTTPAdapter
from requests import Session

from arango.response import Response
from arango.http import HTTPClient


class CustomHTTPClient(HTTPClient):
    """My custom HTTP client with cool features."""

    def __init__(self):
        self._session = Session()
        # Initialize your logger.
        self._logger = logging.getLogger('my_logger')

    def create_session(self, host):
        session = Session()

        # Add request header.
        session.headers.update({'x-my-header': 'true'})

        # Enable retries.
        adapter = HTTPAdapter(max_retries=5)
        self._session.mount('https://', adapter)

        return session

    def send_request(self,
                     session,
                     method,
                     url,
                     params=None,
                     data=None,
                     headers=None,
                     auth=None):
        # Add your own debug statement.
        self._logger.debug('Sending request to {}'.format(url))

        # Send a request.
        response = session.request(
            method=method,
            url=url,
            params=params,
            data=data,
            headers=headers,
            auth=auth,
            verify=False  # Disable SSL verification
        )
        self._logger.debug('Got {}'.format(response.status_code))

        # Return an instance of arango.response.Response.
        return Response(
            method=response.request.method,
            url=response.url,
            headers=response.headers,
            status_code=response.status_code,
            status_text=response.reason,
            raw_body=response.text,
        )

In [36]:
@timing
def execute(query):
    cursor = aql.execute(query)
    item_keys = [doc for doc in cursor]
    return item_keys

In [37]:
# Extraccion de emails

def textCleansing(text):
    text = re.sub('[^A-Za-z0-9@ _.+-]+', ' ', text)
    return str(text).replace('  ', ' ')

def emailExtraction(text):
    emails = list()    
    pattern = "([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)"
    return re.findall(pattern, text)
    
# text2 = textCleansing(text)
# emails = emailExtraction(textCleansing(text2))
# emails = list(dict.fromkeys(emails))

In [38]:
def textNormalize(text):
    text = str(text).lower()
    text = re.sub('[^a-zA-Z ]', '', str(text))
    text = unicodedata.normalize('NFD', text)
    return str(text)

In [39]:
def extractUrlMimecast(x):
    x = str(x).replace('orgsMimecast/','')
    url_body = x.split('.')
    url_body = url_body[0:len(url_body)-1]
    return ''.join(url_body)
    
print(extractUrlMimecast('orgsMimecast/s-gltd.com'))
print(extractUrlMimecast('orgsMimecast/orgsMimecast/comms.flightglobal.com'))

s-gltd
commsflightglobal


# Conexion a la BBDD. Descarga colecciones. Guardalas en CSV

In [42]:
!ls output/

pd_documents.csv	     pd_orgPipedrive.csv  pd_peopleExtraction.csv
pd_documents_ner.csv	     pd_orgsMaster.csv	  pd_peopleMaster.csv
pd_master_documents_ner.csv  pd_orgsMimecast.csv


### peopleMaster: colección con nombres de personas

In [43]:
#col = db.collection('peopleMaster')
#pd_peopleMaster = pd.DataFrame(list(col))
#pd_peopleMaster.to_csv('output/pd_peopleMaster.csv', index=False)
pd_peopleMaster = pd.read_csv('output/pd_peopleMaster.csv')

### orgsMaster: colección con nombres de empresas

In [44]:
#col = db.collection('orgsMaster')
#pd_orgsMaster = pd.DataFrame(list(col))
#pd_orgsMaster.to_csv('output/pd_orgsMaster.csv', index=False)
pd_orgsMaster = pd.read_csv('output/pd_orgsMaster.csv')

### peopleExtraction (no se necesita ya) -> info en peopleMaster

In [45]:
#col = db.collection('peopleExtraction')
#pd_peopleExtraction = pd.DataFrame(list(col))
#pd_peopleExtraction.to_csv('output/pd_peopleExtraction.csv', index=False)
pd_peopleExtraction = pd.read_csv('output/pd_peopleExtraction.csv')

### orgPipedrive (no se necesita ya) -> info en orgsMaster

In [64]:
#col = db.collection('orgPipedrive')
#pd_orgPipedrive = pd.DataFrame(list(col))
#pd_orgPipedrive.to_csv('output/pd_orgPipedrive.csv', index=False)
pd_orgPipedrive = pd.read_csv('output/pd_orgPipedrive.csv')

### orgMimecast (no se necesita ya) -> info en orgsMaster

In [65]:
#col = db.collection('orgsMimecast')
#pd_orgsMimecast = pd.DataFrame(list(col))
#pd_orgsMimecast.to_csv('output/pd_orgsMimecast.csv', index=False)
pd_orgMimecast = pd.read_csv('output/pd_orgsMimecast.csv')

### master_documents_ner: colección con relaciones p1 $\to$ p2 encontrado en d. Esta colección es el output del flujo de sparta de Yessica con NER en Java.

In [48]:
#col = db.collection('master_documents_ner')
#pd_master_documents_ner = pd.DataFrame(list(col))
#pd_master_documents_ner.to_csv('output/pd_master_documents_ner.csv', index=False)
pd_master_documents_ner = pd.read_csv('output/pd_master_documents_ner.csv')

### documents_ner: colección con las entidades encontradas por NER en Java.

In [49]:
#col = db.collection('documents_NER')
#pd_documents_ner = pd.DataFrame(list(col))
#pd_documents_ner.to_csv('output/pd_documents_ner.csv', index=False)
pd_documents_ner = pd.read_csv('output/pd_documents_ner.csv')

### documents: colección con los documentos a analizar

In [51]:
#collection = 'documents'
#query=r'''FOR d IN '''+collection+'''
#  LIMIT 1000
#  RETURN { key: d._key, text: d.text, uploader: d.uploaded_by, created: d.created_at, updated: d.updated_at }
#'''
#print(query)
#col = execute(query)
#pd_documents = pd.DataFrame(list(col))
#pd_documents.to_csv('output/pd_documents.csv', index=False)
pd_documents = pd.read_csv('output/pd_documents.csv')

In [None]:
pd_documents.head(3)

# Preparamos el DataFrame con los datos

### Personas

In [58]:
# filtro registros de la bbdd cuyo tipo es nombre
pd_people = pd_peopleExtraction[pd_peopleExtraction.type == 'name']
pd_people['name_nfd'] = pd_people.value.apply(lambda x: textNormalize(x))
pd_people = pd_people[['key_peopleMaster','id_originally', 'name_nfd']]
pd_people.head(3)

Unnamed: 0,key_peopleMaster,id_originally,name_nfd
0,1571812576-132380815,peopleMimecast/ultano.kindelan@fyber.com,ultano kindelan
5,1571819599-55121593,peopleMimecast/info@cognitionx.io,charlie muirhead
8,1572879274-100347265,peopleMimecast/movie@youthsquare.hk,youth square


In [54]:
# Vamos a quedarnos con las personas que tienen más de una palabra (nombre + apellido)

pd_people_2 = pd_people[['key_peopleMaster', 'name_nfd']]
pd_people_2.name_nfd = pd_people_2.name_nfd.apply(lambda x: x.strip())
pd_people_2 = pd_people_2.drop_duplicates()
pd_people_2 = pd_people_2[pd_people_2.name_nfd.str.contains(' ')]
print(len(pd_people))
print(len(pd_people_2))
pd_people_2.head(3)

305879
197417


Unnamed: 0,key_peopleMaster,name_nfd
0,1571812576-132380815,ultano kindelan
5,1571819599-55121593,charlie muirhead
8,1572879274-100347265,youth square


In [55]:
# Trabajamos en spark

schema = StructType([StructField("key_peopleMaster", StringType(), True),
                     StructField("name_nfd", StringType(), True)])
df_people = spark.createDataFrame(pd_people_2, schema=schema)
df_people.show(2, False)

+--------------------+----------------+
|key_peopleMaster    |name_nfd        |
+--------------------+----------------+
|1571812576-132380815|ultano kindelan |
|1571819599-55121593 |charlie muirhead|
+--------------------+----------------+
only showing top 2 rows



### Empresas

In [66]:
pd_orgPipedrive['name_nfd'] = pd_orgPipedrive.name.apply(lambda x: textNormalize(x))
pd_orgPipedrive = pd_orgPipedrive[['_id','name','name_nfd']]
pd_orgPipedrive.columns = ['id_originally', 'name', 'name_nfd']
print(len(pd_orgPipedrive))
# pd_orgPipedrive.head(3)

pd_orgMimecast['name'] = pd_orgMimecast._id.apply(lambda x: extractUrlMimecast(x))
pd_orgMimecast['name_nfd'] = pd_orgMimecast.name.apply(lambda x: textNormalize(x))
pd_orgMimecast = pd_orgMimecast[['_id','name','name_nfd']]
pd_orgMimecast.columns = ['id_originally', 'name', 'name_nfd']
print(len(pd_orgMimecast))
# pd_orgMimecast.head(3)

pd_orgs = pd.concat([pd_orgPipedrive, pd_orgMimecast])
print(len(pd_orgs))
pd_orgs.head(3)

37459
32858
70317


Unnamed: 0,id_originally,name,name_nfd
0,orgPipedrive/39497,RedMere Technology,redmere technology
1,orgPipedrive/65382,Gestaweb 2020,gestaweb
2,orgPipedrive/69002,Remy Valette,remy valette


In [67]:
# Trabajamos en spark

schema = StructType([StructField("id_originally", StringType(), True),\
                     StructField("name", StringType(), True),\
                     StructField("name_nfd", StringType(), True)])
df_orgs = spark.createDataFrame(pd_orgs, schema=schema)
df_orgs.show(2, False)

+------------------+------------------+------------------+
|id_originally     |name              |name_nfd          |
+------------------+------------------+------------------+
|orgPipedrive/39497|RedMere Technology|redmere technology|
|orgPipedrive/65382|Gestaweb 2020     |gestaweb          |
+------------------+------------------+------------------+
only showing top 2 rows



# Busquedad de Entidades

In [68]:
def find_entities2(entities):
    """
    Gets dataframe output of StanfordNER tagger
    and finds entities by looking at consecutive indices.
    
    input : entities - DataFrame with tagged words StanfordNER tagger (output)
    output : list - entities (PERSON, ORGANIZATION)
    """

    x = entities.index.values
    L = []

    i = 0
    while i < len(x):
        l = [x[i]]
        j = i + 1
        while (j < len(x)) and (x[j]-x[j-1]==1):
            l.append(x[j])
            j = j + 1
        L.append(l)
        i = j
        
    return [[' '.join(list(entities.loc[l, 'word'])), ' '.join(list(entities.loc[l, 'label'].unique()))] for l in L]

In [69]:
#folder1 = '/home/eduardofernandez/Archivo/_INTELLIJ/2019_07_gpbullhound/eduardoner/src/main/resources/classifiers/'
folder1 = '/home/acalle/Documentos/GP-bullhound/stanford-ner-2018-10-16/classifiers/'
st = StanfordNERTagger(folder1+'english.all.3class.distsim.crf.ser.gz')


***********************************

### Función: findEntityInText(text, orgsDF, peopleDF)

***********************************

In [70]:
def findEntityInText(dict_text, df_orgs, df_people):
    """
    Recibe un texto y encuentra entidades (empresas y personas) en ese texto. 
    Devuelve la ID del texto y las entidades encontradas.
    
    :param: - dict_text - diccionario {'key', 'text', 'uploaded_by', 'updated_at'}
    :param: - df_orgs - SparkDF con información de organizaciones (orgPipedrive, orgMimecast)
    :param: - df_people - SparkDF con información de personas (peopleExtraction)
    """

    key = dict_text['key']
    text = dict_text['text']
    uploader = dict_text['uploader']
    created = dict_text['created']    
    updated = dict_text['updated']        
    
    if len(text)==0: 
        return [key, 'No text in document']
    
    #filter names with more than three characters:
    df_orgs_2 = df_orgs.filter(F.length(F.col('name_nfd'))>3)
    df_people_2 = df_people.filter(F.length(F.col('name_nfd'))>3)
    
    #NER entities
    labelled = st.tag(text.split())
    labelled = pd.DataFrame(labelled, columns=['word', 'label'])
    entities = labelled[((labelled.label == 'PERSON') | (labelled.label == 'ORGANIZATION')) & (len(labelled.word)>1)]
    entities = find_entities2(entities)
    entities = pd.DataFrame(entities).drop_duplicates().reset_index(drop=True)
    entities.columns = ['entity', 'type']

    #spark df entities
    schema = StructType([StructField("entity", StringType(), True),\
                         StructField("type", StringType(), True)])
    df_entities = spark.createDataFrame(entities, schema=schema)
    df_entities_2 = df_entities.withColumn('name_nfd', F.trim(F.lower(F.col('entity'))))\
                               .filter(F.length(F.col('name_nfd'))>3)
        
    # -----------------------
    #fuzzy orgs vs. entities
    # -----------------------   

    #hashing model
    model = Pipeline(stages=[
        RegexTokenizer(pattern="", inputCol="name_nfd", outputCol="tokens", minTokenLength=1),
        NGram(n=3, inputCol="tokens", outputCol="ngrams"),
        HashingTF(inputCol="ngrams", outputCol="vectors"),
        MinHashLSH(inputCol="vectors", outputCol="lsh", numHashTables=5)
    ])    
    model = model.fit(df_orgs_2)    
    df_orgs_hashed = model.transform(df_orgs_2)
    df_entities_hashed = model.transform(df_entities_2)
    
    #similitud
    threshold = 0.4
    results_names = model.stages[-1]\
                     .approxSimilarityJoin(df_orgs_hashed, df_entities_hashed, threshold, distCol="dist_jaccard")\
                     .select(
                            F.col("datasetA.id_originally"),
                            F.col("datasetA.name"),
                            F.col("datasetB.entity"),
                            F.col("dist_jaccard"))
    pd_results1 = results_names.toPandas()    
    
    # -------------------------
    #fuzzy people vs. entities
    # -------------------------  
    #hashing model
    model = Pipeline(stages=[
        RegexTokenizer(pattern="", inputCol="name_nfd", outputCol="tokens", minTokenLength=1),
        NGram(n=3, inputCol="tokens", outputCol="ngrams"),
        HashingTF(inputCol="ngrams", outputCol="vectors"),
        MinHashLSH(inputCol="vectors", outputCol="lsh", numHashTables=5)
    ])    
    model = model.fit(df_people_2)
    df_people_hashed = model.transform(df_people_2)
    df_entities_hashed = model.transform(df_entities_2)
    
    #similitud
    threshold = 0.4
    results_names = model.stages[-1]\
                     .approxSimilarityJoin(df_people_hashed, df_entities_hashed, threshold, distCol="dist_jaccard")\
                     .select(
                            F.col("datasetA.key_peopleMaster"),
                            F.col("datasetA.name_nfd"),
                            F.col("datasetB.entity"),
                            F.col("dist_jaccard"))
    pd_results2 = results_names.toPandas()
    
    return pd_results1, pd_results2



In [71]:
df_orgs.show(2)

+------------------+------------------+------------------+
|     id_originally|              name|          name_nfd|
+------------------+------------------+------------------+
|orgPipedrive/39497|RedMere Technology|redmere technology|
|orgPipedrive/65382|     Gestaweb 2020|         gestaweb |
+------------------+------------------+------------------+
only showing top 2 rows



In [72]:
df_people.show(2)

+--------------------+----------------+
|    key_peopleMaster|        name_nfd|
+--------------------+----------------+
|1571812576-132380815| ultano kindelan|
| 1571819599-55121593|charlie muirhead|
+--------------------+----------------+
only showing top 2 rows



In [85]:
dict_text = dict(pd_documents.iloc[10])

In [None]:
dict_text['text']

In [87]:
pd_results1, pd_results2 = findEntityInText(dict_text, df_orgs, df_people)

In [88]:
pd_results1

Unnamed: 0,id_originally,name,entity,dist_jaccard
0,orgsMimecast/avito.ru,avito,Avito,0.0
1,orgPipedrive/38181,21 Ventures,Kite Ventures,0.363636
2,orgPipedrive/49244,Ayala Corporation,SINA Corporation,0.388889
3,orgPipedrive/40076,Investment AB Kinnevik,Investment AB Kinnevik,0.0
4,orgsMimecast/e-tengelmann.de,e-tengelmann,Tengelmann,0.111111
5,orgPipedrive/58120,Unica Corporation,SINA Corporation,0.388889
6,orgPipedrive/72870,Rite Ventures,Kite Ventures,0.166667
7,orgPipedrive/51605,HE Ventures,Kite Ventures,0.333333
8,orgPipedrive/41662,Avito,Avito,0.0
9,orgPipedrive/39975,Granite Ventures,Kite Ventures,0.333333


In [89]:
pd_results2

Unnamed: 0,key_peopleMaster,name_nfd,entity,dist_jaccard
0,1571761769-93155709,goldman sachs,Goldman Sachs,0.0
1,1571761769-85222693,tengelmann eday,Tengelmann,0.384615
2,1571828697-97525763,rse ventures,Kite Ventures,0.384615
3,1571813286-27934682,andreessen horowitz,Andreessen Horowitz,0.0
4,1571943672-77086776,jme ventures,Kite Ventures,0.384615
5,1571828697-129553725,financial services,Financial Services Authority,0.384615


### ¿Qué hace la función por dentro?

In [91]:
key = dict_text['key']
text = dict_text['text']
uploader = dict_text['uploader']
created = dict_text['created']
updated = dict_text['updated']        
    
#si no hay texto termina
if len(text)==0: 
    print('No text in document')#return [key, 'No text in document']

In [94]:
#filtra de nuestra BBDD nombres con mas de tres caracteres (NER se vuelve un poco loco)
df_orgs_2 = df_orgs.filter(F.length(F.col('name_nfd'))>3)
df_people_2 = df_people.filter(F.length(F.col('name_nfd'))>3)
    
#aplico NER
labelled = st.tag(text.split())
labelled = pd.DataFrame(labelled, columns=['word', 'label'])

In [122]:
labelled

Unnamed: 0,word,label
0,1,O
1,Valuation,O
2,Analysis,O
3,November,O
4,2012,O
...,...,...
4688,FAX,O
4689,+49(0)30,O
4690,610,O
4691,80,O


In [109]:
labelled[labelled['label'].str.contains("PERSON|ORGANIZATION")]

Unnamed: 0,word,label
10,APPENDIX,ORGANIZATION
12,VALUATION,ORGANIZATION
13,SUMMARY,ORGANIZATION
165,DCF,ORGANIZATION
203,DCF,ORGANIZATION
...,...,...
4230,London,ORGANIZATION
4545,Financial,ORGANIZATION
4546,Services,ORGANIZATION
4547,Authority,ORGANIZATION


In [110]:
#python nltk no da entidades completas, por lo que usamos la funcion find_entities2
entities = labelled[((labelled.label == 'PERSON') | (labelled.label == 'ORGANIZATION')) & (len(labelled.word)>1)]
entities = find_entities2(entities)
entities = pd.DataFrame(entities).drop_duplicates().reset_index(drop=True)
entities.columns = ['entity', 'type']

In [111]:
entities

Unnamed: 0,entity,type
0,APPENDIX,ORGANIZATION
1,VALUATION SUMMARY,ORGANIZATION
2,DCF,ORGANIZATION
3,EBITDA,ORGANIZATION
4,PPS,ORGANIZATION
...,...,...
63,SoundCloud Ltd.,ORGANIZATION
64,Financial Services Authority,ORGANIZATION
65,United Kingdom Financial Services Authority,ORGANIZATION
66,"Jermyn Street, London",ORGANIZATION


In [114]:
#spark df_entities_2 -> contiene las entidades encontradas por NER
schema = StructType([StructField("entity", StringType(), True),\
                     StructField("type", StringType(), True)])
df_entities = spark.createDataFrame(entities, schema=schema)
df_entities_2 = df_entities.withColumn('name_nfd', F.trim(F.lower(F.col('entity'))))\
                           .filter(F.length(F.col('name_nfd'))>3)
df_entities_2.show()

+--------------------+------------+--------------------+
|              entity|        type|            name_nfd|
+--------------------+------------+--------------------+
|            APPENDIX|ORGANIZATION|            appendix|
|   VALUATION SUMMARY|ORGANIZATION|   valuation summary|
|              EBITDA|ORGANIZATION|              ebitda|
|               Avito|      PERSON|               avito|
|    Avito IRR Slando|ORGANIZATION|    avito irr slando|
|           LeBonCoin|ORGANIZATION|           leboncoin|
|           eBay Inc.|ORGANIZATION|           ebay inc.|
|   REA Group Limited|ORGANIZATION|   rea group limited|
|SouFun Holdings Ltd.|ORGANIZATION|soufun holdings ltd.|
|       Zillow , Inc.|ORGANIZATION|       zillow , inc.|
| Dice Holdings, Inc.|ORGANIZATION| dice holdings, inc.|
|JobStreet Corp. Bhd.|ORGANIZATION|jobstreet corp. bhd.|
|         Yandex N.V.|ORGANIZATION|         yandex n.v.|
|         Google Inc.|ORGANIZATION|         google inc.|
|      Facebook, Inc.|ORGANIZAT

**Mining of Massive Datasets.** Jure Leskovec, Anand Rajaraman, Jeff Ullman

**Minhashing** -> compress large documents into small signature matrices preserving the expected similarity of any pair of documents.

>> _The probability that the minhash function for a random permutation of
rows in a document matrix produces the same value for two sets equals the Jaccard similarity
of those sets. The Jaccard similarities of the underlying sets are estimated from the signature matrix resulting of applying hash function._

**Local-Sensitivity Hashing** -> focus only in the most similar pairs or all pairs that are above some lower bound in similarity.

In [113]:
# -----------------------
#fuzzy orgs vs. entities
# -----------------------   

#hashing model -> Local Sensitive Hashing when comparing Big Datasets
model = Pipeline(stages=[
        RegexTokenizer(pattern="", inputCol="name_nfd", outputCol="tokens", minTokenLength=1),
        NGram(n=3, inputCol="tokens", outputCol="ngrams"),
        HashingTF(inputCol="ngrams", outputCol="vectors"),
        MinHashLSH(inputCol="vectors", outputCol="lsh", numHashTables=5)
    ])    

model = model.fit(df_orgs_2)    
df_orgs_hashed = model.transform(df_orgs_2)
df_entities_hashed = model.transform(df_entities_2)


In [119]:
df_entities_hashed.show(5)

+-----------------+------------+-----------------+--------------------+--------------------+--------------------+--------------------+
|           entity|        type|         name_nfd|              tokens|              ngrams|             vectors|                 lsh|
+-----------------+------------+-----------------+--------------------+--------------------+--------------------+--------------------+
|         APPENDIX|ORGANIZATION|         appendix|[a, p, p, e, n, d...|[a p p, p p e, p ...|(262144,[30529,88...|[[2.47270919E8], ...|
|VALUATION SUMMARY|ORGANIZATION|valuation summary|[v, a, l, u, a, t...|[v a l, a l u, l ...|(262144,[11798,50...|[[6.7059534E7], [...|
|           EBITDA|ORGANIZATION|           ebitda|  [e, b, i, t, d, a]|[e b i, b i t, i ...|(262144,[33543,13...|[[3.94996328E8], ...|
|            Avito|      PERSON|            avito|     [a, v, i, t, o]|[a v i, v i t, i ...|(262144,[201892,2...|[[1.036406121E9],...|
| Avito IRR Slando|ORGANIZATION| avito irr slando|[a, v

In [120]:
#calcula la similitud
threshold = 0.4
results_names = model.stages[-1]\
                     .approxSimilarityJoin(df_orgs_hashed, df_entities_hashed, threshold, distCol="dist_jaccard")\
                     .select(
                            F.col("datasetA.id_originally"),
                            F.col("datasetA.name"),
                            F.col("datasetB.entity"),
                            F.col("dist_jaccard"))
pd_results1 = results_names.toPandas()    

In [121]:
pd_results1.head(5)

Unnamed: 0,id_originally,name,entity,dist_jaccard
0,orgsMimecast/avito.ru,avito,Avito,0.0
1,orgPipedrive/38181,21 Ventures,Kite Ventures,0.363636
2,orgPipedrive/49244,Ayala Corporation,SINA Corporation,0.388889
3,orgPipedrive/40076,Investment AB Kinnevik,Investment AB Kinnevik,0.0
4,orgsMimecast/e-tengelmann.de,e-tengelmann,Tengelmann,0.111111
