# Word Overlap
Simple way to measure lexical overlaping in two texts (sentence, paragraph or document).
It removes punctuation and stopwords. Then, calculates the amount of common words over the sum of the words in two texts. So, this measure goes from 0 (different) to 0.5 (equal).

In [1]:
import nltk
import string

In [2]:
stopwords_list = nltk.corpus.stopwords.words('portuguese')

def tokenize_words(t):
    return nltk.tokenize.word_tokenize(t)

def remove_stopwords(tokens):
    return [t for t in tokens if t not in stopwords_list]

def remove_punctuation(t):
    return t.translate(str.maketrans('','',string.punctuation))

def word_overlap(t1, t2):
    tokens1 = remove_stopwords(tokenize_words(remove_punctuation(t1)))
    tokens2 = remove_stopwords(tokenize_words(remove_punctuation(t2)))
    matches = [1 for t in tokens1 if t in tokens2]
    return sum(matches) / (len(tokens1) + len(tokens2))

In [3]:
t1 = "Contagem manual de milhões de votos deixa 272 pessoas mortas e outras 1.878 doentes na Indonésia"
word_overlap(t1, t1)

0.5

In [4]:
t2 = "Primeira eleição que juntou o voto presidencial com parlamentares nacionais e regionais num mesmo pleito."
word_overlap(t1, t2)

0.0

In [5]:
t3 = "272 pessoas mortas e outras 1.878 doentes na Indonésia após contagem manual de milhões de votos."
word_overlap(t1, t3)

0.4166666666666667

# Word2Vec
download a pretrained wordembedding at http://www.nilc.icmc.usp.br/nilc/index.php/repositorio-de-word-embeddings-do-nilc

You can click in the following link: http://143.107.183.175:22980/download.php?file=embeddings/word2vec/skip_s50.zip. Unzip the file inside notebooks directory

In [2]:
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('skip_s50.txt')

In [3]:
model.most_similar('carro')

  if np.issubdtype(vec.dtype, np.int):


[('passageiro', 0.907621443271637),
 ('trator', 0.9069172143936157),
 ('caminhão', 0.8961774706840515),
 ('jipe', 0.889147162437439),
 ('avião', 0.8786295652389526),
 ('guindaste', 0.8719200491905212),
 ('contêiner', 0.8684994578361511),
 ('parabrisa', 0.866083562374115),
 ('elevador', 0.8559513688087463),
 ('motorista', 0.8482216596603394)]

In [8]:
import scipy
import numpy as np
def cosine_distance_wordembedding_method(t1, t2):
    vector_1 = np.mean([model.get_vector(word) for word in t1.split() if word in model.vocab],axis=0)
    vector_2 = np.mean([model.get_vector(word) for word in t2.split() if word in model.vocab],axis=0)
    cosine = scipy.spatial.distance.cosine(vector_1, vector_2)
    print('Word Embedding method with a cosine distance asses that our two sentences are similar to',round((1-cosine)*100,2),'%')

cosine_distance_wordembedding_method(t1, t1)
cosine_distance_wordembedding_method(t1, t2)
cosine_distance_wordembedding_method(t2, t3)

Word Embedding method with a cosine distance asses that our two sentences are similar to 100.0 %
Word Embedding method with a cosine distance asses that our two sentences are similar to 46.93 %
Word Embedding method with a cosine distance asses that our two sentences are similar to 47.78 %


# Wordnets
Do not forget to donwload: ```nltk.download()```

In [9]:
from nltk.corpus import wordnet

In [10]:
dog_synsets = wordnet.synsets('dog')
for i, dog in enumerate(dog_synsets):
    print('{} -> {}'.format(i, dog.definition()))

0 -> a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds
1 -> a dull unattractive unpleasant girl or woman
2 -> informal term for a man
3 -> someone who is morally reprehensible
4 -> a smooth-textured sausage of minced beef or pork usually smoked; often served on a bread roll
5 -> a hinged catch that fits into a notch of a ratchet to move a wheel forward or prevent it from moving backward
6 -> metal supports for logs in a fireplace
7 -> go after with the intent to catch


In [11]:
print(wordnet.synset('dog.n.01').lowest_common_hypernyms(wordnet.synset('cat.n.01')))
print(wordnet.synset('dog.n.01').lowest_common_hypernyms(wordnet.synset('duck.n.01')))

[Synset('carnivore.n.01')]
[Synset('vertebrate.n.01')]


### see http://www.nltk.org/howto/wordnet.html

In [12]:
dog = wordnet.synset('dog.n.01')
cat = wordnet.synset('cat.n.01')
car = wordnet.synset('car.n.01')

print('Path similarity')
print(dog.path_similarity(cat))
print(dog.path_similarity(car))
print()
print('Leacock-Chodorow similarity')
print(dog.lch_similarity(cat))
print(dog.lch_similarity(car))
print()
print('Wu-Palmer similarity')
print(dog.wup_similarity(cat))
print(dog.wup_similarity(car))

Path similarity
0.2
0.07692307692307693

Leacock-Chodorow similarity
2.0281482472922856
1.072636802264849

Wu-Palmer similarity
0.8571428571428571
0.4


In [13]:
from itertools import islice
for synset in islice(wordnet.all_synsets('n'), 10):
    print(synset, synset.hypernyms())

Synset('entity.n.01') []
Synset('physical_entity.n.01') [Synset('entity.n.01')]
Synset('abstraction.n.06') [Synset('entity.n.01')]
Synset('thing.n.12') [Synset('physical_entity.n.01')]
Synset('object.n.01') [Synset('physical_entity.n.01')]
Synset('whole.n.02') [Synset('object.n.01')]
Synset('congener.n.03') [Synset('whole.n.02')]
Synset('living_thing.n.01') [Synset('whole.n.02')]
Synset('organism.n.01') [Synset('living_thing.n.01')]
Synset('benthos.n.02') [Synset('organism.n.01')]


# Portuguese usage

In [14]:
carro = wordnet.synsets('carro', lang='por')
print(carro)
# here a Word Sense Disambiguation tool is needed!

hyper = carro[1].hypernyms()[0]
print(hyper)
print(hyper.lemma_names(lang='por'))
hyper_hyper = hyper.hypernyms()[0]

print(hyper_hyper)
print(hyper_hyper.lemma_names(lang='por'))

hyper_hyper_hyper = hyper_hyper.hypernyms()[0]
print(hyper_hyper_hyper)
print(hyper_hyper_hyper.lemma_names(lang='por'))

[Synset('beach_wagon.n.01'), Synset('car.n.01'), Synset('car.n.02'), Synset('cart.n.01')]
Synset('motor_vehicle.n.01')
['veículos_motorizados']
Synset('self-propelled_vehicle.n.01')
[]
Synset('wheeled_vehicle.n.01')
[]


# Synsets
Download synset database from http://www.nilc.icmc.usp.br/tep2/download.htm
The file ```base_tep2.txt``` follows the format:

Numero_sequencial do synset. \[nome_da_categoria_do_synset\] {Conjunto Synset separado por virgulas} <numero_identificador do(s) Synset(s) antonimo(s)>

In [15]:
import re

synsets = {}
with open('base_tep2/base_tep2.txt', 'r', encoding='latin1') as tep_file:
    for line in tep_file.readlines():
        syn_id = re.match('^(?P<int>\d+)', line).group('int')
        cat = re.search('\[(?P<cat>(.*?))\]', line).group('cat')
        syn = re.search('\{(?P<syn>(.*?))\}', line).group('syn')
        s = re.search('\<(?P<ant>(.*?))\>', line)
        ant = None
        if s is not None:
            ant = s.group('ant')
        synsets[syn_id] = {'words': [w.strip() for w in syn.split(',')],
                           'category': cat,
                           'antonym': ant}
