# Using gensim

In [94]:
import os
import wget

url = 'http://rgai.inf.u-szeged.hu/~berend/compsem/hu-cbow.vec.gz'
embedding_file_name = url.split('/')[-1]
if not os.path.exists(embedding_file_name):
    filename = wget.download(url)
    print(filename, " got downloaded")

hu-cbow.vec.gz  got downloaded


In [82]:
from gensim.models.keyedvectors import KeyedVectors

embeddings = KeyedVectors.load_word2vec_format(embedding_file_name, limit=30000)

In [83]:
print(embeddings.most_similar('kutya', topn=5))

[('macska', 0.8715766668319702), ('ló', 0.8034482002258301), ('majom', 0.7777713537216187), ('madár', 0.7722772359848022), ('egér', 0.7345019578933716)]


  self.vectors_norm = (self.vectors / sqrt((self.vectors ** 2).sum(-1))[..., newaxis]).astype(REAL)


In [84]:
print(embeddings.most_similar(positive=['kutya', 'ellenséges'], negative=['barátságos'], topn=5))

[('zsákmány', 0.649147629737854), ('ellenség', 0.6485747694969177), ('állat', 0.6432604193687439), ('fegyver', 0.6141625642776489), ('macska', 0.6066000461578369)]


# Querying WordNet
  
[WordNet](https://wordnet.princeton.edu/) is a lexical database originally created for English alone.
Since its creation many extension (such as [BabelNet](http://babelnet.org/)) has been created.

WordNet organizes word forms into semantically coherent groups of word, called _synsets_.
The database consist of multiple sub-databases for senses of words belonging to different par-of-speech (_noun_, _verb_, _adverb_ and _adjective_).
Sysnets of WordNet can be connected if ceratin relations hold such as
* _synonymy_ such as **synonym(dog, canis)**
* _antonimy_ such as **antonym(happy, sad)**
* _meronymy_ such as **meronymy(dog, kennel)**
* _holonymy_ such as **holonymy(dog, tail)**
* _hypernymy_ such as **hypernymy(dog, animal)**
* _hyponymy_ such as **hyponymy(dog, chiwawa)**

In [2]:
from nltk.corpus import wordnet as wn

Let's check in which synsets does the word _dog_ occurrs in!

In [3]:
dog_synsets = wn.synsets('dog')
for s in dog_synsets:
    print(s)

Synset('dog.n.01')
Synset('frump.n.01')
Synset('dog.n.03')
Synset('cad.n.01')
Synset('frank.n.02')
Synset('pawl.n.01')
Synset('andiron.n.01')
Synset('chase.v.01')


Print the definitions and the part-of-speech category for the synsets of the word _dog_. Additionally print the exemplar sentences for the given synset from WordNet.

In [4]:
for i, s in enumerate(dog_synsets):
    print(i, s.pos(), s.definition())
    for j, e in enumerate(s.examples()):
        print('\tExample sentence #{}: {}'.format(j+1, e))

0 n a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds
	Example sentence #1: the dog barked all night
1 n a dull unattractive unpleasant girl or woman
	Example sentence #1: she got a reputation as a frump
	Example sentence #2: she's a real dog
2 n informal term for a man
	Example sentence #1: you lucky dog
3 n someone who is morally reprehensible
	Example sentence #1: you dirty dog
4 n a smooth-textured sausage of minced beef or pork usually smoked; often served on a bread roll
5 n a hinged catch that fits into a notch of a ratchet to move a wheel forward or prevent it from moving backward
6 n metal supports for logs in a fireplace
	Example sentence #1: the andirons were too hot to touch
7 v go after with the intent to catch
	Example sentence #1: The policeman chased the mugger down the alley
	Example sentence #2: the dog chased the rabbit


Measure the similarity of the pairs of synsets for the word _dog_ and _cat_
according to Lin's similarity.
The [paper of Budanitsky and Hirst](https://www.mitpressjournals.org/doi/pdf/10.1162/coli.2006.32.1.13)
contains an extensive overview of WordNet based similarities.
  
[Some further read about measuring semantic similarity](https://web.stanford.edu/~jurafsky/slp3/17.pdf)

Additionally, there is a [nice demo](http://ws4jdemo.appspot.com/) illustrating various similarity metrics.

In [8]:
cat_synsets = wn.synsets('cat')
from nltk.corpus import wordnet_ic  # load information content data
brown_ic = wordnet_ic.ic('ic-brown.dat')
semcor_ic = wordnet_ic.ic('ic-semcor.dat')

for i,d in enumerate(dog_synsets):
    print(i+1, d.definition())
    for j,c in enumerate(cat_synsets):
        if d.pos() == c.pos():
            print("\t", c.definition(), d.lin_similarity(c, brown_ic))
        else:
            pos_mismatch = True
            print('\tPOS tags mismatch')
    print("========")

1 a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds
	 feline mammal usually having thick soft fur and no ability to roar: domestic cats; wildcats 0.8768009843733973
	 an informal term for a youth or man 0.2373157912950559
	 a spiteful woman gossip 0.22541367230467804
	 the leaves of the shrub Catha edulis which are chewed like tobacco or used to make tea; has the effect of a euphoric stimulant 0.08125685233221215
	 a whip with nine knotted cords 0.15524861014827232
	 a large tracked vehicle that is propelled by two endless metal belts; frequently used for moving earth in construction and farm work 0.15561545181951641
	 any of several large cats typically able to roar and living in the wild 0.8395837008502381
	 a method of examining body organs by scanning them with X rays and using a computer to construct a series of cross-sectional scans along a single axis -0.0
	POS tags mismatch
	POS

Which synsets are the hypernyms of the 1st synset of the word _dog_?  
Note: hypernyms of a concept are the set of concept which are more general.

In [97]:
for hypernym in dog_synsets[0].hypernyms():
    print(hypernym)

Synset('canine.n.02')
Synset('domestic_animal.n.01')


To learn about further functionalities of WordNet see the [documentation of the corresponding NLTK package](http://www.nltk.org/howto/wordnet.html).

# Querying ConceptNet

[ConceptNet](http://conceptnet.io/) is an open source massively **multilingual** semantic network
listing the attributes of certain words and phrases and their relations.   
The ConceptNet graph contains vertices for [100+ languages](https://github.com/commonsense/conceptnet5/wiki/Languages).
For more deatils read the [paper introducing ConceptNet v5.0](http://www.lrec-conf.org/proceedings/lrec2012/pdf/1072_Paper.pdf)

ConceptNet can be hosted on a local machine, however, it requires a substantial amount of disk capacity.
To this end, it could be a good idea to rely on its convenient REST API to be briefly introduced next.
For futher deatils regarding the of the API of ConceptNet see its [documentation](https://github.com/commonsense/conceptnet5/wiki/API).  

In [90]:
import urllib
import requests
from contextlib import closing

In [91]:
lang = 'hu'
word = 'kutya'
limit = 10
input_url = 'http://api.conceptnet.io/c/{}/{}?limit={}'.format(lang, urllib.parse.quote_plus(word), limit)
print('Querying {}'.format(input_url))
try:
    with closing(requests.get(input_url)) as response:
        obj = response.json()
        for edge in obj['edges']:
            s, e, r = edge['start'], edge['end'], edge['rel']
            print('{}\t{}\t{}\t{}'.format(s['term'], e['term'], r['label'], edge['weight']))
except (requests.exceptions.ConnectionError, requests.exceptions.Timeout, requests.exceptions.TooManyRedirects, ValueError):
    print('Connection refused for {}'.format(input_url))

Querying http://api.conceptnet.io/c/hu/kutya?limit=10
/c/hu/eb	/c/hu/kutya	Synonym	2.82842712474619
/c/hu/kutya	/c/hu/eb	Synonym	2.82842712474619
/c/hu/a_hazug_embert_hamarabb_utolérik_mint_a_sánta_kutyát	/c/hu/kutya	DerivedFrom	1.0
/c/hu/a_kutya_ugat_a_karaván_halad	/c/hu/kutya	DerivedFrom	1.0
/c/hu/amelyik_kutya_ugat_az_nem_harap	/c/hu/kutya	DerivedFrom	1.0
/c/hu/egyik_kutya_másik_eb	/c/hu/kutya	DerivedFrom	1.0
/c/hu/fakutya	/c/hu/kutya	DerivedFrom	1.0
/c/hu/juhászkutya	/c/hu/kutya	DerivedFrom	1.0
/c/hu/kiskutya	/c/hu/kutya	DerivedFrom	1.0
/c/hu/könnyebb_utolérni_a_hazugot_mint_a_sánta_kutyát	/c/hu/kutya	DerivedFrom	1.0


In [93]:
lang = 'en'
word = 'dog'
limit = 10
input_url = 'http://api.conceptnet.io/c/{}/{}?limit={}'.format(lang, urllib.parse.quote_plus(word), limit)
print('Querying {}'.format(input_url))
try:
    with closing(requests.get(input_url)) as response:
        obj = response.json()
        for edge in obj['edges']:
            s, e, r = edge['start'], edge['end'], edge['rel']
            print('{}\t{}\t{}\t{}'.format(s['term'], e['term'], r['label'], edge['weight']))
except (requests.exceptions.ConnectionError, requests.exceptions.Timeout, requests.exceptions.TooManyRedirects, ValueError):
    print('Connection refused for {}'.format(input_url))

Querying http://api.conceptnet.io/c/en/dog?limit=10
/c/en/dog	/c/en/bark	CapableOf	16.0
/c/en/dog	/c/en/guard_house	CapableOf	10.392304845413264
/c/en/dog	/c/en/pet	RelatedTo	9.82975075981075
/c/en/dog	/c/en/animal	RelatedTo	9.410419756844005
/c/en/dog	/c/en/kennel	AtLocation	9.38083151964686
/c/en/flea	/c/en/dog	RelatedTo	9.02064299260313
/c/en/dog	/c/en/canine	RelatedTo	7.625745865159683
/c/en/dog	/c/en/pet	CapableOf	7.483314773547882
/c/en/dog	/c/en/loyal_friend	IsA	6.6332495807108
/c/en/dog	/c/en/companionship	UsedFor	6.32455532033676
