# Case study: LiLa and Wikidata

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Latin-Lexemes-in-Wikidata" data-toc-modified-id="Latin-Lexemes-in-Wikidata-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Latin Lexemes in Wikidata</a></span></li><li><span><a href="#Lemmas-in-LiLa" data-toc-modified-id="Lemmas-in-LiLa-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Lemmas in LiLa</a></span><ul class="toc-item"><li><span><a href="#Matching" data-toc-modified-id="Matching-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Matching</a></span></li><li><span><a href="#Writing-down-the-output" data-toc-modified-id="Writing-down-the-output-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Writing down the output</a></span></li></ul></li></ul></div>

## Latin Lexemes in Wikidata

According to [ordia](https://ordia.toolforge.org/language/Q397), there are 32,183 Latin lexemes in Wikidata. The same total can also be verified using [Wikidata's SPARQL endpoint](https://query.wikidata.org/). In order to count the lexemes available for the Latin language [Q397](https://www.wikidata.org/wiki/Q397), we can run the following query in their interface (no need to worry about namespace prefixes)

```sparql
select ?lexemeId ?lemma WHERE {
  ?lexemeId dct:language wd:Q397;
            wikibase:lemma ?lemma.
}
```


We'll download URIs and lemma strings (available via the property `wikibase:lemma`) using the query above and with the help of the [qwikidata](https://qwikidata.readthedocs.io/en/stable/index.html) library

In [9]:
from qwikidata.sparql import  return_sparql_query_results
from qwikidata.entity import WikidataItem, WikidataLexeme
from qwikidata.linked_data_interface import get_entity_dict_from_api

In [2]:
q = '''select ?lexemeId ?lemma ?pos WHERE {
  ?lexemeId dct:language wd:Q397;
            wikibase:lemma ?lemma ;
            wikibase:lexicalCategory ?pos 
}'''

In [3]:
res = return_sparql_query_results(q)

We can easily verify that our `res` variable holds all the expected results

In [4]:
len(res['results']['bindings'])

32183

In [5]:
r = res['results']['bindings'][0]
r['lemma']['value']

'adquisitrix'

Before we move on, let's pack everything into a list of URI,lemma lists

In [6]:
wikidata_lexemes = [(r['lexemeId']['value'], r['lemma']['value'], r['pos']['value']) 
                    for r in res['results']['bindings']]
wikidata_lexemes[1]

('http://www.wikidata.org/entity/L255916',
 'adrepticius',
 'http://www.wikidata.org/entity/Q34698')

In [11]:
for p in set([i[-1] for i in wikidata_lexemes]):
    e = WikidataItem(get_entity_dict_from_api(p.split('/')[-1]))
    print(p, e.get_label())

http://www.wikidata.org/entity/Q83034 interjection
http://www.wikidata.org/entity/Q187931 phrase
http://www.wikidata.org/entity/Q24905 verb
http://www.wikidata.org/entity/Q62155 affix
http://www.wikidata.org/entity/Q63116 numeral
http://www.wikidata.org/entity/Q36224 pronoun
http://www.wikidata.org/entity/Q380057 adverb
http://www.wikidata.org/entity/Q4833830 preposition
http://www.wikidata.org/entity/Q36484 conjunction
http://www.wikidata.org/entity/Q184511 idiom
http://www.wikidata.org/entity/Q814722 participle
http://www.wikidata.org/entity/Q54310231 interrogative pronoun
http://www.wikidata.org/entity/Q2006180 pro-form
http://www.wikidata.org/entity/Q35102 proverb
http://www.wikidata.org/entity/Q9788 letter
http://www.wikidata.org/entity/Q168417 hapax legomenon
http://www.wikidata.org/entity/Q5456361 fixed expression
http://www.wikidata.org/entity/Q147276 proper noun
http://www.wikidata.org/entity/Q104051989 adjectival suffix
http://www.wikidata.org/entity/Q53996674 conjugation cla

lila:subordinating_conjunction
	
lila:coordinating_conjunction
		
lila:particle
	
lila:other

In [13]:
pos_mapping = {
'http://www.wikidata.org/entity/Q83034': 'interjection' , # interjection
'http://www.wikidata.org/entity/Q187931': 'None' , # phrase
'http://www.wikidata.org/entity/Q24905': 'verb' , # verb
'http://www.wikidata.org/entity/Q62155': 'None' , # affix
'http://www.wikidata.org/entity/Q63116': 'numeral' , # numeral
'http://www.wikidata.org/entity/Q36224': 'pronoun' , # pronoun
'http://www.wikidata.org/entity/Q380057': 'adverb' , # adverb
'http://www.wikidata.org/entity/Q4833830': 'adposition' , # preposition
'http://www.wikidata.org/entity/Q36484': 'conjunction' , # conjunction
'http://www.wikidata.org/entity/Q184511': 'None' , # idiom
'http://www.wikidata.org/entity/Q814722': 'None' , # participle
'http://www.wikidata.org/entity/Q54310231': 'pronoun' , # interrogative pronoun
'http://www.wikidata.org/entity/Q2006180': 'None' , # pro-form
'http://www.wikidata.org/entity/Q35102': 'None' , # proverb
'http://www.wikidata.org/entity/Q9788': 'None' , # letter
'http://www.wikidata.org/entity/Q168417': 'None' , # hapax legomenon
'http://www.wikidata.org/entity/Q5456361': 'None' , # fixed expression
'http://www.wikidata.org/entity/Q147276': 'proper_noun' , # proper noun
'http://www.wikidata.org/entity/Q104051989': 'None' , # adjectival suffix
'http://www.wikidata.org/entity/Q53996674': 'None' , # conjugation class
'http://www.wikidata.org/entity/Q576271': 'determiner' , # determiner
'http://www.wikidata.org/entity/Q134830': 'None', # prefix
'http://www.wikidata.org/entity/Q1084': 'noun' , # noun
'http://www.wikidata.org/entity/Q34698': 'adjective' , # adjective
'http://www.wikidata.org/entity/Q102047': 'None' # suffix  
}

## Lemmas in LiLa

In [14]:
import sys
import os

sys.path.insert(0, os.path.abspath('../'))

import pylila

In [15]:
from pylila.lemma import get_lemmas_by_writtenrep
from pylila.resources import LiLaLemmaBank

In [16]:
from tqdm import tqdm

With the help of PyLiLa we can easilly interrogate the Lemma Bank. However, it would be incredibly long and time consuming to send 32k SPARQL queries online to check each and every lemma.

Instead, I will load a local copy of the Lemma Bank into a `pylila.resources.LiLaLemmaBank` object and query the file locally (even that may take a while).

In [17]:
%%time
path_to_lb = os.path.expanduser('~/Downloads/lemmaBank.ttl')
lb = LiLaLemmaBank.from_file(path_to_lb)

CPU times: user 30.4 s, sys: 352 ms, total: 30.7 s
Wall time: 30.8 s


It's more than 1.3M triples...

In [18]:
len(lb.graph)

1337466

Now we can use `rdflib` SPARQL support to query

In [19]:
q = '''SELECT DISTINCT ?lemma ?pos
WHERE {
    ?lemma ontolex:writtenRep "amo" ;
           lila:hasPOS ?pos}'''


In [20]:
qres = lb.graph.query(q)
for row in qres:
    print(row.lemma, row.pos)

http://lila-erc.eu/data/id/lemma/88705 http://lila-erc.eu/ontologies/lila/verb
http://lila-erc.eu/data/id/lemma/29874 http://lila-erc.eu/ontologies/lila/noun


In [21]:
lb.graph.query(q)
[str(row.lemma) for row in qres]

['http://lila-erc.eu/data/id/lemma/88705',
 'http://lila-erc.eu/data/id/lemma/29874']

### Matching

In [22]:
def check_lemma_bank(wdlm):
    chk_str = wdlm.lower().replace('v', 'u').replace('j', 'i')
    
    q = f'''SELECT DISTINCT ?lemma ?pos
            WHERE {{
            ?lemma ontolex:writtenRep "{chk_str}" ;
            lila:hasPOS ?pos}}'''
    qres = lb.graph.query(q)
    return [(str(row.lemma), str(row.pos).split('/')[-1]) 
            for row in qres]

In [23]:
check_lemma_bank('amo')

[('http://lila-erc.eu/data/id/lemma/88705', 'verb'),
 ('http://lila-erc.eu/data/id/lemma/29874', 'noun')]

In [36]:
wd_matched = []
for wd_uri, wd_lemma, wd_pos in tqdm(wikidata_lexemes):
    lila_match = check_lemma_bank(wd_lemma)
    wd_matched.append([wd_uri, wd_lemma, wd_pos, lila_match, len(lila_match)])

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32183/32183 [00:30<00:00, 1047.83it/s]


In [37]:
wd_matched[100]

['http://www.wikidata.org/entity/L256712',
 'agglomero',
 'http://www.wikidata.org/entity/Q24905',
 [('http://lila-erc.eu/data/id/lemma/88076', 'verb')],
 1]

In [38]:
from collections import Counter

In [39]:
c = Counter([i[-1] for i in wd_matched])

Old results (without any string massaging):
```
[(1, 20255), (0, 7209), (2, 3880), (3, 676), (4, 135), (5, 25), (6, 3)]
```

In [40]:
c.most_common(10)

[(1, 22860), (2, 4452), (0, 3900), (3, 780), (4, 157), (5, 31), (6, 3)]

In [41]:
len([m for m in wd_matched if m[-1] > 1])

5423

Now let's try to disambiguate the ambiguous cases using the POS mapping we did above:

In [42]:
for m in wd_matched:
    if m[-1] > 1:
        matched_pos = pos_mapping[m[2]]
        for li in m[3]:
            if li[1] == matched_pos:
                m[3] = [li]
                m[4] = len(m[3])
                break

In [43]:
newc = Counter([i[-1] for i in wd_matched])

In [44]:
newc.most_common(10)

[(1, 28046), (0, 3900), (2, 183), (3, 43), (4, 8), (6, 2), (5, 1)]

In [48]:
[w for w in wd_matched if w[-1] > 1][0]

['http://www.wikidata.org/entity/L256677',
 'Africus',
 'http://www.wikidata.org/entity/Q1084',
 [('http://lila-erc.eu/data/id/lemma/7334', 'adjective'),
  ('http://lila-erc.eu/data/id/lemma/7338', 'proper_noun')],
 2]

### Writing down the output

I'll aim for a JSON file structured like that:

```json
[ 
    { 
      "id" :  "L255905",
      " uri" : " http://www.wikidata.org/entity/L255905" ,
      " wiki_lemma" : " adquisitrix" ,
      " wiki_pos"  : " http://www.wikidata.org/entity/Q1084" ,
      " lila_links"  : [
            { " lila_uri"  : " http://lila-erc.eu/data/id/lemma/87268" ,
              " lila_pos"  : "noun"  
            }
              ],
      "nr_of_links"  : 1
    },
{
    "id" : "L256677",
    "uri" : "http://www.wikidata.org/entity/L256677",
    "wiki_lemma": "Africus",
    "wiki_pos": "http://www.wikidata.org/entity/Q1084",
    "lila_links": [
        {
            "lila_uri" : "http://lila-erc.eu/data/id/lemma/7334",
            "lila_pos" :  "adjective"
        },
        {
            "lila_uri" : "http://lila-erc.eu/data/id/lemma/7338",
            "lila_pos" :  "proper_noun"
        }
    ],
    "nr_of_links" : 2
}
]
```

In [49]:
import json

In [52]:
wd_matched[0]

['http://www.wikidata.org/entity/L255905',
 'adquisitrix',
 'http://www.wikidata.org/entity/Q1084',
 [('http://lila-erc.eu/data/id/lemma/87268', 'noun')],
 1]

In [54]:
jmatchs = []
for w in wd_matched:
    j = {}
    j['id'] = w[0].split("/")[-1]
    j['uri'] = w[0]
    j['wiki_lemma'] = w[1]
    j['wiki_pos'] = w[2]
    j['lila_links'] = [{'lila_uri' : ll[0], 'lila_pos' : ll[1]} for ll in w[3]]
    j['nr_of_links'] = w[-1]
    jmatchs.append(j)
    

In [57]:
with open(os.path.expanduser('~/Desktop/lila_wikidata.json'), 'w') as out:
    json.dump(jmatchs, out, ensure_ascii=False, indent=2)

---