# Wikidata lexicon (~ Wiktionary)

In [1]:
from tools.datasets import *

book = "L536"

book_df = fetch_dataset(book, provider="wikidata")

Dataset L536.json not available, downloading from https://wikidata.org/wiki/Special:EntityData/L536.json
Done


## Wikidata lexemes breakdown structure

(cfr. Wikidata.ipynb). The online documentation on lexemes is pretty limited.

A lexeme is a unit of lexical meaning. Morphologically speaking it can only belong to one grammatical category. Homographical lexemes (P5402) are stored as different lexemes. In this case, L536 refers to book as a noun

- `lemmas`: array of lemmas of a lexeme.
    - `#lang` (e.g. `en`): contains the basic lemma in one or more language (lang->value). In general, the word could be valid in more languages.
    - `lexicalCategory`: an entity describing the grammatical category (verb, noun...)
    - `language`: a lexema only corresponds to a single language. Even here, just an entity
- `claims`: Structured like normal wikidata claims, contains grammatical features of the main lexeme and other relationships not related to senses, glosses or morphological forms. For example, `P5185` is the grammatical gender, `P5402` is a homograph lexeme.
- `forms`: an array of morphological forms. Each form is called L{ENTITY_NAME}-F{NO} with NO starting from 1.
    - `ìd`
    - `representations`: like for `#lang` above, but this time it represents a morphological variation.
    - `grammaticalFeatures`: an array of grammatical features
- `senses`: array of senses (either a translation or a definition, depending on the start and end language)
    - `claims`: the structure is similar to a normal claim in wikidata, but the number of predicates is circumscribed to:
        - `P5972: translation`: bring to senses (of the form `wd:LX-SN` where `LX` is a lexeme and `SN` is the sense number starting from S1. Human-readable annotations can be found by querying their label (`rdfs:label` or `skos:definition` on the dataset).
        - `P5137: item for this sense`: the corresponding Wikidata Entity
        - a few others (`P18: image`, ...)
    - `glosses`: categorises the noun. Mainly used to disambiguate word senses. Like for `#lang` above.

In [5]:
import pandas as pd

book_pd = pd.json_normalize(book_df["entities"]["L536"])

In [7]:
book_pd.columns

Index(['pageid', 'ns', 'title', 'lastrevid', 'modified', 'type', 'id',
       'lexicalCategory', 'language', 'forms', 'senses', 'lemmas.en.language',
       'lemmas.en.value', 'claims.P5402'],
      dtype='object')

## Testing data

We are collecting the top 1000 most used worst according to Wikidictionary. The counts are based on the absolute word frequency extracted from TV series and movie scripts in public domain till 2006. More details [here](https://en.wiktionary.org/wiki/Wiktionary:Frequency_lists/TV/2006/1-1000). From here on, we'll refer to them as WDTV.

Similarly, we compare with an extraction from Project Gutemberg (synced 2006 - is there anything more modern?) (WDPG) and hunspell-en-gb -ise (HUN-GB).

In [22]:
# scrape wdtv

from bs4 import BeautifulSoup
from requests import get
from os.path import join


def scrape_wiktionary(url):
    r = get(url)
    parsed = BeautifulSoup(r.content, "html.parser")
    tables = parsed.find_all("table")
    table = tables[0]
    rows = table.find_all("tr")[1:]
    cols = [row.find_all("td")[1].find("a").text for row in rows]
    
    return cols


    
wdtv_list = scrape_wiktionary("https://en.wiktionary.org/wiki/Wiktionary:Frequency_lists/TV/2006/1-1000")
wdpg_list = scrape_wiktionary("https://en.wiktionary.org/wiki/Wiktionary:Frequency_lists/PG/2006/04/1-10000")

In [29]:
def hunspell(dataset):
    # Remove the gender/plurality annotation
    with wrap_open(join("wordlists", dataset)) as f:
        num_of_words = int(f.readline())
        return [row.split("/")[0] for idx, row in enumerate(f.readlines()) if idx < num_of_words]


hun_en_gb = hunspell("en_GB-ise.dic")