In [1]:
from IPython.html.services.config import ConfigManager
from IPython.paths import locate_profile
cm = ConfigManager(profile_dir=locate_profile(get_ipython().profile))

cm.update('notebook', {"load_extensions": {"livereveal/main": True}})
cm.update('livereveal', {
    'theme': 'simple',
    'transition': 'linear',
    'slideNumber': True,
    'start_slideshow_at': 'selected',
    'scroll': True,
})



{'scroll': True,
 'slideNumber': True,
 'start_slideshow_at': 'selected',
 'theme': 'simple',
 'transition': 'linear'}

# Named entity recognition

![NER](./img/ner.png)

... is the task of **identifying** and **classifying** named entities into predefined categories

* Predefined categories are: **PER, LOC, ORG, MISC**, DATE, NUM, ...
* it is a sequence classification problem
* BIO or BILOU tagging scheme
* State-of-the-art performance (for English) is around 90% F-score

## Anatomy of spaCy's NER

In [2]:
from spacy.en import English

enlp = English()

In [3]:
doc = enlp("President Trump has a new morning ritual. Around 6:30 a.m. on many days — "
    "before all the network news shows have come on the air — "
    "he gets on the phone with a member of his outside legal team to chew over all things Russia.")

In [4]:
doc.ents

(Trump, 6:30 a.m. on, Russia)

In [5]:
["{}/{}".format(ent.text, ent.label_) for ent in doc.ents]

['Trump/PERSON', '6:30 a.m. on/TIME', 'Russia/GPE']

spaCy recognizes [10+7 NE types](https://spacy.io/docs/usage/entity-recognition#entity-types)

In [6]:
for tok in doc[:10]:
    print(tok, tok.ent_iob_, tok.ent_type_)

President O 
Trump B PERSON
has O 
a O 
new O 
morning O 
ritual O 
. O 
Around O 
6:30 B TIME


## Spacy does not know Hungarian NEs (yet)

...but we have other tools to use

Pretrained open-source tools
* [Szeged NER](http://www.inf.u-szeged.hu/rgai/NER)
* [HunTag](https://github.com/recski/HunTag)


You can build models using existing corpora and open-source tools:
* [Szeged NER corpus](http://rgai.inf.u-szeged.hu/index.php?lang=en&page=corpus_ne)
* [hunNERwiki](http://hlt.sztaki.hu/resources/hunnerwiki.html)

### +1

[HuNLP](https://github.com/oroszgy/hunlp) wraps [`magyarlanc`](http://www.inf.u-szeged.hu/rgai/magyarlanc) and [Szeged NER](http://www.inf.u-szeged.hu/rgai/NER)
* merges NER results with the output of magyarlanc
* convenient programatic API
* REST API
* Dockerized

### HuNlp in practice

For the workshop the dockerized HuNLP is running on an AWS instance, you may want to run it locally: 

```
docker pull oroszgy/hunlp
docker run -it -p 9090:9090 oroszgy/hunlp
```

In [7]:
from hunlp import HuNlp

text = open("./data/hvg_cikk.txt").read()
# if you have a local HuNLP instance running, replace the parameter with "http://127.0.0.1"
nlp = HuNlp("http://35.189.225.241")
doc = nlp(text)
list(doc.entities)

[('TASZ-ról', 'ORG'),
 ('Zsiga Marcellről', 'PER'),
 ('Fidesz', 'ORG'),
 ('Fidesz', 'ORG'),
 ('Zsiga Marcell-sztori', 'ORG'),
 ('Zsiga Marcellnek', 'PER'),
 ('Miskolc', 'LOC'),
 ('Szerencsejáték Zrt.', 'ORG'),
 ('Szerencsejáték Zrt.', 'ORG'),
 ('TASZ', 'ORG'),
 ('Fedák Sári', 'PER'),
 ('TASZ', 'ORG'),
 ('Strasbourgba', 'LOC'),
 ('TASZ', 'ORG'),
 ('Zsiga Marcell', 'PER'),
 ('TASZ', 'ORG'),
 ('Orbán Viktort', 'PER'),
 ('Magyarországon', 'LOC')]

In [8]:
for sent in doc:
    for tok in sent:
        if tok.entity_type != "O":
            print(tok.text, tok.entity_type)

TASZ-ról I-ORG
Zsiga I-PER
Marcellről I-PER
Fidesz I-ORG
Fidesz I-ORG
Zsiga I-ORG
Marcell-sztori I-ORG
Zsiga I-PER
Marcellnek I-PER
Miskolc I-LOC
Szerencsejáték I-ORG
Zrt. I-ORG
TASZ I-ORG
Fedák I-PER
Sári I-PER
TASZ I-ORG
Strasbourgba I-LOC
TASZ I-ORG
Zsiga I-PER
Marcell I-PER
TASZ I-ORG
Orbán I-PER
Viktort I-PER
Magyarországon I-LOC


In [9]:
for tok in doc[0]:
    print(tok.i, tok.text, tok.lemma, tok.tag, tok.head, tok.dep)

1 Így így ADV 2 MODE
2 gondozd gondoz VERB 0 ROOT
3 a a DET 4 DET
4 civiledet civil NOUN 2 OBJ


# Wikification

Wikification / entity linking / named entity disambiguation is the task of identifying and linking entities to a knowledge base (e.g. Wikipedia)

## DBpedia

Structured content from Wikipedia

Wikipedia infobox
![Infobox](./img/wikipedia_infobox.png)

Extracted DBpedia content

![Dbpedia](./img/dbpedia1.png)
![Dbpedia](./img/dbpedia2.png)

DBpedia is:

* freely accessible and open-source
* represented as semantic triples
* easily browsable through web or SPARQL interface

Most importantly, DBpedia 
* incorporates multiple ontologies (Yago, Umbel, ...)
* is multilingual (with interlinks between languages!)

## [DBpedia Spotlight](https://github.com/dbpedia-spotlight/dbpedia-spotlight)

... is a tool for automatically annotating mentions of DBpedia/Wikipedia resources in text. ([Demo](http://demo.dbpedia-spotlight.org/))

![Spotlight](./img/spotlight.png)

Spotlight is trained on Wikipedia & DBpedia, using various features such as

* disrtribution of anchor words
* cooccurrences of concepts
* contexts of interlinks

### Why is it so interesting?

It could be used for

* finding/disambiguating Named Entities
* extracting topics from raw text

In [10]:
import spotlight

In [11]:
spotlight.annotate('http://spotlight.sztaki.hu:2229/rest/annotate', text, support=200, confidence=0.4)

[{'URI': 'http://hu.dbpedia.org/resource/Fidesz_–_Magyar_Polgári_Szövetség',
  'offset': 130,
  'percentageOfSecondRank': 0.0,
  'similarityScore': 1.0,
  'support': 937,
  'surfaceForm': 'Fidesz',
  'types': 'DBpedia:Agent,Schema:Organization,DBpedia:Organisation,DBpedia:PoliticalParty'},
 {'URI': 'http://hu.dbpedia.org/resource/Fidesz_–_Magyar_Polgári_Szövetség',
  'offset': 333,
  'percentageOfSecondRank': 0.0,
  'similarityScore': 1.0,
  'support': 937,
  'surfaceForm': 'Fidesz',
  'types': 'DBpedia:Agent,Schema:Organization,DBpedia:Organisation,DBpedia:PoliticalParty'},
 {'URI': 'http://hu.dbpedia.org/resource/Miskolc',
  'offset': 893,
  'percentageOfSecondRank': 3.8197350632819025e-21,
  'similarityScore': 1.0,
  'support': 2851,
  'surfaceForm': 'Miskolc',
  'types': 'Schema:Place,DBpedia:Place,DBpedia:PopulatedPlace,DBpedia:Settlement'},
 {'URI': 'http://hu.dbpedia.org/resource/Strasbourg',
  'offset': 2323,
  'percentageOfSecondRank': 4.833194969505155e-15,
  'similarityScore

# Closing remarks

## Summary

Now you are able to

* extract frequent terms and keyphrases, visualize them
* perform basic NLP tasks on Hungarian texts
* build a simple topic classifier
* automatically analyze sentiment of tweets
* identify and classify named entities

![DT](./img/hat.png)

If interested in even more open-source Hungarian NLP tools, look around [in this document](https://github.com/oroszgy/awesome-hungarian-nlp/) 

* `hunlp` and the Hungarian `spaCy` models are in a very early development phase, use them with caution!
* Bug reports, PRs are always welcome :)

In [12]:
enlp("Thank you").vector

array([ -2.18545005e-01,   2.67279983e-01,  -5.61850011e-01,
        -2.22974997e-02,  -1.04899995e-01,   1.43052489e-01,
        -7.74499774e-03,  -3.86530012e-01,  -9.36115012e-02,
         2.45519996e+00,  -1.99582502e-01,   3.25064994e-02,
         1.25824004e-01,  -7.39655048e-02,  -3.32289994e-01,
        -2.44334996e-01,  -3.53455007e-01,   1.00812006e+00,
        -3.68030012e-01,   1.03618503e-01,   5.74027523e-02,
        -7.30015039e-02,  -5.03524989e-02,  -1.29685000e-01,
        -3.57325017e-01,   9.77305025e-02,  -3.94384973e-02,
        -1.38725013e-01,   1.69670001e-01,  -1.58250004e-01,
         2.33949989e-01,   2.18370005e-01,  -1.71914995e-01,
         2.67655015e-01,  -2.49304995e-01,  -1.61386486e-02,
        -2.03913495e-01,   3.73250097e-02,  -1.63419992e-01,
        -1.18268497e-01,   4.76804972e-02,   3.09480000e-02,
        -2.76349992e-01,  -2.88515002e-01,   1.43166497e-01,
         3.66054982e-01,  -2.29525000e-01,   1.34764507e-01,
         2.46250004e-01,