<a href="https://colab.research.google.com/github/fginter/ainl_2020_tutorial/blob/main/parser_tnpp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Turku Neural Parser Pipeline - Python module version

* This is a basic tutorial for running the parser pipeline under Google Colab
* Many new properties added for the AINL2020 tutorial
  * Parser changed to Udify (87->91% LAS improvement)
  * Code restructured so as to be importable as a module and can be installed using pip
  * No longer depends on Tensorflow
* Makes it possible for anyone to run the parser with GPU acceleration


# Install

* Pre-built wheel, not yet in PyPi
* Only Finnish and English models in this tutorial, more coming!

## Install the parser package (takes its time)

`pip3 install http://dl.turkunlp.org/turku-parser-models/turku_neural_parser-0.2-py3-none-any.whl`

## Download and unpack the model

`wget http://dl.turkunlp.org/turku-parser-models/models_fi_tdt.tar.gz ; tar zxvf models_fi_tdt.tar.gz`

...and you are good to go!

In [None]:
#Gdown is a utility for downloading large files from Google Drive, where I mirrored the NER trained model for you

!pip3 install gdown
!gdown -O turku_neural_parser-0.2-py3-none-any.whl 'https://drive.google.com/uc?export=download&id=1m8Nhd1oU6eS559D0Xboe83NbTzcMsyjN'

!pip3 install turku_neural_parser-0.2-py3-none-any.whl

In [11]:
!gdown -O models_fi_tdt.tar.gz 'https://drive.google.com/uc?export=download&id=157r-qRi0YxuN82U252I4RHOXDPv-PgXg'
!tar zxvf models_fi_tdt.tar.gz

Downloading...
From: https://drive.google.com/uc?export=download&id=157r-qRi0YxuN82U252I4RHOXDPv-PgXg
To: /content/models_fi_tdt.tar.gz
1.11GB [00:18, 61.2MB/s]
models_fi_tdt/
models_fi_tdt/Udify/
models_fi_tdt/Udify/tdt-udify-model.tar.gz
models_fi_tdt/Udify/weights.th
models_fi_tdt/Udify/vocabulary/
models_fi_tdt/Udify/vocabulary/head_tags.txt
models_fi_tdt/Udify/vocabulary/feats.txt
models_fi_tdt/Udify/vocabulary/token_characters.txt
models_fi_tdt/Udify/vocabulary/tokens.txt
models_fi_tdt/Udify/vocabulary/upos.txt
models_fi_tdt/Udify/vocabulary/xpos.txt
models_fi_tdt/Udify/vocabulary/non_padded_namespaces.txt
models_fi_tdt/Udify/vocabulary/lemmas.txt
models_fi_tdt/Udify/config.json
models_fi_tdt/pipelines.yaml
models_fi_tdt/README
models_fi_tdt/Lemmatizer/
models_fi_tdt/Lemmatizer/lemmatizer.pt
models_fi_tdt/Tokenizer/
models_fi_tdt/Tokenizer/tokenizer.udpipe


# Running the parser

* Every model can specify many processing pipelines
* These are in `modeldir/pipelines.yaml`
* `parse_plaintext`is the default

In [12]:
from tnparser.pipeline import read_pipelines, Pipeline

available_pipelines=read_pipelines("models_fi_tdt/pipelines.yaml")
print(list(available_pipelines.keys()))


['parse_plaintext', 'parse_sentlines', 'parse_wslines', 'parse_conllu', 'tokenize']


In [13]:
p=Pipeline(available_pipelines["parse_plaintext"])
parsed=p.parse("Minulla on ruskea koira! Se haukkuu ja juoksee. Voi että!")

Dataset reader: <class 'udify.dataset_readers.universal_dependencies.UniversalDependenciesDatasetReader'>
0it [00:00, ?it/s]Your label namespace was 'upos'. We recommend you use a namespace ending with 'labels' or 'tags', so we don't add UNK and PAD tokens by default to your vocabulary.  See documentation for `non_padded_namespaces` parameter in Vocabulary.
Your label namespace was 'xpos'. We recommend you use a namespace ending with 'labels' or 'tags', so we don't add UNK and PAD tokens by default to your vocabulary.  See documentation for `non_padded_namespaces` parameter in Vocabulary.
Your label namespace was 'feats'. We recommend you use a namespace ending with 'labels' or 'tags', so we don't add UNK and PAD tokens by default to your vocabulary.  See documentation for `non_padded_namespaces` parameter in Vocabulary.
Your label namespace was 'lemmas'. We recommend you use a namespace ending with 'labels' or 'tags', so we don't add UNK and PAD tokens by default to your vocabulary.  

In [14]:
print(parsed)

1	Minulla	minä	PRON	_	Case=Ade|Number=Sing|Person=1|PronType=Prs	0	root	_	_
2	on	olla	AUX	_	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act	1	cop:own	_	_
3	ruskea	ruskea	ADJ	_	Case=Nom|Degree=Pos|Number=Sing	4	amod	_	_
4	koira	koira	NOUN	_	Case=Nom|Number=Sing	1	nsubj:cop	_	_
5	!	!	PUNCT	_	_	1	punct	_	_

1	Se	se	PRON	_	Case=Nom|Number=Sing|PronType=Dem	2	nsubj	_	_
2	haukkuu	haukkua	VERB	_	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act	0	root	_	_
3	ja	ja	CCONJ	_	_	4	cc	_	_
4	juoksee	juosta	VERB	_	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act	2	conj	_	_
5	.	.	PUNCT	_	_	2	punct	_	_

1	Voi	voi	INTJ	_	_	0	root	_	_
2	että	että	INTJ	_	_	1	fixed	_	_
3	!	!	PUNCT	_	_	1	punct	_	_




# GPU mode

* The pipeline runs by default in CPU mode
* Needs to be told to run in GPU
* This is a bit tricky right now but not impossible
* Note: if you now switch the Runtime into GPU, you need to re-run the pip install


In [15]:
#I do realize this ain't good! :)
import types
extra_args=types.SimpleNamespace()
extra_args.__dict__["udify_mod.device"]="0" #simulates someone giving a --device 0 parameter to Udify
extra_args.__dict__["lemmatizer_mod.device"]="0" 

p=Pipeline(available_pipelines["parse_plaintext"],extra_args)
parsed=p.parse("Minulla on ruskea koira! Se haukkuu ja juoksee. Voi että!")
print("Parsed has this many lines:",len(parsed.split("\n")))


Dataset reader: <class 'udify.dataset_readers.universal_dependencies.UniversalDependenciesDatasetReader'>
0it [00:00, ?it/s]Your label namespace was 'upos'. We recommend you use a namespace ending with 'labels' or 'tags', so we don't add UNK and PAD tokens by default to your vocabulary.  See documentation for `non_padded_namespaces` parameter in Vocabulary.
Your label namespace was 'xpos'. We recommend you use a namespace ending with 'labels' or 'tags', so we don't add UNK and PAD tokens by default to your vocabulary.  See documentation for `non_padded_namespaces` parameter in Vocabulary.
Your label namespace was 'feats'. We recommend you use a namespace ending with 'labels' or 'tags', so we don't add UNK and PAD tokens by default to your vocabulary.  See documentation for `non_padded_namespaces` parameter in Vocabulary.
Your label namespace was 'lemmas'. We recommend you use a namespace ending with 'labels' or 'tags', so we don't add UNK and PAD tokens by default to your vocabulary.  

Parsed has this many lines: 17


In [16]:
#Since we are on a GPU, we can try to push through quite a bit more of data
parsed=p.parse("Minulla on ruskea koira! Se haukkuu ja juoksee. Voi että! "*200) #takes forever on CPU, finishes in few seconds on GPU
print("Parsed has this many lines:",len(parsed.split("\n")))

Parsed has this many lines: 3201


# Process the output

* The output of the pipeline run is a conll-u string
* You can parse it in any number of ways
* This is my preferred:

In [17]:
ID,FORM,LEMMA,UPOS,XPOS,FEAT,HEAD,DEPREL,DEPS,MISC=range(10) #the 10 columns

def read_conll(inp,max_sent=0,drop_tokens=True,drop_nulls=True):
    """
    inp: list of lines or an open file
    max_sent: 0 for all, >0 to limit
    drop_tokens: ignore multiword token lines
    drop_nulls: ignore null nodes in enhanced dependencies

    Yields lines of the parse and comments
    """

    comments=[]
    sent=[]
    yielded=0
    for line in inp:
        line=line.rstrip("\n")
        if line.startswith("#"):
            comments.append(line)
        elif not line:
            if sent:
                yield sent,comments
                yielded+=1
                if max_sent>0 and yielded==max_sent:
                    break
                sent,comments=[],[]
        else:
            cols=line.split("\t")
            if drop_tokens and "-" in cols[ID]:
                continue
            if drop_nulls and "." in cols[ID]:
                continue
            sent.append(cols)
    else:
        if sent:
            yield sent,comments

for one_sent,comments in read_conll(parsed.split("\n"),5):
    words=(word_line[FORM] for word_line in one_sent)
    lemmas=(word_line[LEMMA] for word_line in one_sent)
    print(" ".join(words))
    print(" ".join(lemmas))
    print()

# and that's really all there is to it :)


Minulla on ruskea koira !
minä olla ruskea koira !

Se haukkuu ja juoksee .
se haukkua ja juosta .

Voi että !
voi että !

Minulla on ruskea koira !
minä olla ruskea koira !

Se haukkuu ja juoksee .
se haukkua ja juosta .

