# _Flair Walkthrough_

This notebook will walkthrough how to use [flair](https://github.com/flairNLP/flair), a simple framework built directly on PyTorch, that makes it easy to train your own NLP models and experiment with new approaches using Flair embeddings and classes.

In [42]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
from tqdm.notebook import tqdm

## [_Tutorial 1: NLP Base Types_](https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_1_BASICS.md)

- two types of objects central to `flair`
    - `Sentence` object: holds a textual sentence (essentially a list of `Token`)

In [9]:
# the sentence object holds a sentence that we may want to embed or tag
from flair.data import Sentence

# make a sentence object by passing a whitespace tokenized string
sentence = Sentence('The grass is green .')

print(sentence)

Sentence: "The grass is green ." - 5 Tokens


- print out tells us sentence consists of 5 tokens
- you can access tokens within sentence via their token id or their index
- the print out will include the ID and lexical value of the token

In [10]:
# print using the token id
print(sentence.get_token(4))
# print using the index itself
print(sentence[3])

Token: 4 green
Token: 4 green


In [11]:
for token in sentence:
    print(token)

Token: 1 The
Token: 2 grass
Token: 3 is
Token: 4 green
Token: 5 .


### _Tokenization_

- in cases where text is not already tokenized, can use `use_tokenizer` flag when instantiating `Sentence`

In [4]:
# make sentence object by passing an untokenized string and use_tokenizer flag
sentence = Sentence('The grass is green.', use_tokenizer=True)

print(sentence)

Sentence: "The grass is green ." - 5 Tokens


### _Adding Custom Tokenizers_

- can also pass custom tokenizers to `use_tokenizer` flag
    - pass a tokenization method instead of a `True` boolean
    - check the code of [`flair.data.space_tokenizer`](https://github.com/flairNLP/flair/blob/master/flair/data.py) to get idea of how to implement a wrapper 

In [12]:
from flair.data import segtok_tokenizer

# make a sentence object by passing in untokenized string and custom tokenizer
sentence = Sentence('The grass is green.', use_tokenizer=segtok_tokenizer)

print(sentence)

Sentence: "The grass is green ." - 5 Tokens


### _Adding Tags to Tokens_

- `Token` has fields for linguistic annotation, such as:
    - lemmas
    - part-of-speech tags
    - named entity tags
- can add a tag by specifying tag type & value

In [13]:
# add NER tag of type color to the word green --> tagged word as an entity of type color
sentence[3].add_tag('ner', 'color')

# print the sentence with all tags of this type
print(sentence.to_tagged_string())

The grass is green <color> .


In [16]:
# each tag is of class Label, which next to the value has a score indicating confidence
# get token 3 in sentence
token = sentence[3]

# get the ner tag of the token
tag = token.get_tag('ner')

# print token
print(f'"{token}" is tagged as "{tag.value}" with confidence score "{tag.score}"')
# our color tag will have a score of 1.0 because we manually added it
# if tag is predicted by sequence labeler, the score value will
# indicate classifier confidence

"Token: 4 green" is tagged as "color" with confidence score "1.0"


### _Adding Labels to Sentences_

- `Sentence` can have one or more labels that can be used in text classification tasks (for example)
- example below will add label `sports` to the sentence
    - labeling it as belonging to the `sports` category

In [18]:
sentence = Sentence('France is the current World Cup winner.')

# add a label to a sentence
sentence.add_label('sports')

# a sentence can also belong to multiple classes
sentence.add_labels(['sports', 'world cup'])

# you can also set the labels while initializing the sentence
sentence = Sentence('France is the current World Cup winner.', labels=['sports', 'world cup'])

# you can print a sentence's labels like this
for label in sentence.labels:
    print(label)

sports (1.0)
world cup (1.0)


## [_Tutorial 2: Tagging your Text_](https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_2_TAGGING.md)

- below will show how to use `flair`'s pretrained models to tag your text

### _Tagging with Pre-trained Sequence Tagging Models_

- we'll use pre-trained model for named entity recognition (NER)
    - model was trained over the English [CoNLL-03](https://dl.acm.org/doi/10.3115/1119176.1119195) task 
        - can recognize 4 different entity types

In [21]:
from flair.models import SequenceTagger

tagger = SequenceTagger.load('ner')

sentence = Sentence('George Washington went to Washington.', use_tokenizer=True)

# use predict() method of the tagger on the sentence
# this will add predicted tags to the tokens in the sentence
tagger.predict(sentence)

# print the sentence with predicted tags
print(sentence.to_tagged_string())

2020-05-20 00:06:12,826 loading file /root/.flair/models/en-ner-conll03-v0.4.pt
George <B-PER> Washington <E-PER> went to Washington <S-LOC> .


In [22]:
# we can directly get such spans in a tagged sentence like this
for entity in sentence.get_spans('ner'):
    print(entity)

PER-span [1,2]: "George Washington"
LOC-span [5]: "Washington"


- above indicats that:
    - "George Washington" is a person (PER)
    - "Washington" is a location (LOC)
- we can get additional information by calling the following command

In [23]:
print(sentence.to_dict(tag_type='ner'))

{'text': 'George Washington went to Washington.', 'labels': [], 'entities': [{'text': 'George Washington', 'start_pos': 0, 'end_pos': 17, 'type': 'PER', 'confidence': 0.9967881441116333}, {'text': 'Washington', 'start_pos': 26, 'end_pos': 36, 'type': 'LOC', 'confidence': 0.9993711113929749}]}


### _List of Pre-Trained Sequence Tagger Models_

- you can choose which pre-trained model you load
    - can be done by passing appropriate string to the `load()` method of `SequenceTagger` class
- for more information on ID's (i.e. the strings to use to load models), check out the list [here](https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_2_TAGGING.md#list-of-pre-trained-sequence-tagger-models)

### _Tagging a German sentence_

- there are pre-trained models for languages other than English
- current languages that are supported:
    - German, French and Dutch
- the cell below is commented out because the model would've taken ~30min to load...

In [26]:
# load German model
#tagger = SequenceTagger.load('de-ner')

# make a German sentence
#sentence = Sentence('George Washington ging nach Washington.', use_tokenizer=True)

# predict NER tags
#tagger.predict(sentence)

# print sentence with predicted tags
#print(sentence.to_tagged_string())

### _Experimental: Semantic Frame Detection_

- has pre-trained model that detect semantic frames in text
    - trained using [Propbank 3.0 frames](https://propbank.github.io/)
- provides word sense disambiguation for frame evoking words
- commented out due to size of model...

In [32]:
# load model
#tagger = SequenceTagger.load('frame')

# make English sentence
#sentence1 = Sentence('George returned to Berlin to return his hat.', use_tokenizer=True)
#sentence2 = Sentence('He had a look at different hats.', use_tokenizer=True)

# predict NER tags
#tagger.predict(sentence1)
#tagger.predict(sentence2)

# print sentence with predicted tags
#print(sentence1.to_tagged_string())
#print(sentence2.to_tagged_string())

- this is what the output for the above cell would look like

```
George returned <return.01> to Berlin to return <return.02> his hat .

He had <have.LV> a look <look.01> at different hats .
```

- frame detector makes distinction in sentence 1 between different meanings of the word `return`
    - `return.01` means returning to a location
    - `return.02` means giving something back
- in sentence two, frame detector finds light verb construction
    - `have` is the light verb
    - `look` is a frame evoking word

### _Tagging a List of Sentences_

- may want to tag an entire text corpus
    - need to split the corpus into sentences 
    - then pass a list of `Sentence` objects to `.predict()` method

In [37]:
# text of many sentences
text = 'This is a sentence. This is another sentence. I love Berlin.'

# use library to split into sentences
from segtok.segmenter import split_single

sentences = [Sentence(sent, use_tokenizer=True) for sent in split_single(text)]

# predict tags for list of sentences
tagger = SequenceTagger.load('ner')
tagger.predict(sentences)

# iterate through the sentences and print predicted labels
for sent in sentences:
    print(sent.to_tagged_string())

2020-05-20 00:37:11,465 loading file /root/.flair/models/en-ner-conll03-v0.4.pt
This is a sentence .
This is another sentence .
I love Berlin <S-LOC> .


- `mini_batch_size` parameter of `predict()` method
    - can set size of mini-batches passed to the tagger
- may have to play around with this paramater to optimize speed

### _Tagging with Pre-Trained Text Classification Models_

- can use pre-trained model for detecting positive or negative comments
    - model was trained over IMDB dataset
    - can recognize positive and negative sentiment in English text

In [44]:
#from flair.models import TextClassifier

#classifier = TextClassifier.load('en-sentiment')

#sentence = Sentence('This film hurts. It is so bad that I am confused.', use_tokenizer=True)

# predict NEW tags
#classifier.predict(setence)

# print sentence with predicted labels
#print(sentence.labels)

- above cell should print the following
```
[NEGATIVE (0.9598667025566101)]
```
- contains the sentiment and the confidence
- here is a [link](https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_2_TAGGING.md#list-of-pre-trained-text-classification-models) to the list of pre-trained text classification models