# _Flair Walkthrough_

This notebook will walkthrough how to use [flair](https://github.com/flairNLP/flair), a simple framework built directly on PyTorch, that makes it easy to train your own NLP models and experiment with new approaches using Flair embeddings and classes.

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
from tqdm.notebook import tqdm

## [_Tutorial 1: NLP Base Types_](https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_1_BASICS.md)

- two types of objects central to `flair`
    - `Sentence` object: holds a textual sentence (essentially a list of `Token`)

In [2]:
# the sentence object holds a sentence that we may want to embed or tag
from flair.data import Sentence

# make a sentence object by passing a whitespace tokenized string
sentence = Sentence('The grass is green .')

print(sentence)

Sentence: "The grass is green ." - 5 Tokens


- print out tells us sentence consists of 5 tokens
- you can access tokens within sentence via their token id or their index
- the print out will include the ID and lexical value of the token

In [3]:
# print using the token id
print(sentence.get_token(4))
# print using the index itself
print(sentence[3])

Token: 4 green
Token: 4 green


In [4]:
for token in sentence:
    print(token)

Token: 1 The
Token: 2 grass
Token: 3 is
Token: 4 green
Token: 5 .


### _Tokenization_

- in cases where text is not already tokenized, can use `use_tokenizer` flag when instantiating `Sentence`

In [5]:
# make sentence object by passing an untokenized string and use_tokenizer flag
sentence = Sentence('The grass is green.', use_tokenizer=True)

print(sentence)

Sentence: "The grass is green ." - 5 Tokens


### _Adding Custom Tokenizers_

- can also pass custom tokenizers to `use_tokenizer` flag
    - pass a tokenization method instead of a `True` boolean
    - check the code of [`flair.data.space_tokenizer`](https://github.com/flairNLP/flair/blob/master/flair/data.py) to get idea of how to implement a wrapper 

In [6]:
from flair.data import segtok_tokenizer

# make a sentence object by passing in untokenized string and custom tokenizer
sentence = Sentence('The grass is green.', use_tokenizer=segtok_tokenizer)

print(sentence)

Sentence: "The grass is green ." - 5 Tokens


### _Adding Tags to Tokens_

- `Token` has fields for linguistic annotation, such as:
    - lemmas
    - part-of-speech tags
    - named entity tags
- can add a tag by specifying tag type & value

In [7]:
# add NER tag of type color to the word green --> tagged word as an entity of type color
sentence[3].add_tag('ner', 'color')

# print the sentence with all tags of this type
print(sentence.to_tagged_string())

The grass is green <color> .


In [8]:
# each tag is of class Label, which next to the value has a score indicating confidence
# get token 3 in sentence
token = sentence[3]

# get the ner tag of the token
tag = token.get_tag('ner')

# print token
print(f'"{token}" is tagged as "{tag.value}" with confidence score "{tag.score}"')
# our color tag will have a score of 1.0 because we manually added it
# if tag is predicted by sequence labeler, the score value will
# indicate classifier confidence

"Token: 4 green" is tagged as "color" with confidence score "1.0"


### _Adding Labels to Sentences_

- `Sentence` can have one or more labels that can be used in text classification tasks (for example)
- example below will add label `sports` to the sentence
    - labeling it as belonging to the `sports` category

In [9]:
sentence = Sentence('France is the current World Cup winner.')

# add a label to a sentence
sentence.add_label('sports')

# a sentence can also belong to multiple classes
sentence.add_labels(['sports', 'world cup'])

# you can also set the labels while initializing the sentence
sentence = Sentence('France is the current World Cup winner.', labels=['sports', 'world cup'])

# you can print a sentence's labels like this
for label in sentence.labels:
    print(label)

sports (1.0)
world cup (1.0)


## [_Tutorial 2: Tagging your Text_](https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_2_TAGGING.md)

- below will show how to use `flair`'s pretrained models to tag your text

### _Tagging with Pre-trained Sequence Tagging Models_

- we'll use pre-trained model for named entity recognition (NER)
    - model was trained over the English [CoNLL-03](https://dl.acm.org/doi/10.3115/1119176.1119195) task 
        - can recognize 4 different entity types

In [10]:
from flair.models import SequenceTagger

tagger = SequenceTagger.load('ner')

sentence = Sentence('George Washington went to Washington.', use_tokenizer=True)

# use predict() method of the tagger on the sentence
# this will add predicted tags to the tokens in the sentence
tagger.predict(sentence)

# print the sentence with predicted tags
print(sentence.to_tagged_string())

2020-05-20 18:54:11,441 loading file /root/.flair/models/en-ner-conll03-v0.4.pt
George <B-PER> Washington <E-PER> went to Washington <S-LOC> .


In [11]:
# we can directly get such spans in a tagged sentence like this
for entity in sentence.get_spans('ner'):
    print(entity)

PER-span [1,2]: "George Washington"
LOC-span [5]: "Washington"


- above indicats that:
    - "George Washington" is a person (PER)
    - "Washington" is a location (LOC)
- we can get additional information by calling the following command

In [12]:
print(sentence.to_dict(tag_type='ner'))

{'text': 'George Washington went to Washington.', 'labels': [], 'entities': [{'text': 'George Washington', 'start_pos': 0, 'end_pos': 17, 'type': 'PER', 'confidence': 0.9967881441116333}, {'text': 'Washington', 'start_pos': 26, 'end_pos': 36, 'type': 'LOC', 'confidence': 0.9993711113929749}]}


### _List of Pre-Trained Sequence Tagger Models_

- you can choose which pre-trained model you load
    - can be done by passing appropriate string to the `load()` method of `SequenceTagger` class
- for more information on ID's (i.e. the strings to use to load models), check out the list [here](https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_2_TAGGING.md#list-of-pre-trained-sequence-tagger-models)

### _Tagging a German sentence_

- there are pre-trained models for languages other than English
- current languages that are supported:
    - German, French and Dutch
- the cell below is commented out because the model would've taken ~30min to load...

In [13]:
# load German model
#tagger = SequenceTagger.load('de-ner')

# make a German sentence
#sentence = Sentence('George Washington ging nach Washington.', use_tokenizer=True)

# predict NER tags
#tagger.predict(sentence)

# print sentence with predicted tags
#print(sentence.to_tagged_string())

### _Experimental: Semantic Frame Detection_

- has pre-trained model that detect semantic frames in text
    - trained using [Propbank 3.0 frames](https://propbank.github.io/)
- provides word sense disambiguation for frame evoking words
- commented out due to size of model...

In [14]:
# load model
#tagger = SequenceTagger.load('frame')

# make English sentence
#sentence1 = Sentence('George returned to Berlin to return his hat.', use_tokenizer=True)
#sentence2 = Sentence('He had a look at different hats.', use_tokenizer=True)

# predict NER tags
#tagger.predict(sentence1)
#tagger.predict(sentence2)

# print sentence with predicted tags
#print(sentence1.to_tagged_string())
#print(sentence2.to_tagged_string())

- this is what the output for the above cell would look like

```
George returned <return.01> to Berlin to return <return.02> his hat .

He had <have.LV> a look <look.01> at different hats .
```

- frame detector makes distinction in sentence 1 between different meanings of the word `return`
    - `return.01` means returning to a location
    - `return.02` means giving something back
- in sentence two, frame detector finds light verb construction
    - `have` is the light verb
    - `look` is a frame evoking word

### _Tagging a List of Sentences_

- may want to tag an entire text corpus
    - need to split the corpus into sentences 
    - then pass a list of `Sentence` objects to `.predict()` method

In [15]:
# text of many sentences
text = 'This is a sentence. This is another sentence. I love Berlin.'

# use library to split into sentences
from segtok.segmenter import split_single

sentences = [Sentence(sent, use_tokenizer=True) for sent in split_single(text)]

# predict tags for list of sentences
tagger = SequenceTagger.load('ner')
tagger.predict(sentences)

# iterate through the sentences and print predicted labels
for sent in sentences:
    print(sent.to_tagged_string())

2020-05-20 18:54:15,979 loading file /root/.flair/models/en-ner-conll03-v0.4.pt
This is a sentence .
This is another sentence .
I love Berlin <S-LOC> .


- `mini_batch_size` parameter of `predict()` method
    - can set size of mini-batches passed to the tagger
- may have to play around with this paramater to optimize speed

### _Tagging with Pre-Trained Text Classification Models_

- can use pre-trained model for detecting positive or negative comments
    - model was trained over IMDB dataset
    - can recognize positive and negative sentiment in English text

In [16]:
#from flair.models import TextClassifier

#classifier = TextClassifier.load('en-sentiment')

#sentence = Sentence('This film hurts. It is so bad that I am confused.', use_tokenizer=True)

# predict NEW tags
#classifier.predict(setence)

# print sentence with predicted labels
#print(sentence.labels)

- above cell should print the following
```
[NEGATIVE (0.9598667025566101)]
```
- contains the sentiment and the confidence
- here is a [link](https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_2_TAGGING.md#list-of-pre-trained-text-classification-models) to the list of pre-trained text classification models

## [_Tutorial 3: Word Embeddings_](https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_3_WORD_EMBEDDING.md)

### _Embeddings_

- all word embedding classes inherit from the `TokenEmbeddings` class
    - and implement the `embed()` method, which needs to be called to embed your text
    
### _Classic Word Embeddings_

- are static and word-lvel
    - each distinct word gets exactly one pre-computed embedding
    - most embedding's fall under this class (including GloVe or Moninos embeddings)

In [19]:
# instantiate WordEmbeddings class, pass string ID of embedding you wish to load
from flair.embeddings import WordEmbeddings

# init embedding
glove_embedding = WordEmbeddings('glove')

# create sentence
sentence = Sentence('The grass is green.', use_tokenizer=True)

# embed a sentence using glove
glove_embedding.embed(sentence)

# now check out embedded tokens
for token in sentence:
    print(token)
    print(token.embedding)

Token: 1 The
tensor([-0.0382, -0.2449,  0.7281, -0.3996,  0.0832,  0.0440, -0.3914,  0.3344,
        -0.5755,  0.0875,  0.2879, -0.0673,  0.3091, -0.2638, -0.1323, -0.2076,
         0.3340, -0.3385, -0.3174, -0.4834,  0.1464, -0.3730,  0.3458,  0.0520,
         0.4495, -0.4697,  0.0263, -0.5415, -0.1552, -0.1411, -0.0397,  0.2828,
         0.1439,  0.2346, -0.3102,  0.0862,  0.2040,  0.5262,  0.1716, -0.0824,
        -0.7179, -0.4153,  0.2033, -0.1276,  0.4137,  0.5519,  0.5791, -0.3348,
        -0.3656, -0.5486, -0.0629,  0.2658,  0.3020,  0.9977, -0.8048, -3.0243,
         0.0125, -0.3694,  2.2167,  0.7220, -0.2498,  0.9214,  0.0345,  0.4674,
         1.1079, -0.1936, -0.0746,  0.2335, -0.0521, -0.2204,  0.0572, -0.1581,
        -0.3080, -0.4162,  0.3797,  0.1501, -0.5321, -0.2055, -1.2526,  0.0716,
         0.7056,  0.4974, -0.4206,  0.2615, -1.5380, -0.3022, -0.0734, -0.2831,
         0.3710, -0.2522,  0.0162, -0.0171, -0.3898,  0.8742, -0.7257, -0.5106,
        -0.5203, -0.1459,  

- above are the print outs of the tokens in the sentence and their embeddings
    - GloVe embeddings are `PyTorch` vectors of dimensionality 100
- typically, you'll use **two-letter language code** to init an embedding
    - `en` for English
    - `de` for German
- by default, this will initialize FastText embeddings trained over Wikipedia
    - can also use FastText embeddings over web crawls with `'-crawl'`
    ```
    # example
    german_embedding = WordEmbeddings('de-crawl')
    ```
- generally recommend FastText embeddings, or GloVe if you want a smaller model

### _Flair Embeddings_

- contextual string embeddings --> capture latent syntactic-semantic information that goes beyond standard word embeddings
    - are trained without any explicit notion of words, fundamentally model words as sequences of characters
    - are contextualized by their surrounding text
        - means that _the same word will have different embeddings depending on its contextual use_

In [21]:
from flair.embeddings import FlairEmbeddings

# init embedding
flair_embedding_forward = FlairEmbeddings('news-forward')

# create a sentence
sentence = Sentence('The grass is green.', use_tokenizer=True)

# embed words in sentence
flair_embedding_forward.embed(sentence)

[Sentence: "The grass is green ." - 5 Tokens]

- for all supported languages, there is a forward and backward model
- if you want to load model for a language using two-letter language code, followed by hyphen and either **forward or backward**
```
# example
flair_embedding_forward = FlairEmbeddings('de-forward')
flair_embedding_backward = FlairEmbeddings('de-backward')
```

### _Stacked Embeddings_

- one of the most important concepts of the library
- can use to combine different embeddings together
    - for example, if you want to use both traditional & contextual string embeddings together
- allows you to mix and match, finding a combination of embeddings that gives best results

In [23]:
from flair.embeddings import WordEmbeddings, CharacterEmbeddings, StackedEmbeddings

# init standard GloVe embedding
glove_embedding = WordEmbeddings('glove')

# init Flair forward and backwards embeddings
flair_embedding_forward = FlairEmbeddings('news-forward')
flair_embedding_backward = FlairEmbeddings('news-backward')

# init StackedEmbeddings class and pass it list containing two above embeddings
stacked_embeddings = StackedEmbeddings([
    glove_embedding,
    flair_embedding_forward,
    flair_embedding_backward
])

# create a sentence
sentence = Sentence('The grass is green.', use_tokenizer=True)

# just embed a sentence using the StackedEmbedding as you would any single embedding
stacked_embeddings.embed(sentence)

# now check out the embedded tokens
for token in sentence:
    print(token)
    print(token.embedding)

Token: 1 The
tensor([-0.0382, -0.2449,  0.7281,  ..., -0.0065, -0.0053,  0.0090],
       device='cuda:0')
Token: 2 grass
tensor([-0.8135,  0.9404, -0.2405,  ...,  0.0354, -0.0255, -0.0143],
       device='cuda:0')
Token: 3 is
tensor([-5.4264e-01,  4.1476e-01,  1.0322e+00,  ..., -5.3691e-04,
        -9.6750e-03, -2.7541e-02], device='cuda:0')
Token: 4 green
tensor([-0.6791,  0.3491, -0.2398,  ..., -0.0007, -0.1333,  0.0161],
       device='cuda:0')
Token: 5 .
tensor([-0.3398,  0.2094,  0.4635,  ...,  0.0005, -0.0177,  0.0032],
       device='cuda:0')


## [_Tutorial 4: List of All Word Embeddings_](https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_4_ELMO_BERT_FLAIR_EMBEDDING.md)

- primarily a list of all embeddings that are supported in Flair
    - [click here](https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_4_ELMO_BERT_FLAIR_EMBEDDING.md#overview) for list of these embeddings
    
### _Combining BERT and Flair_

- you can easily mix and match Flair, ELMo, BERT and classic word embeddings
    - just instantiate each embedding you wish to combine and use in `StackedEmbedding`

In [26]:
from flair.embeddings import FlairEmbeddings, BertEmbeddings, StackedEmbeddings

# init Flair embeddings
flair_forward_embedding = FlairEmbeddings('multi-forward')
flair_backward_embedding = FlairEmbeddings('multi-backward')

# init multilingual BERT
bert_embedding = BertEmbeddings('bert-base-multilingual-cased')

# now create the StackedEmbedding object that combines all embeddings
stacked_embeddings = StackedEmbeddings(
    embeddings=[flair_forward_embedding, flair_backward_embedding, bert_embedding]
)

sentence = Sentence('The grass is green.', use_tokenizer=True)

# just embed a sentence using the StackedEmbedding as you would with any single embedding
stacked_embeddings.embed(sentence)

# now check out the embedded tokens
for token in sentence:
    print(token)
    print(token.embedding)

Token: 1 The
tensor([0.6800, 0.2429, 0.0012,  ..., 0.3829, 0.4721, 0.2985], device='cuda:0')
Token: 2 grass
tensor([ 2.9200e-01,  2.2066e-02,  4.5290e-05,  ...,  8.5283e-01,
        -5.0724e-02,  3.4476e-01], device='cuda:0')
Token: 3 is
tensor([-0.5447,  0.0229,  0.0078,  ..., -0.1828,  0.7153,  0.0051],
       device='cuda:0')
Token: 4 green
tensor([1.4772e-01, 1.0973e-01, 8.5618e-04,  ..., 1.0157e+00, 7.5358e-01,
        1.1230e-01], device='cuda:0')
Token: 5 .
tensor([-1.5555e-01,  6.7598e-03,  5.3829e-06,  ..., -6.0930e-01,
         9.0591e-01,  1.7857e-01], device='cuda:0')


## [_Tutorial 5: Document Embeddings_](https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_5_DOCUMENT_EMBEDDINGS.md)

- all document embedding classes inherit from `DocumentEmbeddings` class
    - implement `embed()` method which you need to call to embed your text
- all embeddings produced with Flair's methods are `PyTorch` vectors
    - can be immediately used for training and fine-tuning
    
### _Document Embeddings_

- are created from the embeddings of all words in the document
- two different methods to obtain a document embedding from a list of word embeddings
    - pooling
    - RNN
    
### _Pooling_

- calculates pooling operation over all word embeddings in a document
    - default operation is `mean`, gives us the mean of all words in the sentence

In [27]:
# create a document embedding using GloVe with Flair
from flair.embeddings import WordEmbeddings, FlairEmbeddings, DocumentPoolEmbeddings, Sentence

# init word embeddings
glove_embedding = WordEmbeddings('glove')
flair_embedding_forward = FlairEmbeddings('news-forward')
flair_embedding_backward = FlairEmbeddings('news-backward')

# init document embeddings, mode = mean
document_embeddings = DocumentPoolEmbeddings([
    glove_embedding,
    flair_embedding_forward,
    flair_embedding_backward
])

# create an example sentence
sentence = Sentence('The grass is green. And the sky is blue.', use_tokenizer=True)

# embed the sentence with document embedding
document_embeddings.embed(sentence)

# now check out the embedded sentence
print(sentence.get_embedding())

tensor([-0.3197,  0.2621,  0.4037,  ..., -0.0021, -0.0207, -0.0016],
       device='cuda:0', grad_fn=<CatBackward>)


- the above printed out the embedding of the document
- since document embedding is derived from word embeddings...
    - dimensionality depends on dimensionality of word embeddings you are using
- can also use `min` or `max` pooling (see below)
```
# example
document_embeddings = DocumentPoolEmbeddings([
    glove_embedding,
    flair_embedding_forward,
    flair_embedding_backward],
    pooling='min'
    )
```
- can also choose which fine-tuning operation you want
    - i.e. which transformation to apply before word embeddings get pooled
    - default operation is `linear` transformation
    - but if you want simple word embeddings that are not task-trained, you can use a 'nonlinear' transformation instead

```
# instantiate pre-trained word embeddings
embeddings = WordEmbeddings('glove')

# document pool embeddings
document_embeddings = DocumentPoolEmbeddings([embeddings], fine_tune_mode='nonlinear')
```

- if you want word embeddings that are task-trained, you are better off doing no transformation at all

```
# instantiate one-hot encoded word embeddings
embeddings = OneHotEmbeddings(corpus)

# document pool embeddings
document_embeddings = DocumentPoolEmbeddings([embeddings], fine_tune_mode='none')
```

### _RNN_

- also support an RNN to obtain a `DocumentEmbeddings`
- takes the word embeddings of every token in document as input
    - provides its last output state as document embedding
- can choose which type of RNN you wish to use
- by default, a GRU-type RNN in instantiated

In [28]:
from flair.embeddings import WordEmbeddings, DocumentRNNEmbeddings

# init GloVe embedding
glove_embedding = WordEmbeddings('glove')

# init RNN document embedding
document_embedding = DocumentRNNEmbeddings([glove_embedding])

# create example sentence
sentence = Sentence('The grass is green. And the sky is blue.', use_tokenizer=True)

# embed the sentence with document embedding
document_embeddings.embed(sentence)

# now check out the embedded sentence
print(sentence.get_embedding())

tensor([-0.3197,  0.2621,  0.4037,  ..., -0.0021, -0.0207, -0.0016],
       device='cuda:0', grad_fn=<CatBackward>)


- the above outputs a single embedding for the complete sentence
    - embedding dimensionality depends on number of hidden states you are using & whether RNN is bidirectional or not
- if you want to use different type of RNN, can set the `rnn_type` parameter in the constructor

In [29]:
# init GloVe embedding 
glove_embedding = WordEmbeddings('glove')

# create LSTM document embedding
document_lstm_embeddings = DocumentRNNEmbeddings([glove_embedding], rnn_type='LSTM')

- Note: `DocumentPoolEmbeddings` are immediately meaningful
    - `DocumentRNNEmbeddings` need to be tuned on the downstream task
    - this happens automatically in Flair if you train a new model with these embeddings
    - for an example, click [here](https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_7_TRAINING_A_MODEL.md#training-a-text-classification-model)
    - once model is trained, you can access tuned `DocumentRNNEmbeddings` object directly from classifier object and embed sentences

```
document_embeddings = classifier.document_embeddings

sentence = Sentence('The grass is green. And the sky is blue.', use_tokenizer=True)

document_embeddings.embed(sentence)

print(sentence.get_embedding())
```

- `DocumentRNNEmbeddings` have hyper-parameters that can be tuned to improve learning
    - `hidden_size`: number of hidden states to run
    - `rnn_layers`: number of layers for the RNN
    - `reproject_words`: boolean value, indicates whether to reproject the token embeddings in a separate linear layer before putting them into the RNN or not
    - `reproject_words_dimension`: output dimension of reprojecting token embeddings; if `None` the same output dimension as before will be taken
    - `bidirectional`: boolean value, indicating whether to use bidirectional RNN or not
    - `dropout`: dropout value to be used
    - `word_dropout`: word dropout value to be used, if `0.0` word dropout is not used
    - `locked_dropout`: locked dropout value to be used, if `0.0` locked dropout is not used
    - `rnn_type`: one of `RNN` or `LSTM`

## [_Tutorial 6: Loading Training Data_](https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_6_CORPUS.md)

- `Corpus` represent dataset you use to train a model
    - consists of the following, which correspond to training, validation and testing split during model training:
        - `train` sentences
        - list of `dev` sentences
        - list of `test` sentences

In [31]:
# instantiate Universal Dependency Treebank for English as corpus object
import flair.datasets
corpus = flair.datasets.UD_ENGLISH()

2020-05-20 21:23:53,988 Reading data from /root/.flair/datasets/ud_english
2020-05-20 21:23:53,989 Train: /root/.flair/datasets/ud_english/en_ewt-ud-train.conllu
2020-05-20 21:23:53,990 Test: /root/.flair/datasets/ud_english/en_ewt-ud-test.conllu
2020-05-20 21:23:53,991 Dev: /root/.flair/datasets/ud_english/en_ewt-ud-dev.conllu


- note: first time you call this snippet, it triggers download of the dataset onto hard drive
    - then reads train, test and dev splits into `Corpus` corpus which it returns

In [32]:
# print the number of Sentences in the train split
print(len(corpus.train))

# print the number of Sentences in the test split
print(len(corpus.test))

# print the number of Sentences in the dev split
print(len(corpus.dev))

12543
2077
2002


In [33]:
# can also access Sentence object directly --> print the first Sentence in the testing split
print(corpus.test[0])

Sentence: "What if Google Morphed Into GoogleOS ?" - 7 Tokens


- `Sentence` above is fully tagged with syntactic and morphological information
    - for example the POS tags
- this means that the corpus is tagged and ready for training

In [35]:
# print the first Sentence in the testing split
print(corpus.test[0].to_tagged_string('pos'))

What <WP> if <IN> Google <NNP> Morphed <VBD> Into <IN> GoogleOS <NNP> ? <.>


### _Helper Functions_

- `Corpus` contains useful helper functions
    - example: you can downsample the data by calling `downsample()` & passing a ratio

In [38]:
# downsample to 10% of the data
downsampled_corpus = flair.datasets.UD_ENGLISH().downsample(0.1)

print('--- 1 Original ---')
print(corpus)

print('--- 2 Downsampled ---')
print(downsampled_corpus)

2020-05-20 21:35:39,420 Reading data from /root/.flair/datasets/ud_english
2020-05-20 21:35:39,421 Train: /root/.flair/datasets/ud_english/en_ewt-ud-train.conllu
2020-05-20 21:35:39,422 Test: /root/.flair/datasets/ud_english/en_ewt-ud-test.conllu
2020-05-20 21:35:39,422 Dev: /root/.flair/datasets/ud_english/en_ewt-ud-dev.conllu
--- 1 Original ---
Corpus: 12543 train + 2002 dev + 2077 test sentences
--- 2 Downsampled ---
Corpus: 1254 train + 200 dev + 208 test sentences


- for many learning tasks you need to create a target dictionary
    - `Corpus` enables you to create your tag or label dictionary, depending on task you want to learn

In [41]:
# create tag dictionary for a PoS task
corpus = flair.datasets.UD_ENGLISH()
print(corpus.make_tag_dictionary('upos'))

# create tag dictionary for an NER task
corpus = flair.datasets.CONLL_03_DUTCH()
print(corpus.make_tag_dictionary('ner'))

# create label dictionary for a text classification task
corpus = flair.datasets.TREC_6()
print(corpus.make_label_dictionary())

2020-05-20 22:11:09,138 Reading data from /root/.flair/datasets/ud_english
2020-05-20 22:11:09,139 Train: /root/.flair/datasets/ud_english/en_ewt-ud-train.conllu
2020-05-20 22:11:09,140 Test: /root/.flair/datasets/ud_english/en_ewt-ud-test.conllu
2020-05-20 22:11:09,140 Dev: /root/.flair/datasets/ud_english/en_ewt-ud-dev.conllu
Dictionary with 21 tags: <unk>, O, PROPN, PUNCT, ADJ, NOUN, VERB, DET, ADP, AUX, PRON, PART, SCONJ, NUM, ADV, CCONJ, X, INTJ, SYM, <START>, <STOP>
2020-05-20 22:11:13,814 Reading data from /root/.flair/datasets/conll_03_dutch
2020-05-20 22:11:13,814 Train: /root/.flair/datasets/conll_03_dutch/ned.train
2020-05-20 22:11:13,815 Dev: /root/.flair/datasets/conll_03_dutch/ned.testa
2020-05-20 22:11:13,815 Test: /root/.flair/datasets/conll_03_dutch/ned.testb
Dictionary with 20 tags: <unk>, O, S-ORG, S-MISC, B-PER, E-PER, S-PER, S-LOC, B-MISC, E-MISC, B-ORG, E-ORG, I-ORG, I-PER, B-LOC, I-LOC, E-LOC, I-MISC, <START>, <STOP>
2020-05-20 22:11:24,271 Reading data from /roo

100%|██████████| 4907/4907 [00:00<00:00, 203930.18it/s]

2020-05-20 22:11:24,599 [b'ENTY', b'DESC', b'HUM', b'LOC', b'NUM', b'ABBR']
Dictionary with 6 tags: ENTY, DESC, HUM, LOC, NUM, ABBR





- another useful function is `obtain_statistics()`
    - returns a Python dictionary with useful stats about dataset

In [42]:
# gather stats on IMDB dataset
import flair.datasets
corpus = flair.datasets.TREC_6()
stats = corpus.obtain_statistics()
print(stats)

2020-05-20 22:12:43,472 Reading data from /root/.flair/datasets/trec_6
2020-05-20 22:12:43,473 Train: /root/.flair/datasets/trec_6/train.txt
2020-05-20 22:12:43,474 Dev: None
2020-05-20 22:12:43,474 Test: /root/.flair/datasets/trec_6/test.txt
{
    "TRAIN": {
        "dataset": "TRAIN",
        "total_number_of_documents": 4907,
        "number_of_documents_per_class": {
            "NUM": 808,
            "ENTY": 1138,
            "DESC": 1044,
            "LOC": 745,
            "HUM": 1098,
            "ABBR": 74
        },
        "number_of_tokens_per_tag": {},
        "number_of_tokens": {
            "total": 50043,
            "min": 3,
            "max": 37,
            "avg": 10.198288159771755
        }
    },
    "TEST": {
        "dataset": "TEST",
        "total_number_of_documents": 500,
        "number_of_documents_per_class": {
            "NUM": 113,
            "LOC": 81,
            "HUM": 65,
            "DESC": 138,
            "ENTY": 94,
            "ABBR": 9
  

### _The MultiCorpus Object_

- if you want to train multiple tasks at once, you can use `MultiCorpus`
- first need to create any number of `Corpus` objects
    - after, pass list of `Corpus` to `MultiCorpus` object
- the following loads a combination corpus consisting of English, German and Dutch Universal Dependency Treebanks

In [43]:
english_corpus = flair.datasets.UD_ENGLISH()
german_corpus = flair.datasets.UD_GERMAN()
dutch_corpus = flair.datasets.UD_DUTCH()

# make multi-corpus consisting of three UDs
from flair.data import MultiCorpus
multi_corpus = MultiCorpus([
    english_corpus, german_corpus, dutch_corpus
])

2020-05-20 22:16:37,413 Reading data from /root/.flair/datasets/ud_english
2020-05-20 22:16:37,414 Train: /root/.flair/datasets/ud_english/en_ewt-ud-train.conllu
2020-05-20 22:16:37,414 Test: /root/.flair/datasets/ud_english/en_ewt-ud-test.conllu
2020-05-20 22:16:37,415 Dev: /root/.flair/datasets/ud_english/en_ewt-ud-dev.conllu
2020-05-20 22:16:50,539 https://raw.githubusercontent.com/UniversalDependencies/UD_German-GSD/master/de_gsd-ud-dev.conllu not found in cache, downloading to /tmp/tmph85vzjc1


882822B [00:00, 58450258.03B/s]          

2020-05-20 22:16:50,579 copying /tmp/tmph85vzjc1 to cache at /root/.flair/datasets/ud_german/de_gsd-ud-dev.conllu
2020-05-20 22:16:50,581 removing temp file /tmp/tmph85vzjc1





2020-05-20 22:16:50,869 https://raw.githubusercontent.com/UniversalDependencies/UD_German-GSD/master/de_gsd-ud-test.conllu not found in cache, downloading to /tmp/tmp2bbi1202


1177197B [00:00, 60654538.92B/s]         

2020-05-20 22:16:50,916 copying /tmp/tmp2bbi1202 to cache at /root/.flair/datasets/ud_german/de_gsd-ud-test.conllu
2020-05-20 22:16:50,918 removing temp file /tmp/tmp2bbi1202





2020-05-20 22:16:52,011 https://raw.githubusercontent.com/UniversalDependencies/UD_German-GSD/master/de_gsd-ud-train.conllu not found in cache, downloading to /tmp/tmp83d_t2ad


18886219B [00:00, 91103558.25B/s]         

2020-05-20 22:16:52,254 copying /tmp/tmp83d_t2ad to cache at /root/.flair/datasets/ud_german/de_gsd-ud-train.conllu
2020-05-20 22:16:52,270 removing temp file /tmp/tmp83d_t2ad
2020-05-20 22:16:52,273 Reading data from /root/.flair/datasets/ud_german
2020-05-20 22:16:52,274 Train: /root/.flair/datasets/ud_german/de_gsd-ud-train.conllu
2020-05-20 22:16:52,274 Test: /root/.flair/datasets/ud_german/de_gsd-ud-test.conllu
2020-05-20 22:16:52,275 Dev: /root/.flair/datasets/ud_german/de_gsd-ud-dev.conllu





2020-05-20 22:17:02,054 https://raw.githubusercontent.com/UniversalDependencies/UD_Dutch-Alpino/master/nl_alpino-ud-dev.conllu not found in cache, downloading to /tmp/tmpdht1ow0b


971006B [00:00, 60659731.16B/s]          

2020-05-20 22:17:02,095 copying /tmp/tmpdht1ow0b to cache at /root/.flair/datasets/ud_dutch/nl_alpino-ud-dev.conllu
2020-05-20 22:17:02,098 removing temp file /tmp/tmpdht1ow0b





2020-05-20 22:17:02,685 https://raw.githubusercontent.com/UniversalDependencies/UD_Dutch-Alpino/master/nl_alpino-ud-test.conllu not found in cache, downloading to /tmp/tmpj89zwq5n


928283B [00:00, 68023010.94B/s]          

2020-05-20 22:17:02,723 copying /tmp/tmpj89zwq5n to cache at /root/.flair/datasets/ud_dutch/nl_alpino-ud-test.conllu
2020-05-20 22:17:02,726 removing temp file /tmp/tmpj89zwq5n





2020-05-20 22:17:03,719 https://raw.githubusercontent.com/UniversalDependencies/UD_Dutch-Alpino/master/nl_alpino-ud-train.conllu not found in cache, downloading to /tmp/tmpczgj4l3j


14959268B [00:00, 103215747.63B/s]        

2020-05-20 22:17:03,893 copying /tmp/tmpczgj4l3j to cache at /root/.flair/datasets/ud_dutch/nl_alpino-ud-train.conllu
2020-05-20 22:17:03,905 removing temp file /tmp/tmpczgj4l3j
2020-05-20 22:17:03,908 Reading data from /root/.flair/datasets/ud_dutch
2020-05-20 22:17:03,909 Train: /root/.flair/datasets/ud_dutch/nl_alpino-ud-train.conllu
2020-05-20 22:17:03,909 Test: /root/.flair/datasets/ud_dutch/nl_alpino-ud-test.conllu
2020-05-20 22:17:03,910 Dev: /root/.flair/datasets/ud_dutch/nl_alpino-ud-dev.conllu





### _Prepared Datasets_

- Flair supports growing list of prepared datasets out of the box
    - automatically downloads and sets up the data the first time you call the corresponding constructor ID
- Click [here](https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_6_CORPUS.md#prepared-datasets) for the list

**Note** -- There are a few resources in Tutorial 6 that I couldn't complete due to the tutorials using generic data.

**There is a section called [`Reading a Text Classification Dataset`](https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_6_CORPUS.md#reading-a-text-classification-dataset) that looks like it may be useful for trying to replicate the spam classifier project.**