# Tutorial

This tutorial takes you through the Flair library. 

## NLP base types

The Sentence object is the central object to our library. It holds a Sentence, consisting of Tokens. To this object, various layers of linguistic annotation may be added. This is also the central object for embedding your text.

Let's illustrate this with an example sentence.

In [4]:
# The sentence objects holds a sentence that we may want to embed
from flair.data import Sentence

# Make a sentence object by passing a whitespace tokenized string
sentence = Sentence('The grass is green .')

# Print the object to see what's in there
print(sentence)

Sentence: "The grass is green ." - 5 Tokens


Each word in a sentence is a Token object. You can directly access a token using the token_id. Each token has attributes, such as an id and a text.

In [6]:
print(sentence[4])

Token: 4 green


You can also iterate over all tokens in a sentence.

In [7]:
for token in sentence:
    print(token)

Token: 1 The
Token: 2 grass
Token: 3 is
Token: 4 green
Token: 5 .


Tokens can also have tags, such as a named entity tag. In this example, we're adding an NER tag of type 'color' to 
the word 'green' in the example sentence.


In [8]:
# add a tag to a word in the sentence
sentence[4].add_tag('ner', 'color')

# print the sentence with all tags of this type
print(sentence.to_ner_string())

The grass is green <color> .


## Tagging with Pre-Trained Models

Now, lets use a pre-trained model for named entity recognition (NER). 
This model was trained over the English CoNLL-03 task and can recognize 4 different entity
types.


In [11]:
from flair.tagging_model import SequenceTaggerLSTM

tagger = SequenceTaggerLSTM.load('ner')



All you need to do is use the `predict()` method of the tagger on a sentence. This will add predicted tags to the tokens
in the sentence. Lets use a sentence with two named
entities: 

In [12]:
sentence = Sentence('George Washington went to Washington .')

# predict NER tags
tagger.predict(sentence)

# print sentence with predicted tags
print(sentence.to_tag_string())

George <B-PER> Washington <E-PER> went to Washington <S-LOC> .


You chose which pre-trained model you load by passing the appropriate 
string you pass to the `load()` method of the `SequenceTaggerLSTM` class. Currently, the following pre-trained models
are provided (more coming): 
 

 'ner' : Conll-03 Named Entity Recognition (English)     

 'chunk' : Conll-2000 Syntactic Chunking (English)    


## Embeddings

We provide a set of classes with which you can embed the words in sentences in various ways. Note that all embedding 
classes inherit from the `TextEmbeddings` class and implement the `embed()` method which you need to call 
to embed your text. This means that for most users of Flair, the complexity of different embeddings remains hidden 
behind this interface. Simply instantiate the embedding class you require and call `embed()` to embed your text.

All embeddings produced with our methods are pytorch vectors, so they can be immediately used for training and 
fine-tuning.

### Classic Word Embeddings

Classic word embeddings are static and word-level, meaning that each distinc word gets exactly one pre-computed 
embedding. Most embeddings fall under this class, including the popular GloVe or Komnios embeddings. 

Simply instantiate the WordEmbeddings class and pass a string identifier of the embedding you wish to load. So, if 
you want to use GloVe embeddings, pass the string 'glove' to the constructor: 

In [16]:
# all embeddings inherit from the TextEmbeddings class. Init a simple glove embedding.
from flair.embeddings import WordEmbeddings
glove_embedding = WordEmbeddings('glove')

Now, create an example sentence and call the embedding's `embed()` method. You always pass a list of sentences to 
this method since some embedding types make use of batching to increase speed. So if you only have one sentence, 
pass a list containing only one sentence:


In [17]:
# embed a sentence using glove.
from flair.data import Sentence
sentence = Sentence('The grass is green .')
glove_embedding.embed(sentences=[sentence])

# now check out the embedded tokens.
for token in sentence:
    print(token)
    print(token.embedding.size())

Token: 1 The
torch.Size([100])
Token: 2 grass
torch.Size([100])
Token: 3 is
torch.Size([100])
Token: 4 green
torch.Size([100])
Token: 5 .
torch.Size([100])


This prints out the tokens and their embeddings. GloVe embeddings are pytorch vectors of dimensionality 100.

You choose which pre-trained embeddings you load by passing the appropriate 
string you pass to the constructor of the `WordEmbeddings` class. Currently, the following static embeddings
are provided (more coming): 
 
'glove' : GloVe embeddings 

'extvec' : Komnios embeddings 

'ft-crawl' : FastText embeddings 

'ft-german' : German FastText embeddings 

So, if you want to load German FastText embeddings, instantiate the method as follows:


In [18]:
german_embedding = WordEmbeddings('ft-german')

### Contextual String Embeddings


Contextual string embeddings are [powerful embeddings](https://drive.google.com/file/d/17yVpFA7MmXaQFTe-HDpZuqw9fJlmzg56/view?usp=sharing)
 that capture latent syntactic-semantic information that goes beyond
standard word embeddings. Key differences are: (1) they are trained without any explicit notion of words and
thus fundamentally model words as sequences of characters. And (2) they are **contextualized** by their
surrounding text, meaning that the *same word will have different embeddings depending on its
contextual use*.

With Flair, you can use these embeddings simply by instantiating the appropriate embedding class, same as before:

In [19]:
# the CharLMEmbedding also inherits from the TextEmbeddings class
from flair.embeddings import CharLMEmbeddings
charlm_embedding_forward = CharLMEmbeddings('news-forward')

# embed a sentence using CharLM.
from flair.data import Sentence
sentence = Sentence('The grass is green .')
charlm_embedding_forward.embed(sentences=[sentence])

FORWARD language mode loaded
on cuda:
False


[<flair.data.Sentence at 0x7fabc51055f8>]

You choose which embeddings you load by passing the appropriate 
string you pass to the constructor of the `CharLMEmbeddings` class. Currently, the following contextual string
 embeddings
are provided (more coming): 
 
| ID | Language | Embedding | 
| -------------     | ------------- | ------------- |
| 'news-forward'    | English | Forward LM embeddings over 1 billion word corpus |
| 'news-backward'   | English | Backward LM embeddings over 1 billion word corpus |
| 'mix-forward'     | English | Forward LM embeddings over mixed corpus (Web, Wikipedia, Subtitles) |
| 'mix-backward'    | English | Backward LM embeddings over mixed corpus (Web, Wikipedia, Subtitles) |
| 'german-forward'  | German  | Forward LM embeddings over mixed corpus (Web, Wikipedia, Subtitles) |
| 'german-backward' | German  | Backward LM embeddings over mixed corpus (Web, Wikipedia, Subtitles) |

So, if you want to load embeddings from the English news backward LM model, instantiate the method as follows:


In [20]:
charlm_embedding_backward = CharLMEmbeddings('news-backward')

BACKWARD language mode loaded
on cuda:
False
