# FLAIR BASICS

Check the documentation and tutorials: 

https://github.com/flairNLP/flair/tree/master/resources/docs

Flair is a state of the art neural toolkit to perform sequence labelling and text classification.

The aim of this lab is to learn how to install Flair, understand the intuitions about the character-based contextual word representations and getting familiar with its API, which is built around the Token and Sentence objects.

In [65]:
!pip install flair

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [66]:
from flair.data import Sentence
from flair.models import SequenceTagger
from flair.models import TextClassifier

In [67]:
# "use_tokenizer" parameter for tokenizing the input text

sentence = Sentence('Washington University, which is located in Missouri, is named after George Washington.', use_tokenizer=False)
tokenized_sentence = Sentence('Washington University, which is located in Missouri, is named after George Washington.', use_tokenizer=True)

In [68]:
print(sentence)
print(tokenized_sentence)

Sentence: "Washington University, which is located in Missouri, is named after George Washington."
Sentence: "Washington University , which is located in Missouri , is named after George Washington ."


In [69]:
## get_token() function retrieves the token by index (starting from 1)
print(sentence.get_token(3))

Token[2]: "which"


In [70]:
## indexes to obtain the tokens (starting from 0)
print(sentence[2])

Token[2]: "which"


In [71]:
for token in sentence:
  print(token)

Token[0]: "Washington"
Token[1]: "University,"
Token[2]: "which"
Token[3]: "is"
Token[4]: "located"
Token[5]: "in"
Token[6]: "Missouri,"
Token[7]: "is"
Token[8]: "named"
Token[9]: "after"
Token[10]: "George"
Token[11]: "Washington."


# WORD REPRESENTATIONS

1. Static Word Embeddings (fastText, Glove, etc.)
2. Flair character-based contextual embeddings


In [72]:
from flair.embeddings import WordEmbeddings

# init embedding
en_embedding = WordEmbeddings('glove')

In [73]:
#sentence = Sentence('Washington University, which is located in Missouri, is named after George Washington.')

# Obtain vector-based representation from glove pre-trained model
en_embedding.embed(sentence)

# print the vector representing each word in the sentence
for token in sentence:
    print(token)
    print(token.embedding)

Token[0]: "Washington"
tensor([-2.2048e-01, -1.1316e-01,  9.4277e-01, -3.9024e-01,  2.5004e-01,
        -4.1651e-01, -1.4640e-01,  2.3628e-03, -1.2966e-01, -1.1173e-01,
        -2.1546e-01, -8.6271e-01,  1.3817e-01,  3.3118e-01, -6.6500e-01,
         3.7134e-01,  2.0050e-01, -3.4055e-01, -1.2422e+00, -7.6653e-01,
        -1.1253e-02,  3.8440e-01, -5.0105e-02, -1.8869e-01,  1.0785e-01,
         1.7502e-01, -1.0167e-01, -5.7925e-01,  2.3529e-01,  3.2626e-02,
         3.2353e-01,  9.7457e-01,  4.5231e-01,  4.9740e-01, -8.8874e-01,
         4.9170e-01,  1.1230e-01, -2.1484e-01,  9.3187e-02,  4.7039e-01,
        -7.8776e-01, -6.8219e-01, -2.3741e-01,  2.2351e-01,  2.0269e-01,
        -1.0166e+00,  1.3095e-01, -2.3654e-01,  3.1501e-01, -3.1880e-01,
         5.9744e-01, -2.8722e-01,  2.9970e-01,  3.4877e-01, -1.6597e-01,
        -2.8483e+00,  3.2219e-01, -7.8469e-01,  1.3754e+00,  1.5050e-01,
        -8.5193e-01,  2.5303e-01,  2.0142e-01, -5.9176e-01,  8.9212e-02,
        -3.5561e-01,  2.6522

In [74]:
# Washington embedding "Washington University"
sentence[0].get_embedding()

tensor([-2.2048e-01, -1.1316e-01,  9.4277e-01, -3.9024e-01,  2.5004e-01,
        -4.1651e-01, -1.4640e-01,  2.3628e-03, -1.2966e-01, -1.1173e-01,
        -2.1546e-01, -8.6271e-01,  1.3817e-01,  3.3118e-01, -6.6500e-01,
         3.7134e-01,  2.0050e-01, -3.4055e-01, -1.2422e+00, -7.6653e-01,
        -1.1253e-02,  3.8440e-01, -5.0105e-02, -1.8869e-01,  1.0785e-01,
         1.7502e-01, -1.0167e-01, -5.7925e-01,  2.3529e-01,  3.2626e-02,
         3.2353e-01,  9.7457e-01,  4.5231e-01,  4.9740e-01, -8.8874e-01,
         4.9170e-01,  1.1230e-01, -2.1484e-01,  9.3187e-02,  4.7039e-01,
        -7.8776e-01, -6.8219e-01, -2.3741e-01,  2.2351e-01,  2.0269e-01,
        -1.0166e+00,  1.3095e-01, -2.3654e-01,  3.1501e-01, -3.1880e-01,
         5.9744e-01, -2.8722e-01,  2.9970e-01,  3.4877e-01, -1.6597e-01,
        -2.8483e+00,  3.2219e-01, -7.8469e-01,  1.3754e+00,  1.5050e-01,
        -8.5193e-01,  2.5303e-01,  2.0142e-01, -5.9176e-01,  8.9212e-02,
        -3.5561e-01,  2.6522e-01,  1.1283e+00, -3.7

In [75]:
# Washington embedding in "George Washington"
sentence[11].get_embedding()

tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.], device='cuda:0')

## ASSIGNMENT 1

In theory, the representations for tokens sentence[0] and sentence[11] should be the same (same glove vector representation).

+ Write code to establish whether the vectors are actually the same. The output should look like the one below.



In [76]:
for token in sentence:
  print(token)

Token[0]: "Washington"
Token[1]: "University,"
Token[2]: "which"
Token[3]: "is"
Token[4]: "located"
Token[5]: "in"
Token[6]: "Missouri,"
Token[7]: "is"
Token[8]: "named"
Token[9]: "after"
Token[10]: "George"
Token[11]: "Washington."


In [77]:
#The two vectors are not same because Token[0] is "washington" where as Token[11] is "washington" + ".", so they are not same.
sentence[11].get_embedding() == sentence[0].get_embedding()

tensor([False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False],
       device='cuda:0')

+ TODO: You need to find out why this is the case.
+ TODO: Once you find out, write code to obtain the embeddings again and to establish that they are indeed the same representations (for both occurrences of 'Washington').

In [111]:
en_embedding.embed(tokenized_sentence)
for token in tokenized_sentence:
  print(token)

Token[0]: "Washington"
Token[1]: "University"
Token[2]: ","
Token[3]: "which"
Token[4]: "is"
Token[5]: "located"
Token[6]: "in"
Token[7]: "Missouri"
Token[8]: ","
Token[9]: "is"
Token[10]: "named"
Token[11]: "after"
Token[12]: "George"
Token[13]: "Washington"
Token[14]: "."


In [112]:
# vector representation of  each word in the sentence
for token in tokenized_sentence:
    print(token)
    print(token.embedding)

Token[0]: "Washington"
tensor([-2.2048e-01, -1.1316e-01,  9.4277e-01,  ..., -2.2946e-03,
         1.4192e-03,  2.5719e-04], device='cuda:0')
Token[1]: "University"
tensor([ 6.9580e-01, -1.9334e-01, -7.8134e-01,  ..., -7.0233e-05,
        -1.3960e-03,  7.4919e-04], device='cuda:0')
Token[2]: ","
tensor([-1.0767e-01,  1.1053e-01,  5.9812e-01,  ...,  2.0763e-05,
        -1.9901e-02,  1.0584e-02], device='cuda:0')
Token[3]: "which"
tensor([ 0.0302,  0.4461,  0.4317,  ..., -0.0035, -0.0067,  0.0031],
       device='cuda:0')
Token[4]: "is"
tensor([-0.5426,  0.4148,  1.0322,  ..., -0.0019,  0.1306,  0.0056],
       device='cuda:0')
Token[5]: "located"
tensor([-3.2592e-01, -9.7740e-02,  4.9287e-01,  ...,  2.8786e-04,
        -1.3166e-03, -1.2682e-03], device='cuda:0')
Token[6]: "in"
tensor([ 8.5703e-02, -2.2201e-01,  1.6569e-01,  ...,  6.4110e-05,
         6.5429e-03,  2.6719e-02], device='cuda:0')
Token[7]: "Missouri"
tensor([ 5.8746e-01, -4.1809e-02,  5.4232e-01,  ..., -3.1905e-05,
         

In [115]:
# checking if the Token[0] is same as Token[13]
tokenized_sentence[13].get_embedding() == tokenized_sentence[0].get_embedding()

tensor([ True,  True,  True,  ..., False, False, False], device='cuda:0')

## ASSIGNMENT 2

In this assigment we will show how the representations obtained for both occurrences of 'Washington' are different when they are obtained from Flair contextual-based embeddings.



In [79]:
from flair.embeddings import FlairEmbeddings

# init Flair embedding
flair_embedding_forward = FlairEmbeddings('news-forward')
tokenized_sentence = Sentence('Washington University, which is located in Missouri, is named after George Washington.', use_tokenizer=True)


+ TODO: compare the representations obtained for 'Washington' for both sentences, tokenized and raw.

In [80]:
# TODO compare the Flair embeddings obtained for 'Washington'
#tokenized sentence
flair_embedding_forward.embed(tokenized_sentence)

for token in tokenized_sentence:
  print(token)
  print(token.embedding)


Token[0]: "Washington"
tensor([ 3.6137e-03, -3.7161e-06, -2.9144e-02,  ..., -2.2946e-03,
         1.4192e-03,  2.5719e-04], device='cuda:0')
Token[1]: "University"
tensor([ 4.9652e-04, -6.9346e-04, -2.1376e-01,  ..., -7.0233e-05,
        -1.3960e-03,  7.4919e-04], device='cuda:0')
Token[2]: ","
tensor([-4.0296e-05,  2.9115e-05,  6.2106e-02,  ...,  2.0763e-05,
        -1.9901e-02,  1.0584e-02], device='cuda:0')
Token[3]: "which"
tensor([-0.0005,  0.0004,  0.0358,  ..., -0.0035, -0.0067,  0.0031],
       device='cuda:0')
Token[4]: "is"
tensor([ 0.0045, -0.0022,  0.1370,  ..., -0.0019,  0.1306,  0.0056],
       device='cuda:0')
Token[5]: "located"
tensor([-0.0110, -0.0005,  0.0741,  ...,  0.0003, -0.0013, -0.0013],
       device='cuda:0')
Token[6]: "in"
tensor([-1.7291e-02,  2.4153e-05,  7.0391e-02,  ...,  6.4110e-05,
         6.5429e-03,  2.6719e-02], device='cuda:0')
Token[7]: "Missouri"
tensor([ 4.3458e-07,  8.4390e-05,  1.8153e-02,  ..., -3.1905e-05,
         7.0416e-03,  7.9550e-03],

In [81]:
#camparing the embeddings of wasington
tokenized_sentence[0].get_embedding() == tokenized_sentence[13].get_embedding()

tensor([False, False, False,  ..., False, False, False], device='cuda:0')

In [82]:
#Raw sentence
flair_embedding_forward.embed(sentence)
for token in sentence:
  print(token)
  print(token.embedding)
 

Token[0]: "Washington"
tensor([-2.2048e-01, -1.1316e-01,  9.4277e-01,  ..., -2.2946e-03,
         1.4192e-03,  2.5719e-04], device='cuda:0')
Token[1]: "University,"
tensor([0.0000, 0.0000, 0.0000,  ..., 0.0002, 0.0303, 0.0230], device='cuda:0')
Token[2]: "which"
tensor([ 0.0302,  0.4461,  0.4317,  ..., -0.0089, -0.0048,  0.0054],
       device='cuda:0')
Token[3]: "is"
tensor([-0.5426,  0.4148,  1.0322,  ..., -0.0027,  0.0913,  0.0074],
       device='cuda:0')
Token[4]: "located"
tensor([-3.2592e-01, -9.7740e-02,  4.9287e-01,  ...,  4.2787e-04,
        -1.6303e-03, -9.5898e-04], device='cuda:0')
Token[5]: "in"
tensor([ 8.5703e-02, -2.2201e-01,  1.6569e-01,  ...,  5.2130e-05,
         7.9720e-03,  3.0658e-02], device='cuda:0')
Token[6]: "Missouri,"
tensor([ 0.0000,  0.0000,  0.0000,  ..., -0.0006,  0.0282,  0.0026],
       device='cuda:0')
Token[7]: "is"
tensor([-5.4264e-01,  4.1476e-01,  1.0322e+00,  ..., -7.2745e-04,
         1.0725e-02,  8.1581e-03], device='cuda:0')
Token[8]: "named"

In [83]:
#camparing the embeddings of washington
tokenized_sentence[0].get_embedding() == tokenized_sentence[11].get_embedding()

tensor([False, False, False,  ..., False, False, False], device='cuda:0')



---


# Tagging

Now we will learn how to tag our sentence using Flair pre-trained models for the following tasks:

1. POS tagging
3. Named Entity Recognition
4. Frame Semantics (event detection)
5. Polarity classification

** Check the following link to see the list of available models and languages:**
[Flair Tagging Info](https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_2_TAGGING.md)
---




In [84]:
pos_tagger = SequenceTagger.load('pos')



2023-02-13 21:45:39,704 loading file /root/.flair/models/pos-english/a9a73f6cd878edce8a0fa518db76f441f1cc49c2525b2b4557af278ec2f0659e.121306ea62993d04cd1978398b68396931a39eb47754c8a06a87f325ea70ac63
2023-02-13 21:45:39,965 SequenceTagger predicts: Dictionary with 53 tags: <unk>, O, UH, ,, VBD, PRP, VB, PRP$, NN, RB, ., DT, JJ, VBP, VBG, IN, CD, NNS, NNP, WRB, VBZ, WDT, CC, TO, MD, VBN, WP, :, RP, EX, JJR, FW, XX, HYPH, POS, RBR, JJS, PDT, NNPS, RBS, AFX, WP$, -LRB-, -RRB-, ``, '', LS, $, SYM, ADD


In [85]:
pos_tagger.predict(sentence)

In [86]:
for postag in sentence.get_spans('pos'):
  print(postag)

In [87]:
print(sentence.to_tagged_string())

Sentence: "Washington University, which is located in Missouri, is named after George Washington." → ["Washington"/NNP, "University,"/NNP, "which"/WDT, "is"/VBZ, "located"/VBN, "in"/IN, "Missouri,"/NNP, "is"/VBZ, "named"/VBN, "after"/IN, "George"/NNP, "Washington."/NNP]


In [88]:
chunker = SequenceTagger.load('chunk')



2023-02-13 21:45:49,513 loading file /root/.flair/models/chunk-english/5b53097d6763734ee8ace8de92db67a1ee2528d5df9c6d20ec8e3e6f6470b423.d81b7fd7a38422f2dbf40f6449b1c63d5ae5b959863aa0c2c1ce9116902e8b22
2023-02-13 21:45:49,706 SequenceTagger predicts: Dictionary with 45 tags: <unk>, O, B-NP, E-NP, I-NP, S-PP, S-VP, S-SBAR, S-ADVP, S-NP, S-ADJP, B-VP, E-VP, B-PP, E-PP, I-VP, S-PRT, B-ADVP, E-ADVP, B-ADJP, E-ADJP, B-CONJP, I-CONJP, E-CONJP, I-ADJP, B-SBAR, E-SBAR, S-INTJ, I-ADVP, I-PP, B-UCP, I-UCP, E-UCP, S-LST, B-PRT, I-PRT, E-PRT, S-CONJP, B-INTJ, E-INTJ, I-INTJ, B-LST, E-LST, <START>, <STOP>


In [89]:
chunker.predict(sentence)

In [90]:
for chunktag in sentence.get_spans('np'):
  print(chunktag)

Span[0:2]: "Washington University," → NP (0.7408)
Span[2:3]: "which" → NP (0.9995)
Span[3:5]: "is located" → VP (0.8574)
Span[5:6]: "in" → PP (1.0)
Span[6:7]: "Missouri," → NP (0.9999)
Span[7:9]: "is named" → VP (0.9624)
Span[9:10]: "after" → PP (0.9981)
Span[10:12]: "George Washington." → NP (0.798)


In [91]:
print(sentence.to_tagged_string())

Sentence: "Washington University, which is located in Missouri, is named after George Washington." → ["Washington"/NNP, "Washington University,"/NP, "University,"/NNP, "which"/WDT, "which"/NP, "is"/VBZ, "is located"/VP, "located"/VBN, "in"/IN, "in"/PP, "Missouri,"/NNP, "Missouri,"/NP, "is"/VBZ, "is named"/VP, "named"/VBN, "after"/IN, "after"/PP, "George"/NNP, "George Washington."/NP, "Washington."/NNP]


In [92]:
ner_tagger = SequenceTagger.load('ner')



2023-02-13 21:45:59,001 loading file /root/.flair/models/ner-english/4f4cdab26f24cb98b732b389e6cebc646c36f54cfd6e0b7d3b90b25656e4262f.8baa8ae8795f4df80b28e7f7b61d788ecbb057d1dc85aacb316f1bd02837a4a4
2023-02-13 21:46:01,603 SequenceTagger predicts: Dictionary with 20 tags: <unk>, O, S-ORG, S-MISC, B-PER, E-PER, S-LOC, B-ORG, E-ORG, I-PER, S-PER, B-MISC, I-MISC, E-MISC, I-ORG, B-LOC, E-LOC, I-LOC, <START>, <STOP>


In [93]:
ner_tagger.predict(sentence)

In [94]:
# iterate over entities and print
for entity in sentence.get_spans('ner'):
    print(entity)

Span[0:2]: "Washington University," → NP (0.7408); ORG (0.8766)
Span[6:7]: "Missouri," → NP (0.9999); LOC (0.9987)
Span[10:12]: "George Washington." → NP (0.798); PER (0.9916)


In [95]:
print(sentence.to_tagged_string())

Sentence: "Washington University, which is located in Missouri, is named after George Washington." → ["Washington"/NNP, "Washington University,"/NP/ORG, "University,"/NNP, "which"/WDT, "which"/NP, "is"/VBZ, "is located"/VP, "located"/VBN, "in"/IN, "in"/PP, "Missouri,"/NNP, "Missouri,"/NP/LOC, "is"/VBZ, "is named"/VP, "named"/VBN, "after"/IN, "after"/PP, "George"/NNP, "George Washington."/NP/PER, "Washington."/NNP]


In [96]:
sem_tagger = SequenceTagger.load('frame')



2023-02-13 21:46:13,008 loading file /root/.flair/models/frame-english/c397b8bbddf56e35a7d4b64295712a42a1a9b7ccf430dff76d03c8c7e26b9707.fd7786a36026b383ca73a1413c0a29aa1e67551621b805a0d28ca547636353b9
2023-02-13 21:46:13,644 SequenceTagger predicts: Dictionary with 5196 tags: <unk>, O, _, do.01, get.01, kid.01, know.01, be.01, send.01, seem.01, fold.03, have.03, want.01, say.01, pass.08, play.01, be_like.04, be.03, record.01, hear.01, speak.01, go.04, mean.01, let.01, go.01, see.01, drive.01, pull.01, look.01, start.01, come.01, get.06, pay.01, go.02, miss.01, know.02, know.06, forget.01, ask.02, mail.01, wait.01, be.02, make.02, make.01, think.01, live.01, care.01, smoke.02, put_off.07, mind.01


In [97]:
sem_tagger.predict(sentence)

In [98]:
for event in sentence.get_spans('frame'):
  print(event)
print(sentence.to_tagged_string())

Sentence: "Washington University, which is located in Missouri, is named after George Washington." → ["Washington"/NNP, "Washington University,"/NP/ORG, "University,"/NNP, "which"/WDT, "which"/NP, "is"/VBZ/be.03, "is located"/VP, "located"/VBN/locate.01, "in"/IN, "in"/PP, "Missouri,"/NNP, "Missouri,"/NP/LOC, "is"/VBZ/be.03, "is named"/VP, "named"/VBN/name.01, "after"/IN, "after"/PP, "George"/NNP, "George Washington."/NP/PER, "Washington."/NNP]


In [99]:
print(sentence.to_dict(tag_type='pos'))

{'text': 'Washington University, which is located in Missouri, is named after George Washington.', 'pos': [{'value': 'NNP', 'confidence': 1.0}, {'value': 'NNP', 'confidence': 0.9999886751174927}, {'value': 'WDT', 'confidence': 0.9999895095825195}, {'value': 'VBZ', 'confidence': 0.9999998807907104}, {'value': 'VBN', 'confidence': 0.9999347925186157}, {'value': 'IN', 'confidence': 1.0}, {'value': 'NNP', 'confidence': 0.999966025352478}, {'value': 'VBZ', 'confidence': 1.0}, {'value': 'VBN', 'confidence': 0.9999909400939941}, {'value': 'IN', 'confidence': 0.9993758797645569}, {'value': 'NNP', 'confidence': 0.9999997615814209}, {'value': 'NNP', 'confidence': 0.9999992847442627}]}


In [100]:
print(sentence.to_dict(tag_type='chunk'))

{'text': 'Washington University, which is located in Missouri, is named after George Washington.', 'chunk': []}


In [101]:
print(sentence.to_dict(tag_type='ner'))

{'text': 'Washington University, which is located in Missouri, is named after George Washington.', 'ner': [{'value': 'ORG', 'confidence': 0.8765585720539093}, {'value': 'LOC', 'confidence': 0.9987462759017944}, {'value': 'PER', 'confidence': 0.9916445016860962}]}


In [102]:
print(sentence.to_dict(tag_type='frame'))

{'text': 'Washington University, which is located in Missouri, is named after George Washington.', 'frame': [{'value': 'be.03', 'confidence': 0.9955000281333923}, {'value': 'locate.01', 'confidence': 0.9831220507621765}, {'value': 'be.03', 'confidence': 0.9986500144004822}, {'value': 'name.01', 'confidence': 0.7090935707092285}]}


In [103]:
polarity_classifier = TextClassifier.load('en-sentiment')

2023-02-13 21:46:29,812 loading file /root/.flair/models/sentiment-en-mix-distillbert_4.pt


In [104]:
polarity_classifier.predict(sentence)

In [105]:
print(sentence.to_tagged_string())
print(sentence.labels)

Sentence: "Washington University, which is located in Missouri, is named after George Washington." → POSITIVE (0.9722) → ["Washington"/NNP, "Washington University,"/NP/ORG, "University,"/NNP, "which"/WDT, "which"/NP, "is"/VBZ/be.03, "is located"/VP, "located"/VBN/locate.01, "in"/IN, "in"/PP, "Missouri,"/NNP, "Missouri,"/NP/LOC, "is"/VBZ/be.03, "is named"/VP, "named"/VBN/name.01, "after"/IN, "after"/PP, "George"/NNP, "George Washington."/NP/PER, "Washington."/NNP]
['Token[0]: "Washington"'/'NNP' (1.0), 'Token[1]: "University,"'/'NNP' (1.0), 'Token[2]: "which"'/'WDT' (1.0), 'Token[3]: "is"'/'VBZ' (1.0), 'Token[4]: "located"'/'VBN' (0.9999), 'Token[5]: "in"'/'IN' (1.0), 'Token[6]: "Missouri,"'/'NNP' (1.0), 'Token[7]: "is"'/'VBZ' (1.0), 'Token[8]: "named"'/'VBN' (1.0), 'Token[9]: "after"'/'IN' (0.9994), 'Token[10]: "George"'/'NNP' (1.0), 'Token[11]: "Washington."'/'NNP' (1.0), 'Span[0:2]: "Washington University,"'/'NP' (0.7408), 'Span[2:3]: "which"'/'NP' (0.9995), 'Span[3:5]: "is located"'

## ASSIGNMENT 3

Check out the following list of sentences and perform the following tasks using the Flair system and models:

1. Perform POS tagging and Named Entity Recognition on sentences 1-4.
2. Chunking and Frame detection for sentences 5-6.
3. Sentiment Analysis for sentences 7-8.

**Do not repeat the instructions, use the loop structure to annotate and display the annotations of every sentence in one step per task.**

In [106]:
sentence_1 = Sentence('Jackson is placed in Microsoft located in Redmond .')
sentence_2 = Sentence('Redmond is coming to New York city .')
sentence_3 = Sentence('Redmond is coming to New York City .')
sentence_4 = Sentence('Redmond is coming to New York City.')
sentence_5 = Sentence('Redmond returned to New York City to return his hat .')
sentence_6 = Sentence('He had a look at different hats .')
sentence_7 = Sentence('This film hurts.')
sentence_8 = Sentence('It is so bad that I am confused.')

In [107]:
# POS tagging and Named Entity Recognition on sentences 1-4.
sent_ner_pos = [sentence_2, sentence_1, sentence_3, sentence_4]
#Chunking and Frame detection for sentences 5-6.
sent_chu_fra = [sentence_5, sentence_6]
#Sentiment Analysis for sentences 7-8.
sent_analysis = [sentence_8, sentence_7]

In [108]:
for sentence in sent_ner_pos:
  pos_tagger.predict(sentence)
  ner_tagger.predict(sentence)
  print(sentence.to_tagged_string())
  print(sentence.to_original_text())

Sentence: "Redmond is coming to New York city ." → ["Redmond"/NNP, "Redmond"/PER, "is"/VBZ, "coming"/VBG, "to"/IN, "New"/NNP, "New York"/LOC, "York"/NNP, "city"/NN, "."/.]
Redmond is coming to New York city .
Sentence: "Jackson is placed in Microsoft located in Redmond ." → ["Jackson"/NNP, "Jackson"/PER, "is"/VBZ, "placed"/VBN, "in"/IN, "Microsoft"/NNP, "Microsoft"/ORG, "located"/VBN, "in"/IN, "Redmond"/NNP, "Redmond"/LOC, "."/.]
Jackson is placed in Microsoft located in Redmond .
Sentence: "Redmond is coming to New York City ." → ["Redmond"/NNP, "Redmond"/PER, "is"/VBZ, "coming"/VBG, "to"/IN, "New"/NNP, "New York City"/LOC, "York"/NNP, "City"/NNP, "."/.]
Redmond is coming to New York City .
Sentence: "Redmond is coming to New York City ." → ["Redmond"/NNP, "Redmond"/PER, "is"/VBZ, "coming"/VBG, "to"/IN, "New"/NNP, "New York City"/LOC, "York"/NNP, "City"/NNP, "."/.]
Redmond is coming to New York City.


In [109]:
for sentence in sent_chu_fra:
  chunker.predict(sentence)
  sem_tagger.predict(sentence)
  print(sentence.to_tagged_string())
  print(sentence.to_original_text())

Sentence: "Redmond returned to New York City to return his hat ." → ["Redmond"/NP, "returned"/VP, "returned"/return.01, "to"/PP, "New York City"/NP, "to return"/VP, "return"/return.02, "his hat"/NP]
Redmond returned to New York City to return his hat .
Sentence: "He had a look at different hats ." → ["He"/NP, "had"/VP, "had"/have.03, "a look"/NP, "look"/look.01, "at"/PP, "different hats"/NP]
He had a look at different hats .


In [110]:
for sentence in sent_analysis:
  polarity_classifier.predict(sentence)
  print(sentence.to_original_text())
  print(sentence.labels)

It is so bad that I am confused.
['Sentence: "It is so bad that I am confused ."'/'NEGATIVE' (0.9999)]
This film hurts.
['Sentence: "This film hurts ."'/'NEGATIVE' (0.9999)]


## ASSIGNMENT 4 (BONUS)

In this task you will be annotating a movie review at document and sentence level.

1. Open the text in this file: '/content/drive/My Drive/Colab Notebooks/2023-ILTAPP/resources/movie-review.txt'
2. Predict Named Entities and Sentiment for the **whole document**.
3. Predict Named Entities and Sentiment for each of the sentences in the document.
> 3.1 Hint: You will need to segment the document at sentence level using the segtok segmenter and store each sentence as a Sentence object. The final result must be a list of Sentence objects. The segtok segmenter is used as in the following code snippet:

```
from segtok.segmenter import split_single
split_single(docText)
```

4. Print both the sentiment classification output and Named Entities.
5. Spot the differences in the annotations when performed at document and at sentence level. Write the difference at the end of this notebook.




In [None]:
# TODO add code here to obtain an output similar to the one below


Sentence: "Once again Mr. Costner has dragged out a movie for far longer than necessary . Aside from the terrific sea rescue sequences , of which there are very few I just did not care about any of the characters . Most of us have ghosts in the closet , and Costner 's character are realized early on , and then forgotten until much later , by which time I did not care . The character we should really care about is a very cocky , overconfident Ashton Kutcher . The problem is he comes off as kid who thinks he 's better than anyone else around him and shows no signs of a cluttered closet . His only obstacle appears to be winning over Costner . Finally when we are well past the half way point of this stinker , Costner tells us all about Kutcher 's ghosts . We are told why Kutcher is driven to be the best with no prior inkling or foreshadowing . No magic here , it was all I could do to keep from turning it off an hour in ." → NEGATIVE (1.0) → ["Costner"/PER, "Costner"/PER, "Ashton Kutcher"/P