# Morphological Analysis

From lemmatization to co-reference resolution, named entity recognition or other text grammatical analysis, knowing the sentence structure can help us with many NLP tasks.

## Lists of tags:

Different libraries have a slightly different list of tags, but the idea is more or less the same. Here is the list of simple universal POS tags:
- ADJ: adjective, e.g. big, old, green, incomprehensible, first
- ADP: adposition, e.g. in, to, during  
- ADV: adverb, e.g. very, tomorrow, down, where, there  
- AUX: auxiliary, e.g. is, has (done), will (do), should (do)  
- CONJ: conjunction, e.g. and, or, but  
- CCONJ: coordinating conjunction, e.g. and, or, but  
- DET: determiner, e.g. a, an, the  
- INTJ: interjection, e.g. psst, ouch, bravo, hello  
- NOUN: noun, e.g. girl, cat, tree, air, beauty  
- NUM: numeral, e.g. 1, 2017, one, seventy-seven, IV, MMXIV  
- PART: particle, e.g. ’s, not,  
- PRON: pronoun, e.g I, you, he, she, myself, themselves, somebody  
- PROPN: proper noun, e.g. Mary, John, London, NATO, HBO  
- PUNCT: punctuation, e.g. ., (, ), ?  
- SCONJ: subordinating conjunction, e.g. if, while, that  
- SYM: symbol, e.g. $, %, §, ©, +, −, ×, ÷, =, :), 😝  
- VERB: verb, e.g. run, runs, running, eat, ate, eating  
- X: other, e.g. sfpksdpsxmsa  
- SPACE: space, e.g.

And the list of detailed tags:
- .: punctuation
- CC: coordinating conjunction
- CD: cardinal digit
- DT: determiner
- EX: existential there, e.g. “there is” … think of it like “there exists”
- FW: foreign word
- IN: preposition/subordinating conjunction
- JJ: adjective, e.g. big
- JJR: adjective, comparative, e.g. bigger
- JJS: adjective, superlative, e.g. biggest
- LS: list marker, e.g. 1)
- MD: modal: e.g. could, will
- NN: noun, singular, e.g. desk
- NNS: noun plural, e.g. desks
- NNP: proper noun, singular, e.g. Harrison
- NNPS: proper noun, plural, e.g. Americans
- PDT: predeterminer, e.g. all the kids
- POS: possessive ending, e.g. parent's
- PRP: personal pronoun, e.g. I, he, she
- PRP\$: possessive pronoun, e.g. my, his, hers
- RB: adverb, e.g. very, silently
- RBR: adverb, comparative, e.g. better
- RBS: adverb, superlative, e.g. best
- RP: particle, e.g. give up
- TO: e.g. to go 'to' the store.
- UH: interjection, e.g. errrrrrrrm
- VB: verb, base form, e.g. take
- VBD: verb, past tense, e.g. took
- VBG: verb, gerund/present participle, e.g. taking
- VBN: verb, past participle, e.g. taken
- VBP: verb, sing. present, non-3d, e.g. take
- VBZ: verb, 3rd person sing. present, e.g. takes
- WDT: wh-determiner, e.g. which
- WP: wh-pronoun, e.g. who, what
- WP\$: possessive wh-pronoun, e.g. whose
- WRB: wh-adverb, e.g. where, when

Notice how the second analysis is a lot more fine-grained. We can always choose which option is more suited for our task.

## Part of Speech (POS) Tagging:

In [None]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

In [None]:
text = "We don't like to keep our lovely customers waiting for long!"
words = nltk.word_tokenize(text)
nltk.pos_tag(words)

[('We', 'PRP'),
 ('do', 'VBP'),
 ("n't", 'RB'),
 ('like', 'VB'),
 ('to', 'TO'),
 ('keep', 'VB'),
 ('our', 'PRP$'),
 ('lovely', 'JJ'),
 ('customers', 'NNS'),
 ('waiting', 'VBG'),
 ('for', 'IN'),
 ('long', 'RB'),
 ('!', '.')]

Of course, we can always join all the information we have in a dataframe for an easier analysis:

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("We don't like to keep our lovely customers waiting for long!")

In [None]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.array([[token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.shape_, token.is_alpha, token.is_stop, token.morph] for token in doc]),
                  columns=['WORD', 'LEMMA', 'POS', 'TAG', 'DEP', 'SHAPE', 'ALPHA', 'STOP', 'MORPH'])

df

Unnamed: 0,WORD,LEMMA,POS,TAG,DEP,SHAPE,ALPHA,STOP,MORPH
0,We,we,PRON,PRP,nsubj,Xx,True,True,"(Case=Nom, Number=Plur, Person=1, PronType=Prs)"
1,do,do,AUX,VBP,aux,xx,True,True,"(Mood=Ind, Tense=Pres, VerbForm=Fin)"
2,n't,not,PART,RB,neg,x'x,False,True,(Polarity=Neg)
3,like,like,VERB,VB,ROOT,xxxx,True,False,(VerbForm=Inf)
4,to,to,PART,TO,aux,xx,True,True,()
5,keep,keep,VERB,VB,xcomp,xxxx,True,True,(VerbForm=Inf)
6,our,our,PRON,PRP$,poss,xxx,True,True,"(Number=Plur, Person=1, Poss=Yes, PronType=Prs)"
7,lovely,lovely,ADJ,JJ,amod,xxxx,True,False,(Degree=Pos)
8,customers,customer,NOUN,NNS,dobj,xxxx,True,False,(Number=Plur)
9,waiting,wait,VERB,VBG,advcl,xxxx,True,False,"(Aspect=Prog, Tense=Pres, VerbForm=Part)"


It has been proven that we can use this type of analysis for determining relevant information about a person. For example, [this article](https://aclanthology.org/W18-4102.pdf) shows how people diagnosed with depression tend to use adverbs and the first person a lot more than others.

## Dependency parsing

We can use POS tagging for placing words and the relations between them in a dependency tree. Dependency parsing breaks a sentence into several components, depending on the relation between them and it can be useful, for example, for text summarization:

In [None]:
from spacy import displacy

displacy.render(doc, style="dep", options={"distance": 100}, jupyter=True)

Dependency trees start from the root (the predicate), create branches towards the subject and other verbs present in the sentence and continue from each new node (word) until all words are connected. We can observe the full tree structure in the same way we looked at the POS tags:

In [None]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.array([[token.text, token.pos_, token.dep_, token.head.text, [child for child in token.children]] for token in doc], dtype='object'),
                  columns=['WORD', 'POS', 'DEPENDENCY', 'PARENT', 'CHILDREN'])

df

Unnamed: 0,WORD,POS,DEPENDENCY,PARENT,CHILDREN
0,We,PRON,nsubj,like,[]
1,do,AUX,aux,like,[]
2,n't,PART,neg,like,[]
3,like,VERB,ROOT,like,"[We, do, n't, keep, !]"
4,to,PART,aux,keep,[]
5,keep,VERB,xcomp,like,"[to, customers, waiting]"
6,our,PRON,poss,customers,[]
7,lovely,ADJ,amod,customers,[]
8,customers,NOUN,dobj,keep,"[our, lovely]"
9,waiting,VERB,advcl,keep,[for]


If we don't understand a relation present in the dependency tree we can always check the explanation:

In [None]:
spacy.explain('amod')

'adjectival modifier'

## Lemmatization using POS

Remember how we identified lemmas for different words in our previous lab? Spacy already looks at the POS tags when lemmatizing, but nltk requires an additional argument, which we can now provide:

In [None]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

for word in words:
  print(f"{word} :", lemmatizer.lemmatize(word))

We : We
do : do
n't : n't
like : like
to : to
keep : keep
our : our
lovely : lovely
customers : customer
waiting : waiting
for : for
long : long
! : !


There are 4 POS tags that the lemmatizer accepts:
- n: noun/noun
- v: verb
- a: adjective
- r: adverb

Notice the difference when we provide the POS with our words:

In [None]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
tags = nltk.pos_tag(words)

for word, tag in tags:
  pos = tag.lower()[0]
  print(f"{word} :", lemmatizer.lemmatize(word, pos=pos if pos in "nvar" else 's'))

We : We
do : do
n't : n't
like : like
to : to
keep : keep
our : our
lovely : lovely
customers : customer
waiting : wait
for : for
long : long
! : !


# Named Entity Recognition (NER)

A Named Entity is a real-life object clearly identified by a name. It can represent a date, a location, an organization etc. Here is a list of commonly accepted entities:

<img src='https://media.geeksforgeeks.org/wp-content/uploads/20210324175946/pwn-623x660.png' height=500>

NER has several implications and use cases in information extraction, event extraction, text summarization etc. We can also use the relations between words to determine the Named Entities:

In [None]:
from spacy import displacy

displacy.render(doc, style="ent", options={"distance": 100}, jupyter=True)

Or we can check whether a token is Inside, Outside or is the Beginning of an entity:

In [None]:
import pandas as pd
import numpy as np

doc = nlp("When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously.")

df = pd.DataFrame(np.array([[token.text, token.ent_iob_] for token in doc], dtype='object'),
                  columns=['WORD', 'ENTITY'])

df.head()

Unnamed: 0,WORD,ENTITY
0,When,O
1,Sebastian,B
2,Thrun,I
3,started,O
4,working,O


In [None]:
ner = nlp.get_pipe('ner')
ner.model.layers

[<thinc.model.Model at 0x7b5832699fc0>,
 <thinc.model.Model at 0x7b583269a040>,
 <thinc.model.Model at 0x7b583269a0c0>]

# Coreference Resolution

Remember in kindergarten when the teacher would scold you for repeating words too close together instead of using synonyms? Or for always writing a person's name instead of referencing it with a pronoun?

Those replacements are what makes a computer's life hard. If we want our program to know that multiple words relate to the same thing, we might have to suggest that somehow.

Identifying what a word or a sequence refers to is called _coreference resolution_. Notice in the following example how we connect each pronoun to the corresponding noun:

<img src='https://cdn.neurosys.com/wp-content/webp-express/webp-images/uploads/2021/09/05_huggingface-demo-1-1280x226.png.webp' height=150>

Replacing the pronouns with their corresponding nouns would not change the meaning of the text at all, but would make it all easier for someone who can't concentrate to remember all words at once (like our models so far!).

At the moment there are no libraries that implement this feature in a maintained and easy to use manner.

# Exerciții

1. Descarcati un text de ~20 de fraze in limba engleza (de exemplu primele 20 de fraze dintr-un articol wikipedia).
    - folosind spacy POS-tagging, extrageti separat toate substantivele, toate verbele, toate adjectivele si toate adverbele (afisati cate cuvinte din fiecare tip ati gasit);
    - pentru aceste cuvinte extrase, determinati lema folosind nltk wordnet lemmatizer; pentru fiecare cuvant determinati lema atat fara a specifica tipul de parte de vorbire, cat si specificandu-l; pentru ce procent din cuvinte rezultatele pentru cele doua variante sunt diferite.
2. Selectati din RONEC un subset mai mare de date si antrenati un model de NER (puteți lua modelul de [aici](https://www.kaggle.com/competitions/nitro-lang-processing-1/data?select=train.json)). Pentru fiecare clasa afisati precision, recall, f1.
    - Care este clasa cu scorul cel mai bun? Dar cea cu scorul cel mai slab?