# Vocabulary
- **Task**: implement different strategies to tokenize and normalize text in order to weight token relevance.
- **Input**: raw text
- **Output**: a list of tokens for each text

### Main steps
0. Language detection
1. Tokenization
2. Case, punctuation, stopwords
3. Normalization

In [46]:
import json
import pandas as pd

In [2]:
dataset_file = '../data/country_dataset.json'
with open(dataset_file, 'r') as infile:
    dataset = json.load(infile)
docs = dataset['docs']
queries = dataset['queries']

In [3]:
T = docs[10]

In [4]:
T

'The group was presented to the Prince of Wales, later King Charles I, in 1623 while he was in Spain negotiating a marriage contract, and it soon became the most famous Italian sculpture in England.\n'

## Language detection
Many options, see for example [langdetect](https://pypi.org/project/langdetect)

In [5]:
from langdetect import detect, detect_langs

In [6]:
L = "Testo che mescola English words con testo italiano."

 This shows how models trained mainly on English may be unbalanced

In [7]:
print(detect_langs(L))
print(detect(L))

[en:0.7142848002489957, it:0.2857148540663416]
en


The purpose of language detection is to use it when dealing with multilanguage corpora because some of the vocabulary building operations may be language dependant (e.g., lemmatization)

## Tokenization

In [8]:
from nltk.tokenize import RegexpTokenizer

In [9]:
pattern = '\w+|\$[\d\.]+|\S+'
tokenizer = RegexpTokenizer(pattern)

In [10]:
text = docs[10]
tokens = tokenizer.tokenize(text)

In [11]:
print(text)

The group was presented to the Prince of Wales, later King Charles I, in 1623 while he was in Spain negotiating a marriage contract, and it soon became the most famous Italian sculpture in England.



In [12]:
print(tokens)

['The', 'group', 'was', 'presented', 'to', 'the', 'Prince', 'of', 'Wales', ',', 'later', 'King', 'Charles', 'I', ',', 'in', '1623', 'while', 'he', 'was', 'in', 'Spain', 'negotiating', 'a', 'marriage', 'contract', ',', 'and', 'it', 'soon', 'became', 'the', 'most', 'famous', 'Italian', 'sculpture', 'in', 'England', '.']


**Note**: when dealing with long texts, tokenization shold be performed sentence by sentence, exploiting <code>nltk.tokenize.sent_tokenize</code> before tokenization and normalization.

## Case, punctuation and stopwords removal
The importance of each step is relative to the size of the corpus and its sparseness

In [14]:
from string import punctuation
from nltk.corpus import stopwords

In [15]:
stopwords = set(stopwords.words('english'))

In [18]:
lower_tokens = lambda data: [x.lower() for x in data]
punct_tokens = lambda data: [x for x in data if x not in punctuation]
stop_tokens = lambda data: [x for x in data if x not in stopwords]

In [23]:
pipeline = [('lower', lower_tokens), ('punctuation', punct_tokens), ('stopwords', stop_tokens)]
current = tokens
print(T)
print(current, '\n')
for operation, f in pipeline:
    print(operation)
    current = f(current)
    print(current, '\n')

The group was presented to the Prince of Wales, later King Charles I, in 1623 while he was in Spain negotiating a marriage contract, and it soon became the most famous Italian sculpture in England.

['The', 'group', 'was', 'presented', 'to', 'the', 'Prince', 'of', 'Wales', ',', 'later', 'King', 'Charles', 'I', ',', 'in', '1623', 'while', 'he', 'was', 'in', 'Spain', 'negotiating', 'a', 'marriage', 'contract', ',', 'and', 'it', 'soon', 'became', 'the', 'most', 'famous', 'Italian', 'sculpture', 'in', 'England', '.'] 

lower
['the', 'group', 'was', 'presented', 'to', 'the', 'prince', 'of', 'wales', ',', 'later', 'king', 'charles', 'i', ',', 'in', '1623', 'while', 'he', 'was', 'in', 'spain', 'negotiating', 'a', 'marriage', 'contract', ',', 'and', 'it', 'soon', 'became', 'the', 'most', 'famous', 'italian', 'sculpture', 'in', 'england', '.'] 

punctuation
['the', 'group', 'was', 'presented', 'to', 'the', 'prince', 'of', 'wales', 'later', 'king', 'charles', 'i', 'in', '1623', 'while', 'he', 'w

## Normalization

### Stemming

In [24]:
from nltk.stem.snowball import SnowballStemmer

In [25]:
stemmer = SnowballStemmer('english')

In [26]:
print([stemmer.stem(x) for x in current])

['group', 'present', 'princ', 'wale', 'later', 'king', 'charl', '1623', 'spain', 'negoti', 'marriag', 'contract', 'soon', 'becam', 'famous', 'italian', 'sculptur', 'england']


### Lemmatization with WordNet

In [27]:
from nltk.corpus import wordnet as wn

In [30]:
syns = wn.synsets('group')

#### Problem 1: word sense disambiguation

In [33]:
for syn in syns:
    print(syn, syn.definition())

Synset('group.n.01') any number of entities (members) considered as a unit
Synset('group.n.02') (chemistry) two or more atoms bound together as a single unit and forming part of a molecule
Synset('group.n.03') a set that is closed, associative, has an identity element and every element has an inverse
Synset('group.v.01') arrange into a group or groups
Synset('group.v.02') form a group or group together


#### Problem 2: choice of lemma

In [35]:
for syn in syns:
    print(syn, [lemma.name() for lemma in syn.lemmas()])

Synset('group.n.01') ['group', 'grouping']
Synset('group.n.02') ['group', 'radical', 'chemical_group']
Synset('group.n.03') ['group', 'mathematical_group']
Synset('group.v.01') ['group']
Synset('group.v.02') ['group', 'aggroup']


### Naive strategy

In [36]:
def wnlemma(word):
    try:
        s = wn.synsets(word)[0]
        try:
            l = s.lemmas()[0].name()
        except IndexError:
            return word
    except IndexError:
        return word
    return l

In [37]:
lemma_tokens = lambda data: [wnlemma(x) for x in data]

In [39]:
print(current)
print(lemma_tokens(current))

['group', 'presented', 'prince', 'wales', 'later', 'king', 'charles', '1623', 'spain', 'negotiating', 'marriage', 'contract', 'soon', 'became', 'famous', 'italian', 'sculpture', 'england']
['group', 'show', 'prince', 'Wales', 'later', 'king', 'Charles', '1623', 'Spain', 'negociate', 'marriage', 'contract', 'soon', 'become', 'celebrated', 'Italian', 'sculpture', 'England']


### Exercize: find a better strategy for word sense disambiguation using WordNet

# Approaches based on language modeling: Spacy

In [40]:
import spacy

In [41]:
nlp = spacy.load("en_core_web_sm")

In [42]:
doc = nlp(T)

### Sentence parsing

In [43]:
for s in doc.sents:
    print(s)

The group was presented to the Prince of Wales, later King Charles I, in 1623 while he was in Spain negotiating a marriage contract, and it soon became the most famous Italian sculpture in England.



## Tokenization

In [45]:
fields = ['text', 'lemma', 'pos', 'tag', 'dep', 'shape', 'alpha', 'stopwords']
tks = []
for token in doc:
    data = [token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop]
    tks.append(dict([(fields[i], x) for i, x in enumerate(data)]))

In [47]:
df = pd.DataFrame(tks)

In [48]:
df

Unnamed: 0,text,lemma,pos,tag,dep,shape,alpha,stopwords
0,The,the,DET,DT,det,Xxx,True,True
1,group,group,NOUN,NN,nsubjpass,xxxx,True,False
2,was,be,VERB,VBD,auxpass,xxx,True,True
3,presented,present,VERB,VBN,ROOT,xxxx,True,False
4,to,to,ADP,IN,prep,xx,True,True
5,the,the,DET,DT,det,xxx,True,True
6,Prince,Prince,PROPN,NNP,pobj,Xxxxx,True,False
7,of,of,ADP,IN,prep,xx,True,True
8,Wales,Wales,PROPN,NNP,pobj,Xxxxx,True,False
9,",",",",PUNCT,",",punct,",",False,False


## Case, punctuation and stopwords removal

In [49]:
punct_tokens = lambda data: [x for x in data if x.pos_ not in ['PUNCT', 'SPACE']]
stop_tokens = lambda data: [x for x in data if not x.is_stop]

In [50]:
spacy_tokens = stop_tokens(punct_tokens(nlp(T)))

In [53]:
print(spacy_tokens)

[group, presented, Prince, Wales, later, King, Charles, 1623, Spain, negotiating, marriage, contract, soon, famous, Italian, sculpture, England]


## Normalization

In [54]:
spacy_lemma = lambda data: [x.lemma_ for x in data]

In [55]:
print(spacy_lemma(spacy_tokens))

['group', 'present', 'Prince', 'Wales', 'later', 'King', 'Charles', '1623', 'Spain', 'negotiate', 'marriage', 'contract', 'soon', 'famous', 'italian', 'sculpture', 'England']


### A look into dependencies and entities (more on this later on course)

In [59]:
from spacy import displacy

In [82]:
T = docs[1].strip()

In [83]:
displacy.render(nlp(T), style='ent')

In [84]:
displacy.render(nlp(T), style='dep', options={'compact': True, 
                                              'collapse_phrases': True,
                                             'add_lemma': True})

In [85]:
table = {'token': [], 'token dep': [], 'head': [], 'head pos': [], 'children': [], 'ancestors': []}
for token in nlp(T):
    table['token'].append(token.text)
    table['token dep'].append(token.dep_)
    table['head'].append(token.head.text)
    table['head pos'].append(token.head.pos_)
    table['children'].append(", ".join([child.text for child in token.children]))
    table['ancestors'].append(", ".join([a.text for a in token.ancestors]))
S = pd.DataFrame(table)

In [86]:
S

Unnamed: 0,token,token dep,head,head pos,children,ancestors
0,Zalog,nsubj,is,VERB,,is
1,is,ROOT,is,VERB,"Zalog, settlement, .",
2,a,det,settlement,NOUN,,"settlement, is"
3,formerly,advmod,independent,ADJ,,"independent, settlement, is"
4,independent,amod,settlement,NOUN,formerly,"settlement, is"
5,settlement,attr,is,VERB,"a, independent, in, in",is
6,in,prep,settlement,NOUN,part,"settlement, is"
7,the,det,part,NOUN,,"part, in, settlement, is"
8,eastern,amod,part,NOUN,,"part, in, settlement, is"
9,part,pobj,in,ADP,"the, eastern, of","in, settlement, is"
