# IHLT Lab 4: Part of Speech

**Authors:** *Zachary Parent ([zachary.parent](mailto:zachary.parent@estudiantat.upc.edu)), Carlos Jiménez ([carlos.humberto.jimenez](mailto:carlos.humberto.jimenez@estudiantat.upc.edu))*

### 2024-10-10

**Instructions:**

1. Consider Treebank corpus.

    - Train HMM, TnT, perceptron and CRF models using the first 500, 1000, 1500, 2000, 2500 and 3000 sentences.

    - Evaluate the resulting 24 models using sentences from 3001.

2. Provide a figure with four learning curves, each per model type (X=training set size; Y=accuracy).

    - Which model would you select? Justify the answer.


## Notes

we should measure the time it takes to train each model.

we could also measure the time it takes to make inferences on the test set

we should plot the accuracy vs the number of sentences.

we could create a ratio of accuracy vs training time for each model.


In [17]:
import pandas as pd
import nltk
nltk.download('treebank')

[nltk_data] Downloading package treebank to
[nltk_data]     /Users/zachparent/nltk_data...
[nltk_data]   Package treebank is already up-to-date!


True

In [18]:
len(nltk.corpus.treebank.tagged_sents())

3914

In [19]:
nltk.corpus.treebank.tagged_sents()[1]

[('Mr.', 'NNP'),
 ('Vinken', 'NNP'),
 ('is', 'VBZ'),
 ('chairman', 'NN'),
 ('of', 'IN'),
 ('Elsevier', 'NNP'),
 ('N.V.', 'NNP'),
 (',', ','),
 ('the', 'DT'),
 ('Dutch', 'NNP'),
 ('publishing', 'VBG'),
 ('group', 'NN'),
 ('.', '.')]

In [20]:
## Learning the model

def hidden_markov(train, test):
    def LID(fd, bins):
        return nltk.probability.LidstoneProbDist(fd, 0.1, bins)
    
    trainer = nltk.tag.hmm.HiddenMarkovModelTrainer()
    HMM = trainer.train_supervised(train, estimator=LID)
    acc = HMM.accuracy(test)

    return acc

# set(test)
# len(set(test).difference(train))

In [21]:
# HMM = nltk.HiddenMarkovModelTagger.train(train)
# HMM.accuracy(test)

## TnT

In [22]:

def TnT(train, test):
    TnT = nltk.tag.tnt.TnT()
    TnT.train(train)
    acc = TnT.accuracy(test)

    return acc
#     TnT.tag(['the', 'men', 'attended', 'to', 'the', 'meetings'])

# Perceptron

In [23]:
def perceptron(train, test):
    PER = nltk.tag.perceptron.PerceptronTagger(load=False)
    PER.train(train)
    acc = PER.accuracy(test)

    return acc
# PER.tag(['the', 'men', 'attended', 'to', 'the', 'meetings']) 

# CRF

In [24]:
# !pip install python-crfsuite

In [25]:
def CRF(train, test):
    CRF = nltk.tag.CRFTagger()
    CRF.train(train,'crf_tagger_model')
    acc = CRF.accuracy(test)
    
    return acc
    # CRF.tag(['the', 'men', 'attended', 'to', 'the', 'meetings'])

## Train all models with different sentences number

In [26]:
# Sentences number list
sentences_n = [500, 1000, 1500, 2000, 2500, 3000]

In [28]:
test = nltk.corpus.treebank.tagged_sents()[3000:]

results = pd.DataFrame(columns=['HMM', 'TnT', 'PER', 'CRF'], dtype=float)

for n in sentences_n:
    print(f'Training with {n} sentences...')
    train = nltk.corpus.treebank.tagged_sents()[:n]

    # TODO:remove
    train = train[:len(train)//10]
    
    hmm_acc = hidden_markov(train, test)
    tnt_acc = TnT(train, test)
    per_acc = perceptron(train, test)
    crf_acc = CRF(train, test)

    new_row = pd.DataFrame({'HMM': [hmm_acc], 'TnT': [tnt_acc], 'PER': [per_acc], 'CRF': [crf_acc]})
    results = pd.concat([results, new_row], ignore_index=True)
    print(results.iloc[-1, :])


Training with 500 sentences...
HMM    0.595683
TnT    0.520181
PER    0.716771
CRF    0.749320
Name: 0, dtype: float64
Training with 1000 sentences...
HMM    0.673818
TnT    0.587222
PER    0.800863
CRF    0.830175
Name: 1, dtype: float64
Training with 1500 sentences...
HMM    0.715001
TnT    0.640276
PER    0.851068
CRF    0.863976
Name: 2, dtype: float64
Training with 2000 sentences...
HMM    0.732225
TnT    0.668250
PER    0.863933
CRF    0.877013
Name: 3, dtype: float64
Training with 2500 sentences...
HMM    0.745737
TnT    0.686682
PER    0.878351
CRF    0.885733
Name: 4, dtype: float64
Training with 3000 sentences...
HMM    0.755752
TnT    0.700842
PER    0.883877
CRF    0.890222
Name: 5, dtype: float64


## Do your thing Zach with the plots ;)

## TODO:
"""

Provide a figure with four learning curves, each per model type (X=training set size; Y=accuracy).

Which model would you select? Justify the answer.

"""