<a href="https://colab.research.google.com/github/Yayawak/ML-projects/blob/main/POS_tagger_demo_with_python_crfsuite.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CRF Tutorial using python-crfsuite

In this tutorial, we will try to use CRF to work on part-of-speech (POS) tagging. There are 6 main parts in this tutorial
1. Setup and preprocessing
2. Designing feature funcions
3. Training
4. Making predictions
5. Evaluation
6. Try: Design a more complex model

# 1. Setup and preprocessing

In this demo we will use [python-crfsuite](https://github.com/scrapinghub/python-crfsuite)



In [1]:
!wget https://www.dropbox.com/s/tuvrbsby4a5axe0/resources.zip
!unzip resources.zip

--2025-03-31 16:13:00--  https://www.dropbox.com/s/tuvrbsby4a5axe0/resources.zip
Resolving www.dropbox.com (www.dropbox.com)... 162.125.5.18, 2620:100:601d:18::a27d:512
Connecting to www.dropbox.com (www.dropbox.com)|162.125.5.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://www.dropbox.com/scl/fi/cn6dnf1loe6u4eaaoiq6t/resources.zip?rlkey=fmolifo9useppb7z8z4or19gw [following]
--2025-03-31 16:13:01--  https://www.dropbox.com/scl/fi/cn6dnf1loe6u4eaaoiq6t/resources.zip?rlkey=fmolifo9useppb7z8z4or19gw
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://ucd9428ed2fa599167d9be15e2ec.dl.dropboxusercontent.com/cd/0/inline/Cm7OETP-j_tS71cTCKy5idl5Ktp-s0lB-vnT3zCeRQuShrjBNHqXlhDTDBxH5lGVP_HRitqJiejWlyglSI3Ur0JUsypPJFiCPIom7rOa02QUsChebus4H81HOeDFrtdKD9Q/file# [following]
--2025-03-31 16:13:01--  https://ucd9428ed2fa599167d9be15e2ec.dl.dropboxusercontent.com/cd/0/inline/Cm7OETP-j_tS71cTCKy

In [2]:
!pip install python-crfsuite

Collecting python-crfsuite
  Downloading python_crfsuite-0.9.11-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.3 kB)
Downloading python_crfsuite-0.9.11-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: python-crfsuite
Successfully installed python-crfsuite-0.9.11


In [3]:
import pycrfsuite
import numpy

We use POS data from [ORCHID corpus](https://www.researchgate.net/profile/Virach-Sornlertlamvanich/publication/2630580_Building_a_Thai_part-of-speech_tagged_corpus_ORCHID/links/02e7e514db19a98619000000/Building-a-Thai-part-of-speech-tagged-corpus-ORCHID.pdf), which is a POS corpus for Thai language.
A method used to read the corpus into a list of sentences with (word, POS) pairs have been implemented already. The example usage has shown below.

In [4]:
from data.orchid_corpus import get_sentences
train_data = get_sentences('train')
test_data = get_sentences('test')
train_data[0]

[('การ', 'FIXN'),
 ('ประชุม', 'VACT'),
 ('ทาง', 'NCMN'),
 ('วิชาการ', 'NCMN'),
 ('<space>', 'PUNC'),
 ('ครั้ง', 'CFQC'),
 ('ที่ 1', 'DONM')]

## 2. Designing features functions

- __word2features()__: This method returns all feature functions for time step _i_ of an input sequence. So, this method is where all feature functions are defined. From the code, we can define just features from input sequence (word for this example), the library will manage the transition functions ($y_{t-1}$ -> $y_t$) and state functions ($y_t$ -> $X$, with all $X$ features you defined in this method) for you.
- __sent2features()__: Loop and call word2features() over the input sequence.
- __sent2labels()__: Get the output labels from train/test sequence
- __sent2tokens()__: Get words from train/test sequence (used in prediction part just to show the full result)

In [6]:
def word2features(sent, i):
    word = sent[i][0]

    features = {
        'word': word,
        'word.isdigit': word.isdigit(),
        'word.length': len(word),
    }

    features['BOS'] = (i == 0)  # beginning of sentence
    features['EOS'] = (i == len(sent)-1)  # end of sentence

    return features

def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for (word, label) in sent]

def sent2tokens(sent):
    return [word for (word, label) in sent]

In [7]:
train_data[0]

[('การ', 'FIXN'),
 ('ประชุม', 'VACT'),
 ('ทาง', 'NCMN'),
 ('วิชาการ', 'NCMN'),
 ('<space>', 'PUNC'),
 ('ครั้ง', 'CFQC'),
 ('ที่ 1', 'DONM')]

In [8]:
sent2features(train_data[0])[0]

{'word': 'การ',
 'word.isdigit': False,
 'word.length': 3,
 'BOS': True,
 'EOS': False}

In [9]:
sent2features(train_data[0])

[{'word': 'การ',
  'word.isdigit': False,
  'word.length': 3,
  'BOS': True,
  'EOS': False},
 {'word': 'ประชุม',
  'word.isdigit': False,
  'word.length': 6,
  'BOS': False,
  'EOS': False},
 {'word': 'ทาง',
  'word.isdigit': False,
  'word.length': 3,
  'BOS': False,
  'EOS': False},
 {'word': 'วิชาการ',
  'word.isdigit': False,
  'word.length': 7,
  'BOS': False,
  'EOS': False},
 {'word': '<space>',
  'word.isdigit': False,
  'word.length': 7,
  'BOS': False,
  'EOS': False},
 {'word': 'ครั้ง',
  'word.isdigit': False,
  'word.length': 5,
  'BOS': False,
  'EOS': False},
 {'word': 'ที่ 1',
  'word.isdigit': False,
  'word.length': 5,
  'BOS': False,
  'EOS': True}]

In [10]:
%%time
x_train = [sent2features(sent) for sent in train_data]
y_train = [sent2labels(sent) for sent in train_data]
x_test = [sent2features(sent) for sent in test_data]
y_test = [sent2labels(sent) for sent in test_data]

CPU times: user 289 ms, sys: 41.1 ms, total: 330 ms
Wall time: 332 ms


## 3. Training

To train a CRF model in python-crfsuite, we have to create a trainer and load training data (pairs of __generated features__ and __labels__) to the trainer first.

In [11]:
trainer = pycrfsuite.Trainer(verbose=False)
for xseq, yseq in zip(x_train, y_train):
    trainer.append(xseq, yseq)

There are several parameters you can set for the training process. You can list all parameter using this method.

In [12]:
trainer.params()

['feature.minfreq',
 'feature.possible_states',
 'feature.possible_transitions',
 'c1',
 'c2',
 'max_iterations',
 'num_memories',
 'epsilon',
 'period',
 'delta',
 'linesearch',
 'max_linesearch']

In this tutorial, we will use 3 parameters:

- __max_iterations__: Define how many times we will let the model learn through training data
- __feature.possible_transitions__: Enable the library to create transition feature functions (as we discussed in section 2)
- __feature.possible_states__: Enable state feature functions

In [13]:
trainer.set_params({
    'max_iterations': 100,
    'feature.possible_transitions': True,
    'feature.possible_states': True,
})

Finally, call the trainer to train with the specified model path.

In [14]:
%%time
model_path = 'model/crf_basic.model'
trainer.train(model_path)

CPU times: user 3min 55s, sys: 863 ms, total: 3min 56s
Wall time: 3min 56s


## 4. Making predictions

When we finished training a model. We can use that model to predict any sequence of words.
To do this, create a tagger with path to the saved model. Then, generate features with a sequence we want to predict and send them to _tag_ method.

In [15]:
tagger = pycrfsuite.Tagger()
tagger.open(model_path)

<contextlib.closing at 0x79c9b451f050>

In [16]:
example_sent = test_data[20]
print(' '.join(sent2tokens(example_sent)))

print('Predicted: ', ' '.join(tagger.tag(sent2features(example_sent))))
print('Correct: ', ' '.join(sent2labels(example_sent)))

<minus> <space> ระบบ การ บันทึก รหัส ไว้ ใน แฟ้มข้อมูล
Predicted:  PUNC PUNC NCMN FIXN VACT NCMN XVAE RPRE NCMN
Correct:  PUNC PUNC NCMN FIXN VACT NCMN XVAE RPRE NCMN


## 5. Evaluation

To measure how good the model can perform, we have to evaluate the model on _test data_. For sequence labeling tasks, we often use __accuracy__ to measure a model's goodness. However, we can analyze further by considering each tag with
- __prediction__: How many times the predicted tag _x_ is correctly tagged (it is a tag _x_ in the test data)
- __recall__: How many times the real tag _x_ is correctly tagged (the model can answer that it is a tag _x_)

The method below, evaluation_report(), is implemented to measure all metrics described and display it in DataFrame. It is ok to just use this method and not going through this.

In [17]:
import pandas as pd
from IPython.display import display

def evaluation_report(y_true, y_pred):
    # retrieve all tags in y_true
    tag_set = set()
    for sent in y_true:
        for tag in sent:
            tag_set.add(tag)
    tag_list = sorted(list(tag_set))

    # count correct points
    tag_info = dict()
    for tag in tag_list:
        tag_info[tag] = {'correct_tagged': 0, 'y_true': 0, 'y_pred': 0}

    all_correct = 0
    all_count = sum([len(sent) for sent in y_true])
    for sent_true, sent_pred in zip(y_true, y_pred):
        for tag_true, tag_pred in zip(sent_true, sent_pred):
            if tag_true == tag_pred:
                tag_info[tag_true]['correct_tagged'] += 1
                all_correct += 1
            tag_info[tag_true]['y_true'] += 1
            tag_info[tag_pred]['y_pred'] += 1
    accuracy = (all_correct / all_count) * 100

    # summarize and make evaluation result
    eval_list = list()
    for tag in tag_list:
        eval_result = dict()
        eval_result['tag'] = tag
        eval_result['correct_count'] = tag_info[tag]['correct_tagged']
        precision = (tag_info[tag]['correct_tagged']/tag_info[tag]['y_pred'])*100 if tag_info[tag]['y_pred'] else '-'
        recall = (tag_info[tag]['correct_tagged']/tag_info[tag]['y_true'])*100
        eval_result['precision'] = precision
        eval_result['recall'] = recall
        eval_result['f_score'] = (2*precision*recall)/(precision+recall) if (type(precision) is float and recall > 0) else '-'

        eval_list.append(eval_result)

    eval_list.append({'tag': 'accuracy=%.2f' % accuracy, 'correct_count': '', 'precision': '', 'recall': '', 'f_score': ''})

    df = pd.DataFrame.from_dict(eval_list)
    df = df[['tag', 'precision', 'recall', 'f_score', 'correct_count']]
    display(df)

Make predictions on test set (y_pred) and evaluate against the real label (y_test)

In [18]:
y_pred = [tagger.tag(x_sent) for x_sent in x_test]

In [19]:
evaluation_report(y_test, y_pred)

Unnamed: 0,tag,precision,recall,f_score,correct_count
0,ADVI,-,0.0,-,0.0
1,ADVN,69.461078,20.677362,31.868132,232.0
2,ADVP,-,0.0,-,0.0
3,ADVS,-,0.0,-,0.0
4,CFQC,-,0.0,-,0.0
5,CLTV,-,0.0,-,0.0
6,CMTR,13.043478,1.452785,2.614379,6.0
7,CMTR@PUNC,-,0.0,-,0.0
8,CNIT,100.0,3.532609,6.824147,13.0
9,DCNM,69.978858,72.349727,71.144546,662.0


## 6. Use pretrained word embedding

In this exercise, we will use pretrained word embedding from previous homework as word feature in pycrfsuite. We load pretrained word embedding using pickle. The pretrained weight is a dictionary which map a word to its embedding.

In [20]:
import pickle
fp = open('basic_ff_embedding.pt', 'rb')
embeddings = pickle.load(fp)
fp.close()

In [21]:
def word2features(sent, i, emb):
    def add_embedding_features(feat, prefix, query_word):
        if query_word in emb:
            vec = emb[query_word]
        else:
            vec = numpy.zeros(32)

        for i in range(vec.shape[0]):
            feat[prefix + str(i)] = vec[i]

    features = dict()
    word = sent[i][0]
    add_embedding_features(features, 'word.embd', word)
    features.update({
        'word.word' : word,
        'word.isdigit': word.isdigit(),
        'word.length': len(word),
    })

    features['BOS'] = (i == 0)  # beginning of sentence
    features['EOS'] = (i == len(sent)-1)  # end of sentence

    return features

def sent2features(sent, emb_dict):
    return [word2features(sent, i, emb_dict) for i in range(len(sent))]

def sent2labels(sent):
    return [label for (word, label) in sent]

def sent2tokens(sent):
    return [word for (word, label) in sent]

In [22]:
%%time
x_train = [sent2features(sent, embeddings) for sent in train_data]
y_train = [sent2labels(sent) for sent in train_data]
x_test = [sent2features(sent, embeddings) for sent in test_data]
y_test = [sent2labels(sent) for sent in test_data]

CPU times: user 5.72 s, sys: 762 ms, total: 6.48 s
Wall time: 6.58 s


In [40]:
train_data[0]

[('การ', 'FIXN'),
 ('ประชุม', 'VACT'),
 ('ทาง', 'NCMN'),
 ('วิชาการ', 'NCMN'),
 ('<space>', 'PUNC'),
 ('ครั้ง', 'CFQC'),
 ('ที่ 1', 'DONM')]

In [23]:
sent2features(train_data[0], embeddings)[0]

{'word.embd0': np.float32(0.63079655),
 'word.embd1': np.float32(0.55423963),
 'word.embd2': np.float32(-0.69944656),
 'word.embd3': np.float32(0.66754633),
 'word.embd4': np.float32(0.71997637),
 'word.embd5': np.float32(0.5652285),
 'word.embd6': np.float32(-0.5982634),
 'word.embd7': np.float32(0.5873137),
 'word.embd8': np.float32(0.6438087),
 'word.embd9': np.float32(0.5209912),
 'word.embd10': np.float32(-0.5298235),
 'word.embd11': np.float32(0.7447553),
 'word.embd12': np.float32(0.6827823),
 'word.embd13': np.float32(-0.5775221),
 'word.embd14': np.float32(-0.66996753),
 'word.embd15': np.float32(0.6653535),
 'word.embd16': np.float32(-0.6439402),
 'word.embd17': np.float32(0.6294213),
 'word.embd18': np.float32(-0.68831235),
 'word.embd19': np.float32(-0.6622428),
 'word.embd20': np.float32(-0.8227441),
 'word.embd21': np.float32(-0.59909046),
 'word.embd22': np.float32(0.6666846),
 'word.embd23': np.float32(0.656023),
 'word.embd24': np.float32(0.68236977),
 'word.embd25': n

In [24]:
%%time
trainer = pycrfsuite.Trainer(verbose=True)
trainer.set_params({
    'max_iterations': 100,
    'feature.possible_transitions': True,
    'feature.possible_states': True,
})

for xseq, yseq in zip(x_train, y_train):
    trainer.append(xseq, yseq)

CPU times: user 6.95 s, sys: 109 ms, total: 7.06 s
Wall time: 7.13 s


In [25]:
%%time
model_path = 'model/crf_neural.model'
trainer.train(model_path)

Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 1
feature.possible_transitions: 1
0....1....2....3....4....5....6....7....8....9....10
Number of features: 709136
Seconds required: 78.075

L-BFGS optimization
c1: 0.000000
c2: 1.000000
num_memories: 6
max_iterations: 100
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

***** Iteration #1 *****
Loss: 776867.117349
Feature norm: 1.000000
Error norm: 282725.103522
Active features: 709136
Line search trials: 1
Line search step: 0.000001
Seconds required for this iteration: 5.366

***** Iteration #2 *****
Loss: 741788.669404
Feature norm: 0.979262
Error norm: 413664.190556
Active features: 709136
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 3.158

***** Iteration #3 *****
Loss: 709224.554318
Feature norm: 0.996845
Error norm: 263743.801395
Active features: 709136
Line search trials: 1
Line search step: 1.000000
Seconds r

KeyboardInterrupt: 

In [None]:
%%time
model_path = 'model/crf_neural.model'
tagger = pycrfsuite.Tagger()
tagger.open(model_path)
y_pred = [tagger.tag(x_sent) for x_sent in x_test]

CPU times: user 2.05 s, sys: 39.1 ms, total: 2.09 s
Wall time: 2.09 s


In [None]:
evaluation_report(y_test, y_pred)

Unnamed: 0,tag,precision,recall,f_score,correct_count
0,ADVI,-,0.0,-,0.0
1,ADVN,53.265045,37.076649,43.720441,416.0
2,ADVP,-,0.0,-,0.0
3,ADVS,-,0.0,-,0.0
4,CFQC,-,0.0,-,0.0
5,CLTV,-,0.0,-,0.0
6,CMTR,47.663551,12.348668,19.615385,51.0
7,CMTR@PUNC,-,0.0,-,0.0
8,CNIT,40.437158,20.108696,26.860254,74.0
9,DCNM,69.714964,64.153005,66.818441,587.0


# TODO
# ให้นักศึกษาทำ word embedding ประโยคสั้นๆมาสองประโยคเช่น
# -> ฉันชอบกินพิซซ่า
# -> ฉันเดินไปตลาด

In [None]:
# INSERT YOUR CODE HERE

In [39]:
# ประโยคตัวอย่าง (ใช้ label 'O' เป็น placeholder)
sent1 = [("ฉัน", "O"), ("ชอบ", "O"), ("กิน", "O"), ("พิซซ่า", "O")]
sent2 = [("ฉัน", "O"), ("เดิน", "O"), ("ไป", "O"), ("ตลาด", "O")]



# ทดสอบฟีเจอร์ของประโยคแรก
features_sent1 = sent2features(sent1, embeddings)
features_sent2 = sent2features(sent2, embeddings)



# แสดงฟีเจอร์ของคำแรกในแต่ละประโยค
print("Features for first word in sentence 1:")
print(features_sent1[0])

print("\nFeatures for first word in sentence 2:")
print(features_sent2[0])


# print(' '.join(sent2tokens(features_sent1[0])))

# print('Predicted: ', ' '.join(tagger.tag(sent2features(example_sent))))
# print('Correct: ', ' '.join(sent2labels(example_sent)))

Features for first word in sentence 1:
{'word.embd0': np.float32(0.4734008), 'word.embd1': np.float32(0.7674905), 'word.embd2': np.float32(-0.74963117), 'word.embd3': np.float32(0.5563736), 'word.embd4': np.float32(0.5993332), 'word.embd5': np.float32(0.7086966), 'word.embd6': np.float32(-0.61197686), 'word.embd7': np.float32(0.7255628), 'word.embd8': np.float32(0.7469505), 'word.embd9': np.float32(0.4603887), 'word.embd10': np.float32(0.0606259), 'word.embd11': np.float32(0.4647168), 'word.embd12': np.float32(0.42943838), 'word.embd13': np.float32(-0.5901407), 'word.embd14': np.float32(-0.627834), 'word.embd15': np.float32(0.79420143), 'word.embd16': np.float32(-0.75600916), 'word.embd17': np.float32(0.62876666), 'word.embd18': np.float32(-0.6262079), 'word.embd19': np.float32(-0.61170936), 'word.embd20': np.float32(-0.55074245), 'word.embd21': np.float32(-0.81159306), 'word.embd22': np.float32(0.57957774), 'word.embd23': np.float32(0.6350854), 'word.embd24': np.float32(0.55458415), '