# Named Entity Recognition with MIT Restaurant Dataset

Your name: Nguyen Son

Student ID: BI12-389

**Not overdued yet**

## Task Description

In this assignment, you will train a NER Model using Conditional Random Fields (CRF) on and report the accuracy of your model on the test dataset.

You will use the [MIT Restaurant Dataset](https://groups.csail.mit.edu/sls/downloads/restaurant/) dataset to do the task.

## How to submit

- Attach notebook file (.ipynb) and submit your work to Google Class Room
- Name your file as YourName_StudentID_Assignment4.ibynb. E.g., Nguyen_Van_A_ST099834_Assignment4.ipynb
- Write your name and student ID into this notebook
- Copying others' assignments is strictly prohibited.


## Install python-crfsuite

In [None]:
!pip install -q python-crfsuite

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25h

## Imports

In [None]:
from itertools import chain
import pycrfsuite

## Dataset

We will use [MIT Restaurant Dataset](https://groups.csail.mit.edu/sls/downloads/restaurant/) dataset.

The data set is already in CoNLL format. We will use the [train](https://groups.csail.mit.edu/sls/downloads/restaurant/restauranttrain.bio) data to create the NER model and evaluate the model on the [test](https://groups.csail.mit.edu/sls/downloads/restaurant/restauranttest.bio) data.

### Download data

In [None]:
%%capture
!rm -f restauranttrain.bio
!rm -f restauranttest.bio

!wget https://groups.csail.mit.edu/sls/downloads/restaurant/restauranttest.bio
!wget https://groups.csail.mit.edu/sls/downloads/restaurant/restauranttrain.bio

## Loading data (30 points)

In this part, you will load a data file into a list of sentences. Each sentence is a list of (word, tag) tuples.

**Note: Blank lines are used to seperate sentences.**

For instance, the sentence below will be loaded into a list

```
O	a
B-Rating	four
I-Rating	star
O	restaurant
B-Location	with
I-Location	a
B-Amenity	bar
```

You will complete the function below

In [None]:
## Add necessary import here

def load_data(file_path):
    """Load data into a list of list of (word, tag) tuples

    Args:
        file_path (str): Path to data

    Returns:
        sentences: list of (word, tag) tuples
    """
    sentences = []

    current_sentence = []

    with open(file_path, 'r') as file:
        for line in file:
            if line == '\n':
                if current_sentence:
                    sentences.append(current_sentence)
                    current_sentence = []
            else:
                parts = line.strip().split('\t')
                if len(parts) == 2:
                    current_sentence.append((parts[0], parts[1]))

    if current_sentence:
        sentences.append(current_sentence)

    return sentences

In [None]:
train_sents = load_data('restauranttrain.bio')
test_sents = load_data('restauranttest.bio')

Let's check the number of sentences in train and test data

In [None]:
len(train_sents)

7660

In [None]:
len(test_sents)

1521

In [None]:
train_sents[:3]

[[('B-Rating', '2'),
  ('I-Rating', 'start'),
  ('O', 'restaurants'),
  ('O', 'with'),
  ('B-Amenity', 'inside'),
  ('I-Amenity', 'dining')],
 [('O', '34')],
 [('B-Rating', '5'),
  ('I-Rating', 'star'),
  ('O', 'resturants'),
  ('B-Location', 'in'),
  ('I-Location', 'my'),
  ('I-Location', 'town')]]

## Features (50 points)

We can extract as many features as you want. You will implement following basic features.

※ Of course, you can add more features.

*Word identity (lowercase)*

- Previous word identity
- Current word identity
- Next word
- Previous word and current word combination. Concat the previous word the current word by '||'
- Current word and next word combination. Concat two words by '||'

*Word shapes*

- Word prefix and suffix (4 characters)
- The first character of the current word is the capital letter

**All you need to do is to complete the function `word2feature`.**

In [None]:
def word2features(sentence, i):
    """
    Extract features for the word at position i in the sentence.
    """
    word = sentence[i][0]
    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
    }

    if i > 0:
        prev_word = sentence[i-1][0]
        features.update({
            'prev_word.lower()': prev_word.lower(),
            'prev_word.istitle()': prev_word.istitle(),
            'prev_word.isupper()': prev_word.isupper(),
            'word+prev_word': word.lower() + '||' + prev_word.lower(),
        })
    else:
        features['BOS'] = True  # Beginning of Sentence

    if i < len(sentence)-1:
        next_word = sentence[i+1][0]
        features.update({
            'next_word.lower()': next_word.lower(),
            'next_word.istitle()': next_word.istitle(),
            'next_word.isupper()': next_word.isupper(),
            'word+next_word': word.lower() + '||' + next_word.lower(),
        })
    else:
        features['EOS'] = True  # End of Sentence

    return features


def sent2features(sentence):
    """
    sentence is a list of words [w1, w2,...,w_n]
    """
    return [word2features(sentence, i) for i in range(len(sentence))]


def sent2labels(sentence):
    """
    sentence is a list of tuples (word, postag)
    """
    return [tag for token, tag in sentence]

def untag(sentence):
    """
    sentence is a list of tuples (word, postag)
    """
    return [token for token, _ in sentence]

Let's try to extract features for the first sentence

In [None]:
train_sents[0]

[('B-Rating', '2'),
 ('I-Rating', 'start'),
 ('O', 'restaurants'),
 ('O', 'with'),
 ('B-Amenity', 'inside'),
 ('I-Amenity', 'dining')]

In [None]:
sent2features(untag(train_sents[0]))[0]

{'bias': 1.0,
 'word.lower()': 'b',
 'word[-3:]': 'B',
 'word[-2:]': 'B',
 'word.isupper()': True,
 'word.istitle()': True,
 'word.isdigit()': False,
 'BOS': True,
 'next_word.lower()': 'i',
 'next_word.istitle()': True,
 'next_word.isupper()': True,
 'word+next_word': 'b||i'}

### Create train/test data

In [None]:
X_train = [sent2features(untag(s)) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(untag(s)) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

## Training

In [None]:
%%time
trainer = pycrfsuite.Trainer(verbose=False)

for xseq, yseq in zip(X_train, y_train):
    trainer.append(xseq, yseq)

CPU times: user 704 ms, sys: 21.1 ms, total: 725 ms
Wall time: 756 ms


In [None]:
import time
import pycrfsuite

# Start timing
start_time = time.time()

# Initialize trainer with verbose output enabled
trainer = pycrfsuite.Trainer(verbose=True)

print("Appending training data...")

for xseq, yseq in zip(X_train, y_train):
    trainer.append(xseq, yseq)

print("Training data appended. Setting parameters...")

max_iterations = 20  # Set your desired number of max iterations

trainer.set_params({
    'c1': 1.0,  # coefficient for L1 penalty
    'c2': 1e-3,  # coefficient for L2 penalty
    'max_iterations': max_iterations,
    'feature.possible_transitions': True
})

print("Parameters set. Starting training...")

trainer.train('mitrestaurant.crfsuite')

print("Training completed.")

# Training finished, stop timing
end_time = time.time()

# Calculate and print the elapsed time
elapsed_time = end_time - start_time
print(f"Total training time: {elapsed_time:.2f} seconds")


Appending training data...
Training data appended. Setting parameters...
Parameters set. Starting training...
Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 1
0....1....2....3....4....5....6....7....8....9....10

## Evaluation (20 points)

We will use [seqeval](https://github.com/chakki-works/seqeval) package for evaluation NER result.

In [None]:
!pip install -q seqeval[cpu]

### Make Predictions

In [None]:
tagger = pycrfsuite.Tagger()
tagger.open('mitrestaurant.crfsuite')

In [None]:
example_sent = test_sents[0]
example_sent

In [None]:
print("Predicted:", ' '.join(tagger.tag(sent2features(untag(example_sent)))))
print("Correct:  ", ' '.join(sent2labels(example_sent)))

In [None]:
%%time
y_pred = [tagger.tag(xseq) for xseq in X_test]

In [None]:
from seqeval.metrics import classification_report

print(classification_report(y_test, y_pred))

# References

1. Datasets for Entity Recognition: https://github.com/juand-r/entity-recognition-datasets
2. [sklearn-crfsuite tutorial](https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html#let-s-use-conll-2002-data-to-build-a-ner-system).
3. [Quick Recipe: Build a POS tagger using a Conditional Random Field](https://nlpforhackers.io/crf-pos-tagger/)
4. [NLP Guide: Identifying Part of Speech Tags using Conditional Random Fields](https://medium.com/analytics-vidhya/pos-tagging-using-conditional-random-fields-92077e5eaa31)
5. [CRFsuite - Tutorial on Chunking Task](http://www.chokkan.org/software/crfsuite/tutorial.html)