# Named Entity Recognition with MIT Restaurant Dataset

Your name: Nguyen Son

Student ID: BI12-389

**Not overdued yet**

## Task Description

In this assignment, you will train a NER Model using Conditional Random Fields (CRF) on and report the accuracy of your model on the test dataset.

You will use the [MIT Restaurant Dataset](https://groups.csail.mit.edu/sls/downloads/restaurant/) dataset to do the task.

## How to submit

- Attach notebook file (.ipynb) and submit your work to Google Class Room
- Name your file as YourName_StudentID_Assignment4.ibynb. E.g., Nguyen_Van_A_ST099834_Assignment4.ipynb
- Write your name and student ID into this notebook
- Copying others' assignments is strictly prohibited.


## Install python-crfsuite

In [1]:
!pip install -q python-crfsuite

## Imports

In [2]:
from itertools import chain
import pycrfsuite

## Dataset

We will use [MIT Restaurant Dataset](https://groups.csail.mit.edu/sls/downloads/restaurant/) dataset.

The data set is already in CoNLL format. We will use the [train](https://groups.csail.mit.edu/sls/downloads/restaurant/restauranttrain.bio) data to create the NER model and evaluate the model on the [test](https://groups.csail.mit.edu/sls/downloads/restaurant/restauranttest.bio) data.

### Download data

In [3]:
%%capture
!rm -f restauranttrain.bio
!rm -f restauranttest.bio

!wget https://groups.csail.mit.edu/sls/downloads/restaurant/restauranttest.bio
!wget https://groups.csail.mit.edu/sls/downloads/restaurant/restauranttrain.bio

## Loading data (30 points)

In this part, you will load a data file into a list of sentences. Each sentence is a list of (word, tag) tuples.

**Note: Blank lines are used to seperate sentences.**

For instance, the sentence below will be loaded into a list

```
O	a
B-Rating	four
I-Rating	star
O	restaurant
B-Location	with
I-Location	a
B-Amenity	bar
```

You will complete the function below

In [4]:
## Add necessary import here

def load_data(file_path):
    """Load data into a list of list of (word, tag) tuples

    Args:
        file_path (str): Path to data

    Returns:
        sentences: list of (word, tag) tuples
    """
    sentences = []

    current_sentence = []

    with open(file_path, 'r') as file:
        for line in file:
            if line == '\n':
                if current_sentence:
                    sentences.append(current_sentence)
                    current_sentence = []
            else:
                parts = line.strip().split('\t')
                if len(parts) == 2:
                    current_sentence.append((parts[1], parts[0]))

    if current_sentence:
        sentences.append(current_sentence)

    return sentences

In [5]:
train_sents = load_data('restauranttrain.bio')
test_sents = load_data('restauranttest.bio')

Let's check the number of sentences in train and test data

In [6]:
len(train_sents)

7660

In [7]:
len(test_sents)

1521

In [8]:
train_sents[0]

[('2', 'B-Rating'),
 ('start', 'I-Rating'),
 ('restaurants', 'O'),
 ('with', 'O'),
 ('inside', 'B-Amenity'),
 ('dining', 'I-Amenity')]

## Features (50 points)

We can extract as many features as you want. You will implement following basic features.

※ Of course, you can add more features.

*Word identity (lowercase)*

- Previous word identity
- Current word identity
- Next word
- Previous word and current word combination. Concat the previous word the current word by '||'
- Current word and next word combination. Concat two words by '||'

*Word shapes*

- Word prefix and suffix (4 characters)
- The first character of the current word is the capital letter

**All you need to do is to complete the function `word2feature`.**

In [9]:
def word2features(sentence, i):
    """
    Extract features for the word at position i in the sentence.
    """
    word = sentence[i][0]
    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
    }

    if i > 0:
        prev_word = sentence[i-1][0]
        features.update({
            'prev_word.lower()': prev_word.lower(),
            'prev_word.istitle()': prev_word.istitle(),
            'prev_word.isupper()': prev_word.isupper(),
            'word+prev_word': word.lower() + '||' + prev_word.lower(),
        })
    else:
        features['BOS'] = True  # Beginning of Sentence

    if i < len(sentence)-1:
        next_word = sentence[i+1][0]
        features.update({
            'next_word.lower()': next_word.lower(),
            'next_word.istitle()': next_word.istitle(),
            'next_word.isupper()': next_word.isupper(),
            'word+next_word': word.lower() + '||' + next_word.lower(),
        })
    else:
        features['EOS'] = True  # End of Sentence

    return features


def sent2features(sentence):
    """
    sentence is a list of words [w1, w2,...,w_n]
    """
    return [word2features(sentence, i) for i in range(len(sentence))]


def sent2labels(sentence):
    """
    sentence is a list of tuples (word, postag)
    """
    return [tag for token, tag in sentence]

def untag(sentence):
    """
    sentence is a list of tuples (word, postag)
    """
    return [token for token, _ in sentence]

Let's try to extract features for the first sentence

In [10]:
train_sents[0]

[('2', 'B-Rating'),
 ('start', 'I-Rating'),
 ('restaurants', 'O'),
 ('with', 'O'),
 ('inside', 'B-Amenity'),
 ('dining', 'I-Amenity')]

In [11]:
sent2features(untag(train_sents[0]))[0]

{'bias': 1.0,
 'word.lower()': '2',
 'word[-3:]': '2',
 'word[-2:]': '2',
 'word.isupper()': False,
 'word.istitle()': False,
 'word.isdigit()': True,
 'BOS': True,
 'next_word.lower()': 's',
 'next_word.istitle()': False,
 'next_word.isupper()': False,
 'word+next_word': '2||s'}

### Create train/test data

In [12]:
X_train = [sent2features(untag(s)) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(untag(s)) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

## Training

In [13]:
%%time
trainer = pycrfsuite.Trainer(verbose=False)

for xseq, yseq in zip(X_train, y_train):
    trainer.append(xseq, yseq)

CPU times: user 794 ms, sys: 17.1 ms, total: 812 ms
Wall time: 1.03 s


In [14]:
#@title Set model parameters

max_iterations = "100" #@param[50, 20, 100]

trainer.set_params({
    'c1': 1.0,   # coefficient for L1 penalty
    'c2': 1e-4,  # coefficient for L2 penalty
    'max_iterations': max_iterations,

    # include transitions that are possible, but not observed
    'feature.possible_transitions': True
})

In [15]:
%%time
trainer.train('mitrestaurant.crfsuite')

CPU times: user 16.7 s, sys: 86.4 ms, total: 16.7 s
Wall time: 17.9 s


## Evaluation (20 points)

We will use [seqeval](https://github.com/chakki-works/seqeval) package for evaluation NER result.

In [16]:
!pip install -q seqeval[cpu]

### Make Predictions

In [17]:
tagger = pycrfsuite.Tagger()
tagger.open('mitrestaurant.crfsuite')

<contextlib.closing at 0x7c42dc10ba90>

In [18]:
example_sent = test_sents[0]
example_sent

[('a', 'O'),
 ('four', 'B-Rating'),
 ('star', 'I-Rating'),
 ('restaurant', 'O'),
 ('with', 'B-Location'),
 ('a', 'I-Location'),
 ('bar', 'B-Amenity')]

In [19]:
print("Predicted:", ' '.join(tagger.tag(sent2features(untag(example_sent)))))
print("Correct:  ", ' '.join(sent2labels(example_sent)))

Predicted: O B-Rating I-Rating O O O B-Amenity
Correct:   O B-Rating I-Rating O B-Location I-Location B-Amenity


In [20]:
%%time
y_pred = [tagger.tag(xseq) for xseq in X_test]

CPU times: user 130 ms, sys: 2.92 ms, total: 133 ms
Wall time: 139 ms


In [21]:
from seqeval.metrics import classification_report

print(classification_report(y_test, y_pred))

                 precision    recall  f1-score   support

        Amenity       0.42      0.30      0.35       533
        Cuisine       0.49      0.46      0.48       532
           Dish       0.32      0.21      0.25       288
          Hours       0.59      0.42      0.49       212
       Location       0.55      0.51      0.53       812
          Price       0.49      0.35      0.40       171
         Rating       0.52      0.51      0.52       201
Restaurant_Name       0.39      0.24      0.30       402

      micro avg       0.48      0.39      0.43      3151
      macro avg       0.47      0.37      0.41      3151
   weighted avg       0.48      0.39      0.42      3151



# References

1. Datasets for Entity Recognition: https://github.com/juand-r/entity-recognition-datasets
2. [sklearn-crfsuite tutorial](https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html#let-s-use-conll-2002-data-to-build-a-ner-system).
3. [Quick Recipe: Build a POS tagger using a Conditional Random Field](https://nlpforhackers.io/crf-pos-tagger/)
4. [NLP Guide: Identifying Part of Speech Tags using Conditional Random Fields](https://medium.com/analytics-vidhya/pos-tagging-using-conditional-random-fields-92077e5eaa31)
5. [CRFsuite - Tutorial on Chunking Task](http://www.chokkan.org/software/crfsuite/tutorial.html)