For this demo, we will use the [MIT Restaurant Corpus](https://groups.csail.mit.edu/sls/downloads/restaurant/) -- a dataset of transcriptions of spoken utterances about restaurants.

The dataset has following entity types:

* 'B-Rating'
* 'I-Rating',
* 'B-Amenity',
* 'I-Amenity',
* 'B-Location',
* 'I-Location',
* 'B-Restaurant_Name',
* 'I-Restaurant_Name',
* 'B-Price',
* 'B-Hours',
* 'I-Hours',
* 'B-Dish',
* 'I-Dish',
* 'B-Cuisine',
* 'I-Price',
* 'I-Cuisine'

Let us load the dataset and see what are we working with.

In [1]:
with open('sent_train', 'r') as train_sent_file:
  train_sentences = train_sent_file.readlines()

with open('label_train', 'r') as train_labels_file:
  train_labels = train_labels_file.readlines()

with open('sent_test', 'r') as test_sent_file:
  test_sentences = test_sent_file.readlines()

with open('label_test', 'r') as test_labels_file:
  test_labels = test_labels_file.readlines()


In [3]:
train_sentences[:10]

['2 start restaurants with inside dining \n',
 '34 \n',
 '5 star resturants in my town \n',
 '98 hong kong restaurant reasonable prices \n',
 'a great lunch spot but open till 2 a m passims kitchen \n',
 'a place that serves soft serve ice cream \n',
 'a restaurant that is good for groups \n',
 'a salad would make my day \n',
 'a smoothie would hit the spot \n',
 'a steak would be nice \n']

In [4]:
train_labels[:10]

['B-Rating I-Rating O O B-Amenity I-Amenity \n',
 'O \n',
 'B-Rating I-Rating O B-Location I-Location I-Location \n',
 'O B-Restaurant_Name I-Restaurant_Name O B-Price O \n',
 'O O O O O B-Hours I-Hours I-Hours I-Hours I-Hours B-Restaurant_Name I-Restaurant_Name \n',
 'O O O O B-Dish I-Dish I-Dish I-Dish \n',
 'O O O O B-Rating B-Amenity I-Amenity \n',
 'O B-Dish O O O O \n',
 'O B-Cuisine O O O O \n',
 'O B-Dish O O O \n']

In [5]:
import spacy

Let us see some example data points.

In [None]:
# Print the 6th sentence in the test set i.e. index value 5.

# Print the labels of this sentence


#Defining Features for Custom NER

First, let us install the required modules.

In [6]:
# Install pycrf and crfsuit packages using pip comman
!pip install pycrf
!pip install sklearn-crfsuite



Collecting pycrf
  Downloading pycrf-0.0.1.tar.gz (1.1 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pycrf
  Building wheel for pycrf (setup.py) ... [?25l[?25hdone
  Created wheel for pycrf: filename=pycrf-0.0.1-py3-none-any.whl size=1871 sha256=0d1853b00d9a281dc51cc191708011cbd2eff8324e63a7591081ca49fdc8cbca
  Stored in directory: /root/.cache/pip/wheels/e3/d2/c9/ba15b05ba596e2eafeb83c2903e79d634207367555aae8c7d2
Successfully built pycrf
Installing collected packages: pycrf
Successfully installed pycrf-0.0.1
Collecting sklearn-crfsuite
  Downloading sklearn_crfsuite-0.5.0-py2.py3-none-any.whl.metadata (4.9 kB)
Collecting python-crfsuite>=0.9.7 (from sklearn-crfsuite)
  Downloading python_crfsuite-0.9.11-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.3 kB)
Downloading sklearn_crfsuite-0.5.0-py2.py3-none-any.whl (10 kB)
Downloading python_crfsuite-0.9.11-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.wh



We will now start with computing features for our input sequences.

We have defined the following features for CRF model building:

- f1 = input word is in lower case;
- f2 = last 3 characters of word;
- f3 = last 2 characers of word;
- f4 = 1; if the word is in uppercase, 0 otherwise;
- f5 = 1; if word is a number; otherwise, 0
- f6= 1; if the word starts with a capital letter; otherwise, 0


In [15]:
#Define a function to get the above defined features for a word.
def get_features_for_one_word(sentence, pos):
  word = sentence[pos]
  # input word is lower case
  features = [
      f'word.lower={word.lower()}',
      f'word[-3:]={word[-3:]}',
      f'word[-2:]={word[-2:]}',
      f'word.isupper={word.isupper()}',
      f'word.isdigit={word.isnumeric()}',
      f'word.startsWithCapital={word[0].isupper()}'
  ]

  if(pos > 0):
    prev_word = sentence[pos - 1]
    features.extend([
      'prev_word.lower=' + prev_word.lower(),
      'prev_word.isupper=%s' % prev_word.isupper(),
      'prev_word.isdigit=%s' % prev_word.isdigit(),
      'prev_words.startsWithCapital=%s' % prev_word[0].isupper()
    ])
  else:
    features.append('BEG')

  if(pos == len(sentence)-1):
    features.append('END')

  return features

#Computing Features

Define a function to get features for a sentence using the already defined 'getFeaturesForOneWord' function

In [None]:
# Define a function to get features for a sentence
# using the 'getFeaturesForOneWord' function.


Define function to get the labels for a sentence.

In [18]:
# Define a function to get the labels for a sentence
def get_features_for_one_sentence(sentence):
  words = sentence.split()
  return [get_features_for_one_word(words, pos) for pos in range(len(words))]

def getLabelsInListForOneSentence(labels):
  return labels.split()


Example features for a sentence


In [17]:
# Apply function 'getFeaturesForOneSentence' to get features on a single sentence which is at index value 5 in train_sentence
get_features_for_one_sentence(train_sentences[5])



[['word.lower=a',
  'word[-3:]=a',
  'word[-2:]=a',
  'word.isupper=False',
  'word.isdigit=False',
  'word.startsWithCapital=False',
  'BEG'],
 ['word.lower=place',
  'word[-3:]=ace',
  'word[-2:]=ce',
  'word.isupper=False',
  'word.isdigit=False',
  'word.startsWithCapital=False',
  'prev_word.lower=a',
  'prev_word.isupper=False',
  'prev_word.isdigit=False',
  'prev_words.startsWithCapital=False'],
 ['word.lower=that',
  'word[-3:]=hat',
  'word[-2:]=at',
  'word.isupper=False',
  'word.isdigit=False',
  'word.startsWithCapital=False',
  'prev_word.lower=place',
  'prev_word.isupper=False',
  'prev_word.isdigit=False',
  'prev_words.startsWithCapital=False'],
 ['word.lower=serves',
  'word[-3:]=ves',
  'word[-2:]=es',
  'word.isupper=False',
  'word.isdigit=False',
  'word.startsWithCapital=False',
  'prev_word.lower=that',
  'prev_word.isupper=False',
  'prev_word.isdigit=False',
  'prev_words.startsWithCapital=False'],
 ['word.lower=soft',
  'word[-3:]=oft',
  'word[-2:]=ft',
  

Get the features for sentences of X_train and X_test and get the labels of Y_train and Y_test data.

In [19]:
X_train = [get_features_for_one_sentence(sentence) for sentence in train_sentences]
y_train = [getLabelsInListForOneSentence(labels) for labels in train_labels]

X_test = [get_features_for_one_sentence(sentence) for sentence in test_sentences]
y_test = [getLabelsInListForOneSentence(labels) for labels in test_labels]

#CRF Model Training

 Now we have all the information we need to train our CRF. Let us see how we can do that.

In [None]:
import sklearn_crfsuite

from sklearn_crfsuite import metrics

We create a CRF object and passtraining data to it. The model then "trains" and learns the weights for feature functions.

In [None]:
# Build the CRF model.


#Model Testing and Evaluation
The model is trained, let us now see how good it performs on the test data.

In [None]:
# Calculate the f1 score using the test data


In [None]:
# Print the orginal labels and predicted labels for the sentence  in test data, which is at index value 10.


#Transitions Learned by CRF

In [None]:
from util import print_top_likely_transitions
from util import print_top_unlikely_transitions

In [None]:
print_top_likely_transitions(crf.transition_features_)

In [None]:
print_top_unlikely_transitions(crf.transition_features_)