## Natural Language Processing COMP3225

CRF Named Entity Recognition (NER) lab
Stuart Middleton, 18/09/2020

This lab will provide practical experience with named entity recongition (NER) software trained to label named entities (NE's) within English sentences using a Conditional Random Field (CRF) model. You will learn how to use the CRF model to label NE's and adjust features to deliver better performance. You will explore how changing the L1 regularization and using all possible transitions changes the learnt transition weights and thus the type of patterns learnt. Finally you will use a randomized hyperparameter search to find an optimal set of hyperparameters for your CRF NER model.

# Part 1

# Pre-requisites
You will need python3. The code below will work OK on a CPU only machine. Increasing the number of training files and iterations will significantly improve the quality of NER performance if you are prepared to wait for the longer compute to complete.

# Task 1 - Install pre-requistes and run the baseline NER

Further reading: [Scikit Learn CRF model](https://sklearn-crfsuite.readthedocs.io/en/latest/api.html#module-sklearn_crfsuite)

Further reading: [CRF paper](https://repository.upenn.edu/cis_papers/159/) 

Further reading: Course Text - Speech and Language Processing >> Information Extraction >> Named Entity Recognition 

First install python3 and the pre-requisite libraries needed for this tutorial.

```
python3 -m pip install numpy
python3 -m pip install tensorflow-gpu
python3 -m pip install sklearn
python3 -m pip install sklearn_crfsuite
python3 -m pip install eli5
python3 -m pip install matplotlib
python3 -m pip install notebook

unzip package for lab
jupyter notebook
==> will open browser windows from localhost:8888
==> load the lab .ipynb file
```

Create a new python code file for your work.

Import this labs required python3 libraries.

In [None]:
import sys, codecs, json, math, time, warnings
warnings.simplefilter( action='ignore', category=FutureWarning )

import nltk, scipy, sklearn, sklearn_crfsuite, sklearn_crfsuite.metrics, eli5
from sklearn.metrics import make_scorer
from collections import Counter
import matplotlib.pyplot as plt
from IPython.display import display    

import logging
import tensorflow as tf
import absl.logging
formatter = logging.Formatter('[%(levelname)s|%(filename)s:%(lineno)s %(asctime)s] %(message)s')
absl.logging.get_absl_handler().setFormatter(formatter)
absl.logging._warn_preinit_stderr = False
logger = tf.get_logger()
logger.setLevel(logging.INFO)

We will now define the hyperparameters used for the model.

In [None]:
# number of CRF iterations to train for. Using 150 will provide much better results, but take a lot longer to compute.
max_iter = 20

# number of ontonotes training files to load. Using a value of None will load the entire dataset, taking the longest
# to train but providing a much larger sentence corpus to train over and thus is able to learn a larger vocabulary.
max_files = 50

# set of NE label types to display in results. this is simply to limit the amount of logging that is perfoemed later
# when displaying details such as state transitions and top N features per state.
display_label_subset = [ 'B-DATE', 'I-DATE', 'B-GPE', 'I-GPE', 'B-PERSON', 'I-PERSON', 'O' ]

Next we define a function to load a parsed JSON formatted file with the ontonotes 5.0 dataset. The dataset is parsed and training and testset created, each a list of sentences constisting of lists of (token, POS_tag, NER_IOB_tag) tuples. IOB tagging is a scheme defining Begin, Inside, Outside tags for labels.

For example "I like New York in the spring" might be tagged "O O B-LOC I-LOC O O O" for the named entity "New York".

Ontonotes is an annotated dataset created from various genres of text (news, conversational telephone speech, weblogs, usenet newsgroups, broadcast, talk shows) in three languages (English, Chinese, and Arabic). Annotations include structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference). We will only use a parsed version here with the words, POS tags and NER tags.

Further reading: [Ontonotes 5.0 dataset](https://catalog.ldc.upenn.edu/LDC2013T19)

In [None]:
def create_dataset( max_files = None ) :
	dataset_file = '../corpus/ontonotes_parsed.json'
    
	# load parsed ontonotes dataset
	readHandle = codecs.open( dataset_file, 'r', 'utf-8', errors = 'replace' )
	str_json = readHandle.read()
	readHandle.close()
	dict_ontonotes = json.loads( str_json )

	# make a training and test split
	list_files = list( dict_ontonotes.keys() )
	if len(list_files) > max_files :
		list_files = list_files[ :max_files ]
	nSplit = math.floor( len(list_files)*0.9 )
	list_train_files = list_files[ : nSplit ]
	list_test_files = list_files[ nSplit : ]

	# sent = (tokens, pos, IOB_label)
	list_train = []
	for str_file in list_train_files :
		for str_sent_index in dict_ontonotes[str_file] :
			# ignore sents with non-PENN POS tags
			if 'XX' in dict_ontonotes[str_file][str_sent_index]['pos'] :
				continue
			if 'VERB' in dict_ontonotes[str_file][str_sent_index]['pos'] :
				continue

			list_entry = []

			# compute IOB tags for named entities (if any)
			ne_type_last = None
			for nTokenIndex in range(len(dict_ontonotes[str_file][str_sent_index]['tokens'])) :
				strToken = dict_ontonotes[str_file][str_sent_index]['tokens'][nTokenIndex]
				strPOS = dict_ontonotes[str_file][str_sent_index]['pos'][nTokenIndex]
				ne_type = None
				if 'ne' in dict_ontonotes[str_file][str_sent_index] :
					dict_ne = dict_ontonotes[str_file][str_sent_index]['ne']
					if not 'parse_error' in dict_ne :
						for str_NEIndex in dict_ne :
							if nTokenIndex in dict_ne[str_NEIndex]['tokens'] :
								ne_type = dict_ne[str_NEIndex]['type']
								break
				if ne_type != None :
					if ne_type == ne_type_last :
						strIOB = 'I-' + ne_type
					else :
						strIOB = 'B-' + ne_type
				else :
					strIOB = 'O'
				ne_type_last = ne_type

				list_entry.append( ( strToken, strPOS, strIOB ) )

			list_train.append( list_entry )

	list_test = []
	for str_file in list_test_files :
		for str_sent_index in dict_ontonotes[str_file] :
			# ignore sents with non-PENN POS tags
			if 'XX' in dict_ontonotes[str_file][str_sent_index]['pos'] :
				continue
			if 'VERB' in dict_ontonotes[str_file][str_sent_index]['pos'] :
				continue

			list_entry = []

			# compute IOB tags for named entities (if any)
			ne_type_last = None
			for nTokenIndex in range(len(dict_ontonotes[str_file][str_sent_index]['tokens'])) :
				strToken = dict_ontonotes[str_file][str_sent_index]['tokens'][nTokenIndex]
                strPOS = dict_ontonotes[str_file][str_sent_index]['pos'][nTokenIndex]
				ne_type = None
				if 'ne' in dict_ontonotes[str_file][str_sent_index] :
					dict_ne = dict_ontonotes[str_file][str_sent_index]['ne']
					if not 'parse_error' in dict_ne :
						for str_NEIndex in dict_ne :
							if nTokenIndex in dict_ne[str_NEIndex]['tokens'] :
								ne_type = dict_ne[str_NEIndex]['type']
								break
				if ne_type != None :
					if ne_type == ne_type_last :
						strIOB = 'I-' + ne_type
					else :
						strIOB = 'B-' + ne_type
				else :
					strIOB = 'O'
				ne_type_last = ne_type

				list_entry.append( ( strToken, strPOS, strIOB ) )

			list_test.append( list_entry )

	return list_train, list_test

Now we defined some helper functions to generate feature sets for each sentence, which the CRF model will use to train with. The word2features_func() function provided as an argument does all the work, and we will define several versions of it later.

In [None]:
def sent2features(sent, word2features_func = None):
	return [word2features_func(sent, i) for i in range(len(sent))]

def sent2labels(sent):
	return [label for token, postag, label in sent]

def sent2tokens(sent):
	return [token for token, postag, label in sent]

def print_F1_scores( micro_F1 ) :
	for label in micro_F1 :
		logger.info( "%-15s -> f1 %0.2f ; prec %0.2f ; recall %0.2f" % ( label, micro_F1[label]['f1-score'], micro_F1[label]['precision'], micro_F1[label]['recall'] ) )

def print_transitions(trans_features):
	for (label_from, label_to), weight in trans_features:
		logger.info( "%-15s -> %-15s %0.6f" % (label_from, label_to, weight) )

def print_state_features(state_features):
	for (attr, label), weight in state_features:
		logger.info( "%0.6f %-15s %s" % (weight, label, attr) )

Now we can write a function to train the CRF model on the ontonotes corpus, and then run the trained model to compute a macro F1 score on the testset.

First we load the corpus and then use the helper functions to generate lists of features for every token in the dataset. We curate a set of NE labels and remove the 'O' label. This is done because the majority of words are not named entities, and so 'O' tags severly imbalance the dataset. We want a CRF model that has a good F1 score across non-O tags, and if we left the 'O' tag in the F1 score would be dominated by the 'O' tag performance only.

Next we train the CRF model and log the weights it has learnt. We then run the trained model on the testset and report the macro F1 score results. We also log information about the state transitions and position/negative weighted features as this can reveal what has really been learnt by the CRF model.

In [None]:
def exec_task( max_files = 10, max_iter = 20, display_label_subset = [], word2features_func = None, train_crf_model_func = None ) :
	logger.info( 'max iterations = ' + repr(max_iter) )
	logger.info( 'word2features_func = ' + word2features_func.__name__  )
	logger.info( 'train_crf_model_func = ' + train_crf_model_func.__name__  )

	# make a dataset from english NE labelled ontonotes sents
	train_sents, test_sents = create_dataset( max_files = max_files )
	logger.info( '# training sents = ' + str(len(train_sents)) )
	logger.info( '# test sents = ' + str(len(test_sents)) )

	# print example sent (1st sent)
	logger.info( '' )
	logger.info( 'Example training sent annotated with IOB tags  = ' + repr(train_sents[0]) )

	# create feature vectors for every sent
	X_train = [sent2features(s, word2features_func = word2features_func) for s in train_sents]
	Y_train = [sent2labels(s) for s in train_sents]

	X_test = [sent2features(s, word2features_func = word2features_func) for s in test_sents]
	Y_test = [sent2labels(s) for s in test_sents]

	# get the label set
	set_labels = set([])
	for data in [Y_train,Y_test] :
		for n_sent in range(len(data)) :
			for str_label in data[n_sent] :
				set_labels.add( str_label )
	labels = list( set_labels )
	logger.info( '' )
	logger.info( 'labels = ' + repr(labels) )

	# remove 'O' label as we are not usually interested in how well 'O' is predicted
	#labels = list( crf.classes_ )
	labels.remove('O')

	# print example feature vector (12th word of 1st sent)
	logger.info( '' )
	logger.info( 'Example training feature = ' + repr(X_train[0][10]) )

	# Train CRF model
	crf = train_crf_model_func( X_train, Y_train, max_iter, labels )

	logger.info('Label transition weights learnt from dataset (for a subset of labels)')
	display( eli5.show_weights(crf, top=10, targets = display_label_subset, show=['transition_features']) )

	logger.info('Top 10 features per-target (for a subset of labels)')
	display( eli5.show_weights(crf, top=20, targets = display_label_subset, show=['targets']) )

	# compute the macro F1 score (F1 for instances of each label class averaged) in the test set
	Y_pred = crf.predict( X_test )
	sorted_labels = sorted(
		labels, 
		key=lambda name: (name[1:], name[0])
	)
	macro_scores = sklearn_crfsuite.metrics.flat_classification_report( Y_test, Y_pred, labels=sorted_labels, digits=3, output_dict = True )
	logger.info( '' )
	logger.info( 'macro F1 scores'  )
	print_F1_scores( macro_scores )

	# inspect the transitions
	logger.info( '' )
	logger.info("Top 10 likely state transitions")
	print_transitions( Counter(crf.transition_features_).most_common(10) )

	logger.info( '' )
	logger.info("Top 10 unlikely state transitions")
	print_transitions( Counter(crf.transition_features_).most_common()[-10:] )

	# inspect the states
	logger.info( '' )
	logger.info("Top 10 positive states")
	print_state_features(Counter(crf.state_features_).most_common(10))

	logger.info( '' )
	logger.info("Top 10 negative states")
	print_state_features(Counter(crf.state_features_).most_common()[-10:])


Now we will define a basic function to create set of features for a token position within a sentence. This function uses only the word and POS tag, and will look ahead and behind by one token index position.

In [None]:
def task1_word2features(sent, i):

	word = sent[i][0]
	postag = sent[i][1]

	features = {
		# basic features - token and POS tag
		'word' : word,
		'postag': postag,
	}
	if i > 0:
		# features for previous word (context)
		word_prev = sent[i-1][0]
		postag_prev = sent[i-1][1]
		features.update({
			'-1:word.lower()': word_prev.lower(),
			'-1:postag': postag_prev,
		})
	else:
		features['BOS'] = True

	if i < len(sent)-1:
		# features for next word (context)
		word_next = sent[i+1][0]
		postag_next = sent[i+1][1]
		features.update({
			'+1:word.lower()': word_next.lower(),
			'+1:postag': postag_next,
		})
	else:
		features['EOS'] = True

	return features

Below is the function to train the CRF model using the sklearn_crfsuite toolkit.

In [None]:
def task1_train_crf_model( X_train, Y_train, max_iter, labels ) :
	# train the basic CRF model
	crf = sklearn_crfsuite.CRF(
		algorithm='lbfgs',
		c1=0.1,
		c2=0.1,
		max_iterations=max_iter,
		all_possible_transitions=False,
	)
	crf.fit(X_train, Y_train)
	return crf

** EXERCISE Task 1 **

Run exec_task() below to build the crf model and look at the baseline F1 scores using just words and POS tags as features. Notice how top 10 features in the baseline are simple words or POS tags.

** END EXERCISE Task 1 **

In [None]:
exec_task( word2features_func = task1_word2features, train_crf_model_func = task1_train_crf_model, max_files = max_files, max_iter = max_iter, display_label_subset = display_label_subset )

# Task 2 - Add word shape and morpheme features

Further reading: Course Text - Speech and Language Processing >> Information Extraction >> Named Entity Recognition 

It is easy to overfit CRF models if the features provided are too specific to the corpus. Word shapes and morphemes are great ways to provide more generic features, which in turn allows the CRF model to learn patterns containing morphological features beyond the surface form of the sentence words.

Think about how many ways you can write a sentence containing the named entity 'New York'. If you only used the surface form words in each sentence you would need an unbounded training set covering all possible ways to talk about 'New York'. Adding more generic morphological features allows the model to handle unseen surface forms much better.

** EXERCISE Task 2 **

Write your own word2feature function that adds extra word shape features (uppercase, title, digits) and morphemes such as word affix (suffix) and POS affix (prefix). Build the model and look how the F1 score improves, and the top 10 features include shape and suffix information.

A model answer is below in task2_word2features(). Look at this once you have had a go at your own function!

** END EXERCISE Task 2 **

In [None]:
def task2_word2features(sent, i):

	word = sent[i][0]
	postag = sent[i][1]

	features = {
		'word' : word,
		'postag': postag,

		# token shape
		'word.lower()': word.lower(),
		'word.isupper()': word.isupper(),
		'word.istitle()': word.istitle(),
		'word.isdigit()': word.isdigit(),

		# token suffix
		'word.suffix': word.lower()[-3:],

		# POS prefix
		'postag[:2]': postag[:2],
	}
	if i > 0:
		word_prev = sent[i-1][0]
		postag_prev = sent[i-1][1]
		features.update({
			'-1:word.lower()': word_prev.lower(),
			'-1:postag': postag_prev,
			'-1:word.lower()': word_prev.lower(),
			'-1:word.isupper()': word_prev.isupper(),
			'-1:word.istitle()': word_prev.istitle(),
			'-1:word.isdigit()': word_prev.isdigit(),
			'-1:word.suffix': word_prev.lower()[-3:],
			'-1:postag[:2]': postag_prev[:2],
		})
	else:
		features['BOS'] = True

	if i < len(sent)-1:
		word_next = sent[i+1][0]
		postag_next = sent[i+1][1]
		features.update({
			'+1:word.lower()': word_next.lower(),
			'+1:postag': postag_next,
			'+1:word.lower()': word_next.lower(),
			'+1:word.isupper()': word_next.isupper(),
			'+1:word.istitle()': word_next.istitle(),
			'+1:word.isdigit()': word_next.isdigit(),
			'+1:word.suffix': word_next.lower()[-3:],
			'+1:postag[:2]': postag_next[:2],
		})
	else:
		features['EOS'] = True

	return features

In [None]:
exec_task( word2features_func = task2_word2features, train_crf_model_func = task1_train_crf_model, max_files = max_files, max_iter = max_iter, display_label_subset = display_label_subset )

# Task 3 - Change L1 regularization

Further reading: [Article on L1 and L2 regularization](https://explained.ai/regularization/L1vsL2.html) 

Increasing the CRF model's L1 regularization (c1 parameter) will leave only more generic features. This should remove instance names such as 'Korea' and 'Iraq' from the feature set. With L1 regularization coefficients of most features should be driven to zero, so patterns reply on POS and word shape.

** EXERCISE Task 3 **

Write your own train_crf_model function to build a CRF model with a c1 of 200 and look at the top 10 features being chosen for labels. See how the features are less reliant on particular words and more on word shape or POS tag.

A model answer is below in task3_train_crf_model(). Look at this once you have had a go at your own function!

** END EXERCISE Task 2 **

In [None]:
def task3_train_crf_model( X_train, Y_train, max_iter, labels ) :
	# train CRF model using L1 reg of 200 (high value)
	crf = sklearn_crfsuite.CRF(
		algorithm='lbfgs',
		c1=200,
		c2=0.1,
		max_iterations=max_iter,
		all_possible_transitions=False,
	)
	crf.fit(X_train, Y_train)
	return crf


In [None]:
exec_task( word2features_func = task2_word2features, train_crf_model_func = task3_train_crf_model, max_files = max_files, max_iter = max_iter, display_label_subset = display_label_subset )

# Task 4 - Use all possible transitions

Further reading: [Scikit Learn CRF model](https://sklearn-crfsuite.readthedocs.io/en/latest/api.html#module-sklearn_crfsuite)

Transitions like O -> I-PERSON should have large negative weights because they are impossible. but these transitions have zero weights, not negative weights, both in heavily the regularized model and the initial model. The reason they are zero is that crfsuite has not seen these transitions in training data, and assumed there is no need to learn weights for them, to save some computation time.

This is the default behavior. It is possible to turn it off using sklearn_crfsuite.CRF all_possible_transitions option.

** EXERCISE Task 4 **

Change your train_crf_model function so it builds a CRF model with all_possible_transitions = True and look at the negative weighting of O -> I-xxx labels. See how these transitions are now explicitly negatively weighted.

A model answer is below in task4_train_crf_model(). Look at this once you have had a go at your own function!

** END EXERCISE Task 4 **

In [None]:
def task4_train_crf_model( X_train, Y_train, max_iter, labels ) :
	# train CRF model using all possible transitions
	crf = sklearn_crfsuite.CRF(
		algorithm='lbfgs',
		c1=0.1,
		c2=0.1,
		max_iterations=max_iter,
		all_possible_transitions=True,
	)
	crf.fit(X_train, Y_train)
	return crf

In [None]:
exec_task( word2features_func = task2_word2features, train_crf_model_func = task4_train_crf_model, max_files = max_files, max_iter = max_iter, display_label_subset = display_label_subset )

# Task 5 - Randomized search for hyperparameter tuning

Further reading: [RandomizedSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)

Choosing the right hyperparameters can be very hard to know at design time. Usually it requires some experimentation to choose the best ones. Using a grid or randomized search strategy is a good way to automatically explore the hyperparameter space and idcentify the best hyperparameter settings. If you already have a hypothesis for what parameter ranges might work best, simple constrain the search space to focus on the areas you think should work best.

** EXERCISE Task 5 **

Create a new train_crf_model function to perform a randomized search to find the best hyperparameters (c1, c2) for the crf model. Return the best CRF model.

A model answer is below in task5_train_crf_model() which includes a visual display of the F1 scores in searched parameter space. Look at this once you have had a go at your own function!

** END EXERCISE Task 5 **

In [None]:
def task5_train_crf_model( X_train, Y_train, max_iter, labels ) :
	# randomized search to discover best parameters for CRF model
	crf = sklearn_crfsuite.CRF(
		algorithm='lbfgs', 
		max_iterations=max_iter, 
		all_possible_transitions=True
	)
	params_space = {
		'c1': scipy.stats.expon(scale=0.5),
		'c2': scipy.stats.expon(scale=0.05),
	}

	# optimize for micro F1 score
	f1_scorer = make_scorer( sklearn_crfsuite.metrics.flat_f1_score, average='weighted', labels=labels )

	logger.info( 'starting randomized search for hyperparameters' )
	n_folds = 2
	n_candidates = 10
	rs = sklearn.model_selection.RandomizedSearchCV(crf, params_space, cv=n_folds, verbose=1, n_jobs=-1, n_iter=n_candidates, scoring=f1_scorer)
	rs.fit(X_train, Y_train)

	# output the results
	logger.info( 'best params: {}'.format( rs.best_params_ ) )
	logger.info( 'best micro F1 score: {}'.format( rs.best_score_ ) )
	logger.info( 'model size: {:0.2f}M'.format( rs.best_estimator_.size_ / 1000000 ) )
	logger.info( 'cv_results_ = ' + repr(rs.cv_results_) )

	# visualize the results in hyperparameter space
	_x = [s['c1'] for s in rs.cv_results_['params']]
	_y = [s['c2'] for s in rs.cv_results_['params']]
	_c = [s for s in rs.cv_results_['mean_test_score']]

	fig = plt.figure()
	fig.set_size_inches(12, 12)
	ax = plt.gca()
	ax.set_yscale('log')
	ax.set_xscale('log')
	ax.set_xlabel('C1')
	ax.set_ylabel('C2')
	ax.set_title("Randomized Hyperparameter Search - F1 scores (blue min={:0.2}, red max={:0.2})".format( min(_c), max(_c) ))
	ax.scatter(_x, _y, c=_c, s=60, alpha=0.9, edgecolors=[0,0,0])

	# return the best model
	crf = rs.best_estimator_
	return crf

In [None]:
exec_task( word2features_func = task2_word2features, train_crf_model_func = task5_train_crf_model, max_files = max_files, max_iter = max_iter, display_label_subset = display_label_subset )