**Overview of NLP (L90) Practical Session 2**

Welcome to the second practical accompanying the Overview of NLP (L90) lecture course. Overall, the purpose of the practicals is to build and evaluate an NLP system.

In this session we continue the 3-part practical task.

1.   Explore and annotate a named entity recognition dataset (last time, worth 10%) [link to colab](https://colab.research.google.com/drive/1J_jXBEFfxbDDuI_NcJpZPmubpNQCG6Kf?usp=sharing)
2.   **Attempt feature-based NER (this practical, assignment due 10 November 3pm, worth 10%)**
3.   Attempt NER with neural networks and write a report (practical on 10 November, due 3 December 3pm, worth 80%)

You might find it useful to watch the short video we recorded last year (on the [part II](https://www.cl.cam.ac.uk/teaching/2122/NLP/video/) and [ACS](https://www.cl.cam.ac.uk/teaching/2122/L90/video/) teaching pages). Also there will be an in-person discussion session on 27 October at 3pm in the Intel Lab: please come along to discuss the previous assignment and this new one!

Note that all submissions are made via the [course Moodle page](https://www.vle.cam.ac.uk/course/view.php?id=206751), and that you can only view this Colab notebook (including the ability to run code blocks) not edit it: you can make a copy of your own in the File menu if you do want to edit anything, but there is no need to.

Ok, let's begin the second practical.

**Recap: Practical 1**

Recall that we're working on the [W-NUT 2017 shared task](https://noisy-text.github.io/2017/emerging-rare-entities.html) on novel, emerging NER.

In your first assignment you annotated 50 tweets for named entities. How did you find it? Tricky or easy, time-consuming or quick? We'll start to send you your agreement scores by email after the deadline (if you haven't received yours one week after the deadline, let us know: apc38). Were you surprised by your level of (dis)agreement with the original annotation and other students? Would you now revise or stand by your annotation decisions?

Hopefully the assignment made you think a little more closely about the annotation side of machine learning & NLP: how subjective it can be, how important it is to get this part of research projects right, and how challenging that is -- especially when multiple annotation layers are required, and when the annotation task is more complex than word token labelling (such as syntactic parsing, or fact checking, for instance). Also, you might consider how the annotation process might contribute to bias in machine learning applications, depending on the task, data, annotator demographics, and so on.

You may also have thought the task quite slow and laborious, but hopefully quite interesting as well. We only looked at 50 texts: bear in mind there are 5.5K in this dataset, so you'll realise why annotation is (a) hard work, (b) expensive -- because usually annotators need to be paid, (c) a little noisy (you probably noticed the non-English tweet, for instance). If you ever need to get some annotation done yourself, you might want to look into ways to distribute the task more widely (aka crowdsourcing), make the task more efficient with semi-supervised or active learning, or indeed explore some unsupervised machine learning methods (though you'll normally still want to annotate a test set).

**Today: Practical 2**

Onto today's task: we'll be constructing 'traditional' feature-based classifiers for named entity recognition (as opposed to neural networks which we'll look at next time).

Your assignment is to write the code for one such classifier, perhaps using the features we'll describe below, but also some of your own. The requirement for a tick is to submit a labelled version of the test file which will allow us to evaluate your model. More about this below.

**Good old-fashioned NLP**

Let's start with some good old-fashioned NLP on these texts (as opposed to modern NLP which often involves neural networks and word embeddings -- more on that next time). As mentioned in practical 1, the texts have already been word tokenized. 

Another common processing task is part-of-speech tagging. You'll have heard in the lectures that performance of PoS taggers for standard English text is very high. This means that we often use pre-trained taggers available in toolkits such as spaCy and Stanford NLP and trust them to work well out-of-the-box. Of course, this is a questionable assumption (for reasons presented in the lecture), but for the purpose of this assignment a pre-trained tagger is fine.

Now, why are we PoS-tagging the text? Well we have an intuition that proper nouns are often named entities, and so we expect that PoS information will be a helpful feature for named entity recognition. Indeed, in the training set we find that a whopping 68% of named entities are proper nouns. It looks like PoS-tags will be a useful feature, even if we only craft a binary proper_noun=True/False feature.

We already pre-processed the training, dev and test files with 'universal' PoS-tags, which are a much simplified set intended for cross-linguistic use as part of the Universal Dependencies project (see the list of all [17 universal tags](https://universaldependencies.org/u/pos/index.html)). Here's a preview:

In [None]:
import pandas as pd
import numpy as np
wnuttrain = 'https://storage.googleapis.com/wnut-2017_ner-shared-task/wnut17train_clean_tagged.txt'
train = pd.read_table(wnuttrain, header=None, names=['token', 'label', 'bio_only', 'upos'])
train.head(n=20)

Unnamed: 0,token,label,bio_only,upos
0,@paulwalk,O,O,NOUN
1,It,O,O,PRON
2,'s,O,O,AUX
3,the,O,O,DET
4,view,O,O,NOUN
5,from,O,O,ADP
6,where,O,O,ADV
7,I,O,O,PRON
8,'m,O,O,X
9,living,O,O,NOUN


You'll see that as before we have the word tokens and named entity labels in the first and second columns, as obtained from the original W-NUT 2017 shared task dataset. Now in the final column we've added UPOS tags, and in the third column you'll see that we've created a 'BIO only' value, which is the B, I or O copied from the full named entity label in column 2. We'll use this today for a less strict classification task.

_Note that this is the starting format of all the files from now on: so however you work with the data for your assignments, make sure it can handle this format as input._

**Baseline classifiers**

Let's load the dev data which we'll work with while developing our baseline classifiers.

In [None]:
wnutdev = 'https://storage.googleapis.com/wnut-2017_ner-shared-task/wnut17dev_clean_tagged.txt'
dev = pd.read_table(wnutdev, header=None, names=['token', 'label', 'bio_only', 'upos'])
dev.head(n=10)

Unnamed: 0,token,label,bio_only,upos
0,Stabilized,O,O,PROPN
1,approach,O,O,NOUN
2,or,O,O,CCONJ
3,not,O,O,PART
4,?,O,O,PUNCT
5,That,O,O,PRON
6,´,O,O,SYM
7,s,O,O,PART
8,insane,O,O,ADJ
9,and,O,O,CCONJ


Note the tagging errors in rows 0 and 6: presumably because we haven't lower-cased all words (row 0) and because of the 'forward tick' rather than plain apostrophe (row 6). Re the former problem: what information would you lose if you do lower-case all words? We leave this as something for you to explore if you wish to.

Now, how well can we do with a very naive baseline classifier? Often, your most basic baseline will be a random classifier, or majority-class (always predicting the most common label). Our labels are not evenly distributed (3.1% B, 1.9% I, 95% O), so a random baseline doesn't make sense. And the way that this task is evaluated, we have to positively make named entity predictions, so therefore using a majority-class baseline of labelling everything "O" isn't appropriate (there's more about evaluation below).

Right, so let's use what we found out about proper nouns and try a baseline approach of identifying all proper nouns as named entities. This is what that looks like, first making a copy of dev so we can label it:

In [None]:
dev.dropna(inplace=True)  # drop empty rows between texts
dev = dev.reset_index(drop=True)
dev_copy = dev.copy()  #  make a copy of original data frame
dev_copy['prediction'] = 'O'
in_entity = 0
for i in dev_copy.index:
  if dev_copy['upos'][i]=='PROPN':
    if in_entity==1:  # if a named entity in progress
      dev_copy['prediction'][i] = 'I'
    else:
      dev_copy['prediction'][i] = 'B'
      in_entity = 1
  else:
    in_entity = 0

print(dev_copy.head())
print('Total number of rows = %i' % len(dev_copy))
dev_copy['prediction'].value_counts()

        token label bio_only   upos prediction
0  Stabilized     O        O  PROPN          B
1    approach     O        O   NOUN          O
2          or     O        O  CCONJ          O
3         not     O        O   PART          O
4           ?     O        O  PUNCT          O
Total number of rows = 15382


O    14432
B      837
I      113
Name: prediction, dtype: int64

Ok so that gives us 837 named entities, which seems reasonable. Let's see how many named entities were tagged in the dev set:

In [None]:
dev_copy['bio_only'].value_counts()

O    14144
B      826
I      412
Name: bio_only, dtype: int64

How does the 'gold-standard' list of named entities compare to the ones we've naively predicted? This calls for an evaluation function: note that we're using the 'entity' rather than 'surface' F1 score used in the shared task (but without predicting entity type). The latter rewards repeating entities only once, so it's more like a vocabulary test, and it requires an entity type too, whereas we've so far only predicted BIO labels. And evaluation is done on whole named entity strings rather than token labels (hence it's a bit more involved than normal per-token evaluation).

_Refer to the [shared task overview paper](https://aclanthology.org/W17-4418.pdf), section §4 for more about the evaluation methods used._

In [None]:
def wnut_evaluate(txt):
  '''entity evaluation: we evaluate by whole named entities'''
  npred = 0; ngold = 0; tp = 0
  nrows = len(txt)
  for i in txt.index:
    if txt['prediction'][i]=='B' and txt['bio_only'][i]=='B':
      npred += 1
      ngold += 1
      for predfindbo in range((i+1),nrows):
        if txt['prediction'][predfindbo]=='O' or txt['prediction'][predfindbo]=='B':
          break  # find index of first O (end of entity) or B (new entity)
      for goldfindbo in range((i+1),nrows):
        if txt['bio_only'][goldfindbo]=='O' or txt['bio_only'][goldfindbo]=='B':
          break  # find index of first O (end of entity) or B (new entity)
      if predfindbo==goldfindbo:  # only count a true positive if the whole entity phrase matches
        tp += 1
    elif txt['prediction'][i]=='B':
      npred += 1
    elif txt['bio_only'][i]=='B':
      ngold += 1
  
  fp = npred - tp  # n false predictions
  fn = ngold - tp  # n missing gold entities
  prec = tp / (tp+fp)
  rec = tp / (tp+fn)
  f1 = (2*(prec*rec)) / (prec+rec)
  print('Sum of TP and FP = %i' % (tp+fp))
  print('Sum of TP and FN = %i' % (tp+fn))
  print('True positives = %i, False positives = %i, False negatives = %i' % (tp, fp, fn))
  print('Precision = %.3f, Recall = %.3f, F1 = %.3f' % (prec, rec, f1))
 
wnut_evaluate(dev_copy)

Sum of TP and FP = 837
Sum of TP and FN = 826
True positives = 371, False positives = 466, False negatives = 455
Precision = 0.443, Recall = 0.449, F1 = 0.446


Ok so we've got precision of .443, recall of .449 and an F1-measure (harmonic mean of precision and recall) of .446 on the dev set. That's actually not too bad given that the leading score (**on the test set**) in the 2017 shared task was .419 (we have checked and this approach _does not_ do so well on the test set, with <.4 entity F1).

Why are there some false positives, you might wonder? Well recall that we noticed errors in the PoS-tagging, including incorrect PROPN tags. Also we recognise better than ever, after assignment 1, that human annotators can make mistakes or disagree about named entity labels. And finally, not all PROPN are named entities according to the annotation guidelines: we can for example find a name in text which does not unambiguously identify a person (i.e. it's one of their names only, a nickname, or a general comment about all Janes, all Harrys, etc).

We know that the PROPN baseline is a really crude approach, and won't adequately capture the _novel_ and _emerging_ entities which are the target of this shared task. Note also that it wouldn't be an acceptable entry to the shared task because it makes no attempt to properly engage with the full range of entities, instead targeting proper nouns only. The errors and complications even with this simple baseline approach does reinforce that the task is hard. Bear in mind that participating in a shared task eventually requires writing a peer-reviewed paper, and that trivial entries would be rejected at this stage.

This approach doesn't contribute anything new to our shared knowledge: it only makes use of existing technology (a pre-trained PoS-tagger). Besides, we've only made very basic use of the training data so far. So let's try to train a classifier based on that data. We'll augment the training data with new features ready for classifier training: these should be integer or logical values, which are suitable feature types for machine learning.


In [None]:
# reload training file
wnuttrain = 'https://storage.googleapis.com/wnut-2017_ner-shared-task/wnut17train_clean_tagged.txt'
train = pd.read_table(wnuttrain, header=None, names=['token', 'label', 'bio_only', 'upos']).dropna()  # drop empty rows

# in order to convert POS tags to integers: get the UPOS tagset
pos_vocab = train.upos.unique().tolist()

# feature 1: convert POS-tags to integers
def pos_index(pos):
  ind = pos_vocab.index(pos)
  return ind

# feature 2: is this a proper noun?
def is_propn(pos):
  resp = False
  if pos=='PROPN':
    resp = True
  return resp

# feature 3: is the first character a capital letter?
def title_case(tok):
  resp = False
  if tok[0:1].isupper():
    resp = True  # thanks Archie Barrett for spotting a typo here!
  return resp

# training labels: convert BIO to integers
def bio_index(bio):
  if bio=='B':
    ind = 0
  elif bio=='I':
    ind = 1
  elif bio=='O':
    ind = 2
  return ind

# pass a data frame through our feature extractor
def extract_features(txt):
  txt.dropna(inplace=True)  # drop empty rows between texts
  txt_copy = txt.reset_index(drop=True)
  posinds = [pos_index(u) for u in txt_copy['upos']]
  txt_copy['pos_indices'] = posinds
  isprop = [is_propn(u) for u in txt_copy['upos']]
  txt_copy['is_propn'] = isprop
  tcase = [title_case(t) for t in txt_copy['token']]
  txt_copy['title_case'] = tcase
  bioints = [bio_index(b) for b in txt_copy['bio_only']]
  txt_copy['bio_only'] = bioints
  return txt_copy

train_copy = extract_features(train)
train_copy.head()

Unnamed: 0,token,label,bio_only,upos,pos_indices,is_propn,title_case
0,@paulwalk,O,2,NOUN,0,False,False
1,It,O,2,PRON,1,False,True
2,'s,O,2,AUX,2,False,False
3,the,O,2,DET,3,False,False
4,view,O,2,NOUN,0,False,False


Ok what we've done here is: (a) converted our PoS-tags to integers ('pos_indices'), (b) produced some logical features based on properties we associate with named entities, namely being a proper noun and having a capital first character ('is_propn', 'title_case'), and (c) converted the 'bio_only' values to integers. These are all suitable feature types for machine learning.

We need to prepare the development data in the exact same way:

In [None]:
wnutdev = 'https://storage.googleapis.com/wnut-2017_ner-shared-task/wnut17dev_clean_tagged.txt'
dev = pd.read_table(wnutdev, header=None, names=['token', 'label', 'bio_only', 'upos']).dropna()

dev_copy = extract_features(dev)
dev_copy.head()

Unnamed: 0,token,label,bio_only,upos,pos_indices,is_propn,title_case
0,Stabilized,O,2,PROPN,9,True,True
1,approach,O,2,NOUN,0,False,False
2,or,O,2,CCONJ,15,False,False
3,not,O,2,PART,14,False,False
4,?,O,2,PUNCT,8,False,False


Now we can drop the unnecessary columns ('token', 'label', 'upos') and split our training file into _X_ and _y_: where _X_ are the features, and _y_ are the 'bio_only' labels.

In [None]:
X_train = train_copy.drop(['token', 'label', 'bio_only', 'upos'], axis=1)
y_train = train_copy['bio_only']
print(X_train.head())
y_train.head()

   pos_indices  is_propn  title_case
0            0     False       False
1            1     False        True
2            2     False       False
3            3     False       False
4            0     False       False


0    2
1    2
2    2
3    2
4    2
Name: bio_only, dtype: int64

With _X_ and _y_ ready to go, we can use scikit-learn to load and fit a logistic regression model in 'multinomial' mode (as opposed to 'binomial' when there are only two classes): logistic regression being a relatively simple but reliably good classifier which can be hard to beat.

In [None]:
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(random_state=0, multi_class='multinomial', penalty='none', solver='newton-cg').fit(X_train, y_train)

logreg

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='multinomial', n_jobs=None, penalty='none',
                   random_state=0, solver='newton-cg', tol=0.0001, verbose=0,
                   warm_start=False)

We can evaluate this model on the development set:

In [None]:
X_dev = dev_copy.drop(['token', 'label', 'bio_only', 'upos'], axis=1)
y_dev = dev_copy['bio_only']
preds = logreg.predict(X_dev)

(unique, counts) = np.unique(preds, return_counts=True)
print('Predicted label, Count of labels')
print(np.asarray((unique, counts)).T)

Predicted label, Count of labels
[[    2 15382]]


Oh. But here's a problem: our classifier has only predicted outside=2 for all tokens in the dev file! No 'begin' or 'inside' predictions. Hmm, we expect this is because the training classes are so imbalanced. Recall that named entities are relatively rare in our training texts:

In [None]:
train.bio_only.value_counts()

O    59095
B     1964
I     1177
Name: bio_only, dtype: int64

The 2 'outside' class far outnumbers the others, meaning that in training the obvious strategy for the model is to learn to predict all labels as non named entities. One alternative is to randomly down-sample the 'outside' tokens, which means that we will lose all access to context (so don't do this if you plan to use n-gram features), but might encourage the model to predict all possible classes and not just one.

Ok let's start again, and this time take just 3000 'O' tokens at random (approx the sum of 'B' and 'I' tokens in the training data).

In [None]:
# reload
wnuttrain = 'https://storage.googleapis.com/wnut-2017_ner-shared-task/wnut17train_clean_tagged.txt'
train = pd.read_table(wnuttrain, header=None, names=['token', 'label', 'bio_only', 'upos']).dropna()

# split into B&I versus O subsets
is_inside = train['bio_only']!='O'
is_outside = train['bio_only']=='O'
bi = train[is_inside]
outside = train[is_outside]
outside = outside.sample(n=3000)  # approx the sum of B and I labels in train

# recombine
train = pd.concat([bi, outside])
print('Down-sampled data:')
train.bio_only.value_counts()

Down-sampled data:


O    3000
B    1964
I    1177
Name: bio_only, dtype: int64

Ok that's our new downsampled training set. We can run feature extraction and do the same for the dev set.

In [None]:
train_copy = extract_features(train)
print('Training file preview:')
print(train_copy.head())

# reload
wnutdev = 'https://storage.googleapis.com/wnut-2017_ner-shared-task/wnut17dev_clean_tagged.txt'
dev = pd.read_table(wnutdev, header=None, names=['token', 'label', 'bio_only', 'upos']).dropna()

dev_copy = extract_features(dev)
print('Dev file preview:')
print(dev_copy.head())

Training file preview:
      token       label  bio_only   upos  pos_indices  is_propn  title_case
0    Empire  B-location         0  PROPN            9      True        True
1     State  I-location         1  PROPN            9      True        True
2  Building  I-location         1  PROPN            9      True        True
3       ESB  B-location         0  PROPN            9      True        True
4      AHFA     B-group         0  PROPN            9      True        True
Dev file preview:
        token label  bio_only   upos  pos_indices  is_propn  title_case
0  Stabilized     O         2  PROPN            9      True        True
1    approach     O         2   NOUN            0     False       False
2          or     O         2  CCONJ           15     False       False
3         not     O         2   PART           14     False       False
4           ?     O         2  PUNCT            8     False       False


Then fit the model again:

In [None]:
X_train = train_copy.drop(['token', 'label', 'bio_only', 'upos'], axis=1)
y_train = train_copy['bio_only']
logreg = LogisticRegression(random_state=0, multi_class='multinomial', penalty='none', solver='newton-cg').fit(X_train, y_train)
logreg

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='multinomial', n_jobs=None, penalty='none',
                   random_state=0, solver='newton-cg', tol=0.0001, verbose=0,
                   warm_start=False)

And evaluate on dev:

In [None]:
X_dev = dev_copy.drop(['token', 'label', 'bio_only', 'upos'], axis=1)
y_dev = dev_copy['bio_only']
preds = logreg.predict(X_dev)

(unique, counts) = np.unique(preds, return_counts=True)
print('Predicted label, Count of labels')
print(np.asarray((unique, counts)).T)

Predicted label, Count of labels
[[    0  1983]
 [    2 13399]]


This time the model has attempted to predict more than just 'outside' labels. And so we can get a measure of performance, having converted the BIO integers back to character values:

In [None]:
def reverse_bio(ind):
  if ind==0:
    bio = 'B'
  elif ind==1:
    bio = 'I'
  elif ind==2:
    bio = 'O'
  return bio

bio_labs = [reverse_bio(b) for b in dev_copy['bio_only']]
dev_copy['bio_only'] = bio_labs
bio_preds = [reverse_bio(p) for p in preds]
dev_copy['prediction'] = bio_preds
print(dev_copy.head())

print('New evaluation:')
wnut_evaluate(dev_copy)

        token label bio_only  ... is_propn  title_case  prediction
0  Stabilized     O        O  ...     True        True           B
1    approach     O        O  ...    False       False           O
2          or     O        O  ...    False       False           O
3         not     O        O  ...    False       False           O
4           ?     O        O  ...    False       False           O

[5 rows x 8 columns]
New evaluation:
Sum of TP and FP = 1983
Sum of TP and FN = 826
True positives = 371, False positives = 1612, False negatives = 455
Precision = 0.187, Recall = 0.449, F1 = 0.264


And there we are: a proper attempt at classification but this particular approach is worse than our baseline of predicting that all proper nouns are named entities! Clearly there's room for improvement: so far we only made use of 3 fairly basic features.

As well as, or instead of these, we could try making proper use of the word tokens, context, or other properties of named entities you may be aware of or have noticed during annotation. This is where we pass the task over to you!

**Assignment 2**

Your assignment involves writing a Python script which trains a feature-based named entity classifier using the W-NUT 2017 training file available [here](https://storage.googleapis.com/wnut-2017_ner-shared-task/wnut17train_clean_tagged.txt). Note that the development file is also available to you [here](https://storage.googleapis.com/wnut-2017_ner-shared-task/wnut17dev_clean_tagged.txt). You can use the code in this notebook as your starting point, or start from scratch. Once you're happy with your classifier, you can apply it to predict labels on the test set, available [here](https://storage.googleapis.com/wnut-2017_ner-shared-task/wnut17test_clean_tagged.txt). 

To be clear, you'll need to read in the test file with something like the following, replacing the values in the 'prediction' column with those output by your classifier:

In [None]:
wnuttest = 'https://storage.googleapis.com/wnut-2017_ner-shared-task/wnut17test_clean_tagged.txt'
testset = pd.read_table(wnuttest, header=None, names=['token', 'upos']).dropna()
testset['prediction'] = 'O'
testset.head()

Unnamed: 0,token,upos,prediction
0,&,CCONJ,O
1,gt,X,O
2,;,PUNCT,O
3,*,PUNCT,O
4,The,DET,O


Once you've done that, you need to write the table as a **tab-delimited file** which you can submit to us via Moodle. **Please do include the column headings in your output file, but not row names / indices**. We can then let you know your precision, recall and F-measure on the test set, which will not affect your tick but you might be interested to know these metrics all the same.

You'll need to write more code than is given here, for a tick: it shouldn't just be a copy of the code found in this notebook. You might try: a different PoS-tagger which outputs a more fine-grained PoS-tagset, making use of the word tokens, other character casing patterns, context from n-grams, other sklearn classifiers, and so on. (But bear in mind this is only 10% of your grade, so don't get too carried away: the idea is to just to get you thinking about feature extraction and the NER task some more).

Before the deadline in 2 weeks, please submit 2 files **with your CRSid in the filenames** in the appropriate place on [Moodle](https://www.vle.cam.ac.uk/course/view.php?id=206751): (1) your Python script, (2) the test set with your predicted labels in the 'prediction' column (**please check the format of this file: any formatting problems and we won't be able to evaluate your predictions --> no tick!**). Any queries please get in touch (apc38).

In [2]:
import pandas as pd
import numpy as np

# reload training file
wnuttrain = 'https://storage.googleapis.com/wnut-2017_ner-shared-task/wnut17train_clean_tagged.txt'
train = pd.read_table(wnuttrain, header=None, names=['token', 'label', 'bio_only', 'upos']).dropna()  # drop empty rows

# in order to convert POS tags to integers: get the UPOS tagset
pos_vocab = train.upos.unique().tolist()

# feature 1: convert POS-tags to integers
def pos_index(pos):
  ind = pos_vocab.index(pos)
  return ind

# feature 2: is this a proper noun?
def is_propn(pos):
  resp = False
  if pos=='PROPN':
    resp = True
  return resp

# feature 3: is this a noun?
def is_noun(pos):
  resp = False
  if pos=='NOUN':
    resp = True
  return resp

# feature 4: is the first character a capital letter?
def title_case(tok):
  resp = False
  if tok[0:1].isupper():
    resp = True 
  return resp

# feature 5: does the word contain a capical letter?
def has_capital_letter(tok):
  resp = False
  for letter in tok:
    if letter.isupper():
        resp = True
  return resp

# training labels: convert BIO to integers
def bio_index(bio):
  if bio=='B':
    ind = 0
  elif bio=='I':
    ind = 1
  elif bio=='O':
    ind = 2
  return ind

# pass a data frame through my new feature extractor
def extract_features(txt):
  txt.dropna(inplace=True)  # drop empty rows between texts
  txt_copy = txt.reset_index(drop=True)
  posinds = [pos_index(u) for u in txt_copy['upos']]
  txt_copy['pos_indices'] = posinds
  isprop = [is_propn(u) for u in txt_copy['upos']]
  txt_copy['is_propn'] = isprop
  isnoun = [is_noun(u) for u in txt_copy['upos']]
  txt_copy['is_noun'] = isnoun
  tcase = [title_case(t) for t in txt_copy['token']]
  txt_copy['title_case'] = tcase
  hascaps = [has_capital_letter(t) for t in txt_copy['token']]
  txt_copy['has_capital_letter'] = hascaps
  bioints = [bio_index(b) for b in txt_copy['bio_only']]
  txt_copy['bio_only'] = bioints
  return txt_copy

train_copy = extract_features(train)
train_copy.head()

Unnamed: 0,token,label,bio_only,upos,pos_indices,is_propn,is_noun,title_case,has_capital_letter
0,@paulwalk,O,2,NOUN,0,False,True,False,False
1,It,O,2,PRON,1,False,False,True,True
2,'s,O,2,AUX,2,False,False,False,False
3,the,O,2,DET,3,False,False,False,False
4,view,O,2,NOUN,0,False,True,False,False


In [3]:
wnutdev = 'https://storage.googleapis.com/wnut-2017_ner-shared-task/wnut17dev_clean_tagged.txt'
dev = pd.read_table(wnutdev, header=None, names=['token', 'label', 'bio_only', 'upos']).dropna()

dev_copy = extract_features(dev)
dev_copy.head()

Unnamed: 0,token,label,bio_only,upos,pos_indices,is_propn,is_noun,title_case,has_capital_letter
0,Stabilized,O,2,PROPN,9,True,False,True,True
1,approach,O,2,NOUN,0,False,True,False,False
2,or,O,2,CCONJ,15,False,False,False,False
3,not,O,2,PART,14,False,False,False,False
4,?,O,2,PUNCT,8,False,False,False,False


In [11]:
# reload training file
wnuttrain = 'https://storage.googleapis.com/wnut-2017_ner-shared-task/wnut17train_clean_tagged.txt'
train = pd.read_table(wnuttrain, header=None, names=['token', 'label', 'bio_only', 'upos']).dropna()

# split into B&I versus O subsets
is_inside = train['bio_only']!='O'
is_outside = train['bio_only']=='O'
bi = train[is_inside]
outside = train[is_outside]
outside = outside.sample(n=3000)  # approx the sum of B and I labels in train

# recombine
train = pd.concat([bi, outside])
print('Down-sampled data:')
train.bio_only.value_counts()

Down-sampled data:


O    3000
B    1964
I    1177
Name: bio_only, dtype: int64

In [12]:
train_copy = extract_features(train)
print('Training file preview:')
print(train_copy.head())

# reload dev file
wnutdev = 'https://storage.googleapis.com/wnut-2017_ner-shared-task/wnut17dev_clean_tagged.txt'
dev = pd.read_table(wnutdev, header=None, names=['token', 'label', 'bio_only', 'upos']).dropna()

dev_copy = extract_features(dev)
print('Dev file preview:')
print(dev_copy.head())

Training file preview:
      token       label  bio_only  ... is_noun  title_case  has_capital_letter
0    Empire  B-location         0  ...   False        True                True
1     State  I-location         1  ...   False        True                True
2  Building  I-location         1  ...   False        True                True
3       ESB  B-location         0  ...   False        True                True
4      AHFA     B-group         0  ...   False        True                True

[5 rows x 9 columns]
Dev file preview:
        token label  bio_only  ... is_noun  title_case  has_capital_letter
0  Stabilized     O         2  ...   False        True                True
1    approach     O         2  ...    True       False               False
2          or     O         2  ...   False       False               False
3         not     O         2  ...   False       False               False
4           ?     O         2  ...   False       False               False

[5 rows x 9 

In [13]:
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression

X_train = train_copy.drop(['token', 'label', 'bio_only', 'upos'], axis=1)
y_train = train_copy['bio_only']
logreg = LogisticRegression(random_state=0, multi_class='multinomial', penalty='none', solver='newton-cg').fit(X_train, y_train)

logreg

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='multinomial', n_jobs=None, penalty='none',
                   random_state=0, solver='newton-cg', tol=0.0001, verbose=0,
                   warm_start=False)

In [14]:
X_dev = dev_copy.drop(['token', 'label', 'bio_only', 'upos'], axis=1)
y_dev = dev_copy['bio_only']
preds = logreg.predict(X_dev)

(unique, counts) = np.unique(preds, return_counts=True)
print('Predicted label, Count of labels')
print(np.asarray((unique, counts)).T)

Predicted label, Count of labels
[[    0  1330]
 [    2 14052]]


In [15]:
def wnut_evaluate(txt):
  '''entity evaluation: we evaluate by whole named entities'''
  npred = 0; ngold = 0; tp = 0
  nrows = len(txt)
  for i in txt.index:
    if txt['prediction'][i]=='B' and txt['bio_only'][i]=='B':
      npred += 1
      ngold += 1
      for predfindbo in range((i+1),nrows):
        if txt['prediction'][predfindbo]=='O' or txt['prediction'][predfindbo]=='B':
          break  # find index of first O (end of entity) or B (new entity)
      for goldfindbo in range((i+1),nrows):
        if txt['bio_only'][goldfindbo]=='O' or txt['bio_only'][goldfindbo]=='B':
          break  # find index of first O (end of entity) or B (new entity)
      if predfindbo==goldfindbo:  # only count a true positive if the whole entity phrase matches
        tp += 1
    elif txt['prediction'][i]=='B':
      npred += 1
    elif txt['bio_only'][i]=='B':
      ngold += 1
  
  fp = npred - tp  # n false predictions
  fn = ngold - tp  # n missing gold entities
  prec = tp / (tp+fp)
  rec = tp / (tp+fn)
  f1 = (2*(prec*rec)) / (prec+rec)
  print('Sum of TP and FP = %i' % (tp+fp))
  print('Sum of TP and FN = %i' % (tp+fn))
  print('True positives = %i, False positives = %i, False negatives = %i' % (tp, fp, fn))
  print('Precision = %.3f, Recall = %.3f, F1 = %.3f' % (prec, rec, f1))

In [16]:
def reverse_bio(ind):
  if ind==0:
    bio = 'B'
  elif ind==1:
    bio = 'I'
  elif ind==2:
    bio = 'O'
  return bio

bio_labs = [reverse_bio(b) for b in dev_copy['bio_only']]
dev_copy['bio_only'] = bio_labs
bio_preds = [reverse_bio(p) for p in preds]
dev_copy['prediction'] = bio_preds
print(dev_copy.head())

print('New evaluation:')
wnut_evaluate(dev_copy)

        token label bio_only  ... title_case  has_capital_letter  prediction
0  Stabilized     O        O  ...       True                True           B
1    approach     O        O  ...      False               False           O
2          or     O        O  ...      False               False           O
3         not     O        O  ...      False               False           O
4           ?     O        O  ...      False               False           O

[5 rows x 10 columns]
New evaluation:
Sum of TP and FP = 1330
Sum of TP and FN = 826
True positives = 366, False positives = 964, False negatives = 460
Precision = 0.275, Recall = 0.443, F1 = 0.340


In [17]:
# pass the test data frame through my new feature extractor
def extract_test_features(txt):
  txt.dropna(inplace=True)  # drop empty rows between texts
  txt_copy = txt.reset_index(drop=True)
  posinds = [pos_index(u) for u in txt_copy['upos']]
  txt_copy['pos_indices'] = posinds
  isprop = [is_propn(u) for u in txt_copy['upos']]
  txt_copy['is_propn'] = isprop
  isnoun = [is_noun(u) for u in txt_copy['upos']]
  txt_copy['is_noun'] = isnoun
  tcase = [title_case(t) for t in txt_copy['token']]
  txt_copy['title_case'] = tcase
  hascaps = [has_capital_letter(t) for t in txt_copy['token']]
  txt_copy['has_capital_letter'] = hascaps
  return txt_copy

In [18]:
# load test file
wnuttest = 'https://storage.googleapis.com/wnut-2017_ner-shared-task/wnut17test_clean_tagged.txt'
test = pd.read_table(wnuttest, header=None, names=['token', 'upos']).dropna()

test_copy = extract_test_features(test)
print('Test file preview:')
print(test_copy.head())

Test file preview:
  token   upos  pos_indices  is_propn  is_noun  title_case  has_capital_letter
0     &  CCONJ           15     False    False       False               False
1    gt      X            6     False    False       False               False
2     ;  PUNCT            8     False    False       False               False
3     *  PUNCT            8     False    False       False               False
4   The    DET            3     False    False        True                True


In [19]:
X_test = test_copy.drop(['token', 'upos'], axis=1)
preds = logreg.predict(X_test)

(unique, counts) = np.unique(preds, return_counts=True)
print('Predicted label, Count of labels')
print(np.asarray((unique, counts)).T)

Predicted label, Count of labels
[[    0  2355]
 [    2 20968]]


In [20]:
bio_preds = [reverse_bio(p) for p in preds]
test_copy['prediction'] = bio_preds
print(test_copy[30:40])

        token   upos  pos_indices  ...  title_case  has_capital_letter  prediction
30          *  PUNCT            8  ...       False               False           O
31     Police   NOUN            0  ...        True                True           B
32       last    ADJ           11  ...       False               False           O
33       week   NOUN            0  ...       False               False           O
34  evacuated   VERB           13  ...       False               False           O
35         80    NUM            7  ...       False               False           O
36  villagers   NOUN            0  ...       False               False           O
37       from    ADP            4  ...       False               False           O
38  Waltengoo  PROPN            9  ...        True                True           B
39        Nar  PROPN            9  ...        True                True           B

[10 rows x 8 columns]


In [21]:
test_output = test_copy.drop(['pos_indices', 'is_propn', 'is_noun', 'title_case', 'has_capital_letter'], axis=1)
test_output.to_csv('wnut17test_clean_tagged_prediction_xz398.txt', sep='\t', index=False, header=True)

In [23]:
test_output_check = pd.read_table('wnut17test_clean_tagged_prediction_xz398.txt').dropna()
test_output_check[30:40]

Unnamed: 0,token,upos,prediction
30,*,PUNCT,O
31,Police,NOUN,B
32,last,ADJ,O
33,week,NOUN,O
34,evacuated,VERB,O
35,80,NUM,O
36,villagers,NOUN,O
37,from,ADP,O
38,Waltengoo,PROPN,B
39,Nar,PROPN,B
