## Introduction

In this assignment, you are going to build a classifier for named entities from the Groningen Meaning Bank corpus.  Named entity recognition (NER) takes noun phrases from a text and identifies whether they are persons, organizations, and so on.  You will be using the Groningen Meaning Bank named entity corpus available on mltgpu at `/scratch/lt2222-v21-resources/GMB_dataset.txt`.  In this version of the task, you will assume we know *that* something is a named entity, and instead use multi-class classification to identify its type.  So you will be doing named entity classification but *not* recognition.

The data looks like this: 

```
3996    182.0   Nicole  NNP     B-per
3997    182.0   Ritchie NNP     I-per
3998    182.0   is      VBZ     O
3999    182.0   pregnant        JJ      O
4000    182.0   .       .       O
4001    183.0   Speaking        VBG     O
4002    183.0   to      TO      O
4003    183.0   ABC     NNP     B-org
4004    183.0   News    NNP     I-org
4005    183.0   interviewer     NN      O
4006    183.0   Dianne  NNP     B-per
4007    183.0   Sawyer  NNP     I-per
4008    183.0   ,       ,       O
4009    183.0   the     DT      O
4010    183.0   25-year-old     JJ      O
4011    183.0   co-star NN      O
4012    183.0   of      IN      O
4013    183.0   TV      NN      O
4014    183.0   's      POS     O
4015    183.0   The     DT      B-art
4016    183.0   Simple  NNP     I-art
4017    183.0   Life    NNP     I-art
4018    183.0   said    VBD     O
4019    183.0   she     PRP     O
4020    183.0   is      VBZ     O
4021    183.0   almost  RB      O
4022    183.0   four    CD      O
4023    183.0   months  NNS     O
4024    183.0   along   IN      O
4025    183.0   in      IN      O
4026    183.0   her     PRP$    O
4027    183.0   pregnancy       NN      O
4028    183.0   .       .       O
```

The first column is the line number.  The second column is a sentence number (for some reason given as a float; ignore it).  The third column is the word.  The fourth column is a part of speech (POS) tag in Penn Treebank format.  The last column contains the named entity annotation. 

The annotation works like this.  Every `O` just means that the row does not represent a named entity.  `B-xyx` means the first word in a named entity with type `xyx`. `I-xyz` means the second and later words of an `xyz` entity, if there are any.  That means that every time there's a `B` or an `I`, there's a named entity.  

The entity types in the corpus are `art`,
`eve`,
`geo`,
`gpe`,
`nat`,
`org`,
`per`,
and `tim`

Your task is the following.
1. To preprocess the text (lowercase and lemmatize; punctuation can be preserved as it gets its own rows).
2. To create instances from every from every identified named entity in the text with the type of the NE as the class, and a surrounding context of five words on either side as the features.  
3. To generate vectors and split the instances into training and testing datasets at random.
4. To train a support vector machine (via `sklearn.svm.LinearSVC`) for classifying the NERs.
5. To evaluate the performance of the classifier.

You will do this by modifying a separate file containing functions that will be called from this notebook as a module.  You can modify this notebook for testing purposes but please only submit the original.  You will document everything in Markdown in README.md and submit a GitHub repository URL.

This assignment is due on **Tuesday, 2021 March 9 at 23:59**.  It has **25 points** and **7 bonus points**.

In [47]:
import a2
from sklearn.svm import LinearSVC

In [61]:
gmbfile = open('GMB_dataset.txt', "r")

In [62]:
from nltk.stem import WordNetLemmatizer
import pandas as pd
def preprocesstest(inputfile):
    lemmatiser=WordNetLemmatizer()
    rows=[]
    lines=[x.strip().split('\t') for x in inputfile.readlines()]
    for l in lines:
        l[2]= lemmatiser.lemmatize(l[2].lower())
        rows.append(l)
    cols=['wordN','sentN','word','pos','type']
    return pd.DataFrame(rows, columns=cols)

preprocesstest(gmbfile)

Unnamed: 0,wordN,sentN,word,pos,type
0,Sentence #,Word,po,Tag,
1,0,1.0,thousand,NNS,O
2,1,1.0,of,IN,O
3,2,1.0,demonstrator,NNS,O
4,3,1.0,have,VBP,O
...,...,...,...,...,...
66157,66156,2999.0,be,VB,O
66158,66157,2999.0,announced,VBN,O
66159,66158,2999.0,within,IN,B-tim
66160,66159,2999.0,day,NNS,O


## Part 1 - preprocessing (3 points)

See step 1 above.  The data is coming to you as an unused file handle object.  You can return the data in any indexable form you like.  You can also choose to remove infrequent or uninformative words to reduce the size of the feature space. (Document this in README.md.)

In [3]:
inputdata = a2.preprocess(gmbfile)
gmbfile.close()
inputdata[20:40]

Unnamed: 0,sentenceNr,word,POS,NEtag
20,1,from,IN,O
21,1,that,DT,O
22,1,country,NN,O
23,1,.,.,O
24,2,family,NNS,O
25,2,of,IN,O
26,2,soldier,NNS,O
27,2,killed,VBN,O
28,2,in,IN,O
29,2,the,DT,O


## Part 2 - Creating instances (7 points)

Do step 2 above.  You will create a collection of Instance objects.  Remember to consider the case where the NE is at the beginning of a sentence or at the end, or close to either (you can create a special start token for that).  You can also start counting from before the `B` end of the NE mention and after the last `I` of the NE mention. That means that the instances should include things before and after the named entity mention, but not the named entity text itself.

In [4]:
instances = a2.create_instances(inputdata)

In [5]:
instances[20:40]

[Class: gpe Features: ['this', 'week', 'restarted', 'part', 'of'],
 Class: geo Features: ['the', 'conversion', 'process', 'at', 'it', 'nuclear', 'plant', '.'],
 Class: gpe Features: ['official', 'say', 'they', 'expect', 'to'],
 Class: tim Features: ['sensitive', 'part', 'of', 'the', 'plant', ',', 'after', 'an', 'iaea', 'surveillance'],
 Class: org Features: ['plant', 'wednesday', ',', 'after', 'an', 'surveillance', 'system', 'begin', 'functioning', '.'],
 Class: org Features: ['the', 'surveillance', 'system', 'begin', 'functioning', '.'],
 Class: gpe Features: ['the', 'european', 'union', ',', 'with', 'backing', ',', 'ha', 'threatened', 'to'],
 Class: gpe Features: [',', 'ha', 'threatened', 'to', 'refer', 'to', 'the', 'u.n.', 'security', 'council'],
 Class: org Features: ['to', 'refer', 'iran', 'to', 'the', 'to', 'the', 'u.n.', 'security', 'council'],
 Class: gpe Features: ['impose', 'sanction', 'if', 'it', 'find', 'ha', 'violated', 'the', 'nuclear', 'non-proliferation'],
 Class: art F

## Part 3 - Creating the table and splitting (10 points)

Here you're going to write the functions that create a data table with "document" vectors representing each instance and split the table into training and testing sets and random with an 80%/20% train/test split.

In [24]:
bigdf = a2.create_table(instances)
bigdf[20:40]

Unnamed: 0,class,0,1,2,3,4,5,6,7,8,...,2990,2991,2992,2993,2994,2995,2996,2997,2998,2999
20,gpe,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
21,geo,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
22,gpe,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
23,tim,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
24,org,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25,org,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
26,gpe,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
27,gpe,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
28,org,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
29,gpe,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [62]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import pandas as pd
from collections import Counter
def create_table(instances):

    # .features and .neclass properties from each instance
    docs=[obj.features for obj in instances] # features lists
    corpus=[' '.join(d) for d in docs]
#     vocab=Counter([v for feat in docs for v in feat]).most_common(3000)
#     vocab=[t[0] for t in vocab]
    neclasses=[obj.neclass for obj in instances]
    
#     vectorizer = CountVectorizer()
#     matrix = vectorizer.fit_transform(corpus)
#     stopwords_opt = 'english' if no_stopwords==True else None
    tfidf = TfidfVectorizer(preprocessor=' '.join, )
    matrix = tfidf.fit_transform(docs) # Vectors are in compressed sparse format

    # pd.DataFrame(tdidf_vecs[1], columns=["vector"])
    df=pd.DataFrame.sparse.from_spmatrix( matrix )
    df.insert(0,'class',neclasses,) # Add to the leftest column
    return df
create_table(instances).iloc[[189]]
# it is normal that everything is zero?? maybe try countvecter? how many dims?

Unnamed: 0,class,0,1,2,3,4,5,6,7,8,...,4744,4745,4746,4747,4748,4749,4750,4751,4752,4753
189,gpe,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [8]:
train_X, train_y, test_X, test_y = a2.ttsplit(bigdf)

# X and y mean feature matrix and class respectively.
train_X, train_y, test_X, test_y

(array([[0.70484028, 0.77705168, 0.90626172, ..., 0.42905333, 0.24064187,
         0.13284756],
        [0.04123746, 0.75263423, 0.82603426, ..., 0.19251307, 0.38836701,
         0.79849862],
        [0.97321444, 0.05596415, 0.36369891, ..., 0.90343845, 0.76148659,
         0.83836349],
        ...,
        [0.49099448, 0.80228978, 0.91855497, ..., 0.22532328, 0.71194009,
         0.19050029],
        [0.56310329, 0.72478209, 0.40060363, ..., 0.59340161, 0.47651113,
         0.85589867],
        [0.84887086, 0.70425809, 0.64863831, ..., 0.15375111, 0.52449518,
         0.92194385]]),
 0     tim
 1     nat
 2     per
 3     gpe
 4     nat
      ... 
 75    geo
 76    tim
 77    per
 78    geo
 79    art
 Name: class, Length: 80, dtype: object,
 array([[0.41216972, 0.4682734 , 0.57299347, ..., 0.9179704 , 0.08185916,
         0.00494053],
        [0.5593475 , 0.85286236, 0.87651353, ..., 0.08948916, 0.68671652,
         0.57111886],
        [0.95088606, 0.89561296, 0.50951007, ..., 0.201

In [9]:
len(test_y) / (len(test_y) + len(train_y))

0.2

In [10]:
len(test_X) / (len(test_X) + len(train_X))

0.2

In [11]:
test_y[0]

'per'

## Part 4 - Training the model (0 points)

This part you won't do yourself.

In [12]:
model = LinearSVC()
model.fit(train_X, train_y)
train_predictions = model.predict(train_X)
test_predictions = model.predict(test_X)

In [13]:
train_predictions

array(['tim', 'nat', 'per', 'gpe', 'nat', 'art', 'gpe', 'gpe', 'per',
       'gpe', 'art', 'gpe', 'org', 'gpe', 'nat', 'geo', 'nat', 'geo',
       'nat', 'eve', 'org', 'geo', 'geo', 'org', 'geo', 'art', 'gpe',
       'nat', 'gpe', 'eve', 'gpe', 'gpe', 'org', 'per', 'art', 'geo',
       'eve', 'tim', 'gpe', 'geo', 'per', 'geo', 'per', 'per', 'art',
       'gpe', 'gpe', 'per', 'eve', 'nat', 'tim', 'org', 'art', 'gpe',
       'geo', 'art', 'art', 'geo', 'gpe', 'nat', 'eve', 'eve', 'eve',
       'geo', 'nat', 'org', 'gpe', 'art', 'nat', 'art', 'nat', 'tim',
       'org', 'nat', 'org', 'geo', 'tim', 'per', 'geo', 'art'],
      dtype=object)

In [14]:
train_y

0     tim
1     nat
2     per
3     gpe
4     nat
     ... 
75    geo
76    tim
77    per
78    geo
79    art
Name: class, Length: 80, dtype: object

In [15]:
test_predictions

array(['gpe', 'geo', 'gpe', 'nat', 'per', 'gpe', 'gpe', 'gpe', 'nat',
       'eve', 'org', 'geo', 'gpe', 'nat', 'gpe', 'nat', 'geo', 'gpe',
       'gpe', 'gpe'], dtype=object)

In [16]:
test_y

0     per
1     eve
2     per
3     tim
4     art
5     nat
6     geo
7     org
8     geo
9     org
10    eve
11    per
12    gpe
13    nat
14    eve
15    per
16    gpe
17    per
18    gpe
19    art
Name: class, dtype: object

## Part 5 - Evaluation (5 points)

Investigate for yourself what a "confusion matrix".  Then implement a function that takes the data and produces a confusion matrix in any readable form that allows us to compare the performance of the model by class.  

In [17]:
a2.confusion_matrix(test_y, test_predictions)

I'm confusing.


"I'm confused."

In [18]:
a2.confusion_matrix(train_y, train_predictions)

I'm confusing.


"I'm confused."

Examine the matrix and describe your observations in README.md.  In particular, what do you notice about the predictions on the training data compared to those on the test data.

## Bonus Part A - Error analysis (2 points)

Look at the weakest-performing classes in the confusion matrix (or any, if they all perform poorly to the same extent).  Find some examples in the test data on which the classifier classified incorrectly for those classes.  What do you think is the reason why those are hard?  Consider linguistic factors and statistical factors, if applicable.  Write your answer in README.md.

## Bonus Part B - Expanding the feature space (7 points)

Run the entire process above, but incorporate part-of-speech tag information into the feature vectors.  It's your choice as to how to do this, but document it in README.md.  Your new process should run from the single call below:

In [19]:
a2.bonusb('GMB_dataset.txt')

In [34]:
import spacy
from tqdm import tqdm
with open('GMB_dataset.txt', "r") as file:
    # Initialize spacy 'en' model, keeping only tagger component needed for lemmatization
    nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
    
    sentences_dict = {}
    for line in tqdm(file.readlines()[1:], desc='Adding each word: '):
        line=line.strip('\n')
        items = line.split('\t')  # [indexNr, sentenceNr, Word,	POS, NEtag]
        items[1] = int( float(items[1]) )  # Turn sentenceNr "NNN.0" into integer
        items[2] = ' '.join([w.lemma_ for w in en_nlp( items[2] )]).lower() # Use Spacy to lemmatise & lowercase
        items[4] = False if items[4]=='O' else items[4]
        sentNr, word, POS, NEclass=items[1:]
        
        if sentNr not in sentences_dict:
            sentences_dict[sentNr]=[]
        sentences_dict[sentNr].append( (word, POS, NEclass) )

sentences_dict[189]

Adding each word: 100%|█████████████████████████████████████████████████████████| 66161/66161 [07:44<00:00, 142.58it/s]


{1: [('thousand', 'NNS', False),
  ('of', 'IN', False),
  ('demonstrator', 'NNS', False),
  ('have', 'VBP', False),
  ('march', 'VBN', False),
  ('through', 'IN', False),
  ('london', 'NNP', 'B-geo'),
  ('to', 'TO', False),
  ('protest', 'VB', False),
  ('the', 'DT', False),
  ('war', 'NN', False),
  ('in', 'IN', False),
  ('iraq', 'NNP', 'B-geo'),
  ('and', 'CC', False),
  ('demand', 'VB', False),
  ('the', 'DT', False),
  ('withdrawal', 'NN', False),
  ('of', 'IN', False),
  ('british', 'JJ', 'B-gpe'),
  ('troop', 'NNS', False),
  ('from', 'IN', False),
  ('that', 'DT', False),
  ('country', 'NN', False),
  ('.', '.', False)],
 2: [('family', 'NNS', False),
  ('of', 'IN', False),
  ('soldier', 'NNS', False),
  ('kill', 'VBN', False),
  ('in', 'IN', False),
  ('the', 'DT', False),
  ('conflict', 'NN', False),
  ('join', 'VBD', False),
  ('the', 'DT', False),
  ('protester', 'NNS', False),
  ('who', 'WP', False),
  ('carry', 'VBD', False),
  ('banner', 'NNS', False),
  ('with', 'IN', F

In [43]:
sentences_dict[189]

[('u.s.', 'NNP', 'B-gpe'),
 ('senator', 'NNP', 'B-per'),
 ('john', 'NNP', 'I-per'),
 ('warner', 'NNP', 'I-per'),
 ('of', 'IN', False),
 ('the', 'DT', False),
 ('southeastern', 'JJ', False),
 ('state', 'NN', False),
 ('of', 'IN', False),
 ('virginia', 'NNP', 'B-geo'),
 (',', ',', False),
 ('a', 'DT', False),
 ('prominent', 'JJ', False),
 ('republican', 'JJ', False),
 ('figure', 'NN', False),
 ('in', 'IN', False),
 ('the', 'DT', False),
 ('debate', 'NN', False),
 ('over', 'IN', False),
 ('the', 'DT', False),
 ('war', 'NN', False),
 ('in', 'IN', False),
 ('iraq', 'NNP', 'B-geo'),
 (',', ',', False),
 ('say', 'VBZ', False),
 ('he', 'PRP', False),
 ('will', 'MD', False),
 ('retire', 'VB', False),
 ('after', 'IN', False),
 ('finish', 'VBG', False),
 ('his', 'PRP$', False),
 ('term', 'NN', False),
 ('in', 'IN', False),
 ('2009', 'CD', 'B-tim'),
 ('.', '.', False)]

In [45]:
NEs_list=[]
for nr, sent in sentences_dict.items():
    for trio in sent:
        word,POS,NEclass=trio[:]
        if NEclass and NEclass[0]=='B':
            words=[word]
            i=sent.index(trio)
            c=1
            try:
                
                while sent[i+c][2]:
                    words.append(sent[i+c][0])
                    c+=1
                    
                if sent[i+c][2]==False:
                    next5=[]
                    for nxt in range(5):
                        try:
                            nextword=sent[i+c+nxt][0]
                            next5.append(nextword)
                        except IndexError:
                            pass
                        
            except IndexError:
                pass
            
            prev5=[]
            for prev in range(5,0,-1):
                try:
                    if i-prev>=0:
                        prevword=sent[i-prev][0]
                        prev5.append(prevword)
                except IndexError:
                    pass
                
                
            NE = ' '.join(words)
            NEs_list.append((NE,NEclass[2:],prev5, next5))
                
            
NEs_list  


[('london',
  'geo',
  ['of', 'demonstrator', 'have', 'march', 'through'],
  ['to', 'protest', 'the', 'war', 'in']),
 ('iraq',
  'geo',
  ['to', 'protest', 'the', 'war', 'in'],
  ['and', 'demand', 'the', 'withdrawal', 'of']),
 ('british',
  'gpe',
  ['and', 'demand', 'the', 'withdrawal', 'of'],
  ['troop', 'from', 'that', 'country', '.']),
 ('bush',
  'per',
  ['with', 'such', 'slogan', 'as', '" " " "'],
  ['number', 'one', 'terrorist', '" " " "', 'and']),
 ('hyde park', 'geo', ['parliament', 'to', 'a', 'rally', 'in'], ['.']),
 ('britain',
  'geo',
  ['of', 'the', 'annual', 'conference', 'of'],
  ['be', 'rule', 'labor', 'party', 'in']),
 ('labor party',
  'org',
  ['conference', 'of', 'britain', 'be', 'rule'],
  ['in', 'the', 'southern', 'english', 'seaside']),
 ('english',
  'gpe',
  ['labor', 'party', 'in', 'the', 'southern'],
  ['seaside', 'resort', 'of', 'brighton', '.']),
 ('brighton',
  'geo',
  ['southern', 'english', 'seaside', 'resort', 'of'],
  ['.']),
 ('britain',
  'gpe',
 

In [33]:
len(NEs_list)

6922