# Assignment 3 Part 2: IE

## Overview

In this assignment, the task is to code a Named Entity Recognizer (NER) application in Python using the CRFsuite library.

It is recommended you complete the Named_Entity_Extraction_Tutorial.ipynb tutorial before attemping this.

Your tasks for this assignment are to:
1. Build a NER classifier following the tutorial.
2. Improve the performance of your NER classifier.
3. Answer three written assignments.

* Write answers in this notebook file, and upload the file to Wattle submission site. **Please rename and submit jupyter notebook file (Assignment5.ipynb) to your_uid.ipynb (e.g. u6000001.ipynb) with your written answers therein**. Do not upload any other files to Wattle except this notebook file.

### <span style="color:blue"> Question 1 (2 points) Build a NER model <a id='Task1'></a> </span>
### Part A (1.5 marks)

* Build a NER model using the train and test data files.
* You can use the code provided in [tutorial sheet](Named_Entity_Extraction_Tutorial.ipynb) 
* Try changing the feature extraction, model hyper parameters, or other settings in order to improve your model performance.
* Marks will be awarded based on how well your model performs.


In [1]:
### YOUR CODE HERE
from __future__ import print_function
from sklearn.metrics import confusion_matrix
import io
import nltk
import scipy
import codecs
import sklearn
import pycrfsuite
import pandas as pd
from itertools import chain
from sklearn.preprocessing import LabelBinarizer
from sklearn.metrics import classification_report

print('sklearn version:', sklearn.__version__)
print('Libraries succesfully loaded!')
def sent2features(sent, feature_func):
    return [feature_func(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [s[-1] for s in sent]

def sent2tokens(sent):
    return [s[0] for s in sent]

def bio_classification_report(y_true, y_pred):
    """
    Classification report for a list of BIO-encoded sequences.
    It computes token-level metrics and discards "O" labels.
    
    Note that it requires scikit-learn 0.15+ (or a version from github master)
    to calculate averages properly!
    """
    lb = LabelBinarizer()
    y_true_combined = lb.fit_transform(y_true)
    y_pred_combined = lb.transform(y_pred)
        
    tagset = set(lb.classes_) - {'O'}
    tagset = sorted(tagset, key=lambda tag: tag.split('-', 1)[::-1])
    class_indices = {cls: idx for idx, cls in enumerate(lb.classes_)}
    
    return classification_report(
        y_true_combined,
        y_pred_combined,
        labels = [class_indices[cls] for cls in tagset],
        target_names = tagset,
    )
            
def word2simple_features(sent, i):
    '''
    This makes a simple baseline.  
    You can add and/or remove features to get (much?) better results.
    Experiment with it as you will need to do this for assignment.
    '''
    word = sent[i][0]
    
    features = {
        'bias': 1.0, 
        'word.lower()': word.lower(), 
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
    }
    if i > 0:
        word1 = sent[i-1][0]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
        })
    else:
        features['BOS'] = True
    if i < len(sent)-1:
        word1 = sent[i+1][0]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
        })
    else:
        features['EOS'] = True
    return features

# load data and preprocess
def extract_data(path):
    """
    Extracting data from train file or test file. 
    path - the path of the file to extract
    
    return:
        res - a list of sentences, each sentence is a
              a list of tuples. For train file, each tuple
              contains token and label. For test file, each
              tuple only contains token.
        ids - a list of ids for the corresponding token. This
              is mainly for Kaggle submission.
    """
    file = io.open(path, mode="r", encoding="utf-8")
    next(file)
    res = []
    ids = []
    sent = []
    for line in file:
        if line != '\n':
            # Each line contains the position ID, the token, and (for the training set) the label.
            parts = line.strip().split(' ')
            sent.append(tuple(parts[1:]))
            ids.append(parts[0])
        else:
            res.append(sent)
            sent = []
                
    return res, ids
# Load train and test data
train_data, train_ids = extract_data('train')
test_data, test_ids = extract_data('test')

# Load true labels for test data
test_labels = list(pd.read_csv('test_ground_truth').loc[:, 'label'])

print('Train and Test data loaded succesfully!')

# Feature extraction using the word2simple_features function
train_features = [sent2features(s, feature_func=word2simple_features) for s in train_data]
train_labels = [sent2labels(s) for s in train_data]
test_features = [sent2features(s, feature_func=word2simple_features) for s in test_data]

trainer = pycrfsuite.Trainer(verbose=False)
for xseq, yseq in zip(train_features, train_labels):
    trainer.append(xseq, yseq)
print('Feature Extraction done!')    

# Explore the extracted features    
sent2features(train_data[0], word2simple_features)
trainer.params()
trainer.set_params({
    'c1': 0.01,   # coefficient for L1 penalty
    'c2': 1e-2,  # coefficient for L2 penalty
    'max_iterations': 100,  # stop earlier

    # include transitions that are possible, but not observed
    'feature.possible_transitions': True
})


sklearn version: 0.21.3
Libraries succesfully loaded!
Train and Test data loaded succesfully!
Feature Extraction done!


In [2]:
%%time
trainer.train('ner-esp.model')

print('Training done :)')
# Make predictions
tagger = pycrfsuite.Tagger()
tagger.open('ner-esp.model')
test_pred = [tagger.tag(xseq) for xseq in test_features]
test_pred = [s for w in test_pred for s in w]

## Print evaluation
print(bio_classification_report(test_pred, test_labels))

Training done :)
              precision    recall  f1-score   support

       B-LOC       0.83      0.82      0.82      2067
       I-LOC       0.75      0.75      0.75       759
      B-MISC       0.61      0.74      0.67       715
      I-MISC       0.60      0.61      0.61      1228
       B-ORG       0.84      0.87      0.86      3090
       I-ORG       0.82      0.81      0.81      2237
       B-PER       0.90      0.92      0.91      1850
       I-PER       0.94      0.94      0.94      1634

   micro avg       0.81      0.83      0.82     13580
   macro avg       0.79      0.81      0.80     13580
weighted avg       0.82      0.83      0.82     13580
 samples avg       0.10      0.10      0.10     13580

CPU times: user 20.5 s, sys: 163 ms, total: 20.7 s
Wall time: 20 s


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


The output of the above cell should look something like this (but with different numbers)

                precision    recall  f1-score   support

      B-LOC       0.68      0.47      0.55      1084
      I-LOC       0.52      0.25      0.34       325
     B-MISC       0.54      0.11      0.19       339
     I-MISC       0.54      0.22      0.32       557
      B-ORG       0.76      0.51      0.61      1400
      I-ORG       0.67      0.44      0.53      1104
      B-PER       0.73      0.68      0.71       735
      I-PER       0.78      0.82      0.80       634

avg / total       0.68      0.48      0.55      6178



### Part B (0.5 marks)

Briefly explain what changes to your model you tried and how these changes affected the model's performance.

- I change the implementation for 'word2simple_features' function which considers defines more features based on word identify, suffix and shape.

- I change the coefficient for L1 penalty from 100 to 0.01. Since the L1 adds the 'absolute value of magnitude' of coefficient as penalty term to the loss function. If the coefficient is set to a large value, the model will be under-fit (as output shown: precision is low). Test with 0,0.001,0.01,0.1,0.5. 0.01 gives the best performance.

- I change the coefficient for Le penalty from 'le-3' to 'le-2' which aims to avoid over-fitting. Test with 'le-2','le-3','le-4'. le-2 gives the best performance.

- I change the 'max_iterations' from 50 to 100. The iterations affects how well the model is trained. For larger iterations, the model is trained better for the training data. However, if the iteration is set to a extremely large value, the model could over-fit the training data. So, I only increase the iteration 50 rounds. 100 and 300 have the similiar performance.

- After apply the above changes, the model's overall performance has huge increase. The 'support' also have huge increases.

### <span style="color:blue"> Written Part (3 points) </span>

Answer briefly and concisely the following questions.
Check [this](https://sourceforge.net/p/jupiter/wiki/markdown_syntax/#md_ex_lists) if you are not familiar with markdown syntax.

### Question 2 (0.5 point)
Think of three relevant baselines for the Named Entity Classification task.
Provide answers using bullet list with 3 items. Give a short description of each of them.

- Random Assignment: ramdomly assign a label to an entity. By testing the random assignment performance and your model's performance.
- Simple heuristic: define simple rule for quick building. For example, set a rule that the word with a captical character is a person name. By testing the entity with simple heuristic and your classficiation to test your model's performance.
- Simple machine learning techniques: using a defined machine learning skill like Navie Bayes. By testing the navie bayes' performance and your model's performance.

### Question 3 (1.5 point)
How does Maximal Marginal Relevance (MMR) address redundancy issues? (0.5 point)

How can you tell MMR that "Sydney" and "Melbourne" are cities? (0.5 points)

How can you tell MMR that "solar panels" and "photovoltaic cells" have similar meaning? (0.5 points)

- The MMR address the redundancy issues by using Sim2 to measure the similarity between the two sentences. In here, it measures the similarity between the current sentence and the sentences already selected to be summary. For example, the current sentence s1 is similar to a sentence s2 in selected set R. The value of $Sim_{1}$(s1,s2) is large. By minusing (1-$\lambda$)$Sim_{1}$(s1,s2), the score for sentence s1 is very low. So the sentence s1 will not be selected to be a part of summary which avoids the redundancy.
- To tell MMR 'Sydney' and 'Melbourne' are cities, we could set Q to 'city'. By setting the $\lambda$ to 1, we could get a relevance between 'City' and D. When D is 'Sydney', the relevance between 'City' and 'Sydney' is calculated. When D is 'Melbourne', the relevance between 'City' and 'Melbourne' is calculated. Since these two values are high (Sydney and Melbourne has label 'City'), the MMR knows the 'Sydney' and 'Melbourne' are cities.
- To tell MMR that 'solar panels' and 'photovolatic cells' have similar meaning, we could directly calculate the similarity between these two. We could set $\lambda$ to 1, D to 'solar panels' and Q to 'photovolatic cells'.

### Question 4 (1 point)

Imagine you are developing an extractive text summarization tool using HMM.

What are the hidden states and the observations of the HMM model? (0.5 point)

Which algorithm is used to compute the probability of a particular observation sequence? (0.5 point)

- In this case, the hidden states are 'summary state' and 'non summary state'. The observations are 3 features: Position of sentence in the document, number of terms in the sentence and likelihood of the sentences given the document terms.
- Forward algorithm: $$\alpha_{t}(j)=\sum_{i=1}^N\alpha_{t-1}(i)\alpha_{ij}b{j}b_{j}(O_{t})$$
where $\alpha_{t-1}$ is the previous path probability, $\alpha_{ij}$ is the transition probability from the previous state to the current state, $b_{j}(O_{t})$ is the state observation likelihood of the oberservation $O_{t}$ given the current state j
