In [1]:
# write the list of necessary packages here:
!pip install pandas
!pip install nltk
!pip install spacy
!pip install scikit-learn
!pip install sklearn-crfsuite

Collecting sklearn-crfsuite
  Downloading sklearn_crfsuite-0.5.0-py2.py3-none-any.whl.metadata (4.9 kB)
Collecting python-crfsuite>=0.9.7 (from sklearn-crfsuite)
  Downloading python_crfsuite-0.9.11-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.3 kB)
Downloading sklearn_crfsuite-0.5.0-py2.py3-none-any.whl (10 kB)
Downloading python_crfsuite-0.9.11-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m23.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: python-crfsuite, sklearn-crfsuite
Successfully installed python-crfsuite-0.9.11 sklearn-crfsuite-0.5.0


## Training a model on Named Entity Recognition task

Token classification refers to the task of classifying individual tokens in a sentence. One of the most common token
classification tasks is Named Entity Recognition (NER). NER attempts to find a label for each entity in a sentence,
such as a person, location, or organization. In this assignment, you will learn how to train a model on the [CoNLL 2023 NER Dataset](https://www.clips.uantwerpen.be/conll2003/ner/) dataset to detect new entities.

### Loading the dataset

In [2]:
# import your packages here:
import pandas as pd
import nltk


In [4]:
import os

# Change directory to mdrive
os.chdir('/content/drive/MyDrive')



In [5]:
train_df = pd.read_csv("ner_data/train.txt", header=0, sep=" ")
val_df = pd.read_csv("ner_data/val.txt", header=0, sep=" ")
test_df = pd.read_csv("ner_data/test.txt", header=0, sep=" ")

print(f"{train_df.shape}, {val_df.shape}, {test_df.shape}")

(204566, 4), (51577, 4), (46665, 4)


The CoNLL-2003 shared task data files contain four columns separated by a single space. Each word has been put on a separate line and there is an empty line after each sentence. The first item on each line is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. The chunk tags and the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only if two phrases of the same type immediately follow each other, the first word of the second phrase will have tag B-TYPE to show that it starts a new phrase. A word with tag O is not part of a phrase. Here is an example:

In [17]:
train_df.head(10)


Unnamed: 0,-DOCSTART-,-X-,-X-.1,O
0,EU,NNP,B-NP,B-ORG
1,rejects,VBZ,B-VP,O
2,German,JJ,B-NP,B-MISC
3,call,NN,I-NP,O
4,to,TO,B-VP,O
5,boycott,VB,I-VP,O
6,British,JJ,B-NP,B-MISC
7,lamb,NN,I-NP,O
8,.,.,O,O
9,Peter,NNP,B-NP,B-PER


In [6]:
label_list = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

labels_vocab = {'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6, 'B-MISC': 7, 'I-MISC': 8}
labels_vocab_reverse = {v:k for k,v in labels_vocab.items()}

### Feature Extraction

You need to extract features for each token. The features can be:
• Basic features: Token itself, token lowercase, prefix/suffix of the token.
• Context features: Neighboring tokens (previous/next token).
• Linguistic features: Part-of-speech (POS) tags or word shapes (capitalization, digits,
etc.).
Note that you are expected to briefly mention which features you employ for training your
model.

In [7]:
def separate_sentences(file_path):

    with open(file_path, 'r') as file:
        lines = file.readlines()

    sentences = []
    current_sentence = []

    for line in lines:
        line = line.strip()
        if line:
            current_sentence.append(line.split(" "))
        else:
            if current_sentence:
                sentences.append(current_sentence)
                current_sentence = []
    if current_sentence:
        sentences.append(current_sentence)

    return sentences

train_file_path = "ner_data/train.txt"
train_sentences = separate_sentences(train_file_path)

test_file_path = "ner_data/test.txt"
test_sentences = separate_sentences(test_file_path)


Word features are explained in the reporting part.

In [10]:
import spacy

nlp = spacy.load("en_core_web_sm")
nlp.max_length = 1104033

def extract_features_from_sentence(sentence):
    features = []
    labels = []
    prev_token = None

    for i in range(len(sentence)):

        token_text = sentence[i][0]
        token = nlp(token_text)[0]

        word_features = {
            'word': token.text,
            'lower': token.text.lower(),
            'prefix': token.text[:3],  # Prefix: first 3 characters
            'suffix': token.text[-3:],  # Suffix: last 3 characters
            'shape': token.shape_,
            'is_digit': token.text.isdigit(),
            'POS': token.pos_,
        }

        # Previous token features
        if prev_token is None:
            word_features.update({
                '-1:word': "",
            })
        else:
            word_features.update({
                '-1:word': prev_token.text,
            })

        # Next token features
        if i + 1 < len(sentence):
            next_token_text = sentence[i + 1][0]
            next_token = nlp(next_token_text)[0]
            word_features.update({
                '+1:word': next_token.text,
            })
        else:
            word_features.update({
                '+1:word': "",
            })

        features.append(word_features)
        labels.append(sentence[i][3])

        prev_token = token

    return features, labels


In [11]:
X = []
y = []
for sentence in train_sentences:
    train_features,train_labels = extract_features_from_sentence(sentence)
    X.append(train_features)
    y.append(train_labels)

### Train a NER Classifier Model

Implement one of the following classifiers for recognizing multiple entity types (e.g., person, organization, location): Conditional Random Field (CRF), biLSTM or multinomial logistic regression. Select only one and provide a brief explanation for
your choice of model.

In [12]:
# write your code here:
from sklearn_crfsuite import CRF

crf = CRF(
    algorithm='lbfgs',            # Optimization algorithm
    c1=0.1,                       # Coefficient for L1 regularization
    c2=0.1,                       # Coefficient for L2 regularization
    max_iterations=100,           # Maximum number of iterations
    all_possible_transitions=True # Allow transitions between all labels
)

crf.fit(X[1:], y[1:])

print("CRF model training complete!")

CRF model training complete!


### Evaluation

Evaluate the model on the test set using metrics such as precision, recall, and F1-score

In [13]:
# write your code here:
test_X = []
test_y = []
for sentence in test_sentences:
    test_features, test_labels = extract_features_from_sentence(sentence)
    test_X.append(test_features)
    test_y.append(test_labels)

In [14]:
test_y_pred = crf.predict(test_X)

In [15]:
from sklearn_crfsuite import metrics

classification_report = metrics.flat_classification_report(
    test_y, test_y_pred, digits=3
)
print("Classification Report:")
print(classification_report)

f1_score = metrics.flat_f1_score(test_y, test_y_pred, average='weighted')
precision = metrics.flat_precision_score(test_y, test_y_pred, average='weighted')
recall = metrics.flat_recall_score(test_y, test_y_pred, average='weighted')

print(f"Weighted Precision: {precision:.3f}")
print(f"Weighted Recall: {recall:.3f}")
print(f"Weighted F1-Score: {f1_score:.3f}")

Classification Report:
              precision    recall  f1-score   support

       B-LOC      0.857     0.883     0.870      1668
      B-MISC      0.833     0.775     0.803       702
       B-ORG      0.808     0.710     0.756      1661
       B-PER      0.831     0.861     0.846      1617
       I-LOC      0.769     0.763     0.766       257
      I-MISC      0.615     0.681     0.646       216
       I-ORG      0.689     0.738     0.713       835
       I-PER      0.872     0.964     0.915      1156
           O      0.990     0.988     0.989     38554

    accuracy                          0.959     46666
   macro avg      0.807     0.818     0.811     46666
weighted avg      0.959     0.959     0.959     46666

Weighted Precision: 0.959
Weighted Recall: 0.959
Weighted F1-Score: 0.959


### Reporting

Summarize your findings and suggest potential improvements for future iterations of the NER system. Additionally, discuss whether your model encountered class imbalance issues and how you addressed them. Write your suggestions to the given markdown cells.

INTRODUCTION

Named entity recognition (NER) identifies predefined categories of objects in a body of text. This report outlines the development and evaluation of an NER model using the CoNLL 2003 NER Shared Task's English dataset. The objective is to accurately recognize multiple entity types by extracting relevant features and implementing a suitable classification model.

FEATURE EXTRACTION

Effective feature extraction is crucial for the performance of our NER model. The features used in this task are a combination of basic, contextual, and linguistic characteristics of each token in the text.

Basic Features

Token Itself: The original word as it appears in the text.
Lowercase Token: The token converted to lowercase to normalize case variations.
Prefix: The first three characters of the token, capturing common prefixes.
Suffix: The last three characters of the token, capturing common suffixes.
Shape: The word shape, which encodes patterns of capitalization, numerals, and punctuation (e.g., 'Xxxx' for 'John').

Contextual Features

Previous Token: The word immediately preceding the current token.
Next Token: The word immediately following the current token.

Linguistic Features

Part-of-Speech (POS) Tag: The grammatical role of the token (e.g., noun, verb), providing syntactic context.
Is Digit: A boolean indicating whether the token consists of digits.

MODEL IMPLEMENTATION

For this task, a Conditional Random Field (CRF) model was implemented for recognition of NER.

CRF was chosen for its effectiveness in sequence labeling tasks like NER. Unlike models that make independent predictions for each token, CRF considers the context of the entire sequence, modeling the conditional probability of the label sequence given the input sequence. This allows the model to capture dependencies between neighboring labels, which is essential for correctly identifying multi-token entities and ensuring consistent labeling.

The model was trained using the extracted features from the training set. The logistic regression classifier predicts the probability of each token belonging to a particular entity class based on the input features.

The model's performance was evaluated on the test set using precision, recall, and F1-score metrics for each entity class. The classification report can be found in the previous part.

RESULTS

The model achieves high precision and recall for the 'O' class (non-entity tokens), which is expected due to its large representation in the dataset.
Entity classes like 'B-LOC' and 'B-PER' have high F1-scores, indicating good performance in recognizing locations and persons.
Lower performance is observed for classes like 'I-MISC' and 'I-ORG', suggesting difficulty in correctly identifying multi-token miscellaneous entities and organizations.

The model demonstrates strong overall performance with an accuracy of 95.9% and a weighted F1-score of 0.959.
The combination of basic, contextual, and linguistic features contributes to the model's ability to distinguish between different entity types.

Class Imbalance: A significant imbalance exists between the 'O' class and the entity classes, with the 'O' class comprising the majority of the dataset. The dominance of the 'O' class can bias the model toward predicting non-entity labels, potentially reducing the recall for actual entities.

To overcome the problem, in the CRF model, transition probabilities between labels can be adjusted to account for class imbalance. Although challenging due to sequence dependencies, attempts can be made to train on more balanced subsets of the data.

POTENTIAL IMPROVEMENTS

Enhanced Feature Engineering:

To capture morphological patterns within words for improving recognition of rare or out-of-vocabulary words, character n-grams or embeddings can be incorporated. Also, pre-trained word embeddings (e.g., GloVe, Word2Vec) can be utilized to provide semantic context beyond the token level.

Integration of neural networks with CRF can be considered to capture non-linear relationships in the data.
- Bidirectional LSTM-CRF
- Bidirectional Long Short-Term Memory (BiLSTM)

Using additional training examples for underrepresented classes can be helpful to balance the dataset.

Fine-tuning the L1 and L2 regularization coefficients to prevent overfitting while maintaining model complexity.
and experimenting with different feature combinations to identify the most impactful features is essential to optimal hyperparameters and highly affect the model performance. Also, K-fold cross-validation can be used to obtain a more robust estimate of the model's performance.

Conclusion
The NER model developed in this project successfully identifies various entity types with high accuracy. While the model performs well overall, there is room for improvement, particularly in handling class imbalance and enhancing recognition of less frequent entity types. Future iterations can leverage advanced models and enriched feature sets to achieve better performance.