In [42]:
# write the list of necessary packages here:
!pip install pandas
!pip install nltk
!pip install spacy
!pip install scikit-learn
!pip install sklearn-crfsuite


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.10 -m pip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.10 -m pip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.10 -m pip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m T

## Training a model on Named Entity Recognition task

Token classification refers to the task of classifying individual tokens in a sentence. One of the most common token
classification tasks is Named Entity Recognition (NER). NER attempts to find a label for each entity in a sentence,
such as a person, location, or organization. In this assignment, you will learn how to train a model on the [CoNLL 2023 NER Dataset](https://www.clips.uantwerpen.be/conll2003/ner/) dataset to detect new entities.

### Loading the dataset

In [None]:
# import your packages here:
import pandas as pd
from sklearn_crfsuite import CRF, metrics
from sklearn.model_selection import train_test_split

In [93]:
train_df = pd.read_csv("ner_data/train.txt", header=0, sep=" ")
val_df = pd.read_csv("ner_data/val.txt", header=0, sep=" ")
test_df = pd.read_csv("ner_data/test.txt", header=0, sep=" ")

print(f"{train_df.shape}, {val_df.shape}, {test_df.shape}")

(204566, 4), (51577, 4), (46665, 4)


The CoNLL-2003 shared task data files contain four columns separated by a single space. Each word has been put on a separate line and there is an empty line after each sentence. The first item on each line is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. The chunk tags and the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only if two phrases of the same type immediately follow each other, the first word of the second phrase will have tag B-TYPE to show that it starts a new phrase. A word with tag O is not part of a phrase. Here is an example:

In [94]:
train_df.head()

Unnamed: 0,-DOCSTART-,-X-,-X-.1,O
0,EU,NNP,B-NP,B-ORG
1,rejects,VBZ,B-VP,O
2,German,JJ,B-NP,B-MISC
3,call,NN,I-NP,O
4,to,TO,B-VP,O


In [95]:
label_list = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

labels_vocab = {'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6, 'B-MISC': 7, 'I-MISC': 8}
labels_vocab_reverse = {v:k for k,v in labels_vocab.items()}

### Feature Extraction
 
You need to extract features for each token. The features can be:
• Basic features: Token itself, token lowercase, prefix/suffix of the token.
• Context features: Neighboring tokens (previous/next token).
• Linguistic features: Part-of-speech (POS) tags or word shapes (capitalization, digits,
etc.).
Note that you are expected to briefly mention which features you employ for training your
model.

In [179]:
def prepare_sentences(df):
    sentences = []
    sentence = []
    for _, row in df.iterrows():
        if pd.isna(row.iloc[0]): 
            if sentence:
                sentences.append(sentence)
                sentence = []
        else:
            word = str(row.iloc[0])
            pos = str(row.iloc[1])
            chunk = str(row.iloc[2])
            ner = str(row.iloc[3])
            sentence.append((word, pos, chunk, ner))
    if sentence:  
        sentences.append(sentence)
    return sentences
    
    
def extract_data(data):
    features = []
    labels = []
   
    for sentence in data:
        word_features = []
        word_label = []
        for i in range(len(sentence)):
            word, pos, chunk, ner = sentence[i]
            
            feature_dict = {
                'word': word,
                'is_all_lower': word.lower() == word, 
                'prefix': word[:2], 
                'suffix': word[-2:], 
                'prev_word': '' if i == 0 else sentence[i - 1][0], 
                'next_word': '' if i == len(sentence) - 1 else sentence[i + 1][0], 
                'is_capitalized': word[0].upper() == word[0],
                'is_numeric': word.isdigit(), 
                'chunk': chunk,
                'pos': pos,
            }
            word_features.append(feature_dict)            
            word_label.append(ner)
        features.append(word_features)
        labels.append(word_label)
    return features, labels


In [180]:
train_sentences = prepare_sentences(train_df)
val_sentences = prepare_sentences(val_df)
test_sentences = prepare_sentences(test_df)


train_features, train_labels = extract_data(train_sentences)
val_features, val_labels = extract_data(val_sentences)
test_features, test_labels = extract_data(test_sentences)


### Train a NER Classifier Model

Implement one of the following classifiers for recognizing multiple entity types (e.g., person, organization, location): Conditional Random Field (CRF), biLSTM or multinomial logistic regression. Select only one and provide a brief explanation for
your choice of model.

In [181]:
crf = CRF(
    algorithm='lbfgs', c1=0.1, c2=0.1, max_iterations=100, all_possible_transitions=True)

crf.fit(train_features, train_labels)
label_predictions = crf.predict(test_features)

### Evaluation

Evaluate the model on the test set using metrics such as precision, recall, and F1-score

In [182]:
print("Flat Accuracy Score:", metrics.flat_accuracy_score(test_labels, label_predictions))
  
report = metrics.flat_classification_report(test_labels, label_predictions, labels=label_list, digits=4)
print("Classification Report:")
print(report)

Flat Accuracy Score: 0.9540047581284695
Classification Report:
              precision    recall  f1-score   support

           O     0.9878    0.9852    0.9865     38124
       B-PER     0.8405    0.8472    0.8439      1617
       I-PER     0.8721    0.9498    0.9093      1156
       B-ORG     0.8065    0.7375    0.7704      1661
       I-ORG     0.6052    0.7509    0.6702       835
       B-LOC     0.8456    0.8046    0.8246      1668
       I-LOC     0.7696    0.6887    0.7269       257
      B-MISC     0.7816    0.7749    0.7783       702
      I-MISC     0.5748    0.6759    0.6213       216

   micro avg     0.9536    0.9536    0.9536     46236
   macro avg     0.7871    0.8016    0.7924     46236
weighted avg     0.9549    0.9536    0.9540     46236


### Reporting

Summarize your findings and suggest potential improvements for future iterations of the NER system. Additionally, discuss whether your model encountered class imbalance issues and how you addressed them. Write your suggestions to the given markdown cells.

Initially I formatted sentences into four columns: token, POS tag, syntactic chunk tag and NER by prepare_sentences function. Then I extracted the features from formatted sentences by extract_data function. I selected nine features for each word, namely: 
    • is_all_lower: boolean representing whether the word is in lowercase or not,
    • prefix: first two letters of the word,
    • suffix: last two letters of the word,
    • prev_word: previous word in the sentence (if it exists),
    • next_word: next word in the sentence (if it exists),
    • is_capitalized: boolean representing if the first letter is capitalized or not,
    • is_numeric: boolean representing if the word is numeric or not,
    • chunk: syntactic chunk tag,
    • pos: POS tag.
    
After extracting features from the data, I fitted this modified data to my CRF classifier, then made a prediction with it using crf.predict function.

I chose Conditional Random Field classifier for my implementation since the task was sequence labeling. CRF is advantageous over the other two classifiers because it models the dependencies between the neighbor labels. If I were to use biLSTM, each prediction for a word would be independent of the predictions of its neighbor. Logistic Regression is ill-suited to our task for the same reason: it processes words independently and predicts the label separately.

This implementation of NER system made predictions with great success rates. I did not come across to any class imbalance issues. If it happened, increasing features may have helped for this problem so sacrificing computation resources might be a solution. Another solution would be using synthetic data, may be some of the minority class data can be duplicated for this reason. 

