### PROBLEM 1 – Reading the data in CoNLL format (20pts)

In [4]:
import requests

def read_conll_file(url):
    # Initialize - empty lists for tokens and tags
    tokens_list = []
    tags_list = []
    current_tokens = []
    current_tags = []

    # GET request - fetch the data from the provided URL
    response = requests.get(url)
    if response.status_code == 200:
        # Split - content into lines
        lines = response.text.split('\n')
        for line in lines:
            if line.strip() == '':
                # Blank line - indicates the start of a new sequence
                if current_tokens:
                    tokens_list.append(current_tokens)
                    tags_list.append(current_tags)
                current_tokens = []
                current_tags = []
            else:
                # Split - each line into token and tag
                parts = line.split('\t')
                if len(parts) == 2:
                    token, tag = parts
                    current_tokens.append(token)
                    current_tags.append(tag)
        if current_tokens:
            tokens_list.append(current_tokens)
            tags_list.append(current_tags)
    return tokens_list, tags_list

# URL - train data
train_url = "https://raw.githubusercontent.com/spyysalo/ncbi-disease/master/conll/train.tsv"

# URL - test data
test_url = "https://raw.githubusercontent.com/spyysalo/ncbi-disease/master/conll/test.tsv"

**Apply your function to train.tsv and test.tsv. To show you have read in the data correctly, show the following in your notebook output:**
- The number of sequences in train and test. (You should see 5432 sequences in train and
940 sequences in test.)
- The tokens and tags of the first sequence in the training dataset.**

In [5]:
# Read - train and test data
train_tokens, train_tags = read_conll_file(train_url)
test_tokens, test_tags = read_conll_file(test_url)

# Display - requested information
print(f"Number of sequences in train: {len(train_tokens)}")
print(f"Number of sequences in test: {len(test_tokens)}")

# Display - tokens and tags of the first sequence in the training dataset
print("\nTrain Data - First Sequence Tokens:")
print(train_tokens[0])
print("\nTrain Data - First Sequence Tags:")
print(train_tags[0])

Number of sequences in train: 5432
Number of sequences in test: 940

Train Data - First Sequence Tokens:
['Identification', 'of', 'APC2', ',', 'a', 'homologue', 'of', 'the', 'adenomatous', 'polyposis', 'coli', 'tumour', 'suppressor', '.']

Train Data - First Sequence Tags:
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Disease', 'I-Disease', 'I-Disease', 'I-Disease', 'O', 'O']


### PROBLEM 2 – Data Discovery (5 pts)

**In this problem you will examine the data that you read into memory in the previous problem. Using the
training dataset for analysis, show the following in your notebook output:** 

- The count of each of the 3 tags in the training data: “B-Disease”, “I-Disease”, and “O”. Note that
the most frequent token is "O", since most words are not part of a disease mention.
- The 20 most common words/tokens that appear with the tags “B-Disease” or “I-Disease”. That
is, show words that often appear disease mentions. (You may show frequent “B-Disease” and “I-
Disease” words separately, or you may combine them into a single list.)
- OPTIONAL: Any other data exploration you would like to perform. For example, you may want to
print and read a small sample of token sequences, to become familiar with the data.

In [3]:
# Count - occurrences of each tag in the training data
tag_counts = {"B-Disease": 0, "I-Disease": 0, "O": 0}

for sequence_tags in train_tags:
    for tag in sequence_tags:
        tag_counts[tag] += 1

# Print - tag counts
print("Tag Counts in Training Data:")
for tag, count in tag_counts.items():
    print(f"{tag}: {count}")

# Create - dictionary to count word frequencies for "B-Disease" and "I-Disease" tags
word_counts = {}

for i, sequence_tags in enumerate(train_tags):
    for j, tag in enumerate(sequence_tags):
        if tag in {"B-Disease", "I-Disease"}:
            word = train_tokens[i][j]
            if word in word_counts:
                word_counts[word] += 1
            else:
                word_counts[word] = 1

# Sort - words by frequency in descending order
sorted_words = sorted(word_counts.items(), key=lambda item: item[1], reverse=True)

# Print - 20 most common words with "B-Disease" or "I-Disease" tags
print("\n20 Most Common Words with 'B-Disease' or 'I-Disease' Tags:")
for word, count in sorted_words[:20]:
    print(f"{word}: {count}")

Tag Counts in Training Data:
B-Disease: 5145
I-Disease: 6122
O: 124819

20 Most Common Words with 'B-Disease' or 'I-Disease' Tags:
-: 636
deficiency: 322
syndrome: 281
cancer: 269
disease: 256
of: 178
dystrophy: 176
breast: 151
ovarian: 132
X: 122
and: 120
DM: 120
ALD: 114
DMD: 110
APC: 100
disorder: 94
muscular: 94
G6PD: 92
linked: 81
the: 78


### PROBLEM 3 – Building features (20 pts)
**In this problem, you will build the features that you will use in your CRF model. You may find it helpful to
refer to this demo notebook, to understand how to work with the python-crfsuite library.**

- Write a function that takes two inputs:
    - A sequence of tokens
    - An integer position, pointing to one token in that sequence.
    
    and returns a list of features, represented as a list of strings. At minimum, include these features:
    - The current word/token in lower case
    - The suffix (last 3 characters) of the current word
    - The previous word/token (position i-1) or “BOS” if at the beginning of the sequence
    - The next word/token (position i+1), or “EOS” if at the beginning of the sequence
    - At least one other feature of your choice


- Apply your function your train and test token sequences (from output of Problem 1).
- To show that you have completed this step, apply your output to the first 3 words in the first
sequence of the training set.

In [4]:
# Define - function to extract features
def extract_features(tokens, position):
    features = []
    word = tokens[position]
    
    # Add - features for the current word
    features.append(f'w0.lower={word.lower()}')
    features.append(f'w0.suffix3={word[-3:]}')
    
    # Add - features for the previous word (position i-1) or "BOS" if at the beginning of the sequence
    if position == 0:
        features.append('BOS')
    else:
        prev_word = tokens[position - 1]
        features.append(f'w-1.lower={prev_word.lower()}')
        features.append(f'w-1.suffix3={prev_word[-3:]}')
    
    # Add - features for the next word (position i+1) or "EOS" if at the end of the sequence
    if position == len(tokens) - 1:
        features.append('EOS')
    else:
        next_word = tokens[position + 1]
        features.append(f'w+1.lower={next_word.lower()}')
        features.append(f'w+1.suffix3={next_word[-3:]}')
    
    # Add - at least one other feature of my choice 
    features.append(f'w0.length={len(word)}')
    
    return features

# Test - function with the first 5 words in the first sequence of the training set
for i in range(5):
    features = extract_features(train_tokens[0], i)
    print(features)

['w0.lower=identification', 'w0.suffix3=ion', 'BOS', 'w+1.lower=of', 'w+1.suffix3=of', 'w0.length=14']
['w0.lower=of', 'w0.suffix3=of', 'w-1.lower=identification', 'w-1.suffix3=ion', 'w+1.lower=apc2', 'w+1.suffix3=PC2', 'w0.length=2']
['w0.lower=apc2', 'w0.suffix3=PC2', 'w-1.lower=of', 'w-1.suffix3=of', 'w+1.lower=,', 'w+1.suffix3=,', 'w0.length=4']
['w0.lower=,', 'w0.suffix3=,', 'w-1.lower=apc2', 'w-1.suffix3=PC2', 'w+1.lower=a', 'w+1.suffix3=a', 'w0.length=1']
['w0.lower=a', 'w0.suffix3=a', 'w-1.lower=,', 'w-1.suffix3=,', 'w+1.lower=homologue', 'w+1.suffix3=gue', 'w0.length=1']


### PROBLEM 4 – Training a CRF model (20 pts)


**In this problem, you will train a CRF model and evaluate it using metrics computed over individual tags.**

- Using the python-crfsuite library, train a CRF sequential tagging model using feature sequences
that you built in the previous step. Using your training data as input.
- Apply your model to your test dataset to generate predicted tag sequences.
- For each of the 3 labels ("B-Disease","I-Disease", and “O") show precision, recall, f1-score.

In [5]:
from sklearn.metrics import classification_report
import pycrfsuite

# Define - function to extract features for the entire dataset
def extract_features_for_dataset(token_sequences):
    features_for_dataset = []
    for tokens in token_sequences:
        features_for_sequence = []
        for i in range(len(tokens)):
            features = extract_features(tokens, i)
            features_for_sequence.append(features)
        features_for_dataset.append(features_for_sequence)
    return features_for_dataset

# Extract - features for the training and test datasets
train_features = extract_features_for_dataset(train_tokens)
test_features = extract_features_for_dataset(test_tokens)

# Define - function to flatten the true and predicted tag sequences
def flatten_tags(tag_sequences):
    return [tag for sequence_tags in tag_sequences for tag in sequence_tags]

# Train - CRF model
trainer = pycrfsuite.Trainer(verbose=False)
for x_seq, y_seq in zip(train_features, train_tags):
    trainer.append(x_seq, y_seq)
trainer.set_params({
    'c1': 1.0,  # Coefficient for L1 penalty
    'c2': 1e-3,  # Coefficient for L2 penalty
    'max_iterations': 50,  # Maximum number of iterations
})

model_file = 'disease_crf_model.crfsuite'
trainer.train(model_file)

# Apply - trained model to the test data
tagger = pycrfsuite.Tagger()
tagger.open(model_file)
test_pred_tags = [tagger.tag(features) for features in test_features]

# Flatten - true and predicted tag sequences
true_tags = flatten_tags(test_tags)
pred_tags = flatten_tags(test_pred_tags)

# Compute - precision, recall, and f1-score for each label
target_names = ["B-Disease", "I-Disease", "O"]
report = classification_report(true_tags, pred_tags, target_names=target_names)

# Print - classification report
print(report)

              precision    recall  f1-score   support

   B-Disease       0.86      0.72      0.78       960
   I-Disease       0.85      0.76      0.80      1087
           O       0.98      0.99      0.99     22450

    accuracy                           0.97     24497
   macro avg       0.90      0.82      0.86     24497
weighted avg       0.97      0.97      0.97     24497



### PROBLEM 5 – Inspecting the trained model (10 pts)


**In this problem you will examine parameter weights assigned by your model. You can do this by calling
“tagger.info().transitions” and “tagger.info().state_features” on your trained model object.**

- In your notebook, show parameter weights given to transitions between the 3 tag types ("B-Disease","I-Disease", and "O").
- Refer back to the feature you designed in Problem 3 (the feature "of your choice"). Show the parameter weights assigned to this feature. You may truncate this list if it is very long.
- *IF* your feature was dropped during model training (that is, there is nothing to show in the
previous step) then return to Problem 4 and design a new feature that is used in your model.

In [6]:
get_info = tagger.info()
tags = set(get_info.transitions.keys())
train_data_tags = set()

# Create - table for transition probabilities
print("Transition Probabilities:")
print("{:<12} {:<12} {:<8}".format("From Tag", "To Tag", "Weight"))
for key, value in get_info.transitions.items():
    label_from, label_to = key
    train_data_tags.add(label_from)
    train_data_tags.add(label_to)
    print("{:<12} {:<12} {:<8}".format(label_from, label_to, f"{value:.4f}"))

# Print - feature's weights
print("\nFeature Weights:")
# Check - if features are available for the tag
print("{:<12} {:<12} {:<8}".format("Label", "Feature", "Weight"))
for key, value in get_info.state_features.items():
    attr, label = key
    if "upper" in attr or "length" in attr:  # Features - upper and length
        print("{:<12} {:<12} {:<8}".format(label, attr, f"{value:.4f}"))

Transition Probabilities:
From Tag     To Tag       Weight  
O            O            3.8236  
O            B-Disease    2.3919  
B-Disease    O            -1.3448 
B-Disease    B-Disease    -4.8753 
B-Disease    I-Disease    5.8435  
I-Disease    O            -3.9211 
I-Disease    B-Disease    -5.6483 
I-Disease    I-Disease    3.4077  

Feature Weights:
Label        Feature      Weight  
O            w0.length=14 -0.2419 
B-Disease    w0.length=14 0.4438  
I-Disease    w0.length=14 0.0008  
O            w0.length=2  0.5359  
B-Disease    w0.length=2  -0.2939 
I-Disease    w0.length=2  -0.2939 
O            w0.length=4  0.6099  
B-Disease    w0.length=4  0.1572  
I-Disease    w0.length=4  -0.2709 
O            w0.length=1  1.6596  
B-Disease    w0.length=1  -1.9769 
I-Disease    w0.length=1  -0.2225 
O            w0.length=9  0.0075  
B-Disease    w0.length=9  0.1911  
I-Disease    w0.length=9  -0.0151 
O            w0.length=3  0.3969  
B-Disease    w0.length=3  0.9805  
I-Disease  

### PROBLEM 6 – Document level performance (10 pts)

**Tag-level accuracy is easy to compute, but it is not very easy to understand. In particular, one disease
reference may cover both "B-Disease" and "I-Disease" tokens. To give another view of model
performance, compute document-level precision and recall on your experiment output. To do this:** 

- Write a function that aggregates token-level tags to a document-level label. For example, convert a tag sequence like ["O", "B-Disease", "I-Disease", "O", "O"] to a single label y=1. Your function should assign y=1 to a sequence with one or more disease mentions (at least one "B-Disease" tag) and y=0 to a sequence with no disease mentions.
- Apply your function to both true and predicted document-level labels from your test set. Use
the output to compute document level precision and recall of your model. Show your results in
your notebook.

In [7]:
# Define - function to agg token level tags
def agg_doc_labels(token_sequences):
    doc_labels = []
    for sequence_tags in token_sequences:
        if any(tag == "B-Disease" for tag in sequence_tags):
            doc_labels.append(1)   
        else:
            doc_labels.append(0)   
    return doc_labels

# Aggregate - true and predicted labels to document-level
true_doc_labels = agg_doc_labels(test_tags)
predicted_doc_labels = agg_doc_labels(test_pred_tags)

# Calculate - document-level precision and recall
from sklearn.metrics import precision_score, recall_score

doc_precision = precision_score(true_doc_labels, predicted_doc_labels)
doc_recall = recall_score(true_doc_labels, predicted_doc_labels)

# Print - document-level precision and recall
print(f"Document-Level Precision: {doc_precision:.4f}")
print(f"Document-Level Recall: {doc_recall:.4f}")


Document-Level Precision: 0.9671
Document-Level Recall: 0.8738


In [8]:
# Define - function to agg token level tags
def agg_doc_labels(token_sequences):
    doc_labels = []
    for sequence_tags in token_sequences:
        if any(tag in {"B-Disease", "I-Disease"} for tag in sequence_tags):
            doc_labels.append(1)  
        else:
            doc_labels.append(0)   
    return doc_labels

# Aggregate - true and predicted labels to document-level
true_doc_labels = agg_doc_labels(test_tags)
predicted_doc_labels = agg_doc_labels(test_pred_tags)

# Calculate - document-level precision and recall
from sklearn.metrics import precision_score, recall_score

doc_precision = precision_score(true_doc_labels, predicted_doc_labels)
doc_recall = recall_score(true_doc_labels, predicted_doc_labels)

# Print - document-level precision and recall
print(f"Document-Level Precision: {doc_precision:.4f}")
print(f"Document-Level Recall: {doc_recall:.4f}")

Document-Level Precision: 0.9671
Document-Level Recall: 0.8738


### PROBLEM 7 – State Transitions (10 pts – Answer in Blackboard)

**The python-crfsuite library allows you to set a Boolean hyper-parameter called
“feature.possible_transitions”. If this parameter is True, then the model may output tag-to-tag
transitions that were never seen in training data. [You do not need to apply this parameter in your code
to answer this question]**

**What is an example of one tag-to-tag transition that never occurred in the training data?**
- An example of tag-to-tag transition that never occurred in the training data could be "B-Disease" transitioning directly to "I-Disease" without any other tag in between.

**For this particular experiment, do you think it makes sense to set this parameter to True or False? That is, should you allow transitions that never occurred in the training data? Explain your answer briefly.**
- I think setting "feature.possible_transitions" to True allows the model to generate transitions that were not explicitly seen during training. This can be beneficial in situations where the training data is limited, and there might be valid transitions that were not observed but are likely to occur in practice. It makes the model more flexible and able to handle unseen transitions.