PROBLEM 1 – Reading the data in CoNLL format (20pts)
Note that the NCBI Disease Corpus (See section DATA above) is already split into train, development,
and test datasets. You will use the train and test datasets in this homework.
As noted above, you should use files in the "ncbi-disease/conll" subfolder. In this file format, a blank line
indicates the start of a new sequence.
• Write a function that reads a .tsv files in the CoNLL format and returns two “list of lists” as
output:
o A list of sequences of tokens, where a single token may be a word or punctuation.
o A list of sequences of tags, representing token-level annotation. You should see these 3
tags in your data (“B-Disease”, “I-Disease”, “O”)
• Apply your function to train.tsv and test.tsv. To show you have read in the data correctly, show
the following in your notebook output:
o The number of sequences in train and test. (You should see 5432 sequences in train and
940 sequences in test.)
o The tokens and tags of the first sequence in the training dataset. Your output should
look like this:
['Identification', 'of', 'APC2', ',', 'a', 'homologue', 'of', 'the', 'adenomatous',
'polyposis', 'coli', 'tumour', 'suppressor', '.']
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Disease', 'I-Disease', 'I-Disease', 'IDisease', 'O', 'O']


In [3]:
def read_conll_file(file_path):
    # Initialize lists to hold tokens and tags
    tokens, tags = [], []

    # Open the CoNLL file for reading
    with open(file_path, 'r', encoding='utf-8') as file:
        # Initialize current sequence lists
        current_tokens, current_tags = [], []

        # Read lines from the file
        for line in file:
            line = line.strip()
            if line:  # If the line is not empty
                parts = line.split('\t')
                token, tag = parts[0], parts[1]
                current_tokens.append(token)
                current_tags.append(tag)
            else:  # Empty line indicates the end of a sequence
                tokens.append(current_tokens)
                tags.append(current_tags)
                current_tokens, current_tags = [], []

    return tokens, tags

# Apply the function to train.tsv and test.tsv
train_tokens, train_tags = read_conll_file('/content/conll/train.tsv')
test_tokens, test_tags = read_conll_file('/content/conll/test.tsv')

# Display the number of sequences in train and test
print("Number of sequences in train:", len(train_tokens))
print("Number of sequences in test:", len(test_tokens))

# Display the tokens and tags of the first sequence in the training dataset
print("Tokens of the first sequence in training dataset:")
print(train_tokens[0])
print("Tags of the first sequence in training dataset:")
print(train_tags[0])


Number of sequences in train: 5432
Number of sequences in test: 940
Tokens of the first sequence in training dataset:
['Identification', 'of', 'APC2', ',', 'a', 'homologue', 'of', 'the', 'adenomatous', 'polyposis', 'coli', 'tumour', 'suppressor', '.']
Tags of the first sequence in training dataset:
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Disease', 'I-Disease', 'I-Disease', 'I-Disease', 'O', 'O']


#PROBLEM 2 – Data Discovery (5 pts)
In this problem you will examine the data that you read into memory in the previous problem. Using the
training dataset for analysis, show the following in your notebook output:
• The count of each of the 3 tags in the training data: “B-Disease”, “I-Disease”, and “O”. Note that
the most frequent token is "O", since most words are not part of a disease mention.
• The 20 most common words/tokens that appear with the tags “B-Disease” or “I-Disease”. That
is, show words that often appear disease mentions. (You may show frequent “B-Disease” and “IDisease” words separately, or you may combine them into a single list.)
• OPTIONAL: Any other data exploration you would like to perform. For example, you may want to
print and read a small sample of token sequences, to become familiar with the data.
Review the list of words that commonly appear in disease mentions. Do you see any patterns? (You do
not need to answer in writing, but it may be helpful in Problem 3 where you design a feature.)


In [5]:
# In the realm of data, let's explore and see,
# What insights the training dataset brings to be.
# With Python's aid, we'll unveil the hidden lore,
# The count of tags and common words galore.

# First, we shall count the tags three,
# "B-Disease," "I-Disease," and "O" you'll agree.
# Among them, "O" shall reign supreme,
# For non-disease words, it's like a dream.


from collections import Counter

# Combine tokens and tags from training data
combined_data = list(zip(train_tokens, train_tags))

# Count of each tag in the training data
tag_counts = Counter(tag for tags in train_tags for tag in tags)

# Display the count of each tag
print("Count of 'B-Disease' tag:", tag_counts['B-Disease'])
print("Count of 'I-Disease' tag:", tag_counts['I-Disease'])
print("Count of 'O' tag:", tag_counts['O'])
# With these lines of code, the tags' count we found,
# To "B-Disease," "I-Disease," and "O," we're bound.
# Now, let's seek the common words that dwell,
# Within the realm of disease, as stories tell.


from collections import Counter

# Create a list of words associated with "B-Disease" and "I-Disease"
disease_words = [token for tokens, tags in combined_data for token, tag in zip(tokens, tags) if tag in ['B-Disease', 'I-Disease']]

# Count the frequency of each word
word_counts = Counter(disease_words)

# Display the 20 most common disease words
most_common_disease_words = word_counts.most_common(20)
print("The 20 most common disease words:")
for word, count in most_common_disease_words:
    print(word, ":", count)


# With this code, the words are revealed,
# That in disease mentions, are often sealed.
# The 20 most common, they grace our sight,
# In this poetic exploration, a data delight.

# Optionally, we may journey further to explore,
# Sample token sequences, data's core.
# To discern patterns, to understand,
# The secrets of the data, like grains of sand.

# With these revelations, we set the stage,
# To craft features in the next coding page.
# In the world of data, our understanding grows,
# As we navigate the rivers where knowledge flows.

Count of 'B-Disease' tag: 5145
Count of 'I-Disease' tag: 6122
Count of 'O' tag: 124819
The 20 most common disease words:
- : 636
deficiency : 322
syndrome : 281
cancer : 269
disease : 256
of : 178
dystrophy : 176
breast : 151
ovarian : 132
X : 122
and : 120
DM : 120
ALD : 114
DMD : 110
APC : 100
disorder : 94
muscular : 94
G6PD : 92
linked : 81
the : 78


PROBLEM 3 – Building features (20 pts)
In this problem, you will build the features that you will use in your CRF model. You may find it helpful to
refer to this demo notebook, to understand how to work with the python-crfsuite library.
• Write a function that takes two inputs:
o A sequence of tokens
o An integer position, pointing to one token in that sequence.
and returns a list of features, represented as a list of strings. At minimum, include these
features:
o The current word/token in lower case
o The suffix (last 3 characters) of the current word
o The previous word/token (position i-1) or “BOS” if at the beginning of the sequence
o The next word/token (position i+1), or “EOS” if at the beginning of the sequence
o At least one other feature of your choice
• Apply your function your train and test token sequences (from output of Problem 1).
• To show that you have completed this step, apply your output to the first 3 words in the first
sequence of the training set. Your output should look something like this (note the names and
order of your features in your notebook do not need to match this output):
['w0.lower=identification', 'w0.suffix3=ion', <other features not shown...>, BOS ]
['w0.lower=of', 'w0.suffix3=of', <other features not shown...>]
['w0.lower=apc2', 'w0.suffix3=PC2', <other features not shown...>]


In [25]:
# In the land of feature crafting, we embark today,
# To pave the way for a robust model, we say.
# With a function's grace, our tools we shall employ,
# Creating features for tokens to bring us joy.

# A function we script, to shape our features right,
# Given tokens and a position, in the data's light.
# From these, we extract knowledge with delight,
# Features to empower our model's might.

# Here's the code, let's unveil the art,
# Of crafting features, each playing its part.

def extract_features(tokens, position):
    # Get the current word and its last 3 characters as features
    current_word = tokens[position].lower()
    word_suffix = tokens[position][-3:] if len(tokens[position]) > 2 else tokens[position]

    # Get the previous and next words (or "BOS" and "EOS" if at the sequence boundaries)
    previous_word = tokens[position - 1] if position > 0 else "BOS"
    next_word = tokens[position + 1] if position < len(tokens) - 1 else "EOS"

    # Include one more feature of your choice (e.g., word shape, capitalization)
    # For example, let's add a feature that represents whether the current word is capitalized
    is_capitalized = "Capitalized" if current_word.istitle() else "NotCapitalized"

    # Return the features as a list of strings
    features = [
        f'w0.lower={current_word}',
        f'w0.suffix3={word_suffix}',
        f'w-1={previous_word}',
        f'w+1={next_word}',
        f'w0.capitalization={is_capitalized}'
    ]

    return features

# Apply the function to the first 3 words in the first sequence of the training set
for i in range(3):
    features = extract_features(train_tokens[0], i)
    print(features)


# With this code, we take a token's hand,
# And from it, features we gently expand.
# Lowercased words, suffixes that guide,
# Contextual words on either side.

# Our chosen feature, capitalization's might,
# Sheds light on tokens, be they bold or light.
# With these crafted gems, our journey's begun,
# To empower the model, we craft features, one by one.

['w0.lower=identification', 'w0.suffix3=ion', 'w-1=BOS', 'w+1=of', 'w0.capitalization=NotCapitalized']
['w0.lower=of', 'w0.suffix3=of', 'w-1=Identification', 'w+1=APC2', 'w0.capitalization=NotCapitalized']
['w0.lower=apc2', 'w0.suffix3=PC2', 'w-1=of', 'w+1=,', 'w0.capitalization=NotCapitalized']


In [26]:
# Problem 3
def get_features(tokens, i):

  features = []

  # Baseline features
  features.append(f"w{i}.lower={tokens[i].lower()}")
  features.append(f"w{i}.suffix3={tokens[i][-3:]}")

  if i > 0:
    features.append(f"w{i-1}={tokens[i-1]}")
  else:
    features.append("BOS")

  if i < len(tokens)-1:
    features.append(f"w{i+1}={tokens[i+1]}")
  else:
    features.append("EOS")

  # Additional feature
  features.append(f"w{i}.isdigit={str(tokens[i].isdigit()).lower()}")

  return features

train_features = [get_features(seq, i) for seq in train_tokens for i in range(len(seq))]
test_features = [get_features(seq, i) for seq in test_tokens for i in range(len(seq))]

print(train_features[0][:3])
print(test_features[0][:3])

['w0.lower=identification', 'w0.suffix3=ion', 'BOS']
['w0.lower=clustering', 'w0.suffix3=ing', 'BOS']


PROBLEM 4 – Training a CRF model (20 pts)
In this problem, you will train a CRF model and evaluate it using metrics computed over individual tags.
• Using the python-crfsuite library, train a CRF sequential tagging model using feature sequences
that you built in the previous step. Using your training data as input.
• Apply your model to your test dataset to generate predicted tag sequences.
• For each of the 3 labels ("B-Disease", "I-Disease", and “O") show precision, recall, f1-score. [You
may use the sckit-learn function classification_report to complete this step. You may also want
to “flatten” both the true and predicted tags into a single list of tags to apply this function.]


In [10]:
!pip install python-crfsuite

Collecting python-crfsuite
  Downloading python_crfsuite-0.9.9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (993 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m993.5/993.5 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: python-crfsuite
Successfully installed python-crfsuite-0.9.9


In [22]:
import pycrfsuite
from sklearn.metrics import classification_report

# Prepare training data
X_train = [extract_features(tokens, i) for tokens in train_tokens for i in range(len(tokens))]
y_train = [tag for tags in train_tags for tag in tags]

# Prepare test data
X_test = [extract_features(tokens, i) for tokens in test_tokens for i in range(len(tokens))]
y_test = [tag for tags in test_tags for tag in tags]

# Initialize and train the CRF model
trainer = pycrfsuite.Trainer()
for x, y in zip(X_train, y_train):
    trainer.append(x, y)

trainer.set_params({
    'c1': 1.0,
    'c2': 1e-3,
    'max_iterations': 50,
    'feature.possible_transitions': True
})

trainer.train('disease_crf_model.crfsuite')

# Load the trained model
tagger = pycrfsuite.Tagger()
tagger.open('disease_crf_model.crfsuite')

# Predict tags for the test dataset
y_pred = [tagger.tag(xseq) for xseq in X_test]

# Flatten the true and predicted tags
y_true_flat = [tag for tags in y_test for tag in tags]
y_pred_flat = [tag for tags in y_pred for tag in tags]

# Calculate precision, recall, and f1-score
report = classification_report(y_true_flat, y_pred_flat, target_names=["B-Disease", "I-Disease", "O"])
print(report)


AttributeError: ignored

In [23]:
import pycrfsuite
from sklearn.metrics import classification_report

# Prepare training data
X_train = [extract_features(tokens, i) for tokens in train_tokens for i in range(len(tokens))]
y_train = [tag for tags in train_tags for tag in tags]

# Prepare test data
X_test = [extract_features(tokens, i) for tokens in test_tokens for i in range(len(tokens))]
y_test = [tag for tags in test_tags for tag in tags]

print (type(X_train),type(y_train),type(X_test),type(y_test))
print (len(X_train),len(y_train),len(X_test),len(y_test))
print (len(X_train[0]),len(y_train[0]),len(X_test[0]),len(y_test[0]))
print (X_train[0],y_train[0])

# # Initialize the CRF model trainer
# trainer = pycrfsuite.Trainer()

# # Prepare your training data
# for xseq, yseq in zip(X_train, y_train):
#     trainer.append(xseq, yseq)

# # Set training parameters
# trainer.set_params({
#     'c1': 1.0,
#     'c2': 1e-3,
#     'max_iterations': 50,
#     'feature.possible_transitions': True
# })

# # Train the CRF model and save it
# trainer.train('disease_crf_model.crfsuite')

# # Load the trained model
# tagger = pycrfsuite.Tagger()
# tagger.open('disease_crf_model.crfsuite')

# # Make predictions on the test data
# y_pred = [tagger.tag(xseq) for xseq in X_test]

# # Flatten the true and predicted labels for evaluation
# y_true_flat = [label for labels in y_test for label in labels]
# y_pred_flat = [label for labels in y_pred for label in labels]

# # Calculate precision, recall, and F1-score
# report = classification_report(y_true_flat, y_pred_flat, target_names=["B-Disease", "I-Disease", "O"])
# print(report)


<class 'list'> <class 'list'> <class 'list'> <class 'list'>
136086 136086 24497 24497
5 1 5 1
['w0.lower=identification', 'w0.suffix3=ion', 'w-1=BOS', 'w+1=of', 'w0.capitalization=NotCapitalized'] O


In [28]:
import pycrfsuite
from sklearn.metrics import classification_report

# Load training data
train_tokens, train_tags = load_train_data(train_file)

# Load test data
test_tokens, test_tags = load_test_data(test_file)

# Extract features for train and test
train_features = extract_features(train_tokens)
test_features = extract_features(test_tokens)

# Initialize CRF model
crf = pycrfsuite.Trainer()

# Train CRF
crf.train(train_features, train_tags)

# Predict tags on test features
pred_tags = crf.tag(test_features)

# Flatten tags
flat_test_tags = [tag for seq in test_tags for tag in seq]
flat_pred_tags = [tag for seq in pred_tags for tag in seq]

# Compute metrics
report = classification_report(flat_test_tags, flat_pred_tags)

print(report)

NameError: ignored

PROBLEM 5 – Inspecting the trained model (10 pts)
In this problem you will examine parameter weights assigned by your model. You can do this by calling
“tagger.info().transitions” and “tagger.info().state_features” on your trained model object.
• In your notebook, show parameter weights given to transitions between the 3 tag types ("BDisease", "I-Disease", and "O").
• Refer back to the feature you designed in Problem 3 (the feature "of your choice"). Show the
parameter weights assigned to this feature. You may truncate this list if it is very long. [This may
happen if you included a word from the sequence in the feature name, so your feature was
expanded to become a larger set of features that grows with your vocabulary]
• *IF* your feature was dropped during model training (that is, there is nothing to show in the
previous step) then return to Problem 4 and design a new feature that is used in your model.
PROBLEM 6 – Document level performance (10 pts)
Tag-level accuracy is easy to compute, but it is not very easy to understand. In particular, one disease
reference may cover both "B-Disease" and "I-Disease" tokens. To give another view of model
performance, compute document-level precision and recall on your experiment output. To do this:
• Write a function that aggregates token-level tags to a document-level label. For example,
convert a tag sequence like ["O", "B-Disease", "I-Disease", "O", "O"] to a single label y=1. Your
function should assign y=1 to a sequence with one or more disease mentions (at least one "BDisease" tag) and y=0 to a sequence with no disease mentions.
• Apply your function to both true and predicted document-level labels from your test set. Use
the output to compute document level precision and recall of your model. Show your results in
your notebook.
PROBLEM 7 – State Transitions (10 pts – Answer in Blackboard)
The python-crfsuite library allows you to set a Boolean hyper-parameter called
“feature.possible_transitions”. If this parameter is True, then the model may output tag-to-tag
transitions that were never seen in training data. [You do not need to apply this parameter in your code
to answer this question]
• What is an example of one tag-to-tag transition that never occurred in the training data?
• For this particular experiment, do you think it makes sense to set this parameter to True or
False? That is, should you allow transitions that never occurred in the training data? Explain your
answer briefly.

In [30]:
!pip install sklearn-crfsuite

Collecting sklearn-crfsuite
  Downloading sklearn_crfsuite-0.3.6-py2.py3-none-any.whl (12 kB)
Installing collected packages: sklearn-crfsuite
Successfully installed sklearn-crfsuite-0.3.6


In [34]:
def read_conll(file_path):
    sequences = []
    tags = []
    with open(file_path, 'r') as file:
        sequence = []
        tag_sequence = []
        for line in file:
            line = line.strip()
            if line:
                token, tag = line.split('\t')
                sequence.append(token)
                tag_sequence.append(tag)
            else:
                sequences.append(sequence)
                tags.append(tag_sequence)
                sequence = []
                tag_sequence = []
        if sequence:
            sequences.append(sequence)
            tags.append(tag_sequence)
    return sequences, tags

# Read train.tsv and test.tsv files
train_sequences, train_tags = read_conll('/content/conll/train.tsv')
test_sequences, test_tags = read_conll('/content/conll/test.tsv')

# Print the number of sequences in train and test
print("Number of sequences in train:", len(train_sequences))
print("Number of sequences in test:", len(test_sequences))

# Print the tokens and tags of the first sequence in the training dataset
print("Tokens of the first sequence in train:")
print(train_sequences[0])
print("Tags of the first sequence in train:")
print(train_tags[0])

print('Problem 2 Solution------>>>>>')
from collections import Counter

# Count of each tag in the training data
tag_counts = Counter(tag for sequence in train_tags for tag in sequence)
print("Tag counts in train:", tag_counts)

# Word counts for B-Disease and I-Disease tags
bdisease_words = [token for sequence, tag_sequence in zip(train_sequences, train_tags) for token, tag in zip(sequence, tag_sequence) if tag == 'B-Disease']
idisease_words = [token for sequence, tag_sequence in zip(train_sequences, train_tags) for token, tag in zip(sequence, tag_sequence) if tag == 'I-Disease']
bdisease_counts = Counter(bdisease_words)
idisease_counts = Counter(idisease_words)
print("20 most common words/tokens with B-Disease tag:")
print(bdisease_counts.most_common(20))
print("20 most common words/tokens with I-Disease tag:")
print(idisease_counts.most_common(20))


print('Problem 3 Solution------>>>>>')
def get_features(tokens, position):
    features = []
    current_word = tokens[position].lower()
    current_suffix = tokens[position][-3:]
    previous_word = tokens[position - 1] if position > 0 else 'BOS'
    next_word = tokens[position + 1] if position < len(tokens) - 1 else 'EOS'

    features.append('w0.lower=' + current_word)
    features.append('w0.suffix3=' + current_suffix)
    features.append('w-1=' + previous_word)
    features.append('w+1=' + next_word)

    # Add your own additional feature(s) here

    return features

# Test the get_features function on the first 3 words of the first sequence in the training set
for i in range(3):
    features = get_features(train_sequences[0], i)
    print(features)

print('Problem 4 Solution------>>>>>')
import sklearn_crfsuite
from sklearn.metrics import classification_report

# Initialize CRF model and fit it on training data
crf_model = sklearn_crfsuite.CRF()
crf_model.fit(train_sequences, train_tags)

# Use the fitted model to predict tags on the test data
predicted_tags = crf_model.predict(test_sequences)

# Flatten the true and predicted tags into single lists
true_tags = [tag for sequence in test_tags for tag in sequence]
predicted_tags = [tag for sequence in predicted_tags for tag in sequence]

# Compute precision, recall, and f1-score for each label
report = classification_report(true_tags, predicted_tags)
print("Precision, Recall, and F1-score:")
print(report)


Number of sequences in train: 5432
Number of sequences in test: 940
Tokens of the first sequence in train:
['Identification', 'of', 'APC2', ',', 'a', 'homologue', 'of', 'the', 'adenomatous', 'polyposis', 'coli', 'tumour', 'suppressor', '.']
Tags of the first sequence in train:
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Disease', 'I-Disease', 'I-Disease', 'I-Disease', 'O', 'O']
Problem 2 Solution------>>>>>
Tag counts in train: Counter({'O': 124819, 'I-Disease': 6122, 'B-Disease': 5145})
20 most common words/tokens with B-Disease tag:
[('DM', 120), ('breast', 115), ('DMD', 110), ('APC', 94), ('X', 92), ('ALD', 86), ('PWS', 75), ('G6PD', 68), ('WAS', 63), ('autosomal', 58), ('familial', 58), ('myotonic', 57), ('Duchenne', 56), ('HD', 55), ('PKU', 52), ('aniridia', 50), ('deficiency', 47), ('ovarian', 46), ('hereditary', 45), ('VHL', 45)]
20 most common words/tokens with I-Disease tag:
[('-', 636), ('syndrome', 281), ('deficiency', 275), ('disease', 256), ('cancer', 230), ('of', 178), ('

In [39]:
import pycrfsuite
from sklearn.metrics import classification_report

# Initialize CRF model
crf_model = pycrfsuite.Trainer()

# Add training data to the model
for sequence, tag_sequence in zip(train_sequences, train_tags):
    crf_model.append(sequence, tag_sequence)

# Set algorithm parameters (optional)
crf_model.set_params({
    'c1': 1.0,   # coefficient for L1 penalty
    'c2': 1e-3,  # coefficient for L2 penalty
    'max_iterations': 100,  # maximum number of iterations
    'feature.possible_transitions': True  # allow tag transitions not present in the training data
})

# Train the CRF model
crf_model.train('model.crfsuite')

# Initialize a new CRF model for prediction
tagger = pycrfsuite.Tagger()
tagger.open('model.crfsuite')

# Use the model to predict tags on the test data
predicted_tags = [tagger.tag(sequence) for sequence in test_sequences]

# Flatten the true and predicted tags into single lists
true_tags = [tag for sequence in test_tags for tag in sequence]
predicted_tags = [tag for sequence in predicted_tags for tag in sequence]

# Compute precision, recall, and f1-score for each label
report = classification_report(true_tags, predicted_tags)
print("Precision, Recall, and F1-score:")
print(report)


Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 1
0....1....2....3....4....5....6....7....8....9....10
Number of features: 224
Seconds required: 0.054

L-BFGS optimization
c1: 1.000000
c2: 0.001000
num_memories: 6
max_iterations: 100
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

***** Iteration #1 *****
Loss: 60423.946806
Feature norm: 1.000000
Error norm: 39234.889966
Active features: 222
Line search trials: 1
Line search step: 0.000007
Seconds required for this iteration: 0.070

***** Iteration #2 *****
Loss: 49982.906930
Feature norm: 1.336998
Error norm: 17901.896707
Active features: 216
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.038

***** Iteration #3 *****
Loss: 45605.097285
Feature norm: 1.540933
Error norm: 14719.544187
Active features: 210
Line search trials: 1
Line search step: 1.000000
Seconds required for this it

In [40]:
# To inspect the parameter weights assigned by the trained model, you can use the tagger.info().transitions and tagger.info().state_features attributes.
# Assuming you have trained a model and stored it in the 'tagger' variable
transitions = tagger.info().transitions
state_features = tagger.info().state_features

# Parameter weights for transitions between tag types
print("Parameter weights for transitions:")
for transition in transitions:
    print(f"From '{transition[0]}' to '{transition[1]}': {transitions[transition]}")

# Parameter weights for the feature designed in Problem 3
print("\nParameter weights for the feature of your choice:")
for feature in state_features:
    if 'your_feature' in feature:
        print(f"{feature}: {state_features[feature]}")

# This code will print out the parameter weights for the transitions between the tag types ("B-Disease", "I-Disease", "O"),
#  as well as the parameter weights for the feature designed in Problem 3. Note that if the feature was dropped during model training,
#   there might be nothing to show for that feature.

Parameter weights for transitions:
From 'O' to 'O': 3.682413
From 'O' to 'B-Disease': 4.620523
From 'O' to 'I-Disease': -4.070437
From 'B-Disease' to 'O': -3.064544
From 'B-Disease' to 'B-Disease': -5.953808
From 'B-Disease' to 'I-Disease': 2.849245
From 'I-Disease' to 'O': -3.948628
From 'I-Disease' to 'B-Disease': -4.930598
From 'I-Disease' to 'I-Disease': 1.840045

Parameter weights for the feature of your choice:


In [44]:
from sklearn import metrics
# Compute document-level precision and recall
def get_document_level_labels(tags):
    document_labels = []
    in_disease_mention = False
    for tag in tags:
        if tag == 'O':
            if in_disease_mention:
                document_labels.append(1)
                in_disease_mention = False
            else:
                document_labels.append(0)
        elif tag == 'B-Disease':
            if in_disease_mention:
                document_labels.append(1)
            else:
                in_disease_mention = True
                document_labels.append(1)
        elif tag == 'I-Disease':
            if in_disease_mention:
                document_labels.append(1)
            else:
                document_labels.append(0)
    return document_labels

true_labels = get_document_level_labels(true_tags)
predicted_labels = get_document_level_labels(predicted_tags)

# Compute document-level precision and recall
document_precision = metrics.precision_score(true_labels, predicted_labels)
document_recall = metrics.recall_score(true_labels, predicted_labels)

print("Document-level Precision:", document_precision)
print("Document-level Recall:", document_recall)

Document-level Precision: 0.6311787072243346
Document-level Recall: 0.166
