<a href="https://colab.research.google.com/github/ganeevsingh18/Drug_prediction/blob/main/Copy_of_fine_tuning_drugs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SciBERT Fine-Tuning on Drug/ADE Corpus

In [1]:
! pip install datasets transformers seqeval



In [2]:
! pip install spacy



In [3]:
from datasets import Dataset, ClassLabel, Sequence, load_dataset, load_metric
import numpy as np
import pandas as pd
from spacy import displacy
import transformers
from transformers import (AutoModelForTokenClassification,
                          AutoTokenizer,
                          DataCollatorForTokenClassification,
                          pipeline,
                          TrainingArguments,
                          Trainer)

---
## Dataset Exploration

We use the `Ade_corpus_v2_drug_ade_relation` subset of the `ade_corpus_v2` dataset, which provides labeled spans for drug names and adverse effects.

See dataset page here: https://huggingface.co/datasets/ade_corpus_v2

In [4]:
datasets = load_dataset("ade_corpus_v2", "Ade_corpus_v2_drug_ade_relation")

In [5]:
datasets

DatasetDict({
    train: Dataset({
        features: ['text', 'drug', 'effect', 'indexes'],
        num_rows: 6821
    })
})

In [6]:
datasets["train"][0]

{'text': 'Intravenous azithromycin-induced ototoxicity.',
 'drug': 'azithromycin',
 'effect': 'ototoxicity',
 'indexes': {'drug': {'start_char': [12], 'end_char': [24]},
  'effect': {'start_char': [33], 'end_char': [44]}}}

## Dataset Consolidation
----
Upon further examination of the dataset, we can see that sentences are often repeated to identify different pairs of drugs and adverse reactions. For example, see this sentence from the dataset:
```
{'text': 'After therapy for diabetic coma with insulin (containing the preservative cresol) and electrolyte solutions was started, the patient complained of increasing myalgia, developed a high fever and respiratory and metabolic acidosis and lost consciousness.', 'drug': 'insulin', 'effect': 'increasing myalgia', 'indexes': {'drug': {'start_char': [37], 'end_char': [44]}, 'effect': {'start_char': [147], 'end_char': [165]}}}
{'text': 'After therapy for diabetic coma with insulin (containing the preservative cresol) and electrolyte solutions was started, the patient complained of increasing myalgia, developed a high fever and respiratory and metabolic acidosis and lost consciousness.', 'drug': 'cresol', 'effect': 'lost consciousness', 'indexes': {'drug': {'start_char': [74], 'end_char': [80]}, 'effect': {'start_char': [233], 'end_char': [251]}}}
{'text': 'After therapy for diabetic coma with insulin (containing the preservative cresol) and electrolyte solutions was started, the patient complained of increasing myalgia, developed a high fever and respiratory and metabolic acidosis and lost consciousness.', 'drug': 'cresol', 'effect': 'high fever', 'indexes': {'drug': {'start_char': [74], 'end_char': [80]}, 'effect': {'start_char': [179], 'end_char': [189]}}}
{'text': 'After therapy for diabetic coma with insulin (containing the preservative cresol) and electrolyte solutions was started, the patient complained of increasing myalgia, developed a high fever and respiratory and metabolic acidosis and lost consciousness.', 'drug': 'insulin', 'effect': 'high fever', 'indexes': {'drug': {'start_char': [37], 'end_char': [44]}, 'effect': {'start_char': [179], 'end_char': [189]}}}
{'text': 'After therapy for diabetic coma with insulin (containing the preservative cresol) and electrolyte solutions was started, the patient complained of increasing myalgia, developed a high fever and respiratory and metabolic acidosis and lost consciousness.', 'drug': 'insulin', 'effect': 'lost consciousness', 'indexes': {'drug': {'start_char': [37], 'end_char': [44]}, 'effect': {'start_char': [233], 'end_char': [251]}}}
{'text': 'After therapy for diabetic coma with insulin (containing the preservative cresol) and electrolyte solutions was started, the patient complained of increasing myalgia, developed a high fever and respiratory and metabolic acidosis and lost consciousness.', 'drug': 'insulin', 'effect': 'respiratory and metabolic acidosis', 'indexes': {'drug': {'start_char': [37], 'end_char': [44]}, 'effect': {'start_char': [194], 'end_char': [228]}}}
{'text': 'After therapy for diabetic coma with insulin (containing the preservative cresol) and electrolyte solutions was started, the patient complained of increasing myalgia, developed a high fever and respiratory and metabolic acidosis and lost consciousness.', 'drug': 'cresol', 'effect': 'respiratory and metabolic acidosis', 'indexes': {'drug': {'start_char': [74], 'end_char': [80]}, 'effect': {'start_char': [194], 'end_char': [228]}}}
```

This is not ideal in an NER setting - if we assigned one set of token labels per row in this dataset as-is, we would end up giving different labels to the same tokens in the same sentences. This would confuse the model during fine-tuning, so we need to consolidate all of the ranges provided for each unique sentence, before performing one pass to label all known entities.

In [7]:
consolidated_dataset = {}

for row in datasets["train"]:
    if row["text"] in consolidated_dataset:
        consolidated_dataset[row["text"]]["drug_indices_start"].update(row["indexes"]["drug"]["start_char"])
        consolidated_dataset[row["text"]]["drug_indices_end"].update(row["indexes"]["drug"]["end_char"])
        consolidated_dataset[row["text"]]["effect_indices_start"].update(row["indexes"]["effect"]["start_char"])
        consolidated_dataset[row["text"]]["effect_indices_end"].update(row["indexes"]["effect"]["end_char"])
        consolidated_dataset[row["text"]]["drug"].append(row["drug"])
        consolidated_dataset[row["text"]]["effect"].append(row["effect"])

    else:
        consolidated_dataset[row["text"]] = {
            "text": row["text"],
            "drug": [row["drug"]],
            "effect": [row["effect"]],
            # use sets because the indices can repeat for various reasons
            "drug_indices_start": set(row["indexes"]["drug"]["start_char"]),
            "drug_indices_end": set(row["indexes"]["drug"]["end_char"]),
            "effect_indices_start": set(row["indexes"]["effect"]["start_char"]),
            "effect_indices_end": set(row["indexes"]["effect"]["end_char"])
        }

---
With the dataset consolidated, we need to assign per-token labels to each sentence. First, we re-define our Python data structure as a Hugging Face Dataset object.

In [8]:
pd.DataFrame(datasets["train"])

Unnamed: 0,text,drug,effect,indexes
0,Intravenous azithromycin-induced ototoxicity.,azithromycin,ototoxicity,"{'drug': {'start_char': [12], 'end_char': [24]..."
1,"Immobilization, while Paget's bone disease was...",dihydrotachysterol,increased calcium-release,"{'drug': {'start_char': [91], 'end_char': [109..."
2,Unaccountable severe hypercalcemia in a patien...,dihydrotachysterol,hypercalcemia,"{'drug': {'start_char': [84], 'end_char': [102..."
3,METHODS: We report two cases of pseudoporphyri...,naproxen,pseudoporphyria,"{'drug': {'start_char': [58], 'end_char': [66]..."
4,METHODS: We report two cases of pseudoporphyri...,oxaprozin,pseudoporphyria,"{'drug': {'start_char': [71], 'end_char': [80]..."
...,...,...,...,...
6816,Lithium treatment was terminated in 1975 becau...,Lithium,lithium intoxication,"{'drug': {'start_char': [0], 'end_char': [7]},..."
6817,Lithium treatment was terminated in 1975 becau...,lithium,lithium intoxication,"{'drug': {'start_char': [52], 'end_char': [59]..."
6818,Eosinophilia caused by clozapine was observed ...,clozapine,Eosinophilia,"{'drug': {'start_char': [23], 'end_char': [32]..."
6819,Eosinophilia has been encountered from 0.2 to ...,clozapine,Eosinophilia,"{'drug': {'start_char': [55], 'end_char': [64]..."


In [9]:
datasets["train"]["indexes"][0]

{'drug': {'start_char': [12], 'end_char': [24]},
 'effect': {'start_char': [33], 'end_char': [44]}}

In [10]:
df = pd.DataFrame(list(consolidated_dataset.values()))

In [11]:
df.head()

Unnamed: 0,text,drug,effect,drug_indices_start,drug_indices_end,effect_indices_start,effect_indices_end
0,Intravenous azithromycin-induced ototoxicity.,[azithromycin],[ototoxicity],{12},{24},{33},{44}
1,"Immobilization, while Paget's bone disease was...",[dihydrotachysterol],[increased calcium-release],{91},{109},{143},{168}
2,Unaccountable severe hypercalcemia in a patien...,[dihydrotachysterol],[hypercalcemia],{84},{102},{21},{34}
3,METHODS: We report two cases of pseudoporphyri...,"[naproxen, oxaprozin]","[pseudoporphyria, pseudoporphyria]","{58, 71}","{80, 66}",{32},{47}
4,"Naproxen, the most common offender, has been a...",[Naproxen],[erythropoietic protoporphyria],{0},{8},{134},{163}


In [12]:
# since no spans overlap, we can sort to get 1:1 matched index spans
# note that sets don't preserve insertion order

df["drug_indices_start"] = df["drug_indices_start"].apply(list).apply(sorted)
df["drug_indices_end"] = df["drug_indices_end"].apply(list).apply(sorted)
df["effect_indices_start"] = df["effect_indices_start"].apply(list).apply(sorted)
df["effect_indices_end"] = df["effect_indices_end"].apply(list).apply(sorted)

In [13]:
# save to JSON to then import into Dataset object
df.to_json("dataset.jsonl", orient="records", lines=True)

In [14]:
cons_dataset = load_dataset("json", data_files="dataset.jsonl")

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [15]:
# no train-test provided, so we create our own
cons_dataset = cons_dataset["train"].train_test_split()

In [16]:
cons_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'drug', 'effect', 'drug_indices_start', 'drug_indices_end', 'effect_indices_start', 'effect_indices_end'],
        num_rows: 3203
    })
    test: Dataset({
        features: ['text', 'drug', 'effect', 'drug_indices_start', 'drug_indices_end', 'effect_indices_start', 'effect_indices_end'],
        num_rows: 1068
    })
})

---
## Token Labeling

Finally, we can label each token with its entity. We use BIO tagging on two entities, `DRUG` and `EFFECT`. This results in five possible classes for each token:

* `O` - outside any entity we care about
* `B-DRUG` - the beginning of a `DRUG` entity
* `I-DRUG` - inside a `DRUG` entity
* `B-EFFECT` - the beginning of an `EFFECT` entity
* `I-EFFECT` - inside an `EFFECT` entity

In [17]:
label_list = ['O', 'B-DRUG', 'I-DRUG', 'B-EFFECT', 'I-EFFECT']

custom_seq = Sequence(feature=ClassLabel(num_classes=5,
                                         names=label_list,
                                         names_file=None, id=None), length=-1, id=None)

cons_dataset["train"].features["ner_tags"] = custom_seq
cons_dataset["test"].features["ner_tags"] = custom_seq

In [18]:
custom_seq

Sequence(feature=ClassLabel(names=['O', 'B-DRUG', 'I-DRUG', 'B-EFFECT', 'I-EFFECT'], id=None), length=-1, id=None)

In [19]:
# cons_dataset["train"].features["ner_tags"]

In [20]:
cons_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'drug', 'effect', 'drug_indices_start', 'drug_indices_end', 'effect_indices_start', 'effect_indices_end'],
        num_rows: 3203
    })
    test: Dataset({
        features: ['text', 'drug', 'effect', 'drug_indices_start', 'drug_indices_end', 'effect_indices_start', 'effect_indices_end'],
        num_rows: 1068
    })
})

In [21]:
model_checkpoint = "allenai/scibert_scivocab_uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [22]:
def generate_row_labels(row, verbose=False):
    """ Given a row from the consolidated `Ade_corpus_v2_drug_ade_relation` dataset,
    generates BIO tags for drug and effect entities.

    """

    text = row["text"]

    labels = []
    label = "O"
    prefix = ""

    # while iterating through tokens, increment to traverse all drug and effect spans
    drug_index = 0
    effect_index = 0

    tokens = tokenizer(text, return_offsets_mapping=True)

    for n in range(len(tokens["input_ids"])):
        offset_start, offset_end = tokens["offset_mapping"][n]

        # should only happen for [CLS] and [SEP]
        if offset_end - offset_start == 0:
            labels.append(-100)
            continue

        if drug_index < len(row["drug_indices_start"]) and offset_start == row["drug_indices_start"][drug_index]:
            label = "DRUG"
            prefix = "B-"

        elif effect_index < len(row["effect_indices_start"]) and offset_start == row["effect_indices_start"][effect_index]:
            label = "EFFECT"
            prefix = "B-"

        labels.append(label_list.index(f"{prefix}{label}"))

        if drug_index < len(row["drug_indices_end"]) and offset_end == row["drug_indices_end"][drug_index]:
            label = "O"
            prefix = ""
            drug_index += 1

        elif effect_index < len(row["effect_indices_end"]) and offset_end == row["effect_indices_end"][effect_index]:
            label = "O"
            prefix = ""
            effect_index += 1

        # need to transition "inside" if we just entered an entity
        if prefix == "B-":
            prefix = "I-"

    if verbose:
        print(f"{row}\n")
        orig = tokenizer.convert_ids_to_tokens(tokens["input_ids"])
        for n in range(len(labels)):
            print(orig[n], labels[n])
    tokens["labels"] = labels

    return tokens


In [23]:
cons_dataset["train"][56]

{'text': 'Cutaneous rashes and eruptions can be caused by many medications, including carbamazepine.',
 'drug': ['carbamazepine', 'carbamazepine'],
 'effect': ['Cutaneous rashes', 'eruptions'],
 'drug_indices_start': [76],
 'drug_indices_end': [89],
 'effect_indices_start': [0, 21],
 'effect_indices_end': [16, 30]}

In [24]:
cons_dataset["train"][2]["text"][11:31]

'cute cardiomyopathy '

In [25]:
# testing out...

generate_row_labels(cons_dataset["train"][3], verbose=True)

{'text': 'Multiple seizures after bupropion overdose in a small child.', 'drug': ['bupropion'], 'effect': ['Multiple seizures'], 'drug_indices_start': [24], 'drug_indices_end': [33], 'effect_indices_start': [0], 'effect_indices_end': [17]}

[CLS] -100
multiple 3
seizures 4
after 0
bup 1
##rop 2
##ion 2
over 0
##dose 0
in 0
a 0
small 0
child 0
. 0
[SEP] -100


{'input_ids': [102, 1624, 12787, 647, 22756, 1036, 329, 573, 9579, 121, 106, 952, 1326, 205, 103], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'offset_mapping': [(0, 0), (0, 8), (9, 17), (18, 23), (24, 27), (27, 30), (30, 33), (34, 38), (38, 42), (43, 45), (46, 47), (48, 53), (54, 59), (59, 60), (0, 0)], 'labels': [-100, 3, 4, 0, 1, 2, 2, 0, 0, 0, 0, 0, 0, 0, -100]}

In [26]:
labeled_dataset = cons_dataset.map(generate_row_labels)

Map:   0%|          | 0/3203 [00:00<?, ? examples/s]

Map:   0%|          | 0/1068 [00:00<?, ? examples/s]

In [27]:
task = "ner" # Should be one of "ner", "pos" or "chunk"
model_checkpoint = "allenai/scibert_scivocab_uncased"
batch_size = 16

In [28]:
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=len(label_list))

Some weights of BertForTokenClassification were not initialized from the model checkpoint at allenai/scibert_scivocab_uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [29]:
!pip install transformers[torch]
!pip install accelerate -U




In [30]:
model_name = model_checkpoint.split("/")[-1]
args = TrainingArguments(
    f"{model_name}-finetuned-{task}",
    evaluation_strategy = "epoch",
    learning_rate=1e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.05,
    logging_steps=1
)


In [31]:
data_collator = DataCollatorForTokenClassification(tokenizer)

In [32]:
import torch
from torch.utils.data import DataLoader

In [33]:
metric = load_metric("seqeval")

  metric = load_metric("seqeval")


Downloading builder script:   0%|          | 0.00/2.47k [00:00<?, ?B/s]

In [34]:
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

1. Random seed initialization

I have set the seed value radomly with the set seed function
I have used the same seed in trainer class

In [35]:
from transformers import set_seed
import random

In [36]:
seed_value = 42
random.seed(seed_value)
training_args = TrainingArguments(output_dir="./output")  # Provide other default arguments as needed
training_args.seed = seed_value

2. Using pytorch Dataloader

In [37]:
train_dataset=labeled_dataset["train"]

train_dataloader = DataLoader(
    train_dataset,
    batch_size=training_args.per_device_train_batch_size,
    collate_fn=data_collator,
    shuffle=True,
)

In [38]:
trainer = Trainer(
    model,
    args=training_args,
    train_dataset=labeled_dataset["train"],
    eval_dataset=labeled_dataset["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,


)

In [39]:
trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,0.166
1000,0.0567


TrainOutput(global_step=1203, training_loss=0.09704284002061497, metrics={'train_runtime': 151.637, 'train_samples_per_second': 63.368, 'train_steps_per_second': 7.933, 'total_flos': 231812234620980.0, 'train_loss': 0.09704284002061497, 'epoch': 3.0})

In [40]:
predictions, labels, _ = trainer.predict(labeled_dataset["test"])
predictions = np.argmax(predictions, axis=2)

# Remove ignored index (special tokens)
true_predictions = [
    [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]
true_labels = [
    [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]

results = metric.compute(predictions=true_predictions, references=true_labels)
results

{'DRUG': {'precision': 0.9411764705882353,
  'recall': 0.9640062597809077,
  'f1': 0.9524545805952842,
  'number': 1278},
 'EFFECT': {'precision': 0.829477858559154,
  'recall': 0.8709229701596114,
  'f1': 0.8496953283683143,
  'number': 1441},
 'overall_precision': 0.8812898653437279,
 'overall_recall': 0.9146745126884884,
 'overall_f1': 0.897671900378993,
 'overall_accuracy': 0.9670778915379328}

In [41]:
trainer.save_model("./model")

In [42]:
tokenizer = AutoTokenizer.from_pretrained("./model")
model = AutoModelForTokenClassification.from_pretrained("./model", id2label={0: 'O',1:'B-DRUG',2:'I-DRUG',3:'B-EFFECT',4:'I-EFFECT'})

In [43]:
effect_ner_model = pipeline(task="ner", model=model, tokenizer=tokenizer, device=-1)

In [44]:
label_list = ['O', 'B-DRUG', 'I-DRUG', 'B-EFFECT', 'I-EFFECT']

In [45]:
def visualize_entities(sentence):
    print(sentence)
    tokens = effect_ner_model(sentence)
    entities = []

    for token in tokens:
        print(token)
        label = token["entity"][-1]
        if label != 0:
            token["label"] = label_list[label]
            entities.append(token)

    params = [{"text": sentence,
               "ents": entities,
               "title": None}]

    html = displacy.render(params, style="ent", manual=True, options={
        "colors": {
                   "B-DRUG": "#f08080",
                   "I-DRUG": "#f08080",
                   "B-EFFECT": "#9bddff",
                   "I-EFFECT": "#9bddff",
               },
    })

In [47]:
examples = [
    "Abortion, miscarriage or uterine hemorrhage associated with misoprostol (Cytotec), a labor-inducing drug.",
    "Addiction to many sedatives and analgesics, such as diazepam, morphine, etc.",
    "Birth defects associated with thalidomide",
    "Bleeding of the intestine associated with aspirin therapy",
    "Cardiovascular disease associated with COX-2 inhibitors (i.e. Vioxx)",
    "Deafness and kidney failure associated with gentamicin (an antibiotic)"
]

# for example in examples:
#     visualize_entities(example)
#     print(f"{'*' * 50}\n")




In [48]:
text=examples

data = effect_ner_model(text)
new = pd.DataFrame.from_dict(data)

In [49]:
new

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,"{'entity': 'B-EFFECT', 'score': 0.99880075, 'i...","{'entity': 'B-EFFECT', 'score': 0.9995122, 'in...","{'entity': 'I-EFFECT', 'score': 0.99945575, 'i...","{'entity': 'I-EFFECT', 'score': 0.9992884, 'in...","{'entity': 'B-EFFECT', 'score': 0.99929285, 'i...","{'entity': 'I-EFFECT', 'score': 0.99964595, 'i...","{'entity': 'B-DRUG', 'score': 0.9994935, 'inde...","{'entity': 'I-DRUG', 'score': 0.9992181, 'inde...","{'entity': 'I-DRUG', 'score': 0.99927825, 'ind..."
1,"{'entity': 'B-EFFECT', 'score': 0.8873475, 'in...","{'entity': 'B-DRUG', 'score': 0.99756444, 'ind...","{'entity': 'I-DRUG', 'score': 0.9975787, 'inde...","{'entity': 'I-DRUG', 'score': 0.997563, 'index...","{'entity': 'B-DRUG', 'score': 0.9977234, 'inde...",,,,
2,"{'entity': 'B-EFFECT', 'score': 0.988674, 'ind...","{'entity': 'I-EFFECT', 'score': 0.99809664, 'i...","{'entity': 'B-DRUG', 'score': 0.9972451, 'inde...","{'entity': 'I-DRUG', 'score': 0.9988702, 'inde...","{'entity': 'I-DRUG', 'score': 0.998831, 'index...","{'entity': 'I-DRUG', 'score': 0.9984609, 'inde...",,,
3,"{'entity': 'B-EFFECT', 'score': 0.99645764, 'i...","{'entity': 'I-EFFECT', 'score': 0.99839526, 'i...","{'entity': 'I-EFFECT', 'score': 0.9986589, 'in...","{'entity': 'I-EFFECT', 'score': 0.99819404, 'i...","{'entity': 'B-DRUG', 'score': 0.99521047, 'ind...",,,,
4,"{'entity': 'B-EFFECT', 'score': 0.99902606, 'i...","{'entity': 'I-EFFECT', 'score': 0.9996542, 'in...","{'entity': 'B-DRUG', 'score': 0.99911886, 'ind...","{'entity': 'I-DRUG', 'score': 0.9989631, 'inde...",,,,,
5,"{'entity': 'B-EFFECT', 'score': 0.996176, 'ind...","{'entity': 'I-EFFECT', 'score': 0.9957082, 'in...","{'entity': 'B-EFFECT', 'score': 0.99125624, 'i...","{'entity': 'I-EFFECT', 'score': 0.9948257, 'in...","{'entity': 'B-DRUG', 'score': 0.9965502, 'inde...",,,,


In [50]:
tokens = effect_ner_model(text)
formatted_dataset = {'drugs':[], 'effect':[], 'drug_indices_start': [],
 'drug_indices_end': [],
 'effect_indices_start': [],
 'effect_indices_end': []}

list2 = [2,1]
formatted_dataset["drugs"].append(list2[0])
formatted_dataset

{'drugs': [2],
 'effect': [],
 'drug_indices_start': [],
 'drug_indices_end': [],
 'effect_indices_start': [],
 'effect_indices_end': []}

In [51]:
effect_ner_model(text)[0]

[{'entity': 'B-EFFECT',
  'score': 0.99880075,
  'index': 1,
  'word': 'abortion',
  'start': 0,
  'end': 8},
 {'entity': 'B-EFFECT',
  'score': 0.9995122,
  'index': 3,
  'word': 'misc',
  'start': 10,
  'end': 14},
 {'entity': 'I-EFFECT',
  'score': 0.99945575,
  'index': 4,
  'word': '##arri',
  'start': 14,
  'end': 18},
 {'entity': 'I-EFFECT',
  'score': 0.9992884,
  'index': 5,
  'word': '##age',
  'start': 18,
  'end': 21},
 {'entity': 'B-EFFECT',
  'score': 0.99929285,
  'index': 7,
  'word': 'uterine',
  'start': 25,
  'end': 32},
 {'entity': 'I-EFFECT',
  'score': 0.99964595,
  'index': 8,
  'word': 'hemorrhage',
  'start': 33,
  'end': 43},
 {'entity': 'B-DRUG',
  'score': 0.9994935,
  'index': 11,
  'word': 'mis',
  'start': 60,
  'end': 63},
 {'entity': 'I-DRUG',
  'score': 0.9992181,
  'index': 12,
  'word': '##oprost',
  'start': 63,
  'end': 69},
 {'entity': 'I-DRUG',
  'score': 0.99927825,
  'index': 13,
  'word': '##ol',
  'start': 69,
  'end': 71}]

In [52]:
ner = pipeline(task="ner", model=model, tokenizer=tokenizer, device=-1, aggregation_strategy='simple')
temps = ner(text)
for temp in temps:
    print(temp)

[{'entity_group': 'EFFECT', 'score': 0.99880075, 'word': 'abortion', 'start': 0, 'end': 8}, {'entity_group': 'EFFECT', 'score': 0.99941874, 'word': 'miscarriage', 'start': 10, 'end': 21}, {'entity_group': 'EFFECT', 'score': 0.9994694, 'word': 'uterine hemorrhage', 'start': 25, 'end': 43}, {'entity_group': 'DRUG', 'score': 0.99933, 'word': 'misoprostol', 'start': 60, 'end': 71}]
[{'entity_group': 'EFFECT', 'score': 0.8873475, 'word': 'addiction', 'start': 0, 'end': 9}, {'entity_group': 'DRUG', 'score': 0.9975688, 'word': 'diazepam', 'start': 52, 'end': 60}, {'entity_group': 'DRUG', 'score': 0.9977234, 'word': 'morphine', 'start': 62, 'end': 70}]
[{'entity_group': 'EFFECT', 'score': 0.9933853, 'word': 'birth defects', 'start': 0, 'end': 13}, {'entity_group': 'DRUG', 'score': 0.9983518, 'word': 'thalidomide', 'start': 30, 'end': 41}]
[{'entity_group': 'EFFECT', 'score': 0.9979264, 'word': 'bleeding of the intestine', 'start': 0, 'end': 25}, {'entity_group': 'DRUG', 'score': 0.99521047, 'w