# Installing Packages
Installing the required packages for the project.
- `datasets`: A library for easily accessing and sharing datasets.
- `openai`: The official Python library for the OpenAI API.
- `python-dotenv`: A Python library for loading environment variables from a .env file.
- `tqdm`: A fast, extensible progress bar for Python and CLI.

In [1]:
%pip install datasets
%pip install openai  
%pip install python-dotenv
%pip install tqdm



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m

# Imports
Importing the necessary modules and libraries for the project.
- `load_dataset`: A function from the `datasets` library for loading datasets.
- `os`: A module for interacting with the operating system.
- `openai`: The official Python library for the OpenAI API.
- `json`: A module for working with JSON data.
- `random`: A module for generating random numbers.
- `OpenAI`: A class from the `openai` library for interacting with the OpenAI API.
- `load_dotenv`: A function from the `dotenv` library for loading environment variables from a .env file.
- `tqdm`: A module for creating progress bars.
- `Pool`: A class from the `multiprocessing` library for creating a pool of worker processes.

In [10]:
from datasets import load_dataset
import os
import openai
import json
import random
from openai import OpenAI
from dotenv import load_dotenv
from tqdm import tqdm
from multiprocessing import Pool
load_dotenv()

True

# Loading the Dataset
Loading the BC5CDR dataset using the `load_dataset` function from the `datasets` library. The dataset is a collection of biomedical texts annotated with chemical and disease mentions. The function returns a `DatasetDict` object containing the training, validation, and test splits of the dataset.

Source: https://huggingface.co/datasets/tner/bc5cdr


In [11]:

openai.api_key = os.getenv("OPENAI_API_KEY")

ds = load_dataset("tner/bc5cdr")

print(ds)
# save dataset to json 
with open("bc5cdr_full.json", "w", encoding="utf-8") as f:
    json.dump(
        {
            "train": [item for item in ds["train"]],
        },
        f,
        ensure_ascii=False,
        indent=2
    )
"""
First three items.:
{'tokens': ['Naloxone', 'reverses', 'the', 'antihypertensive', 'effect', 'of', 'clonidine', '.'], 'tags': [1, 0, 0, 0, 0, 0, 1, 0]}
{'tokens': ['In', 'unanesthetized', ',', 'spontaneously', 'hypertensive', 'rats', 'the', 'decrease', 'in', 'blood', 'pressure', 'and', 'heart', 'rate', 'produced', 'by', 'intravenous', 'clonidine', ',', '5', 'to', '20', 'micrograms', '/', 'kg', ',', 'was', 'inhibited', 'or', 'reversed', 'by', 'nalozone', ',', '0', '.'], 'tags': [0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]}
{'tokens': ['2', 'to', '2', 'mg', '/', 'kg', '.'], 'tags': [0, 0, 0, 0, 0, 0, 0]}
"""

DatasetDict({
    train: Dataset({
        features: ['tokens', 'tags'],
        num_rows: 5228
    })
    validation: Dataset({
        features: ['tokens', 'tags'],
        num_rows: 5330
    })
    test: Dataset({
        features: ['tokens', 'tags'],
        num_rows: 5865
    })
})


"\nFirst three items.:\n{'tokens': ['Naloxone', 'reverses', 'the', 'antihypertensive', 'effect', 'of', 'clonidine', '.'], 'tags': [1, 0, 0, 0, 0, 0, 1, 0]}\n{'tokens': ['In', 'unanesthetized', ',', 'spontaneously', 'hypertensive', 'rats', 'the', 'decrease', 'in', 'blood', 'pressure', 'and', 'heart', 'rate', 'produced', 'by', 'intravenous', 'clonidine', ',', '5', 'to', '20', 'micrograms', '/', 'kg', ',', 'was', 'inhibited', 'or', 'reversed', 'by', 'nalozone', ',', '0', '.'], 'tags': [0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]}\n{'tokens': ['2', 'to', '2', 'mg', '/', 'kg', '.'], 'tags': [0, 0, 0, 0, 0, 0, 0]}\n"

In [12]:
# print first 3 examples
for i in range(3):
    print(ds["train"][i])


# print length of ds
print(len(ds["train"]))
print(len(ds["test"]))
print(len(ds["validation"]))

{'tokens': ['Naloxone', 'reverses', 'the', 'antihypertensive', 'effect', 'of', 'clonidine', '.'], 'tags': [1, 0, 0, 0, 0, 0, 1, 0]}
{'tokens': ['In', 'unanesthetized', ',', 'spontaneously', 'hypertensive', 'rats', 'the', 'decrease', 'in', 'blood', 'pressure', 'and', 'heart', 'rate', 'produced', 'by', 'intravenous', 'clonidine', ',', '5', 'to', '20', 'micrograms', '/', 'kg', ',', 'was', 'inhibited', 'or', 'reversed', 'by', 'nalozone', ',', '0', '.'], 'tags': [0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]}
{'tokens': ['2', 'to', '2', 'mg', '/', 'kg', '.'], 'tags': [0, 0, 0, 0, 0, 0, 0]}
5228
5865
5330


# Filtering Data
Filtering data such that 'tags' include both 1 and 2.

In [13]:
# Filter data such that 'tags' include both 1 and 2.
filtered_ds = {}
filtered_ds["train"] = [example for example in ds["train"] if 1 in example["tags"] and 2 in example["tags"]]
filtered_ds["test"] = [example for example in ds["test"] if 1 in example["tags"] and 2 in example["tags"]]
filtered_ds["validation"] = [example for example in ds["validation"] if 1 in example["tags"] and 2 in example["tags"]]



In [14]:
# print 3 random example's 'tags'.
for i in range(15):
    print(filtered_ds["train"][random.randint(0, len(filtered_ds["train"]))]["tags"])

# print length of filtered_ds
print(len(filtered_ds["train"]))
print(len(filtered_ds["test"]))


print(filtered_ds["train"][15])



[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 4, 0, 1, 4, 0, 0, 2, 0, 0, 0, 0, 0]
[0, 0, 1, 0, 0, 0, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 1]
[0, 0, 0, 0, 0, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 4, 0, 2, 3, 3, 0, 0, 1, 0, 1, 0, 2, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 3, 0, 0, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

In [15]:
data = filtered_ds["train"][15]
tokens = data["tokens"]
tags = data["tags"]

print(tokens)
print(tags)

['Because', 'of', 'the', 'need', 'for', 'the', 'development', 'of', 'new', 'treatments', 'for', "Crohn's", 'disease', ',', 'a', 'pilot', 'study', 'was', 'undertaken', 'to', 'estimate', 'the', 'pharmacodynamics', 'and', 'tolerability', 'of', 'fusidic', 'acid', 'treatment', 'in', 'chronic', 'active', ',', 'therapy', '-', 'resistant', 'patients', '.']
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


# Translating Text

In [16]:

def translate_text(text, target_language):
    client = OpenAI()

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": "You are an expert in translating medicine-related scientific literature. Only provide the translated sentence."
            },
            {
                "role": "user",
                "content": f"Translate to {target_language}: {text}"
            }
        ],
        temperature=0.2,
        max_tokens=2048,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0
    )

    return response.choices[0].message.content.strip()


    



In [17]:
def tokenize_text(text):
    return text.split(" ")
    


In [18]:
translate_text(token_string, "Turkish")

NameError: name 'token_string' is not defined

In [19]:
tokenized_translation = tokenize_text(translate_text(token_string, "Turkish"))
print(tokenized_translation)

NameError: name 'token_string' is not defined

In [20]:
# read tag_translator_system_prompt.txt and assign it to system_prompt variable
system_prompt = open("tag_translator_system_prompt.txt", "r").read()

def translate_tags(system_prompt, original_tokens, original_tags, tokenized_translation):  
    client = OpenAI()

    response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
        "role": "system",
        "content": [
            {
            "type": "text",
            "text": system_prompt}
        ]
        },
        {
        "role": "user",
        "content": [
            {
            "type": "text",
            "text": f"""Based on information provided to you evaluate corresponding tags of words in the new translated set of words.\nOriginal Tokens: {str(original_tokens)}\n\nOriginal Tags: {str(original_tags)}\n\nTranslated Tokens: {str(tokenized_translation)}\n\n\nGive translated tags as a list of integers given in a json object.\n\n
            json content => translated_tags: [integers separated by a comma]"""
            }
        ]
        }
    ],
    response_format={
        "type": "json_object"
    },
    temperature=0.2,
    max_tokens=2048,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0
    )

    # Extract the translated tags from the response
    translated_tags = json.loads(response.choices[0].message.content)["translated_tags"]
    return translated_tags




In [21]:
def translate_tokens(element, target_language):
    tokens = element["tokens"]
    token_string = " ".join(tokens)
    translated_string = translate_text(token_string, target_language)
    translated_tokens = tokenize_text(translated_string)
    return translated_tokens

In [28]:
def translate_element(element, target_language):
    translated_tokens = translate_tokens(element, target_language)
    translated_tags = translate_tags(system_prompt, element["tokens"], element["tags"], translated_tokens)
    original_hash = hash(tuple(element["tokens"]))
    return {"hash": original_hash,"original_tokens": element["tokens"], "tokens": translated_tokens, "original_tags": element["tags"], "tags": translated_tags}

In [30]:
random_element = filtered_ds["train"][18]

print(random_element)

translated_element = translate_element(random_element, "Azerbaijani")
print(translated_element)

{'tokens': ['Electrocardiographic', 'evidence', 'of', 'myocardial', 'injury', 'in', 'psychiatrically', 'hospitalized', 'cocaine', 'abusers', '.'], 'tags': [0, 0, 0, 2, 3, 0, 0, 0, 1, 0, 0]}
{'hash': 3656629275778296211, 'original_tokens': ['Electrocardiographic', 'evidence', 'of', 'myocardial', 'injury', 'in', 'psychiatrically', 'hospitalized', 'cocaine', 'abusers', '.'], 'tokens': ['Psixiatrik', 'xəstəxanada', 'müalicə', 'olunan', 'kokain', 'istifadəçilərində', 'miyokard', 'zədələnməsinin', 'elektrokarioqrafik', 'sübutları.'], 'original_tags': [0, 0, 0, 2, 3, 0, 0, 0, 1, 0, 0], 'tags': [0, 0, 0, 2, 3, 0, 1, 0, 4, 0]}


In [17]:
# TEST: make sure filtered_ds now contains only first 30 elements of each split
# filtered_ds = {split: data[:30] for split, data in filtered_ds.items()}


In [32]:


# Define the translation function for parallel processing
def process_element(args):
    element, language = args
    try:
        return translate_element(element, language)
    except Exception as e:
        print(f"Error processing element: {e}")
        return None

def save_progress(filename, data):
    try:
        with open(filename, "w", encoding="utf-8") as f:
            json.dump(data, f, ensure_ascii=False)
    except IOError as e:
        print(f"Error saving progress: {e}")

def worker_function(args):
    core_index, chunk, split_name = args
    translated_chunk = []
    for element in tqdm(chunk, desc=f"Core {core_index} processing split '{split_name}'"):
        translated = process_element((element, "Turkish"))
        if translated is not None:
            translated_chunk.append(translated)
    return translated_chunk

def translate_split(split_name, split_data, core_count):
    chunk_size = len(split_data) // core_count
    
    progress_files = []
    tasks = []
    for core_index in range(core_count):
        start_index = core_index * chunk_size
        end_index = start_index + chunk_size if core_index < core_count - 1 else len(split_data)
        core_chunk = split_data[start_index:end_index]

        progress_file = f"progress_core_{core_index}_{split_name}.json"
        progress_files.append(progress_file)
        tasks.append((core_index, core_chunk, split_name))

    # Process tasks in parallel
    with Pool(core_count) as pool:
        results = pool.map(worker_function, tasks)

    # Save progress for each core
    for core_index, result in enumerate(results):
        if result:
            save_progress(progress_files[core_index], result)

    # Combine all core results for the split
    translated_split = []
    for file in progress_files:
        if os.path.exists(file):
            try:
                with open(file, "r", encoding="utf-8") as f:
                    translated_split.extend(json.load(f))
            except Exception as e:
                print(f"Error reading progress file {file}: {e}")

    return translated_split

# Main translation logic
translated_ds = {}
output_file = "translated_ds_tr.json"
core_count = 15

for split_name, split_data in filtered_ds.items():
    print(f"Processing split '{split_name}'...")
    translated_ds[split_name] = translate_split(split_name, split_data, core_count)

# Save the final translated dataset
try:
    with open(output_file, "w", encoding="utf-8") as f:
        json.dump(translated_ds, f, ensure_ascii=False)
    print(f"Final dataset saved to {output_file}")

    # Clean up progress files
    for split_name in filtered_ds.keys():
        for core_index in range(core_count):
            progress_file = f"progress_core_{core_index}_{split_name}.json"
            if os.path.exists(progress_file):
                os.remove(progress_file)
except IOError as e:
    print(f"Error saving the final dataset: {e}")


Processing split 'train'...


Core 10 processing split 'train':  53%|█████▎    | 62/117 [04:03<02:44,  2.98s/it]

Error processing element: Expecting ',' delimiter: line 2 column 2066 (char 2067)


Core 9 processing split 'train': 100%|██████████| 117/117 [06:54<00:00,  3.54s/it]]
Core 4 processing split 'train': 100%|██████████| 117/117 [07:06<00:00,  3.64s/it]]
Core 13 processing split 'train': 100%|██████████| 117/117 [07:09<00:00,  3.67s/it]
Core 3 processing split 'train': 100%|██████████| 117/117 [07:11<00:00,  3.69s/it]]
Core 5 processing split 'train': 100%|██████████| 117/117 [07:27<00:00,  3.82s/it]]
Core 7 processing split 'train': 100%|██████████| 117/117 [07:27<00:00,  3.82s/it]
Core 8 processing split 'train': 100%|██████████| 117/117 [07:27<00:00,  3.83s/it]
Core 12 processing split 'train': 100%|██████████| 117/117 [07:27<00:00,  3.83s/it]
Core 10 processing split 'train': 100%|██████████| 117/117 [07:31<00:00,  3.86s/it]
Core 2 processing split 'train': 100%|██████████| 117/117 [07:32<00:00,  3.87s/it]
Core 14 processing split 'train': 100%|██████████| 118/118 [07:48<00:00,  3.97s/it]
Core 1 processing split 'train': 100%|██████████| 117/117 [07:59<00:00,  4.10s/

Processing split 'test'...


Core 11 processing split 'test': 100%|██████████| 127/127 [07:17<00:00,  3.44s/it]
Core 1 processing split 'test': 100%|██████████| 127/127 [07:18<00:00,  3.45s/it]]
Core 0 processing split 'test': 100%|██████████| 127/127 [07:22<00:00,  3.48s/it]]
Core 4 processing split 'test': 100%|██████████| 127/127 [07:27<00:00,  3.52s/it]]
Core 6 processing split 'test': 100%|██████████| 127/127 [07:27<00:00,  3.53s/it]
Core 5 processing split 'test': 100%|██████████| 127/127 [07:32<00:00,  3.56s/it]]
Core 2 processing split 'test': 100%|██████████| 127/127 [07:33<00:00,  3.57s/it]
Core 7 processing split 'test': 100%|██████████| 127/127 [07:34<00:00,  3.58s/it]]
Core 3 processing split 'test': 100%|██████████| 127/127 [07:44<00:00,  3.66s/it]]
Core 10 processing split 'test': 100%|██████████| 127/127 [07:45<00:00,  3.67s/it]
Core 12 processing split 'test': 100%|██████████| 127/127 [07:48<00:00,  3.69s/it]
Core 13 processing split 'test': 100%|██████████| 127/127 [07:56<00:00,  3.75s/it]
Core 9

Processing split 'validation'...


Core 14 processing split 'validation':  75%|███████▌  | 95/126 [05:08<01:40,  3.23s/it]

Error processing element: Expecting ',' delimiter: line 2 column 2064 (char 2065)


Core 14 processing split 'validation': 100%|██████████| 126/126 [06:47<00:00,  3.24s/it]
Core 8 processing split 'validation': 100%|██████████| 125/125 [07:11<00:00,  3.45s/it]]
Core 9 processing split 'validation': 100%|██████████| 125/125 [07:20<00:00,  3.53s/it]]
Core 10 processing split 'validation': 100%|██████████| 125/125 [07:21<00:00,  3.53s/it]
Core 11 processing split 'validation': 100%|██████████| 125/125 [07:23<00:00,  3.55s/it]
Core 3 processing split 'validation': 100%|██████████| 125/125 [07:33<00:00,  3.63s/it]]
Core 4 processing split 'validation': 100%|██████████| 125/125 [07:52<00:00,  3.78s/it]]
Core 2 processing split 'validation': 100%|██████████| 125/125 [07:54<00:00,  3.80s/it]]
Core 5 processing split 'validation': 100%|██████████| 125/125 [08:04<00:00,  3.88s/it]]
Core 1 processing split 'validation': 100%|██████████| 125/125 [08:08<00:00,  3.91s/it]]
Core 7 processing split 'validation': 100%|██████████| 125/125 [08:15<00:00,  3.97s/it]]
Core 13 processing sp

Final dataset saved to translated_ds_tr.json
