# Algoritm inteligent - imbunatatiri

## Imbunatatire 1:

Metoda anterioara de preprocesare a datelor (eliminare a cuvintelor cu o semantica irelevanta, pastrarea strict a caracterelor alfanumerice) avea niste probleme critice:
- lista de cuvinte cu semantica irelevanta continea cuvinte cu conotatie negativa, ceea ce ducea la eliminarea lor si, implicit, denaturarea sensului
- initial, cuvintele cu lungime cel mult 2 erau eliminate, considerandu-se irelevante, insa in domeniul medical multe abrevieri sunt esentiale pentru intelegerea contextului medical

In [None]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
import re
import pandas as pd

negation_words = {'no', 'not', 'nor', 'neither', 'never', 'denies', 'without'}
SKLEARN_STOPWORDS = set(ENGLISH_STOP_WORDS) - negation_words

medical_stopwords = {
    'pt', 'patient', 'hx', 'ros', 'admitted', 'hospital', # Am scos 'denies'
    'history', 'illness', 'presenting', 'states', 'reportedly', 'due',
    'mg', 'capsule', 'tablet', 'solution', 'drops', 'daily', 'tid', 'bid',
    'po', 'iv', 'unit', 'vitals', 'flow',
    'rr', 'bp', 'temp', 'pulse', 'w', 'h', 'm', 'f', 'o', 'r', 'q', 'g',
    'cc', 'cxr', 'us', 'chief'
}
ALL_STOPWORDS = SKLEARN_STOPWORDS.union(medical_stopwords)

INPUT_COLUMNS = ['chief_complaint', 'history_of_present_illness', 'past_medical_history']

def clean_text(text, stop_words):
    if pd.isna(text) or text is None:
        return ""

    text = str(text).lower()
    text = re.sub(r'(\s+in\s+)|(\s+at\s+)|(sometime\s+in\s+)|(\s*\d{4}\s*)|(\s*ms\.\s*,\s*)', ' ', text)
    text = re.sub(r'[^a-zA-Z0-9\s]', ' ', text)

    tokens = text.split()

    # Condiția len(token) > 1 permite acronime de 2 litere
    filtered_tokens = [token for token in tokens if token not in stop_words and len(token) > 1]

    return " ".join(filtered_tokens)

def generate_processed_input(row):
    input_parts = []
    for col in INPUT_COLUMNS:
        # Verificăm dacă coloana există în dataframe pentru a evita erori
        if col in row:
            cleaned_data = clean_text(row[col], ALL_STOPWORDS)
            if cleaned_data:
                label = col.upper().replace("_", " ")
                input_parts.append(f'{label}: {cleaned_data}')

    # Returnăm string-ul cu separatorul.
    return ' | '.join(input_parts)

## Imbunatatire 2:

Metoda anterioara de etichetare a setului de date pentru **weak supervised learning** are niste limitari ce pot afecta semnificativ performanta modelului si reduc scalabilitatea sistemului (alegerea specialistului pe baza domeniului din care face parte primul termen medical relevant gasit poate conduce la etichetari eronate; necesarea intretinerii manuale a dictionarului ce mapeaza specialist - termeni_medicali_asociati reduce scalabilitatea)

Pentru a imbunatati rezultatele modelului, voi folosi **BioBERT Similarity**:
- voi folosi o descriere concisa pentru fiecare specializare;
- voi obtine embedding-urile atat pentru input-ul pacientului, cat si pentru descrierea asociata unei specializari
- voi alege specialistul pentru care embedding-ul este cel mai apropiat de embedding-ul textului dat de pacient (cosine similarity)

In [1]:
import torch
from transformers import AutoTokenizer, AutoModel
from sklearn.metrics.pairwise import cosine_similarity

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
tokenizer = AutoTokenizer.from_pretrained("dmis-lab/biobert-v1.1")
model = AutoModel.from_pretrained("dmis-lab/biobert-v1.1")

# specialist definitions for embeddings generation
specialist_definitions = {
    "Infectious_Disease": (
        "Diagnosis and management of infections caused by bacteria, viruses, fungi, and parasites. "
        "Conditions include sepsis, septic shock, bacteremia, pneumonia, meningitis, and abscesses. "
        "Treatment involves broad-spectrum antibiotics (vancomycin, zosyn, cefepime), antivirals, "
        "fevers, chills, leukocytosis, and positive cultures."
    ),

    "Cardiology": (
        "Diseases of the heart and vascular system. Conditions include coronary artery disease (CAD), "
        "myocardial infarction (MI), congestive heart failure (CHF), atrial fibrillation (Afib), "
        "arrhythmia, hypertension, and angina. Treatments involving anticoagulation (heparin, warfarin), "
        "diuretics (lasix), beta-blockers, ACE inhibitors, cardiac catheterization, and stents."
    ),

    "Orthopedics_Surgery": (
        "Surgical and non-surgical treatment of the musculoskeletal system, bones, joints, and muscles. "
        "Conditions include bone fractures, trauma, osteoarthritis, hip and knee pain. "
        "Procedures include Open Reduction Internal Fixation (ORIF), arthroplasty, debridement, "
        "amputation, and casting. Post-operative care for orthopedic injuries."
    ),

    "Oncology": (
        "Diagnosis and treatment of cancer, malignancies, neoplasms, and tumors. "
        "Includes management of metastasis, lymphoma, leukemia, carcinoma, and masses. "
        "Therapies involve chemotherapy, immunotherapy, radiation, and palliative care for terminal illness."
    ),

    "Pulmonology": (
        "Diseases of the respiratory system, lungs, and airways. "
        "Conditions include pneumonia, chronic obstructive pulmonary disease (COPD), asthma, "
        "emphysema, bronchitis, pulmonary embolism (PE), and pleural effusion. "
        "Symptoms include dyspnea, shortness of breath, hypoxia, respiratory failure, and cough. "
        "Use of bronchodilators and oxygen therapy."
    ),

    "Endocrinology": (
        "Disorders of the endocrine system, hormones, and metabolism. "
        "Primary focus on Diabetes Mellitus (Type 1 and Type 2), diabetic ketoacidosis (DKA), "
        "hyperglycemia, and hypoglycemia. Also thyroid disorders (hypothyroidism, hyperthyroidism), "
        "adrenal insufficiency, and lipid disorders. Insulin management and glucose control."
    ),

    "Gastroenterology": (
        "Diseases of the digestive tract, stomach, intestines, liver, pancreas, and gallbladder. "
        "Conditions include gastrointestinal bleeding (GI bleed), cirrhosis, hepatitis, pancreatitis, "
        "bowel obstruction, melena, and hematochezia. Symptoms of nausea, vomiting, diarrhea, "
        "and abdominal pain. Procedures like endoscopy and colonoscopy."
    ),

    "Neurology": (
        "Disorders of the nervous system, brain, spinal cord, and nerves. "
        "Conditions include cerebrovascular accidents (CVA), stroke, transient ischemic attack (TIA), "
        "seizures, epilepsy, encephalopathy, and neuropathy. "
        "Symptoms include altered mental status, confusion, headache, weakness, numbness, "
        "syncope, and dizziness."
    )
}

def get_embedding(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state[:, 0, :].numpy()

# generam embedding-urile
specialist_vectors = {}
for spec, desc in specialist_definitions.items():
    specialist_vectors[spec] = get_embedding(desc)

# alegem cel mai bun specialist pentru un pacient dat pe baza embedding-urilor
def get_best_specialist_bert(patient_text):
    if not patient_text.strip():
        return None

    patient_vector = get_embedding(patient_text)

    best_score = -1
    best_specialist = None

    # Comparăm vectorul pacientului cu fiecare specialist
    for spec, spec_vector in specialist_vectors.items():
        score = cosine_similarity(patient_vector, spec_vector)[0][0]

        if score > best_score:
            best_score = score
            best_specialist = spec

    return best_specialist

## Imbunatatire 3:

In iteratia trecuta, functia de pierdere folosita a fost cross-entropy loss (o alegere standard pentru probleme de clasificare multi-clasa). Totusi, pentru a compensa dezechilibrul de clase din setul de antrenament (specific in domeniul medical), ar fi mai indicat sa folosesc weighted cross-entropy (penalizeaza mai mult predictiile gresite pentru clasele rar intalnite).

In [None]:
import numpy as np
from sklearn.utils.class_weight import compute_class_weight

y_train_labels_np = train_dataset['labels'].numpy()

class_weights_np = compute_class_weight(
    class_weight='balanced',
    classes=np.unique(y_train_labels_np),
    y=y_train_labels_np
)

## Imbunatatire 4:

Pentru a obtine un fine-tuning mai putin haotic, de aceasta data voi opta pentru folosirea unui learning rate variabil (la inceput creste, dupa care scade liniar pentru a oferi stabilitate)

In [None]:
from transformers import get_linear_schedule_with_warmup
from torch.optim import AdamW

LEARNING_RATE = 2e-5
optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)

scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=int(0.1 * total_steps), # 10% warmup
    num_training_steps=total_steps
)