In this notebook, you will find:

- Our vectorization of the manual for integration into a RAG.
- Tests on standalone LLMs.
- Code generation tests.

This notebook contains only a small sample of our trials, but you can find a larger number of trials in the 'Archive' folder.

# Modules

In [1]:
import sys
sys.path.append('../')

In [2]:
%load_ext autoreload
%autoreload 2

In [None]:
import sys
sys.path.append('../')
import os
import torch
import ollama
from src.config import PATH_PROJECT, PROMPT_SYSTEM, PROMPT_MODEL_EXPERT

device="cuda" if torch.cuda.is_available() else "cpu"
print(device)


cpu


# RAG

In [None]:
from langchain.document_loaders import PyMuPDFLoader
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings

In [None]:
# Import project-specific paths and system prompt
from src.config import PATH_PROJECT, PROMPT_SYSTEM

# Define the absolute path to the project directory
PATH_PROJECT = r'C:\Users\FX506\Desktop\CS\3A\MVA\LLM\mva-llm-project'

# Define the directory where the Chroma vector database will be saved
CHROMA_PERSIST_DIR = os.path.join(PATH_PROJECT, "data/chroma_db/")

# Check if the Chroma database already exists and is not empty
if not os.path.exists(CHROMA_PERSIST_DIR) or not os.listdir(CHROMA_PERSIST_DIR):
    print("Creating Chroma database and indexing documents...")

    # Load PDF files from the 'data/' directory
    data_dir = os.path.join(PATH_PROJECT, "data/")
    documents = []

    for file in os.listdir(data_dir):
        if file.endswith(".pdf"):
            pdf_path = os.path.join(data_dir, file)
            pdf_loader = PyMuPDFLoader(pdf_path)
            documents.extend(pdf_loader.load())

    # Split the documents into smaller chunks for embedding
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
    docs = text_splitter.split_documents(documents)

    # Initialize the embedding model (MiniLM) with device configuration
    embedding_model = HuggingFaceEmbeddings(
        model_name="all-MiniLM-L6-v2",
        model_kwargs={'device': device},
        encode_kwargs={'device': device}
    )

    # Create a new Chroma vector store using the embedded document chunks
    vectorstore = Chroma.from_documents(docs, embedding_model, persist_directory=CHROMA_PERSIST_DIR)

    # Save the Chroma vector store to disk
    vectorstore.persist()
    print("Chroma database saved successfully.")

else:
    # If the Chroma database already exists, simply load it
    print("Loading existing Chroma database...")
    embedding_model = HuggingFaceEmbeddings(
        model_name="all-MiniLM-L6-v2",
        model_kwargs={'device': device},
        encode_kwargs={'device': device}
    )
    vectorstore = Chroma(persist_directory=CHROMA_PERSIST_DIR, embedding_function=embedding_model)

print("Chroma database ready to use.")


In [None]:
query = "What are the symptoms of depression ?"
retrieved_docs = vectorstore.similarity_search(query, k=3)

for doc in retrieved_docs:
    print(doc.page_content)  

# LLMs alone

In this section, we test the models directly to establish diagnostics, without logical programming or expert intervention.

In [None]:
from src.inference import get_answer

get_answer(patient_id=10, model='mistral')

' Based on the provided information, the diagnosis for this patient is Recurrent Depressive Disorder (RDD). The reason being that there is no history of manic, mixed or hypomanic episodes, and the observed symptoms align with depressive episodes (depressed_mood and reduced_energy), making a Bipolar disorder less likely. However, it is important to note that increased self-esteem can be associated with hypomania in some cases, but without a history of manic or hypomanic episodes, it is unlikely to be indicative of a bipolar disorder. A thorough evaluation and assessment by a mental health professional are necessary for an accurate diagnosis.'

In [None]:
for i in range(1, 31):
    print(f"patient {i}", get_answer(patient_id=i, model='deepseek-r1:8b'))

In [None]:
for i in range(1, 31):
    print(f"patient {i}", get_answer(patient_id=i, model='mistral'))

In [None]:
for i in range(1, 31):
    print(f"patient {i}", get_answer(patient_id=i, model='llama3.1'))

# LLMs and logical programming 

In this section, we test the models to generate logical programming with and without expert intervention.

In [4]:
import pandas as pd 
df = pd.read_excel(os.path.join(PATH_PROJECT, "data/") + "dataset.xlsx").fillna("Non renseigné")

We first retrieve names for symptoms and history condition.

In [5]:
df['Observed_Symptom'].unique(), df['History_Condition'].unique()

(array(['depressed_mood', 'reduced_concentration', 'reduced_energy',
        'increased_talkativeness', 'diminished_interest_pleasure',
        'low_self_worth', 'psychomotor_disturbances',
        'increased_activity_energy', 'euphoria_irritability_expansiveness',
        'racing_thoughts', 'increased_self_esteem',
        'disrupted_excessive_sleep', 'change_in_appetite_weight',
        'decreased_need_for_sleep', 'distractibility',
        'recurrent_thoughts_death_suicide', 'impulsive_reckless_behavior',
        'increased_sexual_sociability_goal_directed_activity',
        'hopelessness', 'delusions', 'passivity_experiences',
        'disorganized_behavior'], dtype=object),
 array(['depressive', 'hypomanic', 'Non renseigné', 'mixed', 'manic'],
       dtype=object))

Quick test for Ollama's connection.

In [6]:
import requests

def test_ollama():
    try:
        response = requests.get("http://localhost:11434")
        if response.status_code == 200:
            print("Ollama server is UP!")
        else:
            print(f"Ollama server responded with status code: {response.status_code}")
    except requests.exceptions.ConnectionError:
        print("Ollama server is DOWN or unreachable on localhost:11434")

test_ollama()

Ollama server is UP!


LLMs

In [7]:
from src.inference import get_code, get_code_with_rag

models = ["mistral", "deepseek-coder-v2", "deepseek-r1", "llama3.2"]
models_expert = [ "mistral","deepseek-coder-v2", "deepseek-r1", "llama3.2" ]


  user = f'''Now, translate the following criteria into python code between <code> and <\code> and add your explanations for Bipolar I, Bipolar II, Single Episode Depressive Disorder, and Recurrent Depressive Disorder. IT MUST BE IN PYTHON WITH THE PACKAGE pyDatalog. {retrieved_context} Relevant symptom names for Observed relation: {df['Observed_Symptom'].unique()}  Relevant condition names for History relation: {df['History_Condition'].unique()}'''


## Testing phase

In [8]:
get_code("mistral",model_expert="deepseek-coder-v2", w_model_expert=True, prompt_system=PROMPT_SYSTEM, prompt_system_expert=PROMPT_MODEL_EXPERT,lan_logic='python')

 Here is a Python Pandas code to determine the diagnosis based on the provided criteria:

```python
import pandas as pd

# Assuming df is your DataFrame and it has the structure you've mentioned
def diagnose(df):
    # Define functions for each type of mood episode
    def depressive_episode(row):
        return (row['Mood_Episode'] == 'Depressive Episode') & \
               (row['Observed_Symptoms_Weeks'][0][0] == 'depressed_mood' and row['Observed_Symptoms_Weeks'][0][1] >= 2) & \
               row['History_Condition'].str.contains('|'.join(['Non renseigne', 'manic'])) & \
               (row['Observed_Symptoms_Weeks'][1:] != [(nan, nan)]).all()

    def manic_episode(row):
        return (row['Mood_Episode'] == 'Manic Episode') & \
               row['Observed_Symptoms_Weeks'][0][0] in ['euphoria_irritability_expansiveness', 'increased_talkativeness'] & \
               len(row[row['Observed_Symptoms_Weeks'][1:]]) >= 3 & \
               (row['History_Condition'].str.contains('|'.j

### Example of a generated code

In [9]:
df.head()

Unnamed: 0,PatientID,Disorder,Observed_Symptom,Observed_Week,History_Condition,History_Count,Mood_Episode
0,1,BPD2,depressed_mood,1.5,depressive,1.0,Non renseigné
1,1,BPD2,reduced_concentration,1.2,hypomanic,1.0,Non renseigné
2,1,BPD2,reduced_energy,0.8,Non renseigné,Non renseigné,Non renseigné
3,1,BPD2,increased_talkativeness,0.6,Non renseigné,Non renseigné,Non renseigné
4,2,RDD,depressed_mood,5.7,depressive,1.0,depressive


In [10]:
def has_criteria(series, criteria):
    for sym in criteria:
        if not series[sym].any():
            return False
    return True

def is_depressive_episode(series):
    return (series['depressed_mood'].any() and
            has_criteria(series, ['reduced_concentration', 'feelings_of_worthlessness_or_excessive_guilt',
                                'hopelessness', 'thoughts_of_death_or_suicidal_ideation',
                                'insomnia_or_hypersomnia',
                                'significant_weight_change_or_appetite_disturbance',
                                'psychomotor_agitation_or_retardation',
                                'fatigue_or_loss_of_energy']))

def is_manic_episode(series):
    return (series['euphoria_irritability_expansiveness'].any() and
            series['increased_activity_energy'].any() and
            has_criteria(series, ['inflated_self_esteem_or_grandiosity',
                                'decreased_need_for_sleep',
                                'more_talkative_or_pressured_speech',
                                'flight_of_ideas_or_racing_thoughts',
                                'distractibility',
                                'increased_goal-directed_activity_or_psychomotor_agitation',
                                'risky_behaviors']))

def is_hypomanic_episode(series):
    return (series['euphoria_irritability_expansiveness'].any() and
            series['increased_activity_energy'].any() and
            has_criteria(series, ['inflated_self_esteem', 'decreased_need_for_sleep',
                                'more_talkative_or_pressured_speech',
                                'flight_of_ideas', 'distractibility',
                                'increased_goal-directed_activity']))

def is_mixed_episode(series):
    return (series['depressed_mood'].any() and series['euphoria_irritability_expansiveness'].any())

# Define helper functions for diagnosis
def get_episodes_duration(series):
    episodes = []
    for episode in df.loc[df['symptom'] == series['condition'], 'duration'].unique():
        episodes += [episode] * len(df[(df['symptom'] == series['condition']) & (df['duration'] == episode)])
    return pd.Series(episodes, name='duration')

def get_history(series):
    conditions = []
    for condition in df[df['symptom'].isin(series['condition'])]['condition'].unique():
        conditions += [condition] * len(df[(df['condition'] == condition) & (df['symptom'] == series['condition'])])
    return pd.Series(conditions, name='condition')

# Diagnose the patient based on their symptoms and history of mood episodes
def diagnose(series):
    if is_depressive_episode(series) and df.loc[df['symptom'] == 'Non renseigné', 'duration'].iloc[0] not in series['duration']:
        return 'Single Episode Depressive Disorder'
    elif is_depressive_episode(series) and len(df.loc[(df['symptom'] == 'Non renseigné') & (df['duration'] == df.loc[df['condition'] == 'depressive', 'duration'].iloc[0])]['condition']) > 1:
        return 'Recurrent Depressive Disorder'
    elif is_manic_episode(series):
        if series['condition'].iloc[0] == 'Non renseigné':
            conditions = get_history(series)
            if len(conditions) > 1 and 'hypomanic' in conditions:
                return 'Bipolar I Disorder'
            elif all([c != 'manic' for c in conditions]):
                return 'Unspecified Mood Disorder with Manic Episode'
        else:
            if len(df[(df['symptom'] == series['condition']) & (df['duration'] == df.loc[df['condition'] == 'manic', 'duration'].iloc[0])]) > 1:
                return 'Bipolar I Disorder'
    elif is_hypomanic_episode(series):
        if series['condition'].iloc[0] == 'Non renseigné':
            conditions = get_history(series)
            if len(conditions) > 1 and 'hypomanic' in conditions:
                return 'Bipolar II Disorder'
            elif all([c != 'hypomanic' for c in conditions]):
                return 'Unspecified Mood Disorder with Hypomanic Episode'
        else:
            if len(df[(df['symptom'] == series['condition']) & (df['duration'] == df.loc[df['condition'] == 'hypomanic', 'duration'].iloc[0])]) > 1:
                return 'Bipolar II Disorder'
    elif is_mixed_episode(series):
        if series['condition'].iloc[0] == 'Non renseigné':
            conditions = get_history(series)
            if len(conditions) > 1 and ('hypomanic' in conditions or 'manic' in conditions):
                return 'Bipolar I Disorder with Mixed Features'
        else:
            if len(df[(df['symptom'] == series['condition']) & (df['duration'] == df.loc[df['condition'] == 'mixed', 'duration'].iloc[0])]) > 1:
                return 'Bipolar I Disorder with Mixed Features'
    else:
        return 'Cannot determine diagnosis'

In [25]:
df['diagnosis'] = df.apply(diagnose, axis=1)

KeyError: 'depressed_mood'

Many errors. 

In [None]:
import numpy as np
def diagnose(df):
    df['Depressive Episode'] = df['Observed_Symptom'].apply(lambda x: 1 if all([sym in x for sym in ['depressed_mood', 'diminished_interest_pleasure']]) else 0)
    df['Hypomanic Episode'] = df['History_Count'] > 0
    df['Manic Episode'] = df['Hypomanic Episode'] & (df['Observed_Symptom'].apply(lambda x: len([sym in x for sym in ['euphoria_irritability_expansiveness', 'increased_activity_energy', 'impulsive_reckless_behavior']]) >= 3) == True)
    df['Mixed Episode'] = df['Depressive Episode'] & (df['Manic Episode'] | df['Hypomanic Episode'])

    df['Bipolar I Disorder'] = df['Manic Episode'].astype(int) + df['Mixed Episode'].astype(int) > 0
    df['Bipolar II Disorder'] = (df['Hypomanic Episode'].astype(int) == 1) & (df['Depressive Episode'].astype(int) >= 1)
    df['Single Episode Depressive Disorder'] = (df['Depressive Episode'].astype(int) == 1) & (df['Hypomanic Episode'].astype(int) == 0)
    df['Recurrent Depressive Disorder'] = (df['Depressive Episode'].cumsum() > 1) & (df['Hypomanic Episode'].eq(np.nan).all())


In [11]:
from src import dataprep
df_grouped = dataprep.preproc_df()
df_grouped[df_grouped['PatientID'] == 1]

Unnamed: 0,PatientID,Disorder,History_Condition,History_Count,Mood_Episode,Observed_Symptoms_Weeks
0,1,BPD2,"[depressive, hypomanic, Non renseigne, Non ren...","[1.0, 1.0, nan, nan]","[Non renseigne, Non renseigne, Non renseigne, ...","[(depressed_mood, 1.5), (reduced_concentration..."


In [None]:
diagnose(df)

Unnamed: 0,PatientID,Disorder,Observed_Symptom,Observed_Week,History_Condition,History_Count,Mood_Episode,Depressive Episode,Hypomanic Episode,Manic Episode,Mixed Episode,Bipolar I Disorder,Bipolar II Disorder,Single Episode Depressive Disorder,Recurrent Depressive Disorder
0,1,BPD2,depressed_mood,1.5,depressive,1.0,Non renseigné,0,True,True,False,True,False,False,False
1,1,BPD2,reduced_concentration,1.2,hypomanic,1.0,Non renseigné,0,True,True,False,True,False,False,False
2,1,BPD2,reduced_energy,0.8,Non renseigné,,Non renseigné,0,False,False,False,False,False,False,False
3,1,BPD2,increased_talkativeness,0.6,Non renseigné,,Non renseigné,0,False,False,False,False,False,False,False
4,2,RDD,depressed_mood,5.7,depressive,1.0,depressive,0,True,True,False,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
134,30,Non renseigné,euphoria_irritability_expansiveness,0.7,Non renseigné,,hypomanic,0,False,False,False,False,False,False,False
135,30,Non renseigné,increased_activity_energy,1.2,Non renseigné,,Non renseigné,0,False,False,False,False,False,False,False
136,30,Non renseigné,increased_talkativeness,1.8,Non renseigné,,Non renseigné,0,False,False,False,False,False,False,False
137,30,Non renseigné,racing_thoughts,0.7,Non renseigné,,Non renseigné,0,False,False,False,False,False,False,False


The function is not adapted to the dataframe structure, resulting in incorrect outputs. Overall, we spent significant time fixing syntax errors. Providing more examples in the prompts might help the models generate correct and executable code. However, for now, we have to extensively modify the generated code, which makes the method impractical in its current state.