In [150]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("documents/TWDpdf.pdf")
docs = loader.load()

#print(docs[0].page_content[:1000])


In [151]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Initialize the splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,         # size of each chunk
    chunk_overlap=100       # how much each chunk overlaps with the previous one
)

# Split the document
split_docs = text_splitter.split_documents(docs)

# Inspect the result
print(f"✅ Total chunks created: {len(split_docs)}")
print(f"\n🧩 First chunk preview:\n{split_docs[0].page_content[:300]}...")

✅ Total chunks created: 102

🧩 First chunk preview:
1 
Towards Multi-Brain Decoding in Autism:  
A Self-Supervised Learning Approach 
 
Ghazaleh Ranjabaran1, Quentin Moreau1, Adrien Dubois1, Guillaume 
Dumas*1,2, 
 
1CHU Sainte-Justine Research Centre, Department of Psychiatry, Université de Montréal, 
Montréal, QC, Canada 
2Mila – Quebec AI Institut...


In [152]:
from dotenv import load_dotenv
import os
load_dotenv()
print(os.getenv("OPENAI_API_KEY")[:8])  # Should show "sk-..."


sk-proj-


In [153]:
import os
from langchain.embeddings import OpenAIEmbeddings
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Initialize the embedding model
embeddings = OpenAIEmbeddings(openai_api_key=os.getenv("OPENAI_API_KEY"))

# (You won’t see any output yet — embeddings will be used in the next step)
print("✅ Embedding model initialized.")


✅ Embedding model initialized.


ChromaDB is a lightweight, local vector database — it's used to:

- Store text embeddings (numeric vectors)

- Search for similar vectors based on user queries

- Power semantic search in RAG systems


In [162]:
from langchain.vectorstores import Chroma

#Create Chroma DB and store the embeddings
vectorstore = Chroma.from_documents(
    documents=split_docs,
    embedding=embeddings,
    persist_directory="chroma_db"
)

# Save the DB to disk
vectorstore.persist()

print("✅ Embeddings stored in ChromaDB!")


✅ Embeddings stored in ChromaDB!


In [163]:
query = "What is the main goal of the research?"
results = vectorstore.similarity_search(query, k=3)  # return top 3 relevant chunks

# Display the results
for i, res in enumerate(results):
    print(f"\n🔎 Result {i+1}:\n{res.page_content[:500]}...\n")



🔎 Result 1:
the purpose of acquiring meaningful representations from EEG data. They presented two specific 
self-supervised learning (SSL) tasks, namely relative positioning (RP) and temporal shuffling (TS) 
and adapted a third technique called contrastive predictive coding (CPC) (Oord et al., 2019) to be 
applicable to EEG data. 
 
This study aims to comprehensively analyze hyperscanning EEG data obtained from both autistic 
and neurotypical participants. The primary objective is to develop a DL model expl...


🔎 Result 2:
downstream tasks. In the initial pretext task phase, the model is presented with a set of self -
generated challe nges or auxiliary objectives. These challenges are carefully designed to 
encourage the model to extract meaningful and informative features from the unlabeled data. The 
model learns to uncover patterns, relationships, and representations within the data it self, 
effectively transforming it into a more structured and informative format....


🔎 Result 

In [168]:
query = "What is the goal of this research?"
docs = vectorstore.similarity_search(query, k=3)

for i, doc in enumerate(docs):
    print(f"\nChunk {i+1}:\n{doc.page_content[:300]}...")



Chunk 1:
the purpose of acquiring meaningful representations from EEG data. They presented two specific 
self-supervised learning (SSL) tasks, namely relative positioning (RP) and temporal shuffling (TS) 
and adapted a third technique called contrastive predictive coding (CPC) (Oord et al., 2019) to be 
appl...

Chunk 2:
methods. This research also emphasizes the broader role of computational models in precision 
psychiatry, paving the way for innovative, personalized diagnostic and therapeutic solutions. 
 
References  
 
Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). Optuna: A Next-generation 
Hyp...

Chunk 3:
and further analyses of this parameter are essential. Specifically, investigating the upper bound 
for positive context length and comparing it to the total duration of EEG recordings could offer 
valuable insights into optimizing the pretext task model for this dataset and task....


In [169]:
from langchain.chat_models import ChatOpenAI


load_dotenv()

llm = ChatOpenAI(
    model_name="gpt-4",  # or "gpt3" 
    temperature=0.1,
    openai_api_key=os.getenv("OPENAI_API_KEY")
)


#### Customized promp - strict answer

In [170]:
from langchain.prompts import PromptTemplate
from langchain.chains.question_answering import load_qa_chain

template = template = """
You are a helpful assistant analyzing a scientific research paper.

Use only the context provided below to answer the question. Do not rely on outside knowledge.

- If the answer is clearly present, respond with a concise and accurate explanation.
- If the exact term used in the question does not appear in the text, but an equivalent or related term is used instead, mention and use that term in your answer.
- If the answer is not present at all in the context, respond with: "Not found in the provided context."

Context:
{context}

Question: {question}
Answer:
"""

prompt = PromptTemplate(input_variables=["context", "question"], template=template)
qa_chain = load_qa_chain(llm, chain_type="stuff", prompt=prompt)

In [171]:
# Run it
answer = qa_chain.run(input_documents=docs, question=query)

print("\n🧠 Answer:\n", answer)


🧠 Answer:
 The goal of this research is to comprehensively analyze hyperscanning EEG data obtained from both autistic and neurotypical participants. The primary objective is to develop a deep learning model explicitly designed to extract and recognize patterns and relationships within individual EEG signals, using a self-supervised learning methodology. This research also aims to emphasize the broader role of computational models in precision psychiatry, paving the way for innovative, personalized diagnostic and therapeutic solutions.


### Default setting response

In [172]:
from langchain.chains.question_answering import load_qa_chain

# "stuff" chain just stuffs the documents into the prompt
qa_chain = load_qa_chain(llm, chain_type="stuff")

# Run it
answer = qa_chain.run(input_documents=docs, question=query)

print("\n🧠 Answer:\n", answer)


🧠 Answer:
 The goal of this research is to comprehensively analyze hyperscanning EEG data obtained from both autistic and neurotypical participants. The primary objective is to develop a deep learning model specifically designed to extract and recognize patterns and relationships within individual EEG signals, using a self-supervised learning methodology. This research also aims to emphasize the broader role of computational models in precision psychiatry, potentially paving the way for innovative, personalized diagnostic and therapeutic solutions.


### Adding Fallback -  fallback QA is a common pattern in production RAG systems. It’s a backup strategy used when retrieval fails, meaning:

If standard RAG fails, run GPT on the full doc intro + conclusion (with a warning)

In [173]:
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

# Fallback prompt for when retrieval fails
fallback_prompt = PromptTemplate(
    input_variables=["context", "question"],
    template="""
You are a helpful assistant summarizing a scientific research paper.

The user has a question about the paper. Use the context below to answer it as clearly and accurately as possible. 
If the exact answer is not stated, provide a useful summary based on what you understand from the content.

Context:
{context}

Question: {question}
Answer:
"""
)

fallback_chain = LLMChain(llm=llm, prompt=fallback_prompt)


## Comparing final responses for testing the model

In [145]:
from langchain.prompts import PromptTemplate
from langchain.chains.question_answering import load_qa_chain

# Strict prompt
strict_prompt = PromptTemplate(
    input_variables=["context", "question"],
    template="""
You are a helpful assistant analyzing a scientific research paper.

Your job is to answer questions using **only the context provided below**. Do not rely on outside knowledge or assumptions.

Answer rules:
- If the answer is clearly stated, quote or summarize it briefly.
- If the question uses a term not found in the context (e.g., "hypothesis"), but a related term exists (e.g., "objective" or "research question"), use that instead and mention it.
- If the answer is not found at all, say exactly: "Not found in the provided context."

Context:
{context}

Question: {question}
Answer:
"""
)
# Flexible: no override = LangChain's default QA prompt
qa_chain_strict = load_qa_chain(llm, chain_type="stuff", prompt=strict_prompt)
qa_chain_flexible = load_qa_chain(llm, chain_type="stuff")



# Questions to test
questions = [
    "What is the main goal of the research?",
    "What is the main hypothesis or research question?",
    "Summarize the findings of this study.",
    "What are the main conclusions drawn in the paper?",
    "What data or participants were used in the study?",
    "What methods or models were applied?",
    "What is Self-Supervised Learning in the context of this research?",
    "Did the SSL model outperform the baseline?",
    "How is this research relevant to autism?",
    "What limitations does the paper acknowledge?"
]

# Compare answers
for i, query in enumerate(questions, 1):
    print(f"\n{'='*80}")
    print(f"🔹 Question {i}: {query}")

    # Retrieve chunks
    docs = vectorstore.similarity_search(query, k=5)

    # Optional: print retrieved text for debugging
    # for doc in docs:
    #     print(doc.page_content[:300])
    #     print("---")

    # Flexible chain
    ans_flexible = qa_chain_flexible.run(input_documents=docs, question=query)

    # Strict chain
    # Run your normal QA chain first
    answer = qa_chain_strict.run(input_documents=docs, question=query)

# Trigger fallback if answer is weak or missing
    if "Not found in the provided context" in answer or len(answer.strip()) < 10:
        print("⚠️ Using fallback context...")

    # Use only intro + conclusion chunks for fallback
        fallback_chunks = [
            doc for doc in split_docs
            if "introduction" in doc.page_content.lower() or "conclusion" in doc.page_content.lower()
        ]
        fallback_context = "\n\n".join([doc.page_content for doc in fallback_chunks])

        # Run the fallback chain
        answer = fallback_chain.run({
            "context": fallback_context,
            "question": query
        })

    print("🧠 Final Answer:\n", answer)


    # Output both
    print("\n🧠 Flexible Answer:\n", ans_flexible)
   



🔹 Question 1: What is the main goal of the research?
🧠 Final Answer:
 The main goal of the research is "to comprehensively analyze hyperscanning EEG data obtained from both autistic and neurotypical participants." The primary objective is "to develop a DL model explicitly designed to extract and recognize patterns and relationships within individual EEG signals, employing a self-supervised learning methodology."

🧠 Flexible Answer:
 The main goal of the research is to comprehensively analyze hyperscanning EEG data obtained from both autistic and neurotypical participants. The primary objective is to develop a deep learning model specifically designed to extract and recognize patterns and relationships within individual EEG signals, using a self-supervised learning methodology.

🔒 Strict Answer:
 The paper acknowledges that due to the computational demands of training neural networks on large EEG datasets, they fixed the hyperparameters (e.g., learning rate, batch size, number of itera