# **Ask my PhD. thesis a question**

Install Necessary pachages

In [None]:
!pip install --upgrade langchain
!pip install pypdf pymupdf pdfplumber pypdf2
!pip install google-generativeai
!pip install chromadb
!pip install python-dotenv
!pip install faiss-gpu

In [164]:
# Load necessary libraries
import os
from dotenv import load_dotenv
from PyPDF2 import PdfReader

# Langchain and Generative AI imports
from langchain.document_loaders import UnstructuredPDFLoader, OnlinePDFLoader, PyPDFLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain_google_genai import GoogleGenerativeAIEmbeddings, ChatGoogleGenerativeAI
from langchain.chains.question_answering import load_qa_chain
from langchain.prompts import PromptTemplate

# Google Generative AI import
import google.generativeai as genai

Connect to Google drive to read the thesis pdf and import environment variables

In [16]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


If environment keys (API_key) is not set up, set it up here!

In [165]:
env_path = '/content/drive/My Drive/.env'

# Load the environment variables from the .env file
load_dotenv(env_path)

# Access the environment variables
google_api_key = os.getenv('GOOGLE_API_KEY')

# Print the variable to verify
#print(f'GOOGLE_API_KEY: {google_api_key}')

In [18]:
genai.configure(api_key = google_api_key)

In [19]:
#for m in genai.list_models():
 # print(m)

Model(name='models/chat-bison-001',
      base_model_id='',
      version='001',
      display_name='PaLM 2 Chat (Legacy)',
      description='A legacy text-only model optimized for chat conversations',
      input_token_limit=4096,
      output_token_limit=1024,
      supported_generation_methods=['generateMessage', 'countMessageTokens'],
      temperature=0.25,
      max_temperature=None,
      top_p=0.95,
      top_k=40)
Model(name='models/text-bison-001',
      base_model_id='',
      version='001',
      display_name='PaLM 2 (Legacy)',
      description='A legacy model that understands text and generates text as an output',
      input_token_limit=8196,
      output_token_limit=1024,
      supported_generation_methods=['generateText', 'countTextTokens', 'createTunedTextModel'],
      temperature=0.7,
      max_temperature=None,
      top_p=0.95,
      top_k=40)
Model(name='models/embedding-gecko-001',
      base_model_id='',
      version='001',
      display_name='Embedding Gecko

Load my thesis document

In [33]:
from langchain.document_loaders import PyPDFLoader

# Load the PDF
pdf_loader = PyPDFLoader("/content/drive/My Drive/dissertation/Dalilian_Disseration_QA.pdf")
documents = pdf_loader.load()


In [22]:
documents[100:102]

[Document(metadata={'source': '/content/drive/My Drive/dissertation/Dalilian_Disseration_QA.pdf', 'page': 100}, page_content="   \n \n87   \nTable 8: The range of hyperparameters for the SVM model considered . \nHyperparameter  Range  \nFeature Calculation Window Length  2, 3, 4, 5, 6, 7, 8, 9 , 10 \nKernel Function  Linear, RBF  \nRegularization Parameter 'C'  0.001, 0.005,0.1,0.05,0.01,0.5,1  \nGamma (for RBF Kernels)  0.001, 0.01, 0.1, 1  \n \nWe tested linear and Radial Basis Function (RBF) kernels to identify a hyperplane \n(decision boundary) that effectively separates the data by class. Kernels transform the data into a \nhigher -dimensional space, where the linear kernel seeks a direct linear separation, and the RBF \nkernel, through its non -linear mapping, aims to find a more complex boundary (Han et al., 2012) . \nThe RBF kernel is characterized by a gamma parameter that controls the shape of the decision \nboundary. To find a trade -off between maximizing the margin between

In [23]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Split the document into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
chunks = text_splitter.split_documents(documents)
print(f"Number of chunks: {len(chunks)}")

Number of chunks: 517


**Create embeddings of the chuncks of my thesis to get ready for semantic search**

Here we prepare chunchs for similarity searches. This is done through embedding our chunks of text (getting a vector per chunk).


In [39]:
texts = [doc.page_content for doc in chunks]


In [32]:
embeddings = GoogleGenerativeAIEmbeddings(model = "models/embedding-001")

In [42]:
vector_store = FAISS.from_texts(texts, embedding=embeddings)

In [168]:
def answer_question(query):
  similar_docs = vector_store.similarity_search(query)
  llm = ChatGoogleGenerativeAI(model="gemini-1.5-pro",
                             temperature=0.3)
  chain = load_qa_chain(llm, chain_type="stuff")
  return chain.run(input_documents=similar_docs, question=query)

In [169]:
query1 = "how cognitive monitoring is useful for human-robot collaboration?"
answer_question(query1)

"Cognitive monitoring is useful for human-robot collaboration in two main ways:\n\n1. **Identifying Suboptimal States:** It helps identify when human collaborators are experiencing suboptimal states like stress, fatigue, or lack of attention. This allows for timely interventions, such as suggesting breaks, adjusting tasks, or prompting the human to re-engage. This ensures the well-being and safety of the human operator.\n\n2. **Enabling Adaptive Automation:**  It provides feedback to the robotic system about the human's cognitive state. This allows the robot to adapt its operation dynamically, such as adjusting the pace of work, modifying task assignments, or activating safety protocols. This leads to a more efficient and empathetic human-robot interaction where the machine is responsive to the human's condition. \n"