In [3]:
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.ERROR)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

### Part 1: Make two knowledge bases. 

One specific to Tiara's skills and qualifications (resume and CV). One specific to roles that are good matches (job descriptions).

Large language model: Llama 3.1 8b
Embedding model: BAAI/bge-large-en-v1.5

In [5]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

In [7]:
from llama_index.llms.ollama import Ollama

In [9]:
# load documents

documents = SimpleDirectoryReader("data_tiara").load_data()

In [11]:
# set embedding model
# according to LangChain, "BGE models on the HuggingFace are the best open-source embedding models."

Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-large-en-v1.5")

In [13]:
# ollama to set model to llama3.1 8b

Settings.llm = Ollama(model="llama3.1:8b-instruct-q4_0", request_timeout=360.0)

In [15]:
print(Settings.chunk_size, Settings.chunk_overlap)

1024 200


In [17]:
Settings.chunk_size = 256
Settings.chunk_overlap = 50

In [19]:
# make vector database for document 

index_tiara = VectorStoreIndex.from_documents(
    documents,
)

In [21]:
# use vector database as reference

query_engine = index_tiara.as_query_engine()
response = query_engine.query("What did Tiara study?")
print(response)

Pathobiology and Molecular Medicine.


In [23]:
response = query_engine.query("Where did Tiara go for undergrad?")
print(response)

Michigan State University.


In [29]:
# load documents for the 2nd knowledge base
# make vector database

docs2 = SimpleDirectoryReader("data_jd").load_data()
index_jd = VectorStoreIndex.from_documents(docs2,)

In [31]:
# check 2nd index

query_engine = index_jd.as_query_engine()
response = query_engine.query("What are five most important skills for a clinical data scientist?")
print(response)

Based on the provided job descriptions, here are five essential skills for a Clinical Data Scientist:

1. **Data wrangling and analysis**: The ability to transform clinical trial, observational study, and electronic health data into tidy datasets is crucial.
2. **Programming skills**: Proficiency in Python and its essential data science tools (numpy, pandas) is a must-have for effective data analysis and manipulation.
3. **Clinical knowledge**: Understanding outcome measures, biomarkers, and other data measured in clinical trials is vital for making informed decisions.
4. **Collaboration and communication**: The ability to work with multi-disciplinary scientists and engineers, as well as communicate insights to clinical and laboratory teams, is essential.
5. **Data quality control**: Ensuring the accuracy and consistency of clinical and laboratory data through automated checks and resolving inconsistencies is critical for maintaining high-quality data products.


### Part 2: Set up multi-document agent.

Reference: https://docs.llamaindex.ai/en/stable/examples/agent/multi_document_agents/#building-multi-document-agents

In [38]:
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.core.agent import ReActAgent

# ReAct agent does both reasoning and acting

In [40]:
# set up a different engine for each knowledge base

tiara_engine = index_tiara.as_query_engine()
jobs_engine = index_jd.as_query_engine()

In [42]:
# set up query engine tools

query_engine_tools = [
    QueryEngineTool(
        query_engine=tiara_engine,
        metadata=ToolMetadata(
            name="tiara_quals",
            description=(
                "Provides information about Tiara's professional skills and qualifications."
                "Use a detailed plain text question as input to the tool."
            ),
        ),
    ),
    QueryEngineTool(
        query_engine=jobs_engine,
        metadata=ToolMetadata(
            name="job_descriptions",
            description=(
                "Provides information about jobs that match Tiara's skills, qualifications, and preferences based on industry and location."
                "Use a detailed plain text question as input to the tool."
            ),
        ),
    ),
]

In [46]:
# Add context (add later)

context = """ \
    You are a recruiter trying to find roles that match Tiara's skills, qualifications, and preferences.\
    You MUST use at least one of the tools provided when answering a question.\
"""

# set up the agent

llm = Ollama(model="llama3.1:8b-instruct-q4_0")

agent = ReActAgent.from_tools(
    query_engine_tools,
    llm=llm,
    verbose=True,
    context=context
)

In [48]:
response = agent.chat("Is Tiara qualified for a job that requires a bachelor's degree in a quantitative field and proficiency in at least one programming language?")
print('*****')
print(str(response))

> Running step d2bea435-fb75-434c-8ae1-d00dd343fd2e. Step input: Is Tiara qualified for a job that requires a bachelor's degree in a quantitative field and proficiency in at least one programming language?
[1;3;38;5;200mThought: The current language of the user is English. I need to use a tool to help me answer the question.
Action: tiara_quals
Action Input: {'type': 'object', 'properties': AttributedDict([('input', AttributedDict([('title', 'Input'), ('type', 'string')]))]), 'required': ['input']}
[0m[1;3;34mObservation: A string.
[0m> Running step 48068f59-8b8a-42ca-be78-50139d15d101. Step input: None
[1;3;38;5;200mThought: The observation suggests that the tool tiara_quals returned a string, but it didn't provide any specific information about Tiara's qualifications.
Action: tiara_quals
Action Input: {'type': 'object', 'properties': AttributedDict([('input', "What are Tiara's professional skills and qualifications?")]), 'required': ['input']}
[0m[1;3;34mObservation: Tiara has