# Setting up a query vector store

## Loading in the data

Below, we load in the scraped data and place them in a dictionary where keys are the title of the page (the conditions) and the values are the contents of the scraped page.

In [1]:
import os
from tqdm.notebook import tqdm
from bs4 import BeautifulSoup

# path to the condition folder (download them from sharepoint or scrape them again)
conditions_folder = "../nhs-use-case/conditions/"

# set to True if you want to extract only the main element
main_only = True

# read all conditions and put them in a list
conditions = {}
for condition in tqdm(os.listdir(conditions_folder)):
    try:
        content = open(
            os.path.join(conditions_folder, condition, "index.html"), "r"
        ).read()
        soup = BeautifulSoup(content, "html.parser")
        if main_only:
            # extract the main element
            main_element = soup.find("main", class_="nhsuk-main-wrapper")
            # extract the text from the main element
            text = main_element.get_text(separator="\n", strip=True)
        else:
            text = soup.get_text(separator="\n", strip=True)
        conditions[condition] = text
    except Exception as e:
        print(f"Error reading condition {condition}: {e}")
        continue

  0%|          | 0/913 [00:00<?, ?it/s]

Error reading condition index.html: [Errno 20] Not a directory: '../nhs-use-case/conditions/index.html/index.html'
Error reading condition .DS_Store: [Errno 20] Not a directory: '../nhs-use-case/conditions/.DS_Store/index.html'
Error reading condition README.txt: [Errno 20] Not a directory: '../nhs-use-case/conditions/README.txt/index.html'
Error reading condition mental-health: [Errno 2] No such file or directory: '../nhs-use-case/conditions/mental-health/index.html'


In [2]:
print(conditions["malnutrition"])

Overview
-
Malnutrition
Contents
Overview
Symptoms
Causes
Treatment
Malnutrition is a serious condition that happens when your diet does not contain the right amount of nutrients.
It means "poor nutrition" and can refer to:
undernutrition – not getting enough nutrients
overnutrition – getting more nutrients than needed
These pages focus on undernutrition in adults or children. Read about
obesity
for more about the problems associated with overnutrition.
Signs and symptoms of malnutrition
Common signs of malnutrition include:
unintentional weight loss
– losing 5% to 10% or more of weight over 3 to 6 months is one of the main signs of malnutrition
a low body weight – people with a body mass index (BMI) under 18.5 are at risk of being malnourished (use the
BMI calculator
to work out your BMI)
a lack of interest in eating and drinking
feeling tired all the time
feeling weak
getting ill often and taking a long time to recover
in children, not growing or not putting on weight at the expected

## Using the `langchain` library

We use the `langchain` library to create a vector store and will use sentence transformer embeddings from Hugging Face to create the embeddings. We will use the `Chroma` vector store to store the embeddings and perform similarity search.

In [3]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_text_splitters import SentenceTransformersTokenTextSplitter

In [4]:
sentence_transformer_model_name = "sentence-transformers/all-mpnet-base-v2"
embeddings = HuggingFaceEmbeddings(model_name=sentence_transformer_model_name)

We split the documents using the `SentenceTransformersTokenTextSplitter` specialized text splitter for use with the sentence-transformer models. The default behaviour is to split the text into chunks that fit the token window of the sentence transformer model that you would like to use. We pass in the same model name that we used above.

In [5]:
splitter = SentenceTransformersTokenTextSplitter(
    model_name=sentence_transformer_model_name,
    chunk_overlap=256,
)

We create `Document` objects with the `create_documents` method and we pass in the metadata which will just be the condition. Note below how each page might be split up across many documents.

In [6]:
documents = splitter.create_documents(
    conditions.values(),
    metadatas=[{"document": cond} for cond in list(conditions.keys())],
)

In [7]:
documents[:10]

[Document(metadata={'document': 'malnutrition'}, page_content='overview - malnutrition contents overview symptoms causes treatment malnutrition is a serious condition that happens when your diet does not contain the right amount of nutrients. it means " poor nutrition " and can refer to : undernutrition – not getting enough nutrients overnutrition – getting more nutrients than needed these pages focus on undernutrition in adults or children. read about obesity for more about the problems associated with overnutrition. signs and symptoms of malnutrition common signs of malnutrition include : unintentional weight loss – losing 5 % to 10 % or more of weight over 3 to 6 months is one of the main signs of malnutrition a low body weight – people with a body mass index ( bmi ) under 18. 5 are at risk of being malnourished ( use the bmi calculator to work out your bmi ) a lack of interest in eating and drinking feeling tired all the time feeling weak getting ill often and taking a long time to

We can create a vector database by passing in the documents to `Chroma`. 

Note that you can set the `persist_directory` to a directory of your choice. This will save the vector store to disk so that you can load it later without having to recreate it.

In [8]:
from langchain_chroma import Chroma

db = Chroma.from_documents(documents, embeddings, persist_directory="./nhs_use_case_db")

We can now perform similarity search and if we pass in a query related to losing weight, pages related to malnutrition and obsesity are returned which seems reasonable...

In [9]:
query = "What should I do if I have lost a lot of weight over the last 3 to 6 months?"
docs = db.similarity_search(query, k=3)
docs

[Document(id='9e53b33b-cc82-4821-8c48-c24b3fdd0297', metadata={'document': 'malnutrition'}, page_content='weight over the last 3 to 6 months you have other symptoms of malnutrition you \' re worried someone in your care, such as a child or older person, may be malnourished if you \' re concerned about a friend or family member, try to encourage them to see a gp. a gp can check if you \' re at risk of malnutrition by measuring your weight and height, and asking about any medical problems you have or any recent changes in your weight or appetite. if they think you could be malnourished, they may refer you to a healthcare professional such as a dietitian to discuss treatment. who \' s at risk of malnutrition malnutrition is a common problem that affects millions of people in the uk. anyone can become malnourished, but it \' s more common in people who : have a long - term health conditions that affect appetite, weight and / or how well nutrients are absorbed by the gut, such as crohn \' s

In [10]:
print(docs[0].metadata)

{'document': 'malnutrition'}


In [11]:
print(docs[0].page_content)

weight over the last 3 to 6 months you have other symptoms of malnutrition you ' re worried someone in your care, such as a child or older person, may be malnourished if you ' re concerned about a friend or family member, try to encourage them to see a gp. a gp can check if you ' re at risk of malnutrition by measuring your weight and height, and asking about any medical problems you have or any recent changes in your weight or appetite. if they think you could be malnourished, they may refer you to a healthcare professional such as a dietitian to discuss treatment. who ' s at risk of malnutrition malnutrition is a common problem that affects millions of people in the uk. anyone can become malnourished, but it ' s more common in people who : have a long - term health conditions that affect appetite, weight and / or how well nutrients are absorbed by the gut, such as crohn ' s disease have problems swallowing ( dysphagia ) are socially isolated, have limited mobility, or a low income ne

We can also use the `similarity_search_with_score` method to get the score of the similarity search too:

In [12]:
docs = db.similarity_search_with_relevance_scores(query, k=10)
docs

[(Document(id='9e53b33b-cc82-4821-8c48-c24b3fdd0297', metadata={'document': 'malnutrition'}, page_content='weight over the last 3 to 6 months you have other symptoms of malnutrition you \' re worried someone in your care, such as a child or older person, may be malnourished if you \' re concerned about a friend or family member, try to encourage them to see a gp. a gp can check if you \' re at risk of malnutrition by measuring your weight and height, and asking about any medical problems you have or any recent changes in your weight or appetite. if they think you could be malnourished, they may refer you to a healthcare professional such as a dietitian to discuss treatment. who \' s at risk of malnutrition malnutrition is a common problem that affects millions of people in the uk. anyone can become malnourished, but it \' s more common in people who : have a long - term health conditions that affect appetite, weight and / or how well nutrients are absorbed by the gut, such as crohn \' 

In [13]:
type(docs[0]), len(docs[0])

(tuple, 2)

You can read in the vector store from disk by passing in the `persist_directory` to the `Chroma` constructor. This will load the vector store from disk and you can use it as before. 

In [14]:
db2 = Chroma(persist_directory="./nhs_use_case_db", embedding_function=embeddings)

In [15]:
docs = db2.similarity_search_with_relevance_scores(query, k=10)
docs

[(Document(id='9e53b33b-cc82-4821-8c48-c24b3fdd0297', metadata={'document': 'malnutrition'}, page_content='weight over the last 3 to 6 months you have other symptoms of malnutrition you \' re worried someone in your care, such as a child or older person, may be malnourished if you \' re concerned about a friend or family member, try to encourage them to see a gp. a gp can check if you \' re at risk of malnutrition by measuring your weight and height, and asking about any medical problems you have or any recent changes in your weight or appetite. if they think you could be malnourished, they may refer you to a healthcare professional such as a dietitian to discuss treatment. who \' s at risk of malnutrition malnutrition is a common problem that affects millions of people in the uk. anyone can become malnourished, but it \' s more common in people who : have a long - term health conditions that affect appetite, weight and / or how well nutrients are absorbed by the gut, such as crohn \' 

## Retrievers

Retrievers in Langchain are used to retrieve documents - these could be from a vector store or other databases such as graph databases or relational databases.

It is simple to create a retriever from the vector store by simply calling the `as_retriever` method on the vector store. 

Below, we create a specific retriever which retrieves full documents rather than chunks of documents using the `ParentDocumentRetriever` class.

In [None]:
from langchain_core.documents import Document
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore

# documents do not need to be split now as we want to retrieve full documents
documents = [
    Document(page_content=doc, metadata=meta)
    for doc, meta in zip(
        conditions.values(), [{"document": cond} for cond in list(conditions.keys())]
    )
]

# initalise the document store
store = InMemoryStore()
# uncomment if you want to save the document store to disk
# fs = LocalFileStore("./nhs_use_case_fs")
# store = create_kv_docstore(fs)

# initalise vector store
vectorstore = Chroma(
    collection_name="full_documents",
    embedding_function=embeddings,
    persist_directory="./nhs_use_case_db",
)

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=splitter,
    parent_splitter=None,
    search_kwargs={"k": 20},
)

In [17]:
len(documents)

909

In [18]:
retriever.add_documents(documents)

We can see that the the number of documents in the store is equal to the number of raw documents.

In [19]:
len(list(store.yield_keys()))

909

In [20]:
docs = retriever.invoke(query)

In [21]:
docs

[Document(metadata={'document': 'malnutrition'}, page_content='Overview\n-\nMalnutrition\nContents\nOverview\nSymptoms\nCauses\nTreatment\nMalnutrition is a serious condition that happens when your diet does not contain the right amount of nutrients.\nIt means "poor nutrition" and can refer to:\nundernutrition – not getting enough nutrients\novernutrition – getting more nutrients than needed\nThese pages focus on undernutrition in adults or children. Read about\nobesity\nfor more about the problems associated with overnutrition.\nSigns\xa0and symptoms of malnutrition\nCommon signs of malnutrition include:\nunintentional weight loss\n– losing 5% to 10% or more of weight over 3 to 6 months is one of the main signs of malnutrition\na low body weight – people with a\xa0body mass index (BMI) under 18.5 are at risk of being malnourished (use the\nBMI calculator\nto work out your BMI)\na lack of interest in eating and drinking\nfeeling tired all the time\nfeeling weak\ngetting ill often and t

In [22]:
len(docs)

11