# Create an AI-powered chatbot to answer HR-related questions

### Setup the environment

Import required libraries

In [1]:
#import dynamically "pysqlite3"
__import__("pysqlite3")

#then aliased to "sqlite3" in the sys.module dictionary
import sys
sys.modules["sqlite3"] = sys.modules["pysqlite3"]
    #This means that any subsequent import of sqlite3 will actually use the pysqlite3 module.
    #sqlite3 is the standard library module for SQLite in Python, while pysqlite3 is a third-party 
    #alternative that may offer additional features or updates.

#other libraries
import os
import openai
import sys

Before running this notebook, please set the environmental variable OPENAI_API_KEY

## Define the document loader

We will use PyPDFLoader

In [2]:
from langchain.document_loaders import PyPDFLoader
    #pip install langchain langchain_core langchain_openai langchain_qdrant langchain_text_splitters
    #pip install --upgrade langchain_community pypdf

Load the information contained in the PDF

In [3]:
Doc_loader = PyPDFLoader("../data/1728286846_the_nestle_hr_policy_pdf_2012.pdf")
extracted_text = Doc_loader.load()
extracted_text

[Document(page_content='Policy\nMandatorySeptember\u2009 \u20092012\nThe Nestlé  \nHuman Resources Policy', metadata={'source': '../data/1728286846_the_nestle_hr_policy_pdf_2012.pdf', 'page': 0}),
 Document(page_content='Policy\nMandatorySeptember\u2009\n\u200920\n12Issuing \u2009departement\nHum\nan Resources\nTarget \u2009audience \u2009\nAll\n employees\nApprover\nExecutive Board, Nestlé S.A.\nRepository\nAll Nestlé Principles and Policies, Standards and  Guidelines can be found in the Centre online repository at:  http://intranet.nestle.com/nestledocs\nCopyright\n\u2009and\u2009confidentiality\nAl\nl rights belong to Nestec Ltd., Vevey, Switzerland.\n© 2012, Nestec Ltd.\nDesign\nNestec Ltd., Corporate Identity & Design,  Vevey, Switzerland\nProduction\nbrain’print GmbH, Switzerland\nPaper\nThis report is printed on BVS, a paper produced  from well-managed forests and other controlled sources  certified by the Forest Stewardship Council (FSC).', metadata={'source': '../data/17282868

Break down big pieces of text into smaller parts.

In [4]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter  = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "(?<=\. )", " ", ""]
)
    #each chunk will have a maximum of 150 characters
    #no character will overlap between chunks
    #Multiple separators to split the text. It first tries to split the text at the first separator, if it cannot split the text without exceeding the chunk_size, it will move to th enext separator and so on...
        #"\n\n": Double newline, often used to separate paragraphs.
        #"\n": Single newline, often used to separate lines.
        #"(?<=\. )": A regular expression that matches a period followed by a space, often used to separate sentences.
            #It asserts that what immediately precedes the current position in the text must match the pattern inside the parentheses.
            #\. matches a literal period (dot) character. The backslash \ is used to escape the dot, which is a special character in regular expressions that normally matches any character.
            #The space character matches a literal space
            #Putting it all together, (?<=\. ) matches a position in the text that is immediately preceded by a period followed by a space. 
        #" ": A space character, used to separate words.
        #"": An empty string, which means that if no other separators work, the text will be split at any character to ensure the chunk size is respected.
splitted_text=text_splitter.split_documents(extracted_text)
splitted_text

[Document(page_content='Policy\nMandatorySeptember\u2009 \u20092012\nThe Nestlé  \nHuman Resources Policy', metadata={'source': '../data/1728286846_the_nestle_hr_policy_pdf_2012.pdf', 'page': 0}),
 Document(page_content='Policy\nMandatorySeptember\u2009\n\u200920\n12Issuing \u2009departement\nHum\nan Resources\nTarget \u2009audience \u2009\nAll\n employees\nApprover\nExecutive Board, Nestlé S.A.', metadata={'source': '../data/1728286846_the_nestle_hr_policy_pdf_2012.pdf', 'page': 1}),
 Document(page_content='Repository', metadata={'source': '../data/1728286846_the_nestle_hr_policy_pdf_2012.pdf', 'page': 1}),
 Document(page_content='All Nestlé Principles and Policies, Standards and  Guidelines can be found in the Centre online repository at:  http://intranet.nestle.com/nestledocs', metadata={'source': '../data/1728286846_the_nestle_hr_policy_pdf_2012.pdf', 'page': 1}),
 Document(page_content='Copyright\n\u2009and\u2009confidentiality\nAl\nl rights belong to Nestec Ltd., Vevey, Switzerla

### Create a storage where the chunks are searchable

Open instance of OpenAIEmbeddings() to convert text chunks into numerical vectors

In [5]:
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

Create the storage with Chroma

In [6]:
from langchain.vectorstores import Chroma

vectordb = Chroma.from_documents(
    documents=splitted_text,
    embedding=embeddings,
    persist_directory="../data/chroma_vector_x"
)
    #splitted_text: is a list of text chunks. These chunks are 
        #derived from splitting a PDF document
    #embedding model that converts text chunks into 
        #numerical vectors. These vectors represent the semantic 
        #meaning of the text and are used for similarity search.
    #persist_directory:
        #This specifies the directory where the vector database 
        #will be stored. The database will be persisted to disk 
        #in this directory, allowing it to be loaded and used 
        #later without needing to recompute the embeddings.

### Create a chatbot and intereact with it

Create an instance of GTP

In [7]:
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
    #initializes a language model using the ChatOpenAI class 
        #with the specified model name (gpt-3.5-turbo)
    #The temperature parameter controls the randomness of 
        #the model's output. A temperature of 0 makes the 
        #model's responses more deterministic and focused.
        #We want the model to focus on the data that is present 
        #in the PDF doc and not be "creative"

Create an instance of the RetrievalQA class. When a query is made, the retriever searches the vector database for relevant documents. These documents are then used by the language model to generate a more accurate and contextually relevant answer.

In [8]:
from langchain.chains import RetrievalQA
Retriever_chain = RetrievalQA.from_chain_type(
    llm, \
    retriever=vectordb.as_retriever(), \
    return_source_documents=True \
)
    #llm: The language model instance (llm) created earlier.
    #retriever=vectordb.as_retriever():
        #vectordb.as_retriever() converts the previously created 
        #vector database (vectordb) into a retriever object. 
        #This retriever can be used to find relevant documents 
        #based on a query.
    #return_source_documents=True:
        #This parameter indicates that the source documents 
        #used to generate the answer should be returned along 
        #with the answer itself.

Create the chatbot

In [9]:
#import time to count procesing time
import time

#infinite loop
while True:
    
    #ask the user for the query and save it in query
    query = input("\nEnter a query: ")

    #stop if exit, if empty continue asking for input
    if query == "exit":
        break
    if query.strip() == "":
        continue

    #get the answer from the chain and count time
    start = time.time()
    res=Retriever_chain(query)
    end = time.time()

    print("\n\n> Question:")
    print(query)

    print(f"\n> Answer (took {round(end - start, 2)} s.):")
    print(res['result'])
#query within scope
    #working conditions at Nestlé
#query outside scope
    #number of working hours in Nestlé
#query within scope about parental leave
    #parental leave policy in Nestlé


Enter a query:  working conditions at Nestlé




> Question:
working conditions at Nestlé

> Answer (took 2.58 s.):
Nestlé upholds the freedom of association of its employees and the effective recognition of the right to collective bargaining. This commitment suggests that Nestlé aims to provide fair and respectful working conditions for its employees.



Enter a query:  number of working hours in Nestlé




> Question:
number of working hours in Nestlé

> Answer (took 1.84 s.):
I don't have specific information on the number of working hours at Nestlé. It would be best to check directly with Nestlé or refer to their official policies for details on working hours.



Enter a query:  parental leave policy in Nestlé




> Question:
parental leave policy in Nestlé

> Answer (took 2.8 s.):
Based on the provided context, it seems that the focus is on the Nestlé Human Resources Policy in general, rather than specifically on parental leave. Unfortunately, there is no specific information provided about the parental leave policy at Nestlé. For accurate and detailed information on Nestlé's parental leave policy, it would be best to refer directly to Nestlé's official human resources documentation or contact their HR department.



Enter a query:  exit


### Create a prompt template and add it to the chabot

Create the template

In [10]:
template_content = """
    Answer the following question:
    
    Question: {{ question }}

    Additional context: {{ additional_context }}
"""
with open("../data/prompt_template.jinja2", "w") as file:
    file.write(template_content)

Load the prompt template

In [11]:
from jinja2 import Environment, FileSystemLoader
env = Environment(loader=FileSystemLoader("../data/"))
template = env.get_template("./prompt_template.jinja2")

Run the chatbot

In [12]:
#import time to count procesing time
import time

#infinite loop
while True:

    #ask the user for the query and save it in query
    query = input("\nEnter a query: ")

    #stop if exit, if empty continue asking for input
    if query == "exit":
        break
    if query.strip() == "":
        continue

    #add a link to redirect if the query is about parental leaves
    if ("maternal leave" in query) | ("parental leave" in query):
        additional_context = """
            Redirect to the FAQ page:
            https://www.nestleusa.com/parents/parental-leave-frequently-asked-questions
        """
    else: 
        additional_context = ""

    #render the template 
    formatted_query = template.render(
        question=query, \
        additional_context=additional_context \
    )

    #get the answer from the chain and count time
    start = time.time()
    res=Retriever_chain(formatted_query)
    end = time.time()

    print("\n\n> Formatted Query:")
    print(formatted_query)

    print(f"\n> Answer (took {round(end - start, 2)} s.):")
    print(res['result'])
#query within scope
    #working conditions at Nestlé
#query outside scope
    #number of working hours in Nestlé
#query within scope about parental leave
    #parental leave policy in Nestlé


Enter a query:  working conditions at Nestlé




> Formatted Query:

    Answer the following question:
    
    Question: working conditions at Nestlé

    Additional context: 

> Answer (took 2.59 s.):
Based on the provided context, Nestlé emphasizes being a flexible and dynamic organization that upholds the freedom of association of its employees and the right to collective bargaining. This suggests that Nestlé likely strives to maintain fair and respectful working conditions for its employees.



Enter a query:  number of working hours in Nestlé




> Formatted Query:

    Answer the following question:
    
    Question: number of working hours in Nestlé

    Additional context: 

> Answer (took 1.79 s.):
I'm sorry, but without additional context or specific information provided, I am unable to answer the question about the number of working hours in Nestlé.



Enter a query:  parental leave policy in Nestlé




> Formatted Query:

    Answer the following question:
    
    Question: parental leave policy in Nestlé

    Additional context: 
            Redirect to the FAQ page:
            https://www.nestleusa.com/parents/parental-leave-frequently-asked-questions
        

> Answer (took 3.46 s.):
I don't know the specific details of the parental leave policy at Nestlé. For accurate and up-to-date information, I recommend visiting the FAQ page provided: https://www.nestleusa.com/parents/parental-leave-frequently-asked-questions



Enter a query:  exit


### Create an user-friendly interface using Gradio

Define the function to be wrapped:

In [13]:
def chatbot(query):

    #add a link to redirect if the query is about parental leaves
    if ("maternal leave" in query) | ("parental leave" in query):
        additional_context = """
            Redirect to the FAQ page:
            https://www.nestleusa.com/parents/parental-leave-frequently-asked-questions
        """
    else: 
        additional_context = ""

    #render the template 
    formatted_query = template.render(
        question=query, \
        additional_context=additional_context \
    )
    
    #get the answer from the chain
    res=Retriever_chain(formatted_query)
    
    #return the answer
    return res["result"]

Run it to check

In [14]:
chatbot("working conditions at Nestlé")

'Based on the provided context, Nestlé emphasizes being a flexible and dynamic organization that upholds the freedom of association of its employees and the right to collective bargaining. This suggests that Nestlé likely strives to maintain fair and positive working conditions for its employees.'

Wrap the function using Gradio

In [15]:
#!pip install --upgrade gradio
import gradio as gr

In [16]:
#define the Gradio interface
iface = gr.Interface(
    fn=chatbot,
    inputs="text",
    outputs="text",
    title="Nestlé HR Chatbot",
    description="Ask about the HR policy of Nestlé."
)

Launch the interface with share=True to create a public link

In [17]:
iface.launch(share=True)
#iface.close() #to close the server
    #query within scope
        #working conditions at Nestlé
    #query outside scope
        #number of working hours in Nestlé
    #query within scope about parental leave
        #parental leave policy in Nestlé

* Running on local URL:  http://127.0.0.1:7860
* Running on public URL: https://58c47740defecfccb1.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


