- OpenAI's embedding model has 1536 dimensions.
- After the data is turned into embeddings, they are stored in a vectorstore database, such as Pinecone, Chroma and Faiss, etc.
- Once the query is provided, the most relevant chunks of data is queried based on the similarity (semantic search)

In [1]:
import os 
from dotenv import load_dotenv

from langchain_text_splitters import CharacterTextSplitter
from langchain.vectorstores import Pinecone, Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.chains import ConversationalRetrievalChain
from langchain.chat_models import ChatOpenAI

In [2]:
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

In [39]:
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

  warn_deprecated(


In [4]:
pwd

'/Users/dhavalantala/Desktop/langchain/langchain'

In [5]:
from langchain.document_loaders import DirectoryLoader

pdf_loader = DirectoryLoader('./data', glob="**/*.pdf")
readme_loader = DirectoryLoader('./data', glob="**/*.md")
txt_loader = DirectoryLoader('./data', glob="**/*.txt")

In [6]:
os.listdir("./data")

['human-nutrition-text.pdf']

In [7]:
loaders = [pdf_loader, readme_loader, txt_loader]

#lets create document 
documents = []
for loader in loaders:
    documents.extend(loader.load())

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/dhavalantala/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/dhavalantala/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [8]:
print (f'You have {len(documents)} document(s) in your data')
print (f'There are {len(documents[0].page_content)} characters in your document')

You have 1 document(s) in your data
There are 7823 characters in your document


In [9]:
documents[0]

Document(page_content='Introduction\n\nUNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN NUTRITION PROGRAM AND HUMAN NUTRITION PROGRAM\n\nʻO ke kahua ma mua, ma hope ke kūkulu\n\nThe foundation comes first, then the building\n\nImage by Jim Hollyer / CC BY 4.0\n\nIntroduction | 3\n\nLearning Objectives\n\nBy the end of this chapter, you will be able to:\n\n\n\nDescribe basic concepts in nutrition\n\n\n\nDescribe factors that affect your nutritional needs\n\n\n\nDescribe the importance of research and scientific\n\nmethods to understanding nutrition\n\nWhat are Nutrients?\n\nThe foods we eat contain nutrients. Nutrients are substances\n\nrequired by the body to perform its basic functions. Nutrients must\n\nbe obtained from our diet, since the human body does not\n\nsynthesize or produce them. Nutrients have one or more of three\n\nbasic functions: they provide energy, contribute to body structure,\n\nand/or regulate chemical processes in the body. These basic\n\nfunctions allow us 

## Split the text from the documents

In [11]:
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=40) #chunk overlap seems to work better
documents = text_splitter.split_documents(documents)
print(len(documents))

9


## Embeddings and storing it in Vectorestore

In [12]:
embeddings = OpenAIEmbeddings()

  warn_deprecated(


#### Using Chroma for storing vectors

In [14]:
from langchain.vectorstores import Chroma

vectorstore = Chroma.from_documents(documents, embeddings)

### Using pinecone for storing vectors

In [44]:
import getpass

PINECONE_API_KEY = getpass.getpass('Pinecone API Key:')

In [22]:
PINECONE_ENV = getpass.getpass('Pinecone Environment:')

In [49]:
from langchain_pinecone import PineconeVectorStore
from pinecone import Pinecone

os.environ['PINECONE_API_KEY'] = "4f4b7cdb-a795-451e-a1a7-c4ee1696ee48"

index_name = "langchain-demo"


vectorstore_from_docs = PineconeVectorStore.from_documents(documents, embeddings, index_name=index_name)

#### [For Existing Index](https://docs.pinecone.io/integrations/langchain)

#### We had 9 documents so there are 9 vectors being created in Pinecone.

In [50]:
query = "Who are the authors of gpt4all paper ?"
docs = vectorstore.similarity_search(query)

In [51]:
len(docs)

4

In [52]:
print(docs[0].page_content)

8 | Introduction


## Now the langchain part (Chaining with Chat History) --> With One line of Code (Fantastic)

In [53]:
from langchain.llms import OpenAI

In [54]:

retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k":2})
qa = ConversationalRetrievalChain.from_llm(OpenAI(temperature=0), retriever)

  warn_deprecated(


In [61]:
chat_history = []
query = "What is nutritious ?"
result = qa({"question": query, "chat_history": chat_history})
result["answer"]

' Nutritious means containing the necessary nutrients for a healthy diet.'

In [62]:
chat_history.append((query, result["answer"]))
chat_history

[('What is nutritious ?',
  ' Nutritious means containing the necessary nutrients for a healthy diet.')]

In [64]:
query = "What is apple nutri score?"
result = qa({"question": query, "chat_history": chat_history})
result["answer"]

" I don't know. The nutri score of an apple would depend on its size, ripeness, and any added ingredients. It would also depend on the specific scoring system being used to determine the nutri score."

### Create a chatbot with memory with simple widgets

In [59]:
from IPython.display import display
import ipywidgets as widgets

In [60]:

chat_history = []

def on_submit(_):
    query = input_box.value
    input_box.value = ""
    
    if query.lower() == 'exit':
        print("Thanks for the chat!")
        return
    
    result = qa({"question": query, "chat_history": chat_history})
    chat_history.append((query, result['answer']))
    
    display(widgets.HTML(f'User: {query}'))
    display(widgets.HTML(f'Chatbot: {result["answer"]}'))

print("Chat with your data. Type 'exit' to stop")

input_box = widgets.Text(placeholder='Please enter your question:')
input_box.on_submit(on_submit)

display(input_box)

Chat with your data. Type 'exit' to stop


  input_box.on_submit(on_submit)


Text(value='', placeholder='Please enter your question:')

HTML(value='User: What is nutritious ?')

HTML(value='Chatbot:  Nutritious means containing the necessary nutrients for a healthy diet.')

## Gradio Part (Building the chatbot like UI)
Gradio sample example

In [65]:
import gradio as gr
import random

with gr.Blocks() as demo:
    chatbot = gr.Chatbot()
    msg = gr.Textbox()
    clear = gr.Button("Clear")

    def respond(message, chat_history):
        print(message)
        print(chat_history)
        bot_message = random.choice(["How are you?", "I love you", "I'm very hungry"])
        chat_history.append((message, bot_message))
        print(chat_history)
        return "", chat_history

    msg.submit(respond, [msg, chatbot], [msg, chatbot])
    clear.click(lambda: None, None, chatbot, queue=False)

demo.launch(debug=True, share=True)

Running on local URL:  http://127.0.0.1:7861
Running on public URL: https://ddc89e140b42315f64.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


What is nutritious ?
[]
[('What is nutritious ?', 'I love you')]
What is nutritions?
[['What is nutritious ?', 'I love you']]
[['What is nutritious ?', 'I love you'], ('What is nutritions?', 'How are you?')]
what is apple nutri score?
[['What is nutritious ?', 'I love you'], ['What is nutritions?', 'How are you?']]
[['What is nutritious ?', 'I love you'], ['What is nutritions?', 'How are you?'], ('what is apple nutri score?', 'I love you')]
Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7861 <> https://ddc89e140b42315f64.gradio.live


