# Building Your Own Search Engine Using Vector Databases


![img](https://www.analyticsvidhya.com/datahack-summit-2023/wp-content/uploads/2023/07/s-won_searchengin.jpg)

## Agenda 

**Part 0: The Beginning**

Welcome and Objectives : An introduction to the session's aims, a brief ice breaker activity, and setting the stage for the exploration ahead.

Context Setting: A quick overview of AI, Search Engines, the current landscape, potential use-cases, benefits, and challenges. Highlight the significance of building an AI search engine with your data.

**Part 1: Understanding the Basics**

NLP and Search Engines: Explore the components like Natural Language Processing (NLP), Machine Learning algorithms, and their roles in crafting an efficient AI search engine. Topics include


- What are vector embeddings?
- Legacy vectorizing techniques like CountVectorizer, bag of words
- Similarity measures and how do they work
- LLM and Transformers

Vector Databases
  - An Exploration:
  - Unveiling Vector Databases
  - Understanding their workings
  - Real-world use-cases
  - A comparative analysis of available options

**Part 2: Indexing**

Splitting the data : Why is it required and different kinds of Data splitting. Also cover why splitting is context dependent (i.e depends on data)

The next step is to insert the data into the database


**Part 3: Searching**

Performing Semantic Search on Indexed Data: Discuss and code the integration of NLP and Machine Learning algorithms into the search engine to comprehend, analyze, and generate precise search results from the given data. Employ Command line tools (or maybe a GUI) to execute the searches.

Discuss Different Retrieval algorithms such as
* MMR
* LLM Aided Retrieval
* Compression

**Part 4: Question and answering**

- Prompt Engineering and templates 
- Addressing a lot of windows and short context windows

**Part 5: Chat**

- Introducing memory
- Followup conversations


**Wrap-Up and Next Steps**

Conclusion and Future Directions: Discuss steps to enhance the solution and where to go from here, providing a clear path for continued exploration and development.

**References and Resources**


## Checkbox


### Demo Checkbox

- [X]  **Part 1: Understanding the Basics**
- [X]  **Part 2: Indexing**
- [X]  **Part 3: Semantic search with vector db**
- [X]  **Part 4: Question and answering**
- [X]  **Part 5: Chat**


### Theory material Checkbox

- [X]  **Part 1: Understanding the Basics**
- [X]  **Part 2: Indexing**
- [X]  **Part 3: Semantic search with vector db**
- [X]  **Part 4: Question and answering**
- [X]  **Part 5: Chat**

## Libraries and Technologies we will use

1) Pre Trained Large Language Model (LLM) like ChatGPT for vector Embedding
2) Langchain for Supporting our model application
3) Vector Database like Chroma
4) Gradio


## Generic Architecture

![arch](https://ghost.hacksoft.io/content/images/2023/04/answering_questions.png)

# Hands On Coding

## Dependencies Installation and Loading data

In [1]:
#!pip install langchain openai chromadb kaggle sentence_transformers datasets gradio elasticsearch tiktoken

In [2]:
import os.path
if not os.path.isfile("database.sqlite"):
    os.system("kaggle datasets download benhamner/nips-papersv")
    os.system("unzip -o nips-papers.zip")

In [3]:
import os
import numpy as np
import pandas as pd
from langchain.llms import OpenAI
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
import sqlite3
from PyPDF2 import PdfReader


In [4]:
con = sqlite3.connect("database.sqlite")

sql= """WITH paper_author_list AS (
    SELECT papers.id AS paper_id, Group_concat(authors.name) AS author_list
    FROM papers
    JOIN paper_authors ON papers.id = paper_authors.paper_id
    JOIN authors ON paper_authors.author_id = authors.id
    GROUP BY paper_id
)
SELECT papers.id AS paper_id, papers.year, papers.title, papers.abstract, papers.paper_text, paper_author_list.author_list AS authors
FROM papers
JOIN paper_author_list ON papers.id = paper_author_list.paper_id
WHERE abstract NOT LIKE '%Abstract Missing%'

""";

papers_df = pd.read_sql_query(sql, con)

In [5]:
papers_df.head()

Unnamed: 0,paper_id,year,title,abstract,paper_text,authors
0,1861,2000,Algorithms for Non-negative Matrix Factorization,Non-negative matrix factorization (NMF) has pr...,Algorithms for Non-negative Matrix\nFactorizat...,"Daniel D. Lee,H. Sebastian Seung"
1,1975,2001,Characterizing Neural Gain Control using Spike...,Spike-triggered averaging techniques are effec...,Characterizing neural gain control using\nspik...,"Odelia Schwartz,E.J. Chichilnisky,Eero P. Simo..."
2,3163,2007,Competition Adds Complexity,It is known that determinining whether a DEC-P...,Competition adds complexity\n\nJudy Goldsmith\...,"Judy Goldsmith,Martin Mundhenk"
3,3164,2007,Efficient Principled Learning of Thin Junction...,We present the first truly polynomial algorith...,Efficient Principled Learning of Thin Junction...,"Anton Chechetka,Carlos Guestrin"
4,3167,2007,Regularized Boost for Semi-Supervised Learning,Semi-supervised inductive learning concerns ho...,Regularized Boost for Semi-Supervised Learning...,"Ke Chen,Shihai Wang"


In [6]:
def pre_process_text(papers_df, column):
    
    # Load the regular expression library
    import re
    preprocessed_column = f"{column}_processed"

    # Print the titles of the first rows 
    print(papers_df[column].head())

    # remove punctuations
    #papers_df[preprocessed_column] = papers_df[column].map(lambda x: re.sub('[,!?]', '', x))
    
     # remove carriage return and end of line
    papers_df[preprocessed_column] = papers_df[column].map(lambda x: re.sub('[\r\n]', ' ', x))
    
     # remove double spaces
    papers_df[preprocessed_column] = papers_df[preprocessed_column].map(lambda x: re.sub('  ', ' ', x))

    
    # remove para continuation
    papers_df[preprocessed_column] = papers_df[preprocessed_column].map(lambda x: re.sub('- ', '', x))

    # Convert the titles to lowercase
    papers_df[preprocessed_column] = papers_df[preprocessed_column].map(lambda x: x.lower())
    return papers_df.head()

In [7]:
text_columns = ["title", "abstract", "paper_text"]
for column in text_columns:
    pre_process_text(papers_df, column)

0     Algorithms for Non-negative Matrix Factorization
1    Characterizing Neural Gain Control using Spike...
2                          Competition Adds Complexity
3    Efficient Principled Learning of Thin Junction...
4       Regularized Boost for Semi-Supervised Learning
Name: title, dtype: object
0    Non-negative matrix factorization (NMF) has pr...
1    Spike-triggered averaging techniques are effec...
2    It is known that determinining whether a DEC-P...
3    We present the first truly polynomial algorith...
4    Semi-supervised inductive learning concerns ho...
Name: abstract, dtype: object
0    Algorithms for Non-negative Matrix\nFactorizat...
1    Characterizing neural gain control using\nspik...
2    Competition adds complexity\n\nJudy Goldsmith\...
3    Efficient Principled Learning of Thin Junction...
4    Regularized Boost for Semi-Supervised Learning...
Name: paper_text, dtype: object


In [8]:
papers_df.abstract_processed[0]


'non-negative matrix factorization (nmf) has previously been shown to  be a useful decomposition for multivariate data. two different multi plicative algorithms for nmf are analyzed. they differ only slightly in  the multiplicative factor used in the update rules. one algorithm can be  shown to minimize the conventional least squares error while the other  minimizes the generalized kullback-leibler divergence. the monotonic  convergence of both algorithms can be proven using an auxiliary func tion analogous to that used for proving convergence of the expectation maximization algorithm. the algorithms can also be interpreted as diag onally rescaled gradient descent, where the rescaling factor is optimally  chosen to ensure convergence. '

In [9]:
papers_df = papers_df.drop(["title", "abstract", "paper_text"], axis=1)
papers_df.head()

Unnamed: 0,paper_id,year,authors,title_processed,abstract_processed,paper_text_processed
0,1861,2000,"Daniel D. Lee,H. Sebastian Seung",algorithms for non-negative matrix factorization,non-negative matrix factorization (nmf) has pr...,algorithms for non-negative matrix factorizati...
1,1975,2001,"Odelia Schwartz,E.J. Chichilnisky,Eero P. Simo...",characterizing neural gain control using spike...,spike-triggered averaging techniques are effec...,characterizing neural gain control using spike...
2,3163,2007,"Judy Goldsmith,Martin Mundhenk",competition adds complexity,it is known that determinining whether a dec-p...,competition adds complexity judy goldsmith dep...
3,3164,2007,"Anton Chechetka,Carlos Guestrin",efficient principled learning of thin junction...,we present the first truly polynomial algorith...,efficient principled learning of thin junction...
4,3167,2007,"Ke Chen,Shihai Wang",regularized boost for semi-supervised learning,semi-supervised inductive learning concerns ho...,regularized boost for semi-supervised learning...


In [10]:
papers_df = papers_df.rename(columns={'title_processed':'title', 'abstract_processed': 'abstract', 'paper_text_processed': 'paper_text'})

In [11]:
papers_df.head()

Unnamed: 0,paper_id,year,authors,title,abstract,paper_text
0,1861,2000,"Daniel D. Lee,H. Sebastian Seung",algorithms for non-negative matrix factorization,non-negative matrix factorization (nmf) has pr...,algorithms for non-negative matrix factorizati...
1,1975,2001,"Odelia Schwartz,E.J. Chichilnisky,Eero P. Simo...",characterizing neural gain control using spike...,spike-triggered averaging techniques are effec...,characterizing neural gain control using spike...
2,3163,2007,"Judy Goldsmith,Martin Mundhenk",competition adds complexity,it is known that determinining whether a dec-p...,competition adds complexity judy goldsmith dep...
3,3164,2007,"Anton Chechetka,Carlos Guestrin",efficient principled learning of thin junction...,we present the first truly polynomial algorith...,efficient principled learning of thin junction...
4,3167,2007,"Ke Chen,Shihai Wang",regularized boost for semi-supervised learning,semi-supervised inductive learning concerns ho...,regularized boost for semi-supervised learning...


In [12]:
papers_df = papers_df.drop(["paper_text"], axis=1)

In [13]:
with open("../secret/openai") as f:
    openai_secret = f.read().strip()
    
# PDF_FILE = "../data/GenericEmailMarketting/merged_file.pdf"

# use import getpass instead

os.environ["OPENAI_API_KEY"] = openai_secret 

In [14]:
llm = OpenAI(temperature=0)
llm.openai_api_key = os.environ["OPENAI_API_KEY"]

In [15]:
llm("tell me a joke")

'\n\nQ: What did the fish say when it hit the wall?\nA: Dam!'

In [16]:
llm("Who is the current prime minister of Britain")

'?\n\nThe current Prime Minister of the United Kingdom is Boris Johnson.'

In [17]:
from langchain.embeddings import OpenAIEmbeddings

embeddings_model = OpenAIEmbeddings()

def get_openai_embedding(text):
   text_rep = text.replace("\n", " ")
   return embeddings_model.embed_documents([text_rep])

In [18]:
sentence1 = "i like summer"
sentence2 = "Brocholi on pizza is probably not a good idea"
sentence3 = "I love the warm weather outside"

In [19]:
embedding1 = embeddings_model.embed_query(sentence1)
embedding2 = embeddings_model.embed_query(sentence2)
embedding3 = embeddings_model.embed_query(sentence3)

In [20]:
print(np.dot(embedding1, embedding2))
print(np.dot(embedding2, embedding3))
print(np.dot(embedding1, embedding3))


0.7064328731718995
0.7129745537152843
0.8758719352747321


### Loading the data

In [21]:
from langchain.document_loaders import DataFrameLoader

In [22]:
loader = DataFrameLoader(papers_df, page_content_column="abstract")

In [23]:
# Use lazy load for larger table, which won't read the full table into memory 
page = loader.load()[0]

In [24]:
page.page_content

'non-negative matrix factorization (nmf) has previously been shown to  be a useful decomposition for multivariate data. two different multi plicative algorithms for nmf are analyzed. they differ only slightly in  the multiplicative factor used in the update rules. one algorithm can be  shown to minimize the conventional least squares error while the other  minimizes the generalized kullback-leibler divergence. the monotonic  convergence of both algorithms can be proven using an auxiliary func tion analogous to that used for proving convergence of the expectation maximization algorithm. the algorithms can also be interpreted as diag onally rescaled gradient descent, where the rescaling factor is optimally  chosen to ensure convergence. '

In [25]:
page.metadata

{'paper_id': 1861,
 'year': 2000,
 'authors': 'Daniel D. Lee,H. Sebastian Seung',
 'title': 'algorithms for non-negative matrix factorization'}

In [26]:
type(page)

langchain.schema.Document

### Splitting the data

### Why do we need to split the data
1) Chatgpt and LLM have limits



2) To allow for efficient search for vector spaces



In [27]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

In [28]:
r_spltter = RecursiveCharacterTextSplitter( # Set a really small chunk size, just to show.
    chunk_size = 26,
    chunk_overlap  = 4,
    length_function = len,
    #seperators=['\n\n", "\n", " ", ""]
)

c_splitter = CharacterTextSplitter(
    chunk_size = 26,
    chunk_overlap  = 4,
    length_function = len,

)

In [29]:
text_a2z = 'abcdefghijklmnopqrstuvwxyz'
text_a2z_plus = 'abcdefghijklmnopqrstuvwxyz 12345678'

In [30]:
r_spltter.split_text(text_a2z)

['abcdefghijklmnopqrstuvwxyz']

In [31]:
r_spltter.split_text(text_a2z_plus)

['abcdefghijklmnopqrstuvwxyz', '12345678']

In [32]:
c_splitter.split_text(text_a2z)

['abcdefghijklmnopqrstuvwxyz']

In [33]:
c_splitter.split_text(text_a2z_plus)

['abcdefghijklmnopqrstuvwxyz 12345678']

The issue is Character text splitter splits only on new lines

In [34]:
c_splitter = CharacterTextSplitter(
    chunk_size = 26,
    chunk_overlap  = 4,
    length_function = len,
    separator=" "
)

In [35]:
c_splitter.split_text(text_a2z_plus)

['abcdefghijklmnopqrstuvwxyz', '12345678']

In [36]:
abstract = page.page_content

abstract

'non-negative matrix factorization (nmf) has previously been shown to  be a useful decomposition for multivariate data. two different multi plicative algorithms for nmf are analyzed. they differ only slightly in  the multiplicative factor used in the update rules. one algorithm can be  shown to minimize the conventional least squares error while the other  minimizes the generalized kullback-leibler divergence. the monotonic  convergence of both algorithms can be proven using an auxiliary func tion analogous to that used for proving convergence of the expectation maximization algorithm. the algorithms can also be interpreted as diag onally rescaled gradient descent, where the rescaling factor is optimally  chosen to ensure convergence. '

In [37]:
r_spltter.split_text(abstract)

['non-negative matrix',
 'factorization (nmf) has',
 'has previously been shown',
 'to  be a useful',
 'decomposition for',
 'for multivariate data.',
 'two different multi',
 'plicative algorithms for',
 'for nmf are analyzed.',
 'they differ only slightly',
 'in  the multiplicative',
 'factor used in the update',
 'rules. one algorithm can',
 'can be  shown to minimize',
 'the conventional least',
 'squares error while the',
 'the other  minimizes the',
 'the generalized',
 'kullback-leibler',
 'divergence. the monotonic',
 'convergence of both',
 'algorithms can be proven',
 'using an auxiliary func',
 'tion analogous to that',
 'used for proving',
 'convergence of the',
 'the expectation',
 'maximization algorithm.',
 'the algorithms can also',
 'be interpreted as diag',
 'onally rescaled gradient',
 'descent, where the',
 'the rescaling factor is',
 'is optimally  chosen to',
 'to ensure convergence.']

In [38]:
c_splitter.split_text(abstract)

['non-negative matrix',
 'factorization (nmf) has',
 'has previously been shown',
 'to be a useful',
 'decomposition for',
 'for multivariate data. two',
 'two different multi',
 'plicative algorithms for',
 'for nmf are analyzed. they',
 'they differ only slightly',
 'in the multiplicative',
 'factor used in the update',
 'rules. one algorithm can',
 'can be shown to minimize',
 'the conventional least',
 'squares error while the',
 'the other minimizes the',
 'the generalized',
 'kullback-leibler',
 'divergence. the monotonic',
 'convergence of both',
 'both algorithms can be',
 'be proven using an',
 'an auxiliary func tion',
 'tion analogous to that',
 'that used for proving',
 'convergence of the',
 'the expectation',
 'maximization algorithm.',
 'the algorithms can also be',
 'be interpreted as diag',
 'diag onally rescaled',
 'gradient descent, where',
 'the rescaling factor is',
 'is optimally chosen to',
 'to ensure convergence.']

In [39]:
r_spltter = RecursiveCharacterTextSplitter( # Set a really small chunk size, just to show.
   
    length_function = len,
    separators=['\n\n', "\n", " ", ""],
     chunk_size = 150,
    chunk_overlap  = 0,
)


In [40]:
r_spltter.split_text(abstract)

['non-negative matrix factorization (nmf) has previously been shown to  be a useful decomposition for multivariate data. two different multi plicative',
 'algorithms for nmf are analyzed. they differ only slightly in  the multiplicative factor used in the update rules. one algorithm can be  shown to',
 'minimize the conventional least squares error while the other  minimizes the generalized kullback-leibler divergence. the monotonic  convergence of',
 'both algorithms can be proven using an auxiliary func tion analogous to that used for proving convergence of the expectation maximization algorithm.',
 'the algorithms can also be interpreted as diag onally rescaled gradient descent, where the rescaling factor is optimally  chosen to ensure',
 'convergence.']

In [41]:
r_spltter = RecursiveCharacterTextSplitter( # Set a really small chunk size, just to show.
    length_function = len,
    separators=['\n\n', "\n", " ","\.", ""],
     chunk_size = 1000,
    chunk_overlap  = 0,
)


In [42]:
r_spltter.split_text(abstract)

['non-negative matrix factorization (nmf) has previously been shown to  be a useful decomposition for multivariate data. two different multi plicative algorithms for nmf are analyzed. they differ only slightly in  the multiplicative factor used in the update rules. one algorithm can be  shown to minimize the conventional least squares error while the other  minimizes the generalized kullback-leibler divergence. the monotonic  convergence of both algorithms can be proven using an auxiliary func tion analogous to that used for proving convergence of the expectation maximization algorithm. the algorithms can also be interpreted as diag onally rescaled gradient descent, where the rescaling factor is optimally  chosen to ensure convergence.']

In [43]:
# what if we want to split by sentences
# regex with look behind
sentence_spltter = RecursiveCharacterTextSplitter( # Set a really small chunk size, just to show.
    length_function = len,
    separators=["(?<=\.)"],
     chunk_size = 150,
    chunk_overlap  = 0,
)


In [44]:
sentence_spltter.split_text(abstract)

['non-negative matrix factorization (nmf) has previously been shown to  be a useful decomposition for multivariate data.',
 'two different multi plicative algorithms for nmf are analyzed. they differ only slightly in  the multiplicative factor used in the update rules.',
 'one algorithm can be  shown to minimize the conventional least squares error while the other  minimizes the generalized kullback-leibler divergence.',
 ' the monotonic  convergence of both algorithms can be proven using an auxiliary func tion analogous to that used for proving convergence of the expectation maximization algorithm.',
 ' the algorithms can also be interpreted as diag onally rescaled gradient descent, where the rescaling factor is optimally  chosen to ensure convergence.']

In [45]:
from langchain.text_splitter import TokenTextSplitter

In [46]:
token_text_splitter = TokenTextSplitter(chunk_size=1, chunk_overlap=0)

In [47]:
token_text_splitter.split_text(text_a2z)

['abc',
 'def',
 'gh',
 'ij',
 'kl',
 'mn',
 'op',
 'q',
 'r',
 'st',
 'uv',
 'w',
 'xy',
 'z']

In [48]:
token_text_splitter.split_text(abstract)

['non',
 '-',
 'negative',
 ' matrix',
 ' factor',
 'ization',
 ' (',
 'nm',
 'f',
 ')',
 ' has',
 ' previously',
 ' been',
 ' shown',
 ' to',
 ' ',
 ' be',
 ' a',
 ' useful',
 ' decom',
 'position',
 ' for',
 ' mult',
 'ivariate',
 ' data',
 '.',
 ' two',
 ' different',
 ' multi',
 ' pl',
 'icative',
 ' algorithms',
 ' for',
 ' nm',
 'f',
 ' are',
 ' analyzed',
 '.',
 ' they',
 ' differ',
 ' only',
 ' slightly',
 ' in',
 ' ',
 ' the',
 ' multipl',
 'icative',
 ' factor',
 ' used',
 ' in',
 ' the',
 ' update',
 ' rules',
 '.',
 ' one',
 ' algorithm',
 ' can',
 ' be',
 ' ',
 ' shown',
 ' to',
 ' minimize',
 ' the',
 ' conventional',
 ' least',
 ' squares',
 ' error',
 ' while',
 ' the',
 ' other',
 ' ',
 ' minim',
 'izes',
 ' the',
 ' generalized',
 ' k',
 'ull',
 'back',
 '-',
 'le',
 'ib',
 'ler',
 ' divergence',
 '.',
 ' the',
 ' mon',
 'ot',
 'onic',
 ' ',
 ' convergence',
 ' of',
 ' both',
 ' algorithms',
 ' can',
 ' be',
 ' proven',
 ' using',
 ' an',
 ' auxiliary',
 ' func',
 ' t

In [49]:
docs = loader.load()

In [50]:
len(docs)

3921

In [51]:
docs[0]

Document(page_content='non-negative matrix factorization (nmf) has previously been shown to  be a useful decomposition for multivariate data. two different multi plicative algorithms for nmf are analyzed. they differ only slightly in  the multiplicative factor used in the update rules. one algorithm can be  shown to minimize the conventional least squares error while the other  minimizes the generalized kullback-leibler divergence. the monotonic  convergence of both algorithms can be proven using an auxiliary func tion analogous to that used for proving convergence of the expectation maximization algorithm. the algorithms can also be interpreted as diag onally rescaled gradient descent, where the rescaling factor is optimally  chosen to ensure convergence. ', metadata={'paper_id': 1861, 'year': 2000, 'authors': 'Daniel D. Lee,H. Sebastian Seung', 'title': 'algorithms for non-negative matrix factorization'})

In [53]:
splits = sentence_spltter.split_documents(docs)

In [54]:
splits[0]

Document(page_content='non-negative matrix factorization (nmf) has previously been shown to  be a useful decomposition for multivariate data.', metadata={'paper_id': 1861, 'year': 2000, 'authors': 'Daniel D. Lee,H. Sebastian Seung', 'title': 'algorithms for non-negative matrix factorization'})

In [55]:
random_splits = loader.load_and_split()

In [52]:
# For simplicity we will not use any splitting


## Indexing data in Vectorstores

### ChromaDb

In [56]:
persist_directory_random_split_gpt = './data/chroma/random_split/gpt'

In [57]:
os.system(f"rm -rf {persist_directory_random_split_gpt}")  # remove old database files if any # remove old database files if any

0

In [58]:
%%time
vectordb_gpt = Chroma.from_documents(
    documents=docs,
    embedding=OpenAIEmbeddings(),
    persist_directory=persist_directory_random_split_gpt
)

CPU times: user 10.7 s, sys: 1.18 s, total: 11.9 s
Wall time: 43.3 s


In [59]:
print(vectordb_gpt._collection.count())

3921


In [60]:
vectordb_gpt.persist()

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

### Elasticsearch

In [61]:
!curl -X GET "http://localhost:9200"

{
  "name" : "a1b07c8dd9e0",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "TCUdzVJbQ4CZ1WorXbYDog",
  "version" : {
    "number" : "8.8.2",
    "build_flavor" : "default",
    "build_type" : "docker",
    "build_hash" : "98e1271edf932a480e4262a471281f1ee295ce6b",
    "build_date" : "2023-06-26T05:16:16.196344851Z",
    "build_snapshot" : false,
    "lucene_version" : "9.6.0",
    "minimum_wire_compatibility_version" : "7.17.0",
    "minimum_index_compatibility_version" : "7.0.0"
  },
  "tagline" : "You Know, for Search"
}


In [62]:
from langchain import ElasticVectorSearch
from langchain.vectorstores.elastic_vector_search import ElasticKnnSearch

from langchain.embeddings import OpenAIEmbeddings

embedding = OpenAIEmbeddings()

from elasticsearch import Elasticsearch


elastic = Elasticsearch(hosts=["http://localhost:9200"])
index_name = "test_index"

In [63]:
elastic.delete_by_query(index=[index_name], body={"query": {"match_all": {}}})

  elastic.delete_by_query(index=[index_name], body={"query": {"match_all": {}}})


ObjectApiResponse({'took': 1620, 'timed_out': False, 'total': 3921, 'deleted': 3921, 'batches': 4, 'version_conflicts': 0, 'noops': 0, 'retries': {'bulk': 0, 'search': 0}, 'throttled_millis': 0, 'requests_per_second': -1.0, 'throttled_until_millis': 0, 'failures': []})

In [64]:
%%time

elasticdb_gpt = ElasticVectorSearch.from_documents(
    docs,
    embedding,
    index_name=index_name,
    elasticsearch_url="http://localhost:9200"
)


CPU times: user 7.7 s, sys: 1.91 s, total: 9.61 s
Wall time: 1min 1s


go here http://localhost:5601/app/r/s/70gZ3 to see vector embeddings

In [65]:
elastic_vector_search = ElasticVectorSearch(
            elasticsearch_url="http://localhost:9200",
            index_name=index_name,
            embedding=embedding
        )

In [66]:
elastic_vector_search_knn = ElasticKnnSearch(

            es_connection=elasticdb_gpt.client,
            index_name=index_name,
            embedding=embedding
)

##  Search

When a query comes in we first convert it to a vector and then compare the vector to the elements in the database to get n most similar results.

**Similarity Search**
1. Similarity Search aims to find the most similar items to a given query in a dataset.
2. It uses metrics like cosine similarity, Jaccard similarity, or Euclidean distance to quantify similarity.
3. The search results are ranked based on their similarity scores, and the top-K items are returned.
4. Use-cases: When the user's intent is clear and specific, similarity search is efficient. It's commonly used in standard search engines, recommendation systems, or any scenario where the goal is to find items most similar to the query.

**Maximal Marginal Relevance (MMR) Search**
1. MMR aims to provide a diverse set of results that are relevant to the query.
2. Along with quantifying similarity to the query, MMR also considers similarity between items in the results set to ensure diversity.
3. It aims to maximize the relevance of the returned items to the query, but also minimize the similarity between the returned items.
4. Use-cases: When user intent is ambiguous or when there are multiple relevant responses, MMR can provide a more diverse set of results. It's useful in news article recommendation (to avoid recommending too many similar articles) or in conversational AI (to provide diverse responses).

The choice between similarity search and MMR depends on the specific use case and user needs. If the aim is to provide a diverse set of results, MMR would be more suitable. If the goal is to find items most similar to the query, a similarity search would be more efficient.

### Keyword search

In [67]:
#keyword search
resp = elasticdb_gpt.client.search(q="What is Natural language processing", query={"match_all": {}})

In [68]:
print("Got %d Hits:" % resp['hits']['total']['value'])
table = []
for hit in resp['hits']['hits']:
    table.append([hit['_source']['text'], hit['_score']])
table_df = pd.DataFrame(table, columns=["text", "score"])

table_df.head()

Got 10000 Hits:


Unnamed: 0,text,score
0,cross language text classi?cation is an import...,20.207375
1,cross language text classi?cation is an import...,20.207375
2,"Classically, tasks in natural language process...",18.87657
3,It obtains new state-of-the-art results on ele...,18.226519
4,our framework is inspired by state-of-the-art ...,17.344196


### Similarity Search

In [69]:
vectordb_gpt.similarity_search("What is Natural language processing")

[Document(page_content="a long-term goal of machine learning research is to build an intelligent dialog agent. most research in natural language understanding has focused on learning from fixed training sets of labeled data, with supervision either at the word level (tagging, parsing tasks) or sentence level (question answering, machine translation). this kind of supervision is not realistic of how humans learn, where language is both learned by, and used for, communication. in this work, we study dialog-based language learning, where supervision is given naturally and implicitly in the response of the dialog partner during the conversation. we study this setup in two domains: the babi dataset of (weston et al., 2015) and large-scale question answering from (dodge et al., 2015). we evaluate a set of baseline learning strategies on these tasks, and show that a novel model incorporating predictive lookahead is a promising approach for learning from a teacher's response. in particular, a 

In [70]:
vectordb_gpt.similarity_search_with_score("What is Natural language processing")

[(Document(page_content="a long-term goal of machine learning research is to build an intelligent dialog agent. most research in natural language understanding has focused on learning from fixed training sets of labeled data, with supervision either at the word level (tagging, parsing tasks) or sentence level (question answering, machine translation). this kind of supervision is not realistic of how humans learn, where language is both learned by, and used for, communication. in this work, we study dialog-based language learning, where supervision is given naturally and implicitly in the response of the dialog partner during the conversation. we study this setup in two domains: the babi dataset of (weston et al., 2015) and large-scale question answering from (dodge et al., 2015). we evaluate a set of baseline learning strategies on these tasks, and show that a novel model incorporating predictive lookahead is a promising approach for learning from a teacher's response. in particular, a

In [71]:
vectordb_gpt.similarity_search("What is linear regression")

[Document(page_content='when used to guide decisions, linear regression analysis typically involves estimation of regression coefficients via ordinary least squares and their subsequent use to make decisions. when there are multiple response variables and features do not perfectly capture their relationships, it is beneficial to account for the decision objective when computing regression coefficients. empirical optimization does so but sacrifices performance when features are well-chosen or training data are insufficient. we propose directed regression, an efficient algorithm that combines merits of ordinary least squares and empirical optimization. we demonstrate through a computational study that directed regression can generate significant performance gains over either alternative. we also develop a theory that motivates the algorithm.', metadata={'paper_id': 3686, 'year': 2009, 'authors': 'Yi-hao Kao,Benjamin V. Roy,Xiang Yan', 'title': 'directed regression'}),
 Document(page_cont

In [72]:
elasticdb_gpt.similarity_search_with_score("What is Natural language processing")

[(Document(page_content="a long-term goal of machine learning research is to build an intelligent dialog agent. most research in natural language understanding has focused on learning from fixed training sets of labeled data, with supervision either at the word level (tagging, parsing tasks) or sentence level (question answering, machine translation). this kind of supervision is not realistic of how humans learn, where language is both learned by, and used for, communication. in this work, we study dialog-based language learning, where supervision is given naturally and implicitly in the response of the dialog partner during the conversation. we study this setup in two domains: the babi dataset of (weston et al., 2015) and large-scale question answering from (dodge et al., 2015). we evaluate a set of baseline learning strategies on these tasks, and show that a novel model incorporating predictive lookahead is a promising approach for learning from a teacher's response. in particular, a

In [73]:
elastic_vector_search_knn.similarity_search_with_score("What is Natural language processing")

[(Document(page_content="a long-term goal of machine learning research is to build an intelligent dialog agent. most research in natural language understanding has focused on learning from fixed training sets of labeled data, with supervision either at the word level (tagging, parsing tasks) or sentence level (question answering, machine translation). this kind of supervision is not realistic of how humans learn, where language is both learned by, and used for, communication. in this work, we study dialog-based language learning, where supervision is given naturally and implicitly in the response of the dialog partner during the conversation. we study this setup in two domains: the babi dataset of (weston et al., 2015) and large-scale question answering from (dodge et al., 2015). we evaluate a set of baseline learning strategies on these tasks, and show that a novel model incorporating predictive lookahead is a promising approach for learning from a teacher's response. in particular, a

### Maximum marginal relevance

Maximum marginal relevance strives to achieve both relevance to the query and diversity among the results.

In [74]:
vectordb_gpt.max_marginal_relevance_search("What is linear regression")

[Document(page_content='when used to guide decisions, linear regression analysis typically involves estimation of regression coefficients via ordinary least squares and their subsequent use to make decisions. when there are multiple response variables and features do not perfectly capture their relationships, it is beneficial to account for the decision objective when computing regression coefficients. empirical optimization does so but sacrifices performance when features are well-chosen or training data are insufficient. we propose directed regression, an efficient algorithm that combines merits of ordinary least squares and empirical optimization. we demonstrate through a computational study that directed regression can generate significant performance gains over either alternative. we also develop a theory that motivates the algorithm.', metadata={'paper_id': 3686, 'year': 2009, 'authors': 'Yi-hao Kao,Benjamin V. Roy,Xiang Yan', 'title': 'directed regression'}),
 Document(page_cont

In [75]:
vectordb_gpt.max_marginal_relevance_search("What is natural language processing")

[Document(page_content="a long-term goal of machine learning research is to build an intelligent dialog agent. most research in natural language understanding has focused on learning from fixed training sets of labeled data, with supervision either at the word level (tagging, parsing tasks) or sentence level (question answering, machine translation). this kind of supervision is not realistic of how humans learn, where language is both learned by, and used for, communication. in this work, we study dialog-based language learning, where supervision is given naturally and implicitly in the response of the dialog partner during the conversation. we study this setup in two domains: the babi dataset of (weston et al., 2015) and large-scale question answering from (dodge et al., 2015). we evaluate a set of baseline learning strategies on these tasks, and show that a novel model incorporating predictive lookahead is a promising approach for learning from a teacher's response. in particular, a 

In [76]:
vectordb_gpt.max_marginal_relevance_search("What is natural language processing", filter={"year":1990})

[]

## Question and Answer


When a query comes in we first convert it to a vector and then compare the vector to the elements in the database to get n most similar results. These results are then passed into prompt as a context for LLM to process them


In [77]:
llm = OpenAI(temperature=0)
llm.openai_api_key = os.environ["OPENAI_API_KEY"]

In [78]:
from langchain.prompts import PromptTemplate
template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Use three sentences maximum. Keep the answer as concise as possible. 
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate(input_variables=["context", "question"],template=template,)


In [79]:
from langchain.chains import RetrievalQA

question = "What is linear regression?"
qa_chain = RetrievalQA.from_chain_type(llm,
                                       retriever=elasticdb_gpt.as_retriever(search_type="similarity", search_kwargs={"k": 5}),
                                       return_source_documents=True,
                                       chain_type_kwargs={"prompt": QA_CHAIN_PROMPT})


result = qa_chain({"query": question})
result["result"]

' Linear regression is a statistical technique used to model the relationship between a response variable and one or more predictor variables. It involves estimating regression coefficients via ordinary least squares and using them to make decisions. It can also be used to predict the value of a response variable given a set of predictor variables.'

In [80]:
result['source_documents']

[Document(page_content='when used to guide decisions, linear regression analysis typically involves estimation of regression coefficients via ordinary least squares and their subsequent use to make decisions. when there are multiple response variables and features do not perfectly capture their relationships, it is beneficial to account for the decision objective when computing regression coefficients. empirical optimization does so but sacrifices performance when features are well-chosen or training data are insufficient. we propose directed regression, an efficient algorithm that combines merits of ordinary least squares and empirical optimization. we demonstrate through a computational study that directed regression can generate significant performance gains over either alternative. we also develop a theory that motivates the algorithm.', metadata={'paper_id': 3686, 'year': 2009, 'authors': 'Yi-hao Kao,Benjamin V. Roy,Xiang Yan', 'title': 'directed regression'}),
 Document(page_cont

In [81]:
## WARNING this cell may overshoot your quesry costs
qa_chain_mr = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb_gpt.as_retriever(search_type="similarity", search_kwargs={"k": 20}),
    return_source_documents=True,
    chain_type="map_reduce"
)

result = qa_chain_mr({"query": question})
result

{'query': 'What is linear regression?',
 'result': ' Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables. It involves estimation of regression coefficients via ordinary least squares and their subsequent use to make decisions. It is also used for function estimation and model selection in high-dimensional data analysis.',
 'source_documents': [Document(page_content='when used to guide decisions, linear regression analysis typically involves estimation of regression coefficients via ordinary least squares and their subsequent use to make decisions. when there are multiple response variables and features do not perfectly capture their relationships, it is beneficial to account for the decision objective when computing regression coefficients. empirical optimization does so but sacrifices performance when features are well-chosen or training data are insufficient. we propose directed regression, an eff

In [82]:
## WARNING this cell may overshoot your quesry costs
qa_chain_refine = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb_gpt.as_retriever(search_type="similarity", search_kwargs={"k": 20}),
    return_source_documents=True,
    chain_type="refine"
)

result = qa_chain_refine({"query": question})
result

{'query': 'What is linear regression?',
 'result': '\n\nLinear regression is a statistical technique used to model the relationship between a dependent variable (the response variable) and one or more independent variables (the features). It is used to predict the value of the response variable based on the values of the features. It is typically used to guide decisions by estimating regression coefficients via ordinary least squares and using them to make decisions. Additionally, linear regression studies the problem of estimating a model parameter $\\beta^* \\in \\r^p$, from $n$ observations $\\{(y_i,x_i)\\}_{i=1}^n$ from linear model $y_i = \\langle \\x_i,\\beta^* \\rangle + \\epsilon_i$. It also considers a significant generalization in log-linear models, which are widely used probability models for statistical pattern recognition. These models are typically trained according to a convex criterion, and the optimization of log-linear model parameters is costly and therefore an impor

In [83]:
from langchain import PromptTemplate

def query_db(db, users_question, llm, k=10,filter={}):
  # define the prompt template
  template = """
  Given the following context sections, answer the
  question using only the given context. If you are unsure and the answer is not
  explicitly writting in the documentation, say "Sorry, I don't know how to help with that."

  Context sections:
  {context}

  Question:
  {users_question}

  Answer:
  """
  prompt = PromptTemplate(template=template, input_variables=["context", "users_question"])
    # use our vector store to find similar text chunks
  results = db.similarity_search(
      query=users_question,
      n_results=k,
      filter=filter
  )


  # fill the prompt template
  prompt_text = prompt.format(context = results, users_question = users_question)

  # ask the defined LLM
  return llm(prompt_text)


def query_db_relevance(db, users_question, llm, k=10, filter={}):
  # define the prompt template
  template = """
  Given the following context sections, answer the
  question using only the given context. If you are unsure and the answer is not
  explicitly writting in the documentation, say "Sorry, I don't know how to help with that."

  Context sections:
  {context}

  Question:
  {users_question}

  Answer:
  """
  prompt = PromptTemplate(template=template, input_variables=["context", "users_question"])
    # use our vector store to find similar text chunks
  results = db.max_marginal_relevance_search(
      query=users_question,
      n_results=k,
      filter=filter
  )

  # fill the prompt template
  prompt_text = prompt.format(context = results, users_question = users_question)

  # ask the defined LLM
  return llm(prompt_text)


In [84]:
query_db(vectordb_gpt,"What is linear regression?" , llm, k=10)

' Linear regression is a method of estimating a model parameter from observations from a linear model, where the relationship between the covariates and the responses is unknown. It typically involves estimation of regression coefficients via ordinary least squares and their subsequent use to make decisions.'

In [85]:
query_db_relevance(vectordb_gpt,"What is linear regression?" , llm)

' Linear regression is a method of estimating a model parameter from observations of a linear model, where the relationship between the model parameter and the observations is noisy, quantized to a single bit, potentially nonlinear, noninvertible, and unknown.'

In [86]:
query_db(elasticdb_gpt,"What is linear regression?" , llm)

'Linear regression is a method of estimating a model parameter from observations from a linear model, where the relationship between the covariates and the responses is unknown. It typically involves estimation of regression coefficients via ordinary least squares and their subsequent use to make decisions.'

In [87]:
query_db(vectordb_gpt,"What is Natural language processing?" , llm)

' Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. It is used to build systems that can understand, interpret, and manipulate human language, including speech recognition, natural language understanding, and natural language generation.'

In [88]:
query_db(elasticdb_gpt,"What is Natural language processing?" , llm)

' Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. It is used to build systems that can understand, interpret, and manipulate human language, including speech recognition, natural language understanding, and natural language generation.'

In [89]:
query_db(vectordb_gpt,"What is Natural language processing?" , llm, {"year":1990})

' Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. It is used to build applications such as voice recognition, natural language understanding, and machine translation.'

In [90]:
query_db(elasticdb_gpt,"What is Natural language processing?" , llm, {"year":1990})

' Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. It is used to build systems that can understand, interpret, and manipulate human language, including speech recognition, natural language understanding, and natural language generation.'

In [91]:
query_db(vectordb_gpt,"What is BERT?" , llm)

"Sorry, I don't know how to help with that."

In [92]:
query_db_relevance(vectordb_gpt,"What is BERT?" , llm)

"Sorry, I don't know how to help with that."

## Updates

In [93]:
new_papers = [{
    "title": "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding",
    "authors": "Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova",
    "year": 2018,
    "paper_id": 7301,
    "abstract": """We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.
BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).
"""
},
  {
      "title" : "Evolution of transfer learning in natural language processing",
       "authors": "Aditya Malte, Pratik Ratadiya",
      "year": 2019,
       "paper_id": 7302,
      "abstract": """In this paper, we present a study of the recent advancements which have helped bring Transfer Learning to NLP through the use of semi-supervised training. We discuss cutting-edge methods and architectures such as BERT, GPT, ELMo, ULMFit among others. Classically, tasks in natural language processing have been performed through rule-based and statistical methodologies. However, owing to the vast nature of natural languages these methods do not generalise well and failed to learn the nuances of language. Thus machine learning algorithms such as Naive Bayes and decision trees coupled with traditional models such as Bag-of-Words and N-grams were used to usurp this problem. Eventually, with the advent of advanced recurrent neural network architectures such as the LSTM, we were able to achieve state-of-the-art performance in several natural language processing tasks such as text classification and machine translation. We talk about how Transfer Learning has brought about the well-known ImageNet moment for NLP. Several advanced architectures such as the Transformer and its variants have allowed practitioners to leverage knowledge gained from unrelated task to drastically fasten convergence and provide better performance on the target task. This survey represents an effort at providing a succinct yet complete understanding of the recent advances in natural language processing using deep learning in with a special focus on detailing transfer learning and its potential advantages.
"""    
  },
  {
       "title" : "BERTQA -- Attention on Steroids",
       "authors": "Ankit Chadha, Rewa Sood",
      "year": 2019,
        "paper_id": 7303,
      "abstract": """In this work, we extend the Bidirectional Encoder Representations from Transformers (BERT) with an emphasis on directed coattention to obtain an improved F1 performance on the SQUAD2.0 dataset. The Transformer architecture on which BERT is based places hierarchical global attention on the concatenation of the context and query. Our additions to the BERT architecture augment this attention with a more focused context to query (C2Q) and query to context (Q2C) attention via a set of modified Transformer encoder units. In addition, we explore adding convolution-based feature extraction within the coattention architecture to add localized information to self-attention. We found that coattention significantly improves the no answer F1 by 4 points in the base and 1 point in the large architecture. After adding skip connections the no answer F1 improved further without causing an additional loss in has answer F1. The addition of localized feature extraction added to attention produced an overall dev F1 of 77.03 in the base architecture. We applied our findings to the large BERT model which contains twice as many layers and further used our own augmented version of the SQUAD 2.0 dataset created by back translation, which we have named SQUAD 2.Q. Finally, we performed hyperparameter tuning and ensembled our best models for a final F1/EM of 82.317/79.442 (Attention on Steroids, PCE Test Leaderboard).
"""             
  },
    {
        "title": "BERT: A Review of Applications in Natural Language Processing and Understanding",
        "authors": "Mikhail Koroteev",
        "year": 2021,
        "paper_id": 7304,
        "abstract": "In this review, we describe the application of one of the most popular deep learning-based language models - BERT. The paper describes the mechanism of operation of this model, the main areas of its application to the tasks of text analytics, comparisons with similar models in each task, as well as a description of some proprietary models. In preparing this review, the data of several dozen original scientific articles published over the past few years, which attracted the most attention in the scientific community, were systematized. This survey will be useful to all students and researchers who want to get acquainted with the latest advances in the field of natural language text analysis."
    }
]

In [94]:
new_paper_df = pd.DataFrame(new_papers)

In [95]:
new_paper_df

Unnamed: 0,title,authors,year,paper_id,abstract
0,BERT: Pre-training of Deep Bidirectional Trans...,"Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kris...",2018,7301,We introduce a new language representation mod...
1,Evolution of transfer learning in natural lang...,"Aditya Malte, Pratik Ratadiya",2019,7302,"In this paper, we present a study of the recen..."
2,BERTQA -- Attention on Steroids,"Ankit Chadha, Rewa Sood",2019,7303,"In this work, we extend the Bidirectional Enco..."
3,BERT: A Review of Applications in Natural Lang...,Mikhail Koroteev,2021,7304,"In this review, we describe the application of..."


In [96]:
loader_new = DataFrameLoader(new_paper_df, page_content_column="abstract")

In [117]:
# new_splits = sentence_spltter.split_documents(loader_new.load())

In [98]:
vectordb_gpt.add_documents(loader_new.load())

['1af35606-30a7-11ee-a229-acde48001122',
 '1af357c8-30a7-11ee-a229-acde48001122',
 '1af3580e-30a7-11ee-a229-acde48001122',
 '1af35836-30a7-11ee-a229-acde48001122',
 '1af35854-30a7-11ee-a229-acde48001122',
 '1af3587c-30a7-11ee-a229-acde48001122',
 '1af3589a-30a7-11ee-a229-acde48001122',
 '1af358c2-30a7-11ee-a229-acde48001122',
 '1af358e0-30a7-11ee-a229-acde48001122',
 '1af358fe-30a7-11ee-a229-acde48001122',
 '1af3591c-30a7-11ee-a229-acde48001122',
 '1af35944-30a7-11ee-a229-acde48001122',
 '1af35962-30a7-11ee-a229-acde48001122',
 '1af35980-30a7-11ee-a229-acde48001122',
 '1af3599e-30a7-11ee-a229-acde48001122',
 '1af359bc-30a7-11ee-a229-acde48001122',
 '1af359da-30a7-11ee-a229-acde48001122',
 '1af359f8-30a7-11ee-a229-acde48001122',
 '1af35a16-30a7-11ee-a229-acde48001122',
 '1af35a3e-30a7-11ee-a229-acde48001122',
 '1af35a5c-30a7-11ee-a229-acde48001122',
 '1af35a84-30a7-11ee-a229-acde48001122',
 '1af35aa2-30a7-11ee-a229-acde48001122',
 '1af35ac0-30a7-11ee-a229-acde48001122',
 '1af35ade-30a7-

In [99]:
elasticdb_gpt.add_documents(new_splits)

['b8c84f10-7f39-44a8-9cb9-8b39bb174c4e',
 '2d97f9ef-fbcd-4072-8657-fc0ccdbb08da',
 '55291316-f867-4959-9d12-a04e1870743d',
 '2e2e85e4-2d69-4e22-956a-4057816bb017',
 '49cf8438-ad45-46ef-a1fa-00ecb35ee386',
 'f14253ac-150a-46f4-a0c0-faf1c693e6a4',
 'd857314b-e0e6-467e-9916-a87b826f60c5',
 '8541d068-7ef9-4042-b7d6-d226f4d6c01f',
 'b80a7ae8-bb1a-4701-9a6f-433bd8f7c38b',
 '5e69ac5e-50a4-4347-8f02-294f539e5acc',
 'cdbb8a1e-b4d4-44c6-a4e6-4f846539b0b7',
 '4465ca98-31ca-4ced-b8b4-daa4759d1bbd',
 'e3098a4b-dcfe-46f0-8d3d-e7f3985a2549',
 '00ae78b3-ba41-4716-a53f-6114da8212b1',
 '43ecb5f5-f644-4280-97f7-2095d1074836',
 'cdf28626-cbd6-4fa7-b4c1-9f2166bdd968',
 'fec1c2e6-7ddb-49df-bda9-43c780a83ef3',
 '2b3769d8-1a0b-47a3-bb9b-b777024b97ef',
 '773aa8cb-419f-4b41-a1c3-15d223e041c6',
 'c3084d0b-dd12-42f0-a10b-87f97f0d25b4',
 '91d94b96-b71d-4ae7-9220-b7111fcb1865',
 '0daf1011-59c6-42f0-9da7-678f253fa3ab',
 '5126536d-2a19-4949-b74a-ec7a84d281ee',
 '8ad5cb93-456b-4d5d-be9e-4d38575491d3',
 'a412459b-a77c-

In [100]:
query_db(vectordb_gpt,"What is BERT?" , llm)

' BERT is a new language representation model called Bidirectional Encoder Representations from Transformers.'

In [101]:
query_db_relevance(vectordb_gpt,"What is BERT?" , llm)

' BERT is a new language representation model called Bidirectional Encoder Representations from Transformers.'

In [102]:
query_db(elasticdb_gpt,"What is BERT?" , llm)

' BERT is a new language representation model called Bidirectional Encoder Representations from Transformers.'

In [103]:
query_db_relevance(vectordb_gpt,"How is bert an improvement?" , llm)

'BERT is an improvement because it achieved SQuAD v1.1 Test F1 to 88.4 (5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).'

## Generating Questions for Evaluations

In [104]:
from langchain.chains import QAGenerationChain

In [105]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import QAGenerationChain

chain = QAGenerationChain.from_llm(ChatOpenAI(temperature=0))

In [106]:
questions = [chain.run(doc.page_content) for doc in docs[:10]]

In [107]:
questions

[[{'question': 'What is the purpose of non-negative matrix factorization (NMF)?',
   'answer': 'NMF is a useful decomposition for multivariate data.'}],
 [{'question': 'What is the purpose of spike-triggered covariance method described in the text?',
   'answer': 'The purpose of the spike-triggered covariance method is to retrieve suppressive components of the gain control signal in a neuron.'}],
 [{'question': 'What is the complexity class for determining whether a competitive posg has a positive-expected-reward strategy?',
   'answer': 'The complexity class is nexp with an oracle for np.'}],
 [{'question': 'What is the main contribution of the presented algorithm?',
   'answer': 'The main contribution of the presented algorithm is a polynomial algorithm for learning the structure of bounded-treewidth junction trees.'}],
 [{'question': 'What is the purpose of the local smoothness regularizer introduced in this paper?',
   'answer': 'The purpose of the local smoothness regularizer is t

## Chatting with your data: Final Demo

In [108]:
from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True
)

In [109]:
from langchain.chains import ConversationalRetrievalChain


retriever=vectordb_gpt.as_retriever()
qa = ConversationalRetrievalChain.from_llm(
    llm,
    retriever=retriever,
    memory=memory
)

In [110]:
question = "What is Natural language processing?"
result = qa({"question": question})

In [111]:
result["answer"]

' Natural language processing is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. It involves tasks such as automatic speech recognition, natural language understanding, and natural language generation.'

In [112]:
question = "What are its real life applications?"
result2 = qa({"question": question})

In [113]:
 result2["answer"]

' Natural Language Processing has many real-life applications, such as text classification, machine translation, question answering, dialog-based language learning, and syntactic parsing.'

In [114]:
import datetime
current_date = datetime.datetime.now().date()
if current_date < datetime.date(2023, 9, 2):
    llm_name = "gpt-3.5-turbo-0301"
else:
    llm_name = "gpt-3.5-turbo"
print(llm_name)

gpt-3.5-turbo-0301


In [115]:
import gradio

memory = ConversationBufferMemory(
    memory_key="chat_history_ui",
    return_messages=True
)
retriever=vectordb_gpt.as_retriever()
    
qa = ConversationalRetrievalChain.from_llm(
        llm=ChatOpenAI(model_name=llm_name, temperature=0), 
        chain_type="stuff", 
        retriever=retriever, 
        return_source_documents=True
    )

def chat(message, history=None):
    history = history or []
    response = qa({"question": message, "chat_history":history})['answer']
    history.append((message, response))
    return history, history

In [116]:
# collection.load()
chatbot = gradio.Chatbot(color_map=("green", "gray"))
interface = gradio.Interface(
    chat,
    ["text", "state"],
    [chatbot, "state"],
    allow_screenshot=False,
    allow_flagging="never",
)
interface.launch(inline=True, share=True)


  chatbot = gradio.Chatbot(color_map=("green", "gray"))
  interface = gradio.Interface(


Running on local URL:  http://127.0.0.1:7860
Running on public URL: https://2bda0dc3ea562c7408.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




There are two arch types bert vs gpt : People would want to describe focus on architechtures

1)why aare we doing preprocessing?

2) MMR vs similarity searching

3) Run the notebooks in advance. 


4) What gradio is. Add Gradio in slide

5) COntext optimization big data approaches in slides
