# Building Your Own Search Engine Using Vector Databases


![img](https://www.analyticsvidhya.com/datahack-summit-2023/wp-content/uploads/2023/07/s-won_searchengin.jpg)

## Agenda 

**Part 0: The Beginning**

Welcome and Objectives : An introduction to the session's aims, a brief ice breaker activity, and setting the stage for the exploration ahead.

Context Setting: A quick overview of AI, Search Engines, the current landscape, potential use-cases, benefits, and challenges. Highlight the significance of building an AI search engine with your data.

**Part 1: Understanding the Basics**

NLP and Search Engines: Explore the components like Natural Language Processing (NLP), Machine Learning algorithms, and their roles in crafting an efficient AI search engine. Topics include


- What are vector embeddings?
- Legacy vectorizing techniques like CountVectorizer, bag of words
- Similarity measures and how do they work
- LLM and Transformers

Vector Databases
  - An Exploration:
  - Unveiling Vector Databases
  - Understanding their workings
  - Real-world use-cases
  - A comparative analysis of available options

**Part 2: Indexing**

Splitting the data : Why is it required and different kinds of Data splitting. Also cover why splitting is context dependent (i.e depends on data)

The next step is to insert the data into the database


**Part 3: Searching**

Performing Semantic Search on Indexed Data: Discuss and code the integration of NLP and Machine Learning algorithms into the search engine to comprehend, analyze, and generate precise search results from the given data. Employ Command line tools (or maybe a GUI) to execute the searches.

Discuss Different Retrieval algorithms such as
* MMR
* LLM Aided Retrieval
* Compression

**Part 4: Question and answering**

- Prompt Engineering and templates 
- Addressing a lot of windows and short context windows

**Part 5: Chat**

- Introducing memory
- Followup conversations


**Wrap-Up and Next Steps**

Conclusion and Future Directions: Discuss steps to enhance the solution and where to go from here, providing a clear path for continued exploration and development.

**References and Resources**


## Checkbox


### Demo Checkbox

- [X]  **Part 1: Understanding the Basics**
- [X]  **Part 2: Indexing**
- [X]  **Part 3: Semantic search with vector db**
- [X]  **Part 4: Question and answering**
- [X]  **Part 5: Chat**


### Theory material Checkbox

- [X]  **Part 1: Understanding the Basics**
- [X]  **Part 2: Indexing**
- [X]  **Part 3: Semantic search with vector db**
- [X]  **Part 4: Question and answering**
- [X]  **Part 5: Chat**

## Libraries and Technologies we will use

1) Pre Trained Large Language Model (LLM) like ChatGPT for vector Embedding
2) Langchain for Supporting our model application
3) Vector Database like Chroma
4) Gradio


## Generic Architecture

![arch](https://ghost.hacksoft.io/content/images/2023/04/answering_questions.png)

# Basics


### Search Engines and Evolution With AI


1. **Search Engines:** 
   - Search engines work as the librarians of the internet, scanning billions of pages to provide the most relevant results for your search queries. 
   - Traditional search engines operate primarily by scanning webpage text for matching keywords.

2. **Limitations of Traditional Search Engines:**
   - Traditional methods are limited, akin to finding a book in a library by only looking at the words on the covers without understanding the actual content of the book.

3. **AI Revolution in Search Engines:**
   - AI enables search engines to better understand the content and context of web pages and the user's intent.
   - AI can distinguish the context in which a word is used, enhancing the relevance of search results.
   - AI can personalize search results based on factors like previous searches and location.

4. **Advanced Capabilities of AI-powered Search Engines:**
   - AI-powered search engines can analyze and understand various data types including text, images, videos, and even voice commands.
   - As a result, AI has transformed search engines from simple keyword-matching tools into sophisticated systems that understand and cater to the nuanced needs of users.
   
   
 ![search_engine](https://aeroadmin.com/articles/en/wp-content/uploads/2020/11/search-engine-logo.png)

## Our Focus: Natural Language Search or Semantic Search

Everyone knows we can use Chatgpt for searching and asking questions using Chatgpt


![chatgpt](https://www.brookings.edu/wp-content/uploads/2023/01/shutterstock_2237655785.jpg)

However it has certain limitations:
1) It is trained on data before September 2021.
2) We can't search and ask questions on our custom data

But what if we want to augment LLM with our own data.

### Retrieval augmented generation
 
Retrieval-Augmented Generation (RAG) is a modern approach that can be used to enhance natural language search capabilities. It combines the best of retrieval-based and generative methods in NLP to answer questions or search queries.

Here's a simplified explanation of how this might work:

1. **Query Understanding:** Similar to traditional natural language search, the first step in RAG is understanding the user's query, which can involve Named Entity Recognition (NER), Part-of-Speech tagging (POS tagging), and other techniques to parse and interpret the query.

2. **Document Retrieval:** RAG first retrieves a subset of documents from a larger corpus that are most relevant to the user's query. This retrieval step is based on a similarity measure, such as cosine similarity in the case of vectorized representations of text. The result is a shortlist of documents that are likely to contain the answer to the user's query.

3. **Answer Generation:** Once the relevant documents are retrieved, a separate generative model takes the user's query and the retrieved documents as input and generates a response. The generative model can be a sequence-to-sequence model like a Transformer, which is trained to generate coherent and contextually appropriate text.


![imng](https://cdn.thenewstack.io/media/2023/06/34141141-vector-db-llm-1024x665.png)

RAG is a potent approach for natural language search, particularly for tasks that require understanding and generating language based on a large corpus of text, such as question answering or dialog systems. It combines the strengths of retrieval-based and generative models, allowing the model to access a vast amount of knowledge while generating detailed and contextually relevant responses.




## Embeddings

We can search on different kinds of data

![search](https://redis.com/wp-content/uploads/2023/03/vector-similarity-diagram-1.svg?&auto=webp&quality=85,75&width=500)


## Transforming Text In a Way Condusive for search

Many Machine Learning algorithms and almost all Deep Learning Architectures are not capable of processing strings or plain text in their raw form. In a broad sense, they require numerical numbers as inputs to perform any sort of task, such as classification, regression, clustering, etc.

Transforming text into vectors, known as vectorization, is a critical part of Natural Language Processing (NLP). This process allows us to convert human language into a format that machine learning algorithms can understand and work with. Here are a few common methods:



**Bag of Words (BoW)**: This approach treats each document as an unordered bag or "multiset" of its words, disregarding grammar and word order but keeping the frequency of each word.

![bow](https://miro.medium.com/v2/resize:fit:661/1*3K9GIOVLNu0cRvQap_KaRg.png)

**Term Frequency-Inverse Document Frequency (TF-IDF)**: This method reflects how important a word is to a document in a collection or corpus. It's often used as a weighting factor in text mining and information retrieval.

![tfidf](https://lh6.googleusercontent.com/GTmNOZ5DkSorxEoATI93xvrBDCCOn0XAGDav8ybPJ0hIkqlk4nimcY9P8SNleZV1Cf8vnGVAlwawdZ5Fe8kPykKRZbHVUixjSPu1BJdd9DoAdgAVr5VMwhK2oSkXpyDFXunuON-l)

**Word Embeddings**: These are dense vector representations in which similar words have similar vectors in the vector space. Word2Vec and GloVe are two popular methods of creating word embeddings. Word embeddings capture more nuanced semantic meanings and relationships between words compared to BoW and TF-IDF.

![vectors](https://www.researchgate.net/publication/340825443/figure/fig6/AS:882927785238529@1587517796128/Word-embeddings-map-words-in-a-corpus-of-text-to-vector-space-Linear-combinations-of.png)


**Word2Vec**
- It is a shallow neural network model that generates word embeddings – vectors that represent words in a high-dimensional space.
- Word2Vec models, like Skip-gram and Continuous Bag of Words (CBOW), are trained to reconstruct the linguistic context of words.
- Word2Vec can capture some semantic and syntactic relationships between words.




## Transformers And LLMs

### Limitations of Word2Vec
1) Word2Vec does not consider the order of words (e.g., "cat chases dog" and "dog chases cat" are treated the same).
2) Word2Vec generates a single vector representation for each word, regardless of the context (e.g., "bank" as a river bank and "bank" as a financial institution are treated the same).


### Transitioning into Transformers

To overcome these limitations, researchers developed new techniques and models that could understand context better.

**Contextual Word Embeddings (ELMo, etc.)**: ELMo (Embeddings from Language Models) assigns embeddings to words based on their context, addressing the polysemy issue present in Word2Vec.

**Transformers**: The Transformer model, introduced in the paper "Attention is All You Need," revolutionized NLP. It's based on the idea of self-attention, allowing the model to weigh and understand the impact of each word on others in a sentence.

The transformer model architecture underlies models like BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pretrained Transformer). Unlike Word2Vec, these models provide context-aware representations by considering words in their specific context.

**BERT** uses bidirectional training of Transformers, meaning it looks at the context from both left and right sides (words before and after the target word).

**GPT** uses a transformer decoder and is trained in a unidirectional manner. It has been used extensively for tasks that require the generation of text.

In summary, while we started with Word2Vec creating context-free embeddings, we've moved toward Transformer-based models that provide powerful context-dependent representations, leading to significant improvements in various NLP tasks.


### How are LLMs related to transformers


Large Language Models (LLMs) such as GPT-3 by OpenAI, BERT from Google, and RoBERTa from Facebook AI, all have their roots in the Transformer architecture and use vector embeddings as a fundamental part of their operation.

* These models use the attention mechanism to generate context-aware word embeddings, superior to traditional embeddings generated by Word2Vec or GloVe.
* They are trained on extensive amounts of text data and can understand and generate human-like text.
* LLMs have achieved state-of-the-art performance on a wide range of NLP tasks, such as text classification, question answering, and named entity recognition


![embedding](https://miro.medium.com/v2/resize:fit:1324/1*3XZgoCaZg1e9ySfXPTtfQg.png)

## Vector Databases

![db](https://miro.medium.com/v2/resize:fit:1400/1*VXrN-lBukxTqQVBrlS1lZg.png)

Vector databases, also known as vector search engines or similarity search engines, play a crucial role in managing and leveraging high-dimensional vector data produced by Machine Learning (ML) models, including the Large Language Models (LLMs) in NLP tasks. Here's how they can be helpful:

1. **Efficient Similarity Search**: Vector databases are designed to enable fast and efficient similarity search in high-dimensional vector spaces. They can help find the vectors that are most similar to a given vector, which is a common requirement in NLP tasks.

2. **Scalable Storage and Retrieval**: Vector databases provide a scalable solution for storing and retrieving large amounts of high-dimensional vector data.

3. **Integration with ML Workflows**: Vector databases can be easily integrated into ML workflows. They can serve as a bridge between the offline training of models and the online serving of results.

4. **Use Cases in NLP**: In the context of NLP and language models, vector databases can be used to store and retrieve word or sentence embeddings. They can be used in tasks like semantic search (finding documents with similar meanings), recommendation systems (finding similar items or content), and more.

5. **Real-time Applications**: With their ability to provide quick similarity search results, vector databases can support real-time applications such as chatbots, personalized recommendations, and more.

In summary, vector databases help to store, manage, and retrieve the high-dimensional vector data produced by models like transformers, thus enabling the effective application of these models in various NLP tasks.


![vectordb](https://cdn.sanity.io/images/vr8gru94/production/e88ebbacb848b09e477d11eedf4209d10ea4ac0a-1399x537.png)


### When to use a vector database

Using a vector database as opposed to a simple file for storing and retrieving vector data becomes essential in various situations. Here are some scenarios when you would want to opt for a vector database:

1. **Large-scale Data**: If you're dealing with large-scale vector data, simply reading from and writing to files can be inefficient. A vector database is designed to handle such large amounts of data efficiently and quickly.

2. **Efficient Searching**: If you need to perform similarity search queries on the vector data, a vector database provides sophisticated indexing mechanisms to allow efficient nearest-neighbor search, which would be highly inefficient with plain files.

3. **Concurrency and Real-time Access**: If you need concurrent access to your vector data or if your application requires real-time data access, a vector database is a better fit as it supports these requirements out of the box. File-based systems, on the other hand, may struggle with concurrent accesses and real-time data retrieval.

4. **Data Persistence and Reliability**: If you need your data to be reliably stored and persist across sessions or if you want built-in backup and restore capabilities, a vector database is a much safer bet than simple file storage.

5. **Scalability and Distributed Computing**: Vector databases are built to work in distributed computing environments and can easily scale up or down based on the needs of your application. This is not typically possible with file-based storage without significant manual intervention.

However, if your use case involves small-scale data, infrequent access, or does not require efficient search capabilities, a simple file-based storage might suffice. The choice between a vector database and a file really depends on the specifics of your use case and the scale and complexity of your data.


### How does it work

![vectordb](https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83af13e6-8e84-4e37-84e9-244f3aa3e95b_2071x2470.png)

### Commercially available Vector Databases

1) Milvus
2) Kinetica
3) Chroma
4) Pinecone

# Hands On Coding

## Dependencies Installation and Loading data

In [1]:
!pip install langchain openai chromadb kaggle sentence_transformers datasets gradio elasticsearch







In [2]:
import os.path
if not os.path.isfile("database.sqlite"):
    os.system("kaggle datasets download benhamner/nips-papersv")
    os.system("unzip -o nips-papers.zip")

In [3]:
import os
import numpy as np
import pandas as pd
from langchain.llms import OpenAI
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
import sqlite3
from PyPDF2 import PdfReader


In [4]:
con = sqlite3.connect("database.sqlite")

sql= """WITH paper_author_list AS (
    SELECT papers.id AS paper_id, Group_concat(authors.name) AS author_list
    FROM papers
    JOIN paper_authors ON papers.id = paper_authors.paper_id
    JOIN authors ON paper_authors.author_id = authors.id
    GROUP BY paper_id
)
SELECT papers.id AS paper_id, papers.year, papers.title, papers.abstract, papers.paper_text, paper_author_list.author_list AS authors
FROM papers
JOIN paper_author_list ON papers.id = paper_author_list.paper_id
WHERE abstract NOT LIKE '%Abstract Missing%'

""";

papers_df = pd.read_sql_query(sql, con)

In [5]:
papers_df.head()

Unnamed: 0,paper_id,year,title,abstract,paper_text,authors
0,1861,2000,Algorithms for Non-negative Matrix Factorization,Non-negative matrix factorization (NMF) has pr...,Algorithms for Non-negative Matrix\nFactorizat...,"Daniel D. Lee,H. Sebastian Seung"
1,1975,2001,Characterizing Neural Gain Control using Spike...,Spike-triggered averaging techniques are effec...,Characterizing neural gain control using\nspik...,"Odelia Schwartz,E.J. Chichilnisky,Eero P. Simo..."
2,3163,2007,Competition Adds Complexity,It is known that determinining whether a DEC-P...,Competition adds complexity\n\nJudy Goldsmith\...,"Judy Goldsmith,Martin Mundhenk"
3,3164,2007,Efficient Principled Learning of Thin Junction...,We present the first truly polynomial algorith...,Efficient Principled Learning of Thin Junction...,"Anton Chechetka,Carlos Guestrin"
4,3167,2007,Regularized Boost for Semi-Supervised Learning,Semi-supervised inductive learning concerns ho...,Regularized Boost for Semi-Supervised Learning...,"Ke Chen,Shihai Wang"


In [6]:
def pre_process_text(papers_df, column):
    
    # Load the regular expression library
    import re
    preprocessed_column = f"{column}_processed"

    # Print the titles of the first rows 
    print(papers_df[column].head())

    # remove punctuations
    #papers_df[preprocessed_column] = papers_df[column].map(lambda x: re.sub('[,!?]', '', x))
    
     # remove carriage return and end of line
    papers_df[preprocessed_column] = papers_df[column].map(lambda x: re.sub('[\r\n]', ' ', x))
    
     # remove double spaces
    papers_df[preprocessed_column] = papers_df[preprocessed_column].map(lambda x: re.sub('  ', ' ', x))

    
    # remove para continuation
    papers_df[preprocessed_column] = papers_df[preprocessed_column].map(lambda x: re.sub('- ', '', x))

    # Convert the titles to lowercase
    papers_df[preprocessed_column] = papers_df[preprocessed_column].map(lambda x: x.lower())
    return papers_df.head()

In [7]:
text_columns = ["title", "abstract", "paper_text"]
for column in text_columns:
    pre_process_text(papers_df, column)

0     Algorithms for Non-negative Matrix Factorization
1    Characterizing Neural Gain Control using Spike...
2                          Competition Adds Complexity
3    Efficient Principled Learning of Thin Junction...
4       Regularized Boost for Semi-Supervised Learning
Name: title, dtype: object
0    Non-negative matrix factorization (NMF) has pr...
1    Spike-triggered averaging techniques are effec...
2    It is known that determinining whether a DEC-P...
3    We present the first truly polynomial algorith...
4    Semi-supervised inductive learning concerns ho...
Name: abstract, dtype: object
0    Algorithms for Non-negative Matrix\nFactorizat...
1    Characterizing neural gain control using\nspik...
2    Competition adds complexity\n\nJudy Goldsmith\...
3    Efficient Principled Learning of Thin Junction...
4    Regularized Boost for Semi-Supervised Learning...
Name: paper_text, dtype: object


In [8]:
papers_df.abstract_processed[0]


'non-negative matrix factorization (nmf) has previously been shown to  be a useful decomposition for multivariate data. two different multi plicative algorithms for nmf are analyzed. they differ only slightly in  the multiplicative factor used in the update rules. one algorithm can be  shown to minimize the conventional least squares error while the other  minimizes the generalized kullback-leibler divergence. the monotonic  convergence of both algorithms can be proven using an auxiliary func tion analogous to that used for proving convergence of the expectation maximization algorithm. the algorithms can also be interpreted as diag onally rescaled gradient descent, where the rescaling factor is optimally  chosen to ensure convergence. '

In [9]:
papers_df = papers_df.drop(["title", "abstract", "paper_text"], axis=1)
papers_df.head()

Unnamed: 0,paper_id,year,authors,title_processed,abstract_processed,paper_text_processed
0,1861,2000,"Daniel D. Lee,H. Sebastian Seung",algorithms for non-negative matrix factorization,non-negative matrix factorization (nmf) has pr...,algorithms for non-negative matrix factorizati...
1,1975,2001,"Odelia Schwartz,E.J. Chichilnisky,Eero P. Simo...",characterizing neural gain control using spike...,spike-triggered averaging techniques are effec...,characterizing neural gain control using spike...
2,3163,2007,"Judy Goldsmith,Martin Mundhenk",competition adds complexity,it is known that determinining whether a dec-p...,competition adds complexity judy goldsmith dep...
3,3164,2007,"Anton Chechetka,Carlos Guestrin",efficient principled learning of thin junction...,we present the first truly polynomial algorith...,efficient principled learning of thin junction...
4,3167,2007,"Ke Chen,Shihai Wang",regularized boost for semi-supervised learning,semi-supervised inductive learning concerns ho...,regularized boost for semi-supervised learning...


In [10]:
papers_df = papers_df.rename(columns={'title_processed':'title', 'abstract_processed': 'abstract', 'paper_text_processed': 'paper_text'})

In [11]:
papers_df.head()

Unnamed: 0,paper_id,year,authors,title,abstract,paper_text
0,1861,2000,"Daniel D. Lee,H. Sebastian Seung",algorithms for non-negative matrix factorization,non-negative matrix factorization (nmf) has pr...,algorithms for non-negative matrix factorizati...
1,1975,2001,"Odelia Schwartz,E.J. Chichilnisky,Eero P. Simo...",characterizing neural gain control using spike...,spike-triggered averaging techniques are effec...,characterizing neural gain control using spike...
2,3163,2007,"Judy Goldsmith,Martin Mundhenk",competition adds complexity,it is known that determinining whether a dec-p...,competition adds complexity judy goldsmith dep...
3,3164,2007,"Anton Chechetka,Carlos Guestrin",efficient principled learning of thin junction...,we present the first truly polynomial algorith...,efficient principled learning of thin junction...
4,3167,2007,"Ke Chen,Shihai Wang",regularized boost for semi-supervised learning,semi-supervised inductive learning concerns ho...,regularized boost for semi-supervised learning...


In [12]:
papers_df = papers_df.drop(["paper_text"], axis=1)

In [13]:
with open("../secret/openai") as f:
    openai_secret = f.read().strip()
    
# PDF_FILE = "../data/GenericEmailMarketting/merged_file.pdf"

# use import getpass instead

os.environ["OPENAI_API_KEY"] = openai_secret 

In [14]:
llm = OpenAI(temperature=0)
llm.openai_api_key = os.environ["OPENAI_API_KEY"]

In [15]:
llm("tell me a joke")

'\n\nQ: What did the fish say when it hit the wall?\nA: Dam!'

In [16]:
llm("Who is the current prime minister of Britain")

'?\n\nThe current Prime Minister of the United Kingdom is Boris Johnson.'

In [17]:
from langchain.embeddings import OpenAIEmbeddings

embeddings_model = OpenAIEmbeddings()

def get_openai_embedding(text):
   text_rep = text.replace("\n", " ")
   return embeddings_model.embed_documents([text_rep])

In [18]:
sentence1 = "i like summer"
sentence2 = "Brocholi on pizza is probably not a good idea"
sentence3 = "I love the warm weather outside"

In [19]:
embedding1 = embeddings_model.embed_query(sentence1)
embedding2 = embeddings_model.embed_query(sentence2)
embedding3 = embeddings_model.embed_query(sentence3)

In [20]:
print(np.dot(embedding1, embedding2))
print(np.dot(embedding2, embedding3))
print(np.dot(embedding1, embedding3))


0.7065412380498463
0.7131438797298986
0.8758719352747321


### Loading the data

In [21]:
from langchain.document_loaders import DataFrameLoader

In [22]:
loader = DataFrameLoader(papers_df, page_content_column="abstract")

In [23]:
# Use lazy load for larger table, which won't read the full table into memory 
page = loader.load()[0]

In [24]:
page.page_content

'non-negative matrix factorization (nmf) has previously been shown to  be a useful decomposition for multivariate data. two different multi plicative algorithms for nmf are analyzed. they differ only slightly in  the multiplicative factor used in the update rules. one algorithm can be  shown to minimize the conventional least squares error while the other  minimizes the generalized kullback-leibler divergence. the monotonic  convergence of both algorithms can be proven using an auxiliary func tion analogous to that used for proving convergence of the expectation maximization algorithm. the algorithms can also be interpreted as diag onally rescaled gradient descent, where the rescaling factor is optimally  chosen to ensure convergence. '

In [25]:
page.metadata

{'paper_id': 1861,
 'year': 2000,
 'authors': 'Daniel D. Lee,H. Sebastian Seung',
 'title': 'algorithms for non-negative matrix factorization'}

In [26]:
type(page)

langchain.schema.Document

### Splitting the data

### Why do we need to split the data
1) Chatgpt and LLM have limits

![limits](https://miro.medium.com/v2/resize:fit:0/1*ihkkB2g7j9CMHBfvJtfLKw.png)

2) To allow for efficient search for vector spaces

![embeddings](https://res.cloudinary.com/lesswrong-2-0/image/upload/v1676309873/mirroredImages/pHPmMGEMYefk9jLeh/uvqfemlxskwrikptzeaq.png)

In [27]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

In [28]:
r_spltter = RecursiveCharacterTextSplitter( # Set a really small chunk size, just to show.
    chunk_size = 26,
    chunk_overlap  = 4,
    length_function = len,
    #seperators=['\n\n", "\n", " ", ""]
)

c_splitter = CharacterTextSplitter(
    chunk_size = 26,
    chunk_overlap  = 4,
    length_function = len,

)

In [29]:
text_a2z = 'abcdefghijklmnopqrstuvwxyz'
text_a2z_plus = 'abcdefghijklmnopqrstuvwxyz 12345678'

In [30]:
r_spltter.split_text(text_a2z)

['abcdefghijklmnopqrstuvwxyz']

In [31]:
r_spltter.split_text(text_a2z_plus)

['abcdefghijklmnopqrstuvwxyz', '12345678']

In [32]:
c_splitter.split_text(text_a2z)

['abcdefghijklmnopqrstuvwxyz']

In [33]:
c_splitter.split_text(text_a2z_plus)

['abcdefghijklmnopqrstuvwxyz 12345678']

The issue is Character text splitter splits only on new lines

In [34]:
c_splitter = CharacterTextSplitter(
    chunk_size = 26,
    chunk_overlap  = 4,
    length_function = len,
    separator=" "
)

In [35]:
c_splitter.split_text(text_a2z_plus)

['abcdefghijklmnopqrstuvwxyz', '12345678']

In [36]:
abstract = page.page_content

abstract

'non-negative matrix factorization (nmf) has previously been shown to  be a useful decomposition for multivariate data. two different multi plicative algorithms for nmf are analyzed. they differ only slightly in  the multiplicative factor used in the update rules. one algorithm can be  shown to minimize the conventional least squares error while the other  minimizes the generalized kullback-leibler divergence. the monotonic  convergence of both algorithms can be proven using an auxiliary func tion analogous to that used for proving convergence of the expectation maximization algorithm. the algorithms can also be interpreted as diag onally rescaled gradient descent, where the rescaling factor is optimally  chosen to ensure convergence. '

In [37]:
r_spltter.split_text(abstract)

['non-negative matrix',
 'factorization (nmf) has',
 'has previously been shown',
 'to  be a useful',
 'decomposition for',
 'for multivariate data.',
 'two different multi',
 'plicative algorithms for',
 'for nmf are analyzed.',
 'they differ only slightly',
 'in  the multiplicative',
 'factor used in the update',
 'rules. one algorithm can',
 'can be  shown to minimize',
 'the conventional least',
 'squares error while the',
 'the other  minimizes the',
 'the generalized',
 'kullback-leibler',
 'divergence. the monotonic',
 'convergence of both',
 'algorithms can be proven',
 'using an auxiliary func',
 'tion analogous to that',
 'used for proving',
 'convergence of the',
 'the expectation',
 'maximization algorithm.',
 'the algorithms can also',
 'be interpreted as diag',
 'onally rescaled gradient',
 'descent, where the',
 'the rescaling factor is',
 'is optimally  chosen to',
 'to ensure convergence.']

In [38]:
c_splitter.split_text(abstract)

['non-negative matrix',
 'factorization (nmf) has',
 'has previously been shown',
 'to be a useful',
 'decomposition for',
 'for multivariate data. two',
 'two different multi',
 'plicative algorithms for',
 'for nmf are analyzed. they',
 'they differ only slightly',
 'in the multiplicative',
 'factor used in the update',
 'rules. one algorithm can',
 'can be shown to minimize',
 'the conventional least',
 'squares error while the',
 'the other minimizes the',
 'the generalized',
 'kullback-leibler',
 'divergence. the monotonic',
 'convergence of both',
 'both algorithms can be',
 'be proven using an',
 'an auxiliary func tion',
 'tion analogous to that',
 'that used for proving',
 'convergence of the',
 'the expectation',
 'maximization algorithm.',
 'the algorithms can also be',
 'be interpreted as diag',
 'diag onally rescaled',
 'gradient descent, where',
 'the rescaling factor is',
 'is optimally chosen to',
 'to ensure convergence.']

In [39]:
r_spltter = RecursiveCharacterTextSplitter( # Set a really small chunk size, just to show.
   
    length_function = len,
    separators=['\n\n', "\n", " ", ""],
     chunk_size = 150,
    chunk_overlap  = 0,
)


In [40]:
r_spltter.split_text(abstract)

['non-negative matrix factorization (nmf) has previously been shown to  be a useful decomposition for multivariate data. two different multi plicative',
 'algorithms for nmf are analyzed. they differ only slightly in  the multiplicative factor used in the update rules. one algorithm can be  shown to',
 'minimize the conventional least squares error while the other  minimizes the generalized kullback-leibler divergence. the monotonic  convergence of',
 'both algorithms can be proven using an auxiliary func tion analogous to that used for proving convergence of the expectation maximization algorithm.',
 'the algorithms can also be interpreted as diag onally rescaled gradient descent, where the rescaling factor is optimally  chosen to ensure',
 'convergence.']

In [41]:
r_spltter = RecursiveCharacterTextSplitter( # Set a really small chunk size, just to show.
    length_function = len,
    separators=['\n\n', "\n", " ","\.", ""],
     chunk_size = 1000,
    chunk_overlap  = 0,
)


In [42]:
r_spltter.split_text(abstract)

['non-negative matrix factorization (nmf) has previously been shown to  be a useful decomposition for multivariate data. two different multi plicative algorithms for nmf are analyzed. they differ only slightly in  the multiplicative factor used in the update rules. one algorithm can be  shown to minimize the conventional least squares error while the other  minimizes the generalized kullback-leibler divergence. the monotonic  convergence of both algorithms can be proven using an auxiliary func tion analogous to that used for proving convergence of the expectation maximization algorithm. the algorithms can also be interpreted as diag onally rescaled gradient descent, where the rescaling factor is optimally  chosen to ensure convergence.']

In [43]:
# what if we want to split by sentences
# regex with look behind
sentence_spltter = RecursiveCharacterTextSplitter( # Set a really small chunk size, just to show.
    length_function = len,
    separators=["(?<=\.)"],
     chunk_size = 150,
    chunk_overlap  = 0,
)


In [44]:
sentence_spltter.split_text(abstract)

['non-negative matrix factorization (nmf) has previously been shown to  be a useful decomposition for multivariate data.',
 'two different multi plicative algorithms for nmf are analyzed. they differ only slightly in  the multiplicative factor used in the update rules.',
 'one algorithm can be  shown to minimize the conventional least squares error while the other  minimizes the generalized kullback-leibler divergence.',
 ' the monotonic  convergence of both algorithms can be proven using an auxiliary func tion analogous to that used for proving convergence of the expectation maximization algorithm.',
 ' the algorithms can also be interpreted as diag onally rescaled gradient descent, where the rescaling factor is optimally  chosen to ensure convergence.']

In [45]:
from langchain.text_splitter import TokenTextSplitter

In [46]:
token_text_splitter = TokenTextSplitter(chunk_size=1, chunk_overlap=0)

In [47]:
token_text_splitter.split_text(text_a2z)

['abc',
 'def',
 'gh',
 'ij',
 'kl',
 'mn',
 'op',
 'q',
 'r',
 'st',
 'uv',
 'w',
 'xy',
 'z']

In [48]:
token_text_splitter.split_text(abstract)

['non',
 '-',
 'negative',
 ' matrix',
 ' factor',
 'ization',
 ' (',
 'nm',
 'f',
 ')',
 ' has',
 ' previously',
 ' been',
 ' shown',
 ' to',
 ' ',
 ' be',
 ' a',
 ' useful',
 ' decom',
 'position',
 ' for',
 ' mult',
 'ivariate',
 ' data',
 '.',
 ' two',
 ' different',
 ' multi',
 ' pl',
 'icative',
 ' algorithms',
 ' for',
 ' nm',
 'f',
 ' are',
 ' analyzed',
 '.',
 ' they',
 ' differ',
 ' only',
 ' slightly',
 ' in',
 ' ',
 ' the',
 ' multipl',
 'icative',
 ' factor',
 ' used',
 ' in',
 ' the',
 ' update',
 ' rules',
 '.',
 ' one',
 ' algorithm',
 ' can',
 ' be',
 ' ',
 ' shown',
 ' to',
 ' minimize',
 ' the',
 ' conventional',
 ' least',
 ' squares',
 ' error',
 ' while',
 ' the',
 ' other',
 ' ',
 ' minim',
 'izes',
 ' the',
 ' generalized',
 ' k',
 'ull',
 'back',
 '-',
 'le',
 'ib',
 'ler',
 ' divergence',
 '.',
 ' the',
 ' mon',
 'ot',
 'onic',
 ' ',
 ' convergence',
 ' of',
 ' both',
 ' algorithms',
 ' can',
 ' be',
 ' proven',
 ' using',
 ' an',
 ' auxiliary',
 ' func',
 ' t

In [49]:
docs = loader.load()

In [50]:
len(docs)

3921

In [51]:
docs[0]

Document(page_content='non-negative matrix factorization (nmf) has previously been shown to  be a useful decomposition for multivariate data. two different multi plicative algorithms for nmf are analyzed. they differ only slightly in  the multiplicative factor used in the update rules. one algorithm can be  shown to minimize the conventional least squares error while the other  minimizes the generalized kullback-leibler divergence. the monotonic  convergence of both algorithms can be proven using an auxiliary func tion analogous to that used for proving convergence of the expectation maximization algorithm. the algorithms can also be interpreted as diag onally rescaled gradient descent, where the rescaling factor is optimally  chosen to ensure convergence. ', metadata={'paper_id': 1861, 'year': 2000, 'authors': 'Daniel D. Lee,H. Sebastian Seung', 'title': 'algorithms for non-negative matrix factorization'})

In [52]:
# For simplicity we will use sentence splitting

In [53]:
splits = sentence_spltter.split_documents(docs)

In [54]:
splits[0]

Document(page_content='non-negative matrix factorization (nmf) has previously been shown to  be a useful decomposition for multivariate data.', metadata={'paper_id': 1861, 'year': 2000, 'authors': 'Daniel D. Lee,H. Sebastian Seung', 'title': 'algorithms for non-negative matrix factorization'})

In [55]:
random_splits = loader.load_and_split()

## Indexing data in Vectorstores

In [56]:
from langchain.embeddings import HuggingFaceEmbeddings

In [57]:
persist_directory_random_split_gpt = 'data/chroma/random_split/gpt'

In [192]:
!rm -rf ./data/chroma  # remove old database files if any

In [193]:
%%time
vectordb_gpt = Chroma.from_documents(
    documents=splits,
    embedding=OpenAIEmbeddings(),
    persist_directory=persist_directory_random_split_gpt
)

CPU times: user 1min 21s, sys: 6.5 s, total: 1min 28s
Wall time: 2min 50s


In [60]:
print(vectordb_gpt._collection.count())

23526


In [61]:
vectordb_gpt.persist()

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

In [244]:
from langchain import ElasticVectorSearch
from langchain.vectorstores.elastic_vector_search import ElasticKnnSearch

from langchain.embeddings import OpenAIEmbeddings

embedding = OpenAIEmbeddings()

from elasticsearch import Elasticsearch


elastic = Elasticsearch(hosts=["http://localhost:9200"])
index_name = "test_index"

In [246]:
elastic.delete_by_query(index=[index_name], body={"query": {"match_all": {}}})

  elastic.delete_by_query(index=[index_name], body={"query": {"match_all": {}}})


ObjectApiResponse({'took': 8344, 'timed_out': False, 'total': 25143, 'deleted': 25143, 'batches': 26, 'version_conflicts': 0, 'noops': 0, 'retries': {'bulk': 0, 'search': 0}, 'throttled_millis': 0, 'requests_per_second': -1.0, 'throttled_until_millis': 0, 'failures': []})

In [247]:
%%time

elasticdb_gpt = ElasticVectorSearch.from_documents(
    splits,
    embedding,
    index_name=index_name,
    elasticsearch_url="http://localhost:9200"
)


CPU times: user 41.8 s, sys: 10.9 s, total: 52.7 s
Wall time: 3min 46s


In [235]:
elastic_vector_search = ElasticVectorSearch(
            elasticsearch_url="http://localhost:9200",
            index_name=index_name,
            embedding=embedding
        )

In [236]:
elastic_vector_search_knn = ElasticKnnSearch(

            es_connection=elasticdb_gpt.client,
            index_name=index_name,
            embedding=embedding
)

##  Search

When a query comes in we first convert it to a vector and then compare the vector to the elements in the database to get n most similar results.

**Similarity Search**
1. Similarity Search aims to find the most similar items to a given query in a dataset.
2. It uses metrics like cosine similarity, Jaccard similarity, or Euclidean distance to quantify similarity.
3. The search results are ranked based on their similarity scores, and the top-K items are returned.
4. Use-cases: When the user's intent is clear and specific, similarity search is efficient. It's commonly used in standard search engines, recommendation systems, or any scenario where the goal is to find items most similar to the query.

**Maximal Marginal Relevance (MMR) Search**
1. MMR aims to provide a diverse set of results that are relevant to the query.
2. Along with quantifying similarity to the query, MMR also considers similarity between items in the results set to ensure diversity.
3. It aims to maximize the relevance of the returned items to the query, but also minimize the similarity between the returned items.
4. Use-cases: When user intent is ambiguous or when there are multiple relevant responses, MMR can provide a more diverse set of results. It's useful in news article recommendation (to avoid recommending too many similar articles) or in conversational AI (to provide diverse responses).

The choice between similarity search and MMR depends on the specific use case and user needs. If the aim is to provide a diverse set of results, MMR would be more suitable. If the goal is to find items most similar to the query, a similarity search would be more efficient.

### Keyword search

In [212]:
!pip install tabulate

Collecting tabulate
  Downloading tabulate-0.9.0-py3-none-any.whl (35 kB)
Installing collected packages: tabulate
Successfully installed tabulate-0.9.0


In [213]:
from tabulate import tabulate

In [198]:
#keyword search
resp = elasticdb_gpt.client.search(q="Natural language processing", query={"match_all": {}})

In [222]:
print("Got %d Hits:" % resp['hits']['total']['value'])
table = []
for hit in resp['hits']['hits']:
    table.append([hit['_source']['text'], hit['_score']])
print(tabulate(table, headers=["text", "score"], tablefmt="grid"))

Got 2345 Hits:
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+
| text                                                                                                                                                                                                                                                    |   score |
| cross language text classi?cation is an important learning task in natural language processing.                                                                                                                                                         | 18.6121 |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

### Similarity Search

In [223]:
vectordb_gpt.similarity_search("What is Natural language processing")

[Document(page_content='users want natural language processing (nlp) systems to be both fast and accurate, but quality often comes at the cost of speed.', metadata={'paper_id': 4556, 'year': 2012, 'authors': 'Jiarong Jiang,Adam Teichert,Jason Eisner,Hal Daume', 'title': 'learned prioritization for trading off accuracy and speed'}),
 Document(page_content='cross language text classi?cation is an important learning task in natural language processing.', metadata={'paper_id': 5164, 'year': 2013, 'authors': 'Min Xiao,Yuhong Guo', 'title': 'a novel two-step method for cross language representation learning'}),
 Document(page_content='our framework is inspired by state-of-the-art smoothing techniques used in natural language processing (nlp).', metadata={'paper_id': 5880, 'year': 2015, 'authors': 'Pinar Yanardag,S.V.N. Vishwanathan', 'title': 'a structural smoothing framework for robust graph comparison'}),
 Document(page_content='teaching machines to read natural language documents remains 

In [224]:
vectordb_gpt.similarity_search_with_score("What is Natural language processing")

[(Document(page_content='users want natural language processing (nlp) systems to be both fast and accurate, but quality often comes at the cost of speed.', metadata={'paper_id': 4556, 'year': 2012, 'authors': 'Jiarong Jiang,Adam Teichert,Jason Eisner,Hal Daume', 'title': 'learned prioritization for trading off accuracy and speed'}),
  0.284382700920105),
 (Document(page_content='cross language text classi?cation is an important learning task in natural language processing.', metadata={'paper_id': 5164, 'year': 2013, 'authors': 'Min Xiao,Yuhong Guo', 'title': 'a novel two-step method for cross language representation learning'}),
  0.28937631845474243),
 (Document(page_content='our framework is inspired by state-of-the-art smoothing techniques used in natural language processing (nlp).', metadata={'paper_id': 5880, 'year': 2015, 'authors': 'Pinar Yanardag,S.V.N. Vishwanathan', 'title': 'a structural smoothing framework for robust graph comparison'}),
  0.30891820788383484),
 (Document(p

In [157]:
vectordb_gpt.similarity_search("What is linear regression")

[Document(page_content='when used to guide decisions, linear regression analysis typically involves estimation of regression coefficients via ordinary least squares and their subsequent use to make decisions. when there are multiple response variables and features do not perfectly capture their relationships, it is beneficial to account for the decision objective when computing regression coefficients. empirical optimization does so but sacrifices performance when features are well-chosen or training data are insufficient. we propose directed regression, an efficient algorithm that combines merits of ordinary least squares and empirical optimization. we demonstrate through a computational study that directed regression can generate significant performance gains over either alternative. we also develop a theory that motivates the algorithm.', metadata={'paper_id': 3686, 'year': 2009, 'authors': 'Yi-hao Kao,Benjamin V. Roy,Xiang Yan', 'title': 'directed regression'})]

In [230]:
elasticdb_gpt.similarity_search_with_score("What is Natural language processing")

[(Document(page_content='Classically, tasks in natural language processing have been performed through rule-based and statistical methodologies.', metadata={'title': 'Evolution of transfer learning in natural language processing', 'authors': 'Aditya Malte, Pratik Ratadiya', 'year': 2019, 'paper_id': 7302}),
  1.8594966),
 (Document(page_content='users want natural language processing (nlp) systems to be both fast and accurate, but quality often comes at the cost of speed.', metadata={'paper_id': 4556, 'year': 2012, 'authors': 'Jiarong Jiang,Adam Teichert,Jason Eisner,Hal Daume', 'title': 'learned prioritization for trading off accuracy and speed'}),
  1.8578087),
 (Document(page_content='cross language text classi?cation is an important learning task in natural language processing.', metadata={'paper_id': 5164, 'year': 2013, 'authors': 'Min Xiao,Yuhong Guo', 'title': 'a novel two-step method for cross language representation learning'}),
  1.8553118),
 (Document(page_content='our frame

In [237]:
elastic_vector_search_knn.similarity_search_with_score("What is Natural language processing")

[(Document(page_content='users want natural language processing (nlp) systems to be both fast and accurate, but quality often comes at the cost of speed.', metadata={'paper_id': 4556, 'year': 2012, 'authors': 'Jiarong Jiang,Adam Teichert,Jason Eisner,Hal Daume', 'title': 'learned prioritization for trading off accuracy and speed'}),
  1.8570322),
 (Document(page_content='cross language text classi?cation is an important learning task in natural language processing.', metadata={'paper_id': 5164, 'year': 2013, 'authors': 'Min Xiao,Yuhong Guo', 'title': 'a novel two-step method for cross language representation learning'}),
  1.8553118),
 (Document(page_content='our framework is inspired by state-of-the-art smoothing techniques used in natural language processing (nlp).', metadata={'paper_id': 5880, 'year': 2015, 'authors': 'Pinar Yanardag,S.V.N. Vishwanathan', 'title': 'a structural smoothing framework for robust graph comparison'}),
  1.8455409),
 (Document(page_content='teaching machin

### Maximum marginal relevance

Maximum marginal relevance strives to achieve both relevance to the query and diversity among the results.

In [158]:
vectordb_gpt.max_marginal_relevance_search("What is linear regression")

[Document(page_content='when used to guide decisions, linear regression analysis typically involves estimation of regression coefficients via ordinary least squares and their subsequent use to make decisions. when there are multiple response variables and features do not perfectly capture their relationships, it is beneficial to account for the decision objective when computing regression coefficients. empirical optimization does so but sacrifices performance when features are well-chosen or training data are insufficient. we propose directed regression, an efficient algorithm that combines merits of ordinary least squares and empirical optimization. we demonstrate through a computational study that directed regression can generate significant performance gains over either alternative. we also develop a theory that motivates the algorithm.', metadata={'paper_id': 3686, 'year': 2009, 'authors': 'Yi-hao Kao,Benjamin V. Roy,Xiang Yan', 'title': 'directed regression'}),
 Document(page_cont

In [159]:
vectordb_gpt.max_marginal_relevance_search("What is natural language processing")

[Document(page_content="a long-term goal of machine learning research is to build an intelligent dialog agent. most research in natural language understanding has focused on learning from fixed training sets of labeled data, with supervision either at the word level (tagging, parsing tasks) or sentence level (question answering, machine translation). this kind of supervision is not realistic of how humans learn, where language is both learned by, and used for, communication. in this work, we study dialog-based language learning, where supervision is given naturally and implicitly in the response of the dialog partner during the conversation. we study this setup in two domains: the babi dataset of (weston et al., 2015) and large-scale question answering from (dodge et al., 2015). we evaluate a set of baseline learning strategies on these tasks, and show that a novel model incorporating predictive lookahead is a promising approach for learning from a teacher's response. in particular, a 

In [195]:
vectordb_gpt.max_marginal_relevance_search("What is natural language processing", filter={"year":1990})

[]

## Question and Answer


When a query comes in we first convert it to a vector and then compare the vector to the elements in the database to get n most similar results. These results are then passed into prompt as a context for LLM to process them


In [161]:
from langchain import PromptTemplate

def query_db(db, users_question, llm, filter={}):
  # define the prompt template
  template = """
  Given the following context sections, answer the
  question using only the given context. If you are unsure and the answer is not
  explicitly writting in the documentation, say "Sorry, I don't know how to help with that."

  Context sections:
  {context}

  Question:
  {users_question}

  Answer:
  """
  prompt = PromptTemplate(template=template, input_variables=["context", "users_question"])
    # use our vector store to find similar text chunks
  results = db.similarity_search(
      query=users_question,
      n_results=10,
      filter=filter
  )

  # fill the prompt template
  prompt_text = prompt.format(context = results, users_question = users_question)

  # ask the defined LLM
  return llm(prompt_text)


def query_db_relevance(db, users_question, llm, filter={}):
  # define the prompt template
  template = """
  Given the following context sections, answer the
  question using only the given context. If you are unsure and the answer is not
  explicitly writting in the documentation, say "Sorry, I don't know how to help with that."

  Context sections:
  {context}

  Question:
  {users_question}

  Answer:
  """
  prompt = PromptTemplate(template=template, input_variables=["context", "users_question"])
    # use our vector store to find similar text chunks
  results = db.max_marginal_relevance_search(
      query=users_question,
      n_results=10,
      filter=filter
  )

  # fill the prompt template
  prompt_text = prompt.format(context = results, users_question = users_question)

  # ask the defined LLM
  return llm(prompt_text)


In [162]:
query_db(vectordb_gpt,"What is linear regression?" , llm)

" Sorry, I don't know how to help with that."

In [163]:
query_db_relevance(vectordb_gpt,"What is linear regression?" , llm)

'Linear regression is a method of estimating a model parameter from observations from a linear model, where the relationship between the covariates and the responses is unknown. It typically involves estimation of regression coefficients via ordinary least squares and their subsequent use to make decisions.'

In [164]:
query_db(elasticdb_gpt,"What is linear regression?" , llm)

' Linear regression is a method of estimating a model parameter from observations from a linear model, typically involving estimation of regression coefficients via ordinary least squares and their subsequent use to make decisions.'

In [165]:
query_db(vectordb_gpt,"What is Natural language processing?" , llm)

" Sorry, I don't know how to help with that."

In [166]:
query_db(elasticdb_gpt,"What is Natural language processing?" , llm)

' Natural language processing (NLP) is an important learning task in natural language processing which involves teaching machines to read and comprehend natural language documents. It is used to build systems that are both fast and accurate, but quality often comes at the cost of speed.'

In [167]:
query_db(vectordb_gpt,"What is Natural language processing?" , llm, {"year":1990})

" Sorry, I don't know how to help with that."

In [168]:
query_db(elasticdb_gpt,"What is Natural language processing?" , llm, {"year":1990})

" Sorry, I don't know how to help with that."

In [169]:
query_db(vectordb_gpt,"What is BERT?" , llm)

" Sorry, I don't know how to help with that."

In [170]:
query_db_relevance(vectordb_gpt,"What is BERT?" , llm)

"Sorry, I don't know how to help with that."

## Updates

In [171]:
new_papers = [{
    "title": "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding",
    "authors": "Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova",
    "year": 2018,
    "paper_id": 7301,
    "abstract": """We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.
BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).
"""
},
  {
      "title" : "Evolution of transfer learning in natural language processing",
       "authors": "Aditya Malte, Pratik Ratadiya",
      "year": 2019,
       "paper_id": 7302,
      "abstract": """In this paper, we present a study of the recent advancements which have helped bring Transfer Learning to NLP through the use of semi-supervised training. We discuss cutting-edge methods and architectures such as BERT, GPT, ELMo, ULMFit among others. Classically, tasks in natural language processing have been performed through rule-based and statistical methodologies. However, owing to the vast nature of natural languages these methods do not generalise well and failed to learn the nuances of language. Thus machine learning algorithms such as Naive Bayes and decision trees coupled with traditional models such as Bag-of-Words and N-grams were used to usurp this problem. Eventually, with the advent of advanced recurrent neural network architectures such as the LSTM, we were able to achieve state-of-the-art performance in several natural language processing tasks such as text classification and machine translation. We talk about how Transfer Learning has brought about the well-known ImageNet moment for NLP. Several advanced architectures such as the Transformer and its variants have allowed practitioners to leverage knowledge gained from unrelated task to drastically fasten convergence and provide better performance on the target task. This survey represents an effort at providing a succinct yet complete understanding of the recent advances in natural language processing using deep learning in with a special focus on detailing transfer learning and its potential advantages.
"""    
  },
  {
       "title" : "BERTQA -- Attention on Steroids",
       "authors": "Ankit Chadha, Rewa Sood",
      "year": 2019,
        "paper_id": 7303,
      "abstract": """In this work, we extend the Bidirectional Encoder Representations from Transformers (BERT) with an emphasis on directed coattention to obtain an improved F1 performance on the SQUAD2.0 dataset. The Transformer architecture on which BERT is based places hierarchical global attention on the concatenation of the context and query. Our additions to the BERT architecture augment this attention with a more focused context to query (C2Q) and query to context (Q2C) attention via a set of modified Transformer encoder units. In addition, we explore adding convolution-based feature extraction within the coattention architecture to add localized information to self-attention. We found that coattention significantly improves the no answer F1 by 4 points in the base and 1 point in the large architecture. After adding skip connections the no answer F1 improved further without causing an additional loss in has answer F1. The addition of localized feature extraction added to attention produced an overall dev F1 of 77.03 in the base architecture. We applied our findings to the large BERT model which contains twice as many layers and further used our own augmented version of the SQUAD 2.0 dataset created by back translation, which we have named SQUAD 2.Q. Finally, we performed hyperparameter tuning and ensembled our best models for a final F1/EM of 82.317/79.442 (Attention on Steroids, PCE Test Leaderboard).
"""             
  },
    {
        "title": "BERT: A Review of Applications in Natural Language Processing and Understanding",
        "authors": "Mikhail Koroteev",
        "year": 2021,
        "paper_id": 7304,
        "abstract": "In this review, we describe the application of one of the most popular deep learning-based language models - BERT. The paper describes the mechanism of operation of this model, the main areas of its application to the tasks of text analytics, comparisons with similar models in each task, as well as a description of some proprietary models. In preparing this review, the data of several dozen original scientific articles published over the past few years, which attracted the most attention in the scientific community, were systematized. This survey will be useful to all students and researchers who want to get acquainted with the latest advances in the field of natural language text analysis."
    }
]

In [172]:
new_paper_df = pd.DataFrame(new_papers)

In [173]:
new_paper_df

Unnamed: 0,title,authors,year,paper_id,abstract
0,BERT: Pre-training of Deep Bidirectional Trans...,"Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kris...",2018,7301,We introduce a new language representation mod...
1,Evolution of transfer learning in natural lang...,"Aditya Malte, Pratik Ratadiya",2019,7302,"In this paper, we present a study of the recen..."
2,BERTQA -- Attention on Steroids,"Ankit Chadha, Rewa Sood",2019,7303,"In this work, we extend the Bidirectional Enco..."
3,BERT: A Review of Applications in Natural Lang...,Mikhail Koroteev,2021,7304,"In this review, we describe the application of..."


In [174]:
loader_new = DataFrameLoader(new_paper_df, page_content_column="abstract")

In [175]:
new_splits = sentence_spltter.split_documents(loader_new.load())

In [176]:
vectordb_gpt.add_documents(new_splits)

['1c0d22a2-2336-11ee-8d97-82ad35693800',
 '1c0d24c8-2336-11ee-8d97-82ad35693800',
 '1c0d2518-2336-11ee-8d97-82ad35693800',
 '1c0d254a-2336-11ee-8d97-82ad35693800',
 '1c0d2572-2336-11ee-8d97-82ad35693800',
 '1c0d259a-2336-11ee-8d97-82ad35693800',
 '1c0d25cc-2336-11ee-8d97-82ad35693800',
 '1c0d25f4-2336-11ee-8d97-82ad35693800',
 '1c0d261c-2336-11ee-8d97-82ad35693800',
 '1c0d2644-2336-11ee-8d97-82ad35693800',
 '1c0d266c-2336-11ee-8d97-82ad35693800',
 '1c0d268a-2336-11ee-8d97-82ad35693800',
 '1c0d26b2-2336-11ee-8d97-82ad35693800',
 '1c0d26da-2336-11ee-8d97-82ad35693800',
 '1c0d270c-2336-11ee-8d97-82ad35693800',
 '1c0d272a-2336-11ee-8d97-82ad35693800',
 '1c0d2752-2336-11ee-8d97-82ad35693800',
 '1c0d277a-2336-11ee-8d97-82ad35693800',
 '1c0d27a2-2336-11ee-8d97-82ad35693800',
 '1c0d27ca-2336-11ee-8d97-82ad35693800',
 '1c0d27f2-2336-11ee-8d97-82ad35693800',
 '1c0d281a-2336-11ee-8d97-82ad35693800',
 '1c0d2838-2336-11ee-8d97-82ad35693800',
 '1c0d2860-2336-11ee-8d97-82ad35693800',
 '1c0d2888-2336-

In [177]:
elasticdb_gpt.add_documents(new_splits)

['592b159b-46e2-44a7-956f-eb6b075e333b',
 'd601b398-af58-41f9-aa0d-e4885274667b',
 'c0f99f2b-8911-4929-aa4d-da1344d8595f',
 'e47d86d7-b529-4c5f-b51a-3c288391c9a5',
 '4cca50ab-f515-4b39-857b-c7a932987221',
 '7c6b5aca-1485-4268-a498-48c9dda7bbd6',
 'be47a1cf-7855-4842-9a2c-10abd5d1f889',
 '52215b6d-ba56-4ba9-a082-fa47f09ab066',
 '9eb4aa2a-9504-45f3-a300-ec5690e68357',
 'f6a958fe-7138-4b83-9b76-c3705b5c0ef5',
 'ed164754-5a62-491c-b30d-c3aa411b91a2',
 '36c7e843-8e01-4dfb-8ffa-6621fe3d9450',
 '58f73c9b-34a7-4440-be86-546d785f28b8',
 '3571b58f-d433-49be-ac4c-90e70efcc499',
 'd07ccdef-600f-4341-a047-dff561dea519',
 '773d6339-a3f2-4725-9547-c08ebb8e126c',
 'fed326ac-f206-4247-b222-b0a308b8436b',
 '614ede5a-3acd-400b-a2aa-d72849d126b2',
 '43361fc1-c2ed-43b4-89fb-a6537e1c8c86',
 'ee21e7c9-4357-4b8a-b470-0ea09e4cbbbe',
 '1c16cc6a-145c-4746-b47e-99ec48ee36cd',
 '74368317-04df-47f4-bf28-fc00fd53aec2',
 'd80adf31-0f63-45e5-9d5e-1e4b78b56a2f',
 'e3a8a0c8-9972-4f1d-ba2a-9b664c5e1c41',
 '92fc48a9-bb3c-

In [178]:
query_db(vectordb_gpt,"What is BERT?" , llm)

' BERT is a new language representation model called Bidirectional Encoder Representations from Transformers.'

In [179]:
query_db_relevance(vectordb_gpt,"What is BERT?" , llm)

' BERT is a new language representation model called Bidirectional Encoder Representations from Transformers.'

In [180]:
query_db(elasticdb_gpt,"What is BERT?" , llm)

' BERT is a new language representation model called Bidirectional Encoder Representations from Transformers.'

In [181]:
query_db_relevance(vectordb_gpt,"How is bert an improvement?" , llm)

"Sorry, I don't know how to help with that."

## Chatting with your data: Final Demo

In [182]:
from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True
)

In [183]:
from langchain.chains import ConversationalRetrievalChain


retriever=vectordb_gpt.as_retriever()
qa = ConversationalRetrievalChain.from_llm(
    llm,
    retriever=retriever,
    memory=memory
)

In [184]:
question = "What is Natural language processing?"
result = qa({"question": question})

In [185]:
result["answer"]

' Natural language processing is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages.'

In [186]:
question = "What are its real life applications?"
result2 = qa({"question": question})

In [187]:
 result2["answer"]

' Natural language processing can be used for tasks such as text classification, machine translation, sentiment analysis, and question answering.'

In [188]:
import datetime
current_date = datetime.datetime.now().date()
if current_date < datetime.date(2023, 9, 2):
    llm_name = "gpt-3.5-turbo-0301"
else:
    llm_name = "gpt-3.5-turbo"
print(llm_name)

gpt-3.5-turbo-0301


In [189]:
from langchain.chat_models import ChatOpenAI

def load_db(file, chain_type, k):
    # create vector database from data
    db = vectordb_gpt
    # define retriever
    retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": k})
    # create a chatbot chain. Memory is managed externally.
    qa = ConversationalRetrievalChain.from_llm(
        llm=ChatOpenAI(model_name=llm_name, temperature=0), 
        chain_type=chain_type, 
        retriever=retriever, 
        return_source_documents=True,
#         return_generated_question=True,
    )
    return qa 

In [190]:
import gradio

memory = ConversationBufferMemory(
    memory_key="chat_history_ui",
    return_messages=True
)
retriever=vectordb_gpt.as_retriever()
    
qa = ConversationalRetrievalChain.from_llm(
        llm=ChatOpenAI(model_name=llm_name, temperature=0), 
        chain_type="stuff", 
        retriever=retriever, 
        return_source_documents=True
    )

def chat(message, history=None):
    history = history or []
    response = qa({"question": message, "chat_history":history})['answer']
    print(message, response, history)
    history.append((message, response))
    return history, history

In [191]:
# collection.load()
chatbot = gradio.Chatbot(color_map=("green", "gray"))
interface = gradio.Interface(
    chat,
    ["text", "state"],
    [chatbot, "state"],
    allow_screenshot=False,
    allow_flagging="never",
)
interface.launch(inline=True, share=True)


  chatbot = gradio.Chatbot(color_map=("green", "gray"))
  interface = gradio.Interface(


Running on local URL:  http://127.0.0.1:7861
Running on public URL: https://df2f31e9b157b3103d.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


