## ThirdAI's NeuralDB

NeuralDB, as the name suggests, is a combination of a neural network and a database. It provides a high-level API for users to insert different types of files into it and search through the file contents with natural language queries. The neural network part of it enables semantic search while the database part of it stores the paragraphs of the files that are inserted into it.

First, let's install the dependencies.

In [None]:
!pip3 install thirdai --upgrade
!pip3 install thirdai[neural_db]
!pip3 install langchain --upgrade
!pip3 install openai --upgrade
!pip3 install paper-qa --upgrade

In [None]:
from thirdai import licensing, neural_db as ndb
# licensing.deactivate()
licensing.activate("1FB7DD-CAC3EC-832A67-84208D-C4E39E-V3")

Now, let's import the relevant module and define a neural db class.

In [None]:
db = ndb.NeuralDB(user_id="my_user") # you can use any username, in the future, this username will let you push models to the model hub

### You even load from a base DB from our Bazaar (optional but recommended)

We have a model bazaar that provides users with domain specific NeuralDBs that can jumpstart searching on their private documents. The Bazaar has two main types of DBs

1. Base DBs: These come with models that have either general QnA capabilities or domain specific capabilities like search on Medical Documents, Financial documents or Contracts. These come with an empty data index into which users can insert their files.

2. Pre-Indexed DBs: These are ready-to-search DBs that come with pre-trained models and their corresponding datasets. These are meant to  search through large public datasets like PubMed or Amazon 3MM Products or Stackoverflow issues etc.

In [None]:
# Set up a cache directory
import os
if not os.path.isdir("bazaar_cache"):
    os.mkdir("bazaar_cache")

from pathlib import Path
from thirdai.neural_db import Bazaar
bazaar = Bazaar(cache_dir=Path("bazaar_cache"))


Call fetch to refresh list of available DBs.

In [None]:
bazaar.fetch() # Optional arg filter="model name" to filter by model name.


Below is the list of all DBs in the Bazaar.

In [None]:
print(bazaar.list_model_names())


['Contract Review', 'Finance QnA', 'General QnA']


Finally load the DB

In [None]:
db = bazaar.get_model("General QnA")

### Insert your files

Let's insert things into it!

Currently, we natively support adding CSV, PDF and DOCX files. We also have a support to automatically scrape and parse URLs. All other file formats have to be converted into CSV files where each row represents a paragraph/text-chunk of the document.

#### PDF files

In [None]:
insertable_docs = []
pdf_files = ['hdfc2.pdf']

for file in pdf_files:
    pdf_doc = ndb.PDF(file)
    insertable_docs.append(pdf_doc)

print(insertable_docs)

[<thirdai.neural_db.documents.PDF object at 0x7bfc006fdf30>]


### Insert into NeuralDB

If you wish to insert without unsupervised training, you can set 'train=False' in the insert() method.

In [None]:
source_ids = db.insert(insertable_docs, train=False)

The above command is intended to be used with a base DB which already has reasonable knowledge of the domain. In general, we always recommend using 'train=True' as shown below.

#### Insert and Train

In [None]:
source_ids = db.insert(insertable_docs, train=True)

loaded data | source 'Documents:
hdfc2.pdf' | vectors 1364 | batches 1 | time 0s | complete

train | epoch 0 | train_steps 2473 | train_hash_precision@5=0.404692  | train_batches 1 | time 7s

train | epoch 1 | train_steps 2474 | train_hash_precision@5=0.661437  | train_batches 1 | time 5s

train | epoch 2 | train_steps 2475 | train_hash_precision@5=0.87522  | train_batches 1 | time 4s

train | epoch 3 | train_steps 2476 | train_hash_precision@5=0.939883  | train_batches 1 | time 7s

train | epoch 4 | train_steps 2477 | train_hash_precision@5=0.949853  | train_batches 1 | time 4s

train | epoch 5 | train_steps 2478 | train_hash_precision@5=0.956598  | train_batches 1 | time 4s



If you call the insert() method multiple times, the documents will automatically be de-duplicated. If insert=True, then the training will be done multiple times.

### Search

Now let's start searching.

In [None]:
search_results = db.search(
    query="What is the railway ticket purchase fee?",
    top_k=2,
    on_error=lambda error_msg: print(f"Error! {error_msg}"))

for result in search_results:
    print(result.text)
    # print(result.context(radius=1))
    # print(result.source)
    # print(result.metadata)
    print('************')

a. Railway Ticket Purchase Fee: 1% of the transaction amount + GST except IRCTC HDFC Bank Credit Card. Charges will be reversed post transaction. (Refer IRCTC website for detailed fee structure). b. Conversion charges - Auto EMI: Applicable on all transactions (Excluding Fuel Gold and Jewellery) of Rs 10 000/- or more done on Easy EMI Credit Card.
************
This means that a customer who has a billing date of 4th of the month can spend on his Card from 5th April to 4th May his bill will be generated on 4th May and his Payment Due Date will be 24th May. Hence a purchase made on 14th April will have a credit period of 41 days while a purchase made on 2nd May will have a credit period of 23 days.
************


### Get Answers from OpenAI using Langchain

In this section, we will show how to use LangChain and query OpenAI's QnA module to generate an answer from the references that you retrieve from the above DB. You'll have to specify your own OpenAI key for this module to work. You can replace this segment with any other generative model of your choice. You can choose to use an source model like MPT or Dolly for answer generation with the same prompt that you use with OpenAI.

In [None]:
import os
os.environ["OPENAI_API_KEY"] = ""

In [None]:
from langchain.chat_models import ChatOpenAI
from paperqa.prompts import qa_prompt
from paperqa.chains import make_chain

llm = ChatOpenAI(
    model_name='gpt-3.5-turbo',
    temperature=0.1,
)

qa_chain = make_chain(prompt=qa_prompt, llm=llm)

In [None]:
def get_references(query):
    search_results = db.search(query,top_k=3)
    references = []
    for result in search_results:
        references.append(result.text)
    return references

def get_answer(query, references):
    return qa_chain.run(question=query, context='\n\n'.join(references[:3]), answer_length="abt 50 words")

In [None]:
query = "what is annual membership fee for Teacher's platinium?"

references = get_references(query)
print(references)

["Card variant Joining / Annual membership Fee Minimum Spend for waiver of annual membership fee / renewal charges for next year * Fuel Transaction Surcharge (Plus GST) ** Foreign Currency Transactions % Of transaction value *** Infinia (Metal Edition) Rs12 500 Rs10 Lakh / Annum Waiver capped at Rs1000 every billing cycle 2% Infinia Rs10 000 Rs8 Lakh / Annum Diners Black Rs5 Lakh / Annum HDFC Bank Times Rs500 Rs1.5 lakh / Annum Waiver capped at Rs250 every billing cycle Best Price Save Smart Rs50 000 / Annum Regalia Gold Rs2 500 Rs4 Lakh / Annum NA Regalia Business Regalia Diners Privilege Diners Club Premium Doctor's Regalia Rs3 Lakh / Annum Waiver capped at Rs500 every billing cycle For Regalia Doctor's Regalia Business Regalia - Waived for transaction between Rs400 to Rs5 000 (Diners Privilege Fuel surcharge waiver not applicable effective 1st Aug'23) Tata Neu Infinity HDFC Bank Rs1 499 Times Platinum Rs1 000 Rs2.5 lakh / Annum Regalia First Business Regalia First Rs1 Lakh / Annum D