The goal is to create a patent gap inference engine. The system must allow the user to query the entire uploaded patent data (currently just 10 patents in training_dataset) broadly or narrowly, depending on their preferences. The system should allow users to focus on specific sections of patent data, such as claims and descriptions. It should always retain associations between publication numbers and their respective claims and associated descriptions. Additionally, it should provide a way for users to analyze these individual sections of claims and descriptions. The system must be capable of mapping publication numbers based on common and uncommon topics and keywords in claims and descriptions, as well as common and uncommon CPC classes and definitions associated with the publication numbers. The system must be able to identify both obvious and non-obvious connections between common and uncommon features. It should analyze and compare claims and descriptions within a single publication number or across multiple publication numbers. The system should provide robust inferences based on these comparisons. The system must identify similarities and differences between claims based on their features. It should list these features and distinguish between similar and different features across one or all publication numbers. Additionally, it should compare and contrast the features against the description in one or all publication numbers. The system must be flexible enough to handle any number of patents. The user should be able to analyze each part of the claims and descriptions, focusing on the features and their relationships.

In [None]:
!pip install --upgrade pip

In [None]:
!pip install sentence_transformers

In [None]:
!pip install --upgrade 'farm-haystack[all]'

In [None]:
!pip install pydantic

In [None]:
!pip install --upgrade langchain

In [1]:
# Importing libs 

# Data Handling
import pandas as pd
import numpy as np


# Torch and Transformers
import torch
from torch import bfloat16
import transformers
from transformers import AutoTokenizer

# LangChain
from langchain.llms import HuggingFacePipeline
from langchain_community.document_loaders import DataFrameLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings

from sentence_transformers import SentenceTransformer
from tqdm.auto import tqdm
from haystack.document_stores import FAISSDocumentStore
from haystack.nodes import EmbeddingRetriever
from haystack import Document

# Hiding warnings 
import warnings
warnings.filterwarnings("ignore")

2024-03-09 03:44:57.741131: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-09 03:44:57.741262: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-09 03:44:57.919159: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-03-09 03:45:17,618	INFO util.py:124 -- Outdated packages:
  ipywidgets==7.7.1 found, needs ipywidgets>=8
Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


In [2]:
# Define the file path
directory_path = '/kaggle/input/training-dataset'
file_name = 'training_dataset.csv'
file_path = f'{directory_path}/{file_name}'

# Load the data into a DataFrame
try:
    df = pd.read_csv(file_path, encoding='utf-8')
    print("File loaded successfully with utf-8 encoding.")
except UnicodeDecodeError:
    try:
        df = pd.read_csv(file_path, encoding='ISO-8859-1')  # latin1
        print("File loaded successfully with ISO-8859-1 encoding.")
    except UnicodeDecodeError:
        df = pd.read_csv(file_path, encoding='cp1252')  # Another common encoding
        print("File loaded successfully with cp1252 encoding.")

# Create a Document from the DataFrame using DataFrameLoader
patents = DataFrameLoader(df, page_content_column="publication_number")
document = patents.load()

print("First 1000 tokens of the first document: ", document[:1000])

File loaded successfully with ISO-8859-1 encoding.
First 1000 tokens of the first document:  [Document(page_content='US-2017114613-A1', metadata={'abstract': 'Method for well re-stimulation treatment using instantaneous shut-in pressure (ISIP) to guide the design and execution of refracturing stages. Pore pressure and optional cluster stresses are determined at a start of the treatment. Goal ISIPs for the refracturing correspond to undepleted regions of the formation, and target ISIPs versus treatment progression/stage range from about a lowest pore pressure corresponding to depleted regions of the formation up to within the goal range ISIPs. Diversion and proppant pumping schedules are designed, and the refracturing treatment is initiated in accordance with the design. ISIP is measured at stage end, and if it varies from the target ISIP, subsequent stages are modified from the design as needed to more closely match the ISIP schedule.', 'claim': 'What is claimed is: \n     \n         1

In [3]:

# Assuming 'df' is your DataFrame and it has columns like 'title', 'abstract', 'claims', 'description', etc.
# Adjust the column names according to your actual DataFrame

# Calculate the total character count for each document by summing up the character counts of all text columns
df['total_character_count'] = df[['abstract', 'claim', 'description', 'definitions']].apply(lambda x: sum(len(str(val)) for val in x), axis=1)

# Find the maximum character count across all documents to find the longest document
max_character_count = df['total_character_count'].max()

print("The longest patent document contains", max_character_count, "characters.")


The longest patent document contains 57968 characters.


In [5]:
# Splitting document into smaller chunks
splitter = RecursiveCharacterTextSplitter(chunk_size = 4096,
                                chunk_overlap = 50)
splitted_texts = splitter.split_documents(document)

In [6]:
# Loading model to create the embeddings
embedding_model = SentenceTransformer('sentence-transformers/bert-base-nli-mean-tokens')

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.99k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/399 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [10]:
!pip install chromadb

Collecting chromadb
  Downloading chromadb-0.4.24-py3-none-any.whl.metadata (7.3 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.1.1-py3-none-any.whl.metadata (4.2 kB)
Collecting chroma-hnswlib==0.7.3 (from chromadb)
  Downloading chroma_hnswlib-0.7.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting pulsar-client>=3.1.0 (from chromadb)
  Downloading pulsar_client-3.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.0 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.17.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.3 kB)
Collecting opentelemetry-instrumentation-fastapi>=0.41b0 (from chromadb)
  Downloading opentelemetry_instrumentation_fastapi-0.44b0-py3-none-any.whl.metadata (2.3 kB)
Collecting pypika>=0.48.9 (from chromadb)
  Downloading PyPika-0.48.9.tar.gz (67 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m 

In [13]:
!pip install --upgrade sentence_transformers



In [14]:
from langchain.vectorstores import Chroma
# Creating and indexed database
chroma_database = Chroma.from_documents(splitted_texts,
                                      embedding_model,
                                      persist_directory = 'chroma_db')

AttributeError: 'SentenceTransformer' object has no attribute 'embed_documents'

In [None]:
# Visualizing the database
chroma_database

In [None]:
# Defining a retriever
retriever = chroma_database.as_retriever()

In [None]:
VectorStoreRetriever(tags=['Chroma', 'HuggingFaceEmbeddings'], vectorstore=<langchain_community.vectorstores.chroma.Chroma object at 0x7bc54e8b2ec0>)


In [None]:
# Configuring BitsAndBytesConfig for loading model in an optimal way
quantization_config = transformers.BitsAndBytesConfig(load_in_4bit = True,
                                        bnb_4bit_quant_type = 'nf4',
                                        bnb_4bit_use_double_quant = True,
                                        bnb_4bit_compute_dtype = bfloat16)


In [None]:
# Loading Mistral 7b model 
llm = HuggingFacePipeline.from_model_id(model_id='/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1',
                                       task = 'text-generation',
                                       model_kwargs={'temperature': .3,
                                                    'max_length': 1024,
                                                    'quantization_config': quantization_config},
                                       device_map = "auto")

In [None]:
# Defining a QnA chain
QnA = RetrievalQA.from_chain_type(llm = llm,
                                 chain_type = 'stuff',
                                 retriever = retriever,
                                 verbose = False)

In [None]:
# Defining function to fetch documents according to a query
def get_answers(QnA, query):
    answer = QnA.run(query)
    print(f"\033[1mQuery:\033[0m {query}\n")
    print(f"\033[1mAnswer:\033[0m ", answer)

In [None]:
query = """The drilling fluid is pumped downhole through the drill pipe at a given rate and pressure, are there any examples of the rate and pressure?"""
get_answers(QnA, query)

In [None]:
# Importing Cleaning Libraries
import re
import nltk
import gensim
import unidecode
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from gensim.models.phrases import Phrases, Phraser
from gensim.corpora import Dictionary
from gensim.models.ldamulticore import LdaMulticore
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
from nltk import pos_tag

nltk.download('wordnet')
nltk.download('punkt')
nltk.download('stopwords')

In [None]:
columns = ["publication_number", "abstract", "claim", "description", "top_terms", "cpc", "definitions"] 

In [None]:
# Cleaning the Training Dataset

# Stopwords and known terms setup
stop_words = set(stopwords.words('english')).union({
    "said", "state", "therein", "herein", "wherein", "includes", "included", "include", 
    "least", "wherein", "method", "claim", "claims", "claimed", "device", "equation",
    "comprising", "consisting", "comprises", "comprise", "consist", "first", "second", 
    "third", "invention", "thus", "means", "plurality", "introducing", "may", "new", 
    "novel", "according", "use", "using", "used", "consists", "step", "while", "whilst", 
    "based", "assembly", "one", "system", "unit", "member", "portion", "well-known", "however", 
    "detailed", "description", "example", "however", "reference", "made", "detail", "embodiment", 
    "example", "illustrated", "accompanying", "drawing", "figure", "following", "numerous", 
    "specific", "detail", "set", "forth", "order", "provide", "thorough", "understanding", 
    "disclosure", "apparent", "ordinary", "skill", "art", "certain", "embodiment", "disclosure", 
    "practiced", "without", "specific", "detail", "instance", "method", "procedure", "component", 
    "scope", "and/or", "particular", "shown", "way", "purpose", "illustrative", "discussion", 
    "subject", "presented", "cause", "providing", "believed", "useful", "readily", "understood", 
    "principle", "conceptual", "aspect", "regard", "attempt", "show", "necessary", "taken", 
    "making", "skilled", "form", "embodied", "practice", "furthermore", "like", "number", 
    "designation", "various", "indicate", "element", "purpose", "fracture", "continuous", 
    "void", "space", "defined", "construct", "details", "associated", "listed", "item", 
    "aim", "le,", "Fig", "FIG", "FIG.", "Figure", "Fig.", "fig", "figure", "ic/", "ic", "pr", "al", "pre"
    "le", "se", "Drawing", "39", "process", "thereof"
}).union(set(str(i) for i in range(20)))

known_terms = {"Poisson's_ratio", "Young's_modulus"}

lemmatizer = WordNetLemmatizer()

# Function to convert nltk POS to wordnet POS
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN  # Default to noun if POS tag is unclear

# Adding cleaning pattern to handle specific sequences and special characters
patterns = {
    'remove_brackets_numbers': re.compile(r'\[\d+?\]'),
    'remove_special_chars': re.compile(r'[“_“\[\]ℎ\ue89e-]'),
    'remove_Ntildea_Ntildean': re.compile(r'Ña|Ñan'),
    'remove_non_ascii': re.compile(r'[^\x00-\x7F]+'),
    'replace_underscores': re.compile(r'\s*_\s*'),  # replaces underscore sequences with a single space
    'remove_single_letters': re.compile(r'\b\w\b'),
    'replace_square_brackets': re.compile(r'[\[\]]'),
    'remove_sequence': re.compile(r'Equation_= \[ - - \]'),  # remove 'Equation_= [ - - ]'
    'remove_special_sequences': re.compile(r'\ue89e_\ue89e|“_“|,|\.|:|;|\(|\)|=|>|-|%|�|”|#|&|/|le|ic|ic/|\+|\''), # Remove specific sequences and special characters
    'reduce_spaces': re.compile(r'\s+')
}

def clean_text(text):
    text = unidecode.unidecode(text)  # convert unicode special characters into equivalent ASCII
    text = patterns['remove_brackets_numbers'].sub('', text)
    text = patterns['remove_special_chars'].sub(' ', text)
    text = patterns['remove_Ntildea_Ntildean'].sub('', text)  # removing 'Ña' and 'Ñan' sequences
    text = patterns['remove_Ntildea_Ntildean'].sub('', text)
    text = patterns['replace_underscores'].sub(' ', text)
    text = patterns['remove_single_letters'].sub(' ', text)
    text = patterns['replace_square_brackets'].sub(' ', text)  # Replacing with space may be better than simply removing
    text = patterns['remove_sequence'].sub('', text)
    text = patterns['remove_special_sequences'].sub(' ', text)
    text = patterns['remove_special_sequences'].sub(' ', text)
    text = patterns['remove_non_ascii'].sub(' ', text) # removing non-ascii characters
    text = patterns['reduce_spaces'].sub(' ', text).strip().lower()  # Ensuring text lowercasing and no leading/trailing spaces.
    text = patterns['reduce_spaces'].sub(' ', text).strip()
    return text

def clean_and_tokenize(text):
    cleaned_text = clean_text(text)  # Use your existing clean_text function
    tokens = word_tokenize(cleaned_text)
    
    # POS tagging
    tagged_tokens = pos_tag(tokens)
    
    # Lemmatization with POS tags
    lemmatized_tokens = []
    for word, tag in tagged_tokens:
        wn_tag = get_wordnet_pos(tag)  # Convert to WordNet POS tag
        lemmatized_tokens.append(lemmatizer.lemmatize(word, pos=wn_tag))
    
    # Filter out stopwords and known terms
    final_tokens = [token for token in lemmatized_tokens if token not in stop_words and token not in known_terms]
    
    return final_tokens

# Assuming 'results' DataFrame is loaded correctly with 'abstract' and 'claim' columns available
all_data = []
for index, row in results.iterrows():
    combined_text = f"{row['abstract']} {row['claim']} {row['description']} {row['definitions']}"
    cleaned_combined_text = clean_and_tokenize(combined_text)
    # Adding print statement to display first 500 tokens
    if index == 0:
        print("First 1000 tokens of the first document: ", cleaned_combined_text[:1000])
    all_data.append({
        'publication_number': row['publication_number'],
        'tokens': cleaned_combined_text,
        'top_terms': row['top_terms'],
        'embedding': row['embedding']
    })


In [None]:
# Columns for which to generate the sentence embeddings
columns = ["publication_number", "abstract", "claim", "description", "top_terms", "cpc", "definitions"]

print("Initializing the SBERT model.")
# Initialize the SBERT model
model = SentenceTransformer('bert-base-nli-mean-tokens')

# Function to get SBERT embeddings
def get_sbert_embeddings(text):
    print("Generating SBERT embeddings.")
    embeddings = model.encode([text])
    return embeddings[0]  # Encode returns a list of embeddings

print("Processing each row in the DataFrame and generating embeddings.")
# Process each row in the DataFrame
for index, row in df.iterrows():
    for column in columns:
        text = row[column]
        print(f"Processing record {index + 1} of {len(df)}. Column: {column}")

        embeddings = get_sbert_embeddings(text)
        df.at[index, column + '_embedding'] = ','.join(map(str, embeddings.tolist()))  # Storing embeddings as string

        # Print the first embedding for review
        if index == 0 and column == columns[0]:
            print('First 100 elements of the first SBERT embeddings: ', embeddings[:100])

print("Processing completed. Embeddings generated.")

In [None]:
# Prepare objects to be written to the document store
documents = [
    {
        'text': row['text'],
        'embedding': list(map(float, row['embedding'].split(','))),
    }
    for idx, row in df.iterrows()
]

# Initialize the FAISS DocumentStore
doc_store = FAISSDocumentStore(
    sql_url="sqlite:///training_dataset_db.sqlite", 
    vector_dim=768,  # dimension of embeddings
    faiss_index_factory_str="Flat",
    return_embedding=True
)

# Write documents to the store
doc_store.write_documents(documents)

print(f"All {len(documents)} documents have been added to the DocumentStore.")

In [None]:
from haystack.nodes import DensePassageRetriever, FARMReader

# Instantiate the embedding retriever
print("Initializing Embedding Retriever...")

retriever = DensePassageRetriever(
    document_store=doc_store,
    query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
    passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
    use_gpu=True,
    embed_title=True,
    top_k=10  # retrieves top 10 documents
)

print("Retriever initialized.")

# Initialize the reader
print("Initializing the Reader...")
reader = FARMReader('deepset/roberta-base-squad2')
print("Reader initialized.")

In [None]:
from haystack.pipelines import ExtractiveQAPipeline

# Initialize the Finder, which sticks together reader and retriever in a pipeline to answer our actual questions
print("Initializing the Extractive QA Pipeline...")
pipeline = ExtractiveQAPipeline(reader, retriever)
print("Extractive QA Pipeline initialized.")

In [None]:
# Example query
query = "what is th hybrid parallel strategy for a multiscale solver?"

print("Getting predictions from the pipeline...")
predictions = pipeline.run(query)

# Print top-k answers returned by pipeline
for i, ans in enumerate(predictions["answers"], 1):
    print(f"Answer {i}: {ans.answer}")
    print(f"Metadata keys for answer {i}: {ans.meta.keys()}")  # Print keys of metadata
    print("\n")

Loading Mistral 7b

In [None]:
# Configuring BitsAndBytesConfig for loading model in an optimal way
quantization_config = transformers.BitsAndBytesConfig(load_in_4bit = True,
                                                      bnb_4bit_quant_type = 'nf4',
                                                      bnb_4bit_use_double_quant = True,
                                                      bnb_4bit_compute_dtype = bfloat16)

In [None]:
# Loading the Mistral 7b model
llm = HuggingFacePipeline.from_model_id(model_id='mistralai/Mistral-7B-v0.1',
                                        task='text-generation',
                                        model_kwargs={'temperature': 0.3, 'max_length': 1024, 'quantization_config': quantization_config},
                                        device_map='auto')

In [None]:
from langchain import HuggingFacePipeline
from langchain.prompts.chat import (
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
    SystemMessagePromptTemplate,
    )

from langchain import PromptTemplate
from langchain.chains import LLMChain, SimpleSequentialChain, RetrievalQA, ConversationalRetrievalChain

In [None]:
# Prepare the prompt template
template = """
[INST] <>
Act as a patent expert. Use the following information to answer the question at the end.
<>

{context}

{question} [/INST]
"""
prompt = PromptTemplate(template=template, input_variables=["context", "question"])

In [None]:
# Initialize a FAISS-based Document Store (supports dense retrievers)
doc_store = FAISSDocumentStore()

# Initializing the DensePassageRetriever
print("Initializing Embedding Retriever...")
retriever = DensePassageRetriever(document_store=doc_store,
                                  query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
                                  passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
                                  use_gpu=True, embed_title=True, top_k=10)
print("Retriever initialized.")

# Preparing the RetrievalQA chain with adjusted prompt
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,  # use the retriever instance you've already created
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt},
)

In [None]:
query = "how many possibilities for algorithm parallelization For the multiscale solver, and what are they?"
get_answers(query)

Starting with New Model

In [None]:
# Data Handling
import pandas as pd
import numpy as py

# Torch and Transformers
import torch
from torch import bfloat16
import transformers
import haystack
from transformers import AutoTokenizer
from transformers import BertModel, BertTokenizer
from sentence_transformers import SentenceTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import TfidfVectorizer

# LangChain
from langchain.llms import HuggingFacePipeline
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma

# Document Store
from haystack.document_stores import FAISSDocumentStore

# Hiding warnings 
import warnings
warnings.filterwarnings("ignore")

In [None]:
# Cleaning and tokenization function
# Adding cleaning pattern to handle specific sequences and special characters
patterns = {
    'remove_special_chars': re.compile(r'[“_“\[\]ℎ\ue89e-]'),
    'remove_Ntildea_Ntildean': re.compile(r'Ña|Ñan'),
    #'remove_non_ascii': re.compile(r'[^\x00-\x7F]+'),
    'replace_underscores': re.compile(r'\s*_\s*'),  # replaces underscore sequences with a single space
    'remove_single_letters': re.compile(r'\b\w\b'),
    #'remove_brackets_numbers': re.compile(r'\[\d+?\]'),
    #'replace_square_brackets': re.compile(r'[\[\]]'),
    #'remove_sequence': re.compile(r'Equation_= \[ - - \]'),  # remove 'Equation_= [ - - ]'
    'remove_special_sequences': re.compile(r'\ue89e_\ue89e|“_“|\.|:|;|\(|\)|=|>|%|�|”|#|&|/|\+|\''), # Remove specific sequences and special characters
    'reduce_spaces': re.compile(r'\s+')
}

def clean_text(text):
    text = unidecode.unidecode(text)  # convert unicode special characters into equivalent ASCII
    #text = patterns['remove_brackets_numbers'].sub('', text)
    text = patterns['remove_special_chars'].sub(' ', text)
    text = patterns['remove_Ntildea_Ntildean'].sub('', text)  # removing 'Ña' and 'Ñan' sequences
    text = patterns['replace_underscores'].sub(' ', text)
    text = patterns['remove_single_letters'].sub(' ', text)
    #text = patterns['replace_square_brackets'].sub(' ', text)  # Replacing with space may be better than simply removing
    #text = patterns['remove_sequence'].sub('', text)
    text = patterns['remove_special_sequences'].sub(' ', text)
    #text = patterns['remove_non_ascii'].sub(' ', text) # removing non-ascii characters
    text = patterns['reduce_spaces'].sub(' ', text).strip().lower()  # Ensuring text lowercasing and no leading/trailing spaces.
    #text = patterns['reduce_spaces'].sub(' ', text).strip()
    return text

def clean_and_tokenize(text):
    cleaned_text = clean_text(text)
    # Split the cleaned text into paragraphs
    paragraphs = cleaned_text.split('\n')  # assuming paragraphs are separated by two newlines
    return paragraphs

all_data = []
for index, row in results.iterrows():
    combined_text = f"{row['abstract']} {row['claim']} {row['description']} {row['definitions']}"
    cleaned_combined_text = clean_and_tokenize(combined_text)
    print("First 1000 tokens of the first document: ", cleaned_combined_text[:1000])

    all_data.append({
        'publication_number': row['publication_number'],
        'paragraphs': cleaned_combined_text,    # Change 'tokens' to 'paragraphs'
        'top_terms': row['top_terms'],
        'embedding': row['embedding']
    })

In [None]:
from tqdm.auto import tqdm
from langchain_text_splitters import CharacterTextSplitter
from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings

# Splitter initialization
splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=20)

# Function to split paragraphs and create text chunks
def process_documents(all_data):
    splitted_texts = []
    for doc in tqdm(all_data, desc="Splitting documents", leave=True):
        for paragraph in doc['paragraphs']:
            chunks = splitter.split_text(paragraph)
            docs.extend(chunks)
    return splitted_texts

# Process all documents to get a list of text chunks
print("Processing documents to generate text chunks...")
splitted_texts = splitter.split_documents(document)

# Create the SentenceTransformerEmbeddings 
print("Creating SentenceTransformerEmbeddings...")
embeddings_model = SentenceTransformerEmbeddings(model_name='sentence-transformers/paraphrase-albert-small-v2')
print("SentenceTransformerEmbeddings created.")

In [None]:
# Check the first embedding
print('First 500 elements of first SBERT embedding: ', embeddings[0][:500])

In [None]:
!pip install chromadb

In [None]:
import chromadb
from chromadb.config import Settings
from chromadb.utils import embedding_functions

# Creating and indexed database
chroma_database = Chroma.from_documents(splitted_texts,
                                      embedding_model,
                                      persist_directory = 'chroma_db')



In [None]:
# Visualizing the database
chroma_db

In [None]:
# Defining a retriever
retriever = chroma_db.as_retriever()

In [None]:
VectorStoreRetriever(tags=['Chroma', 'SentenceTransformer'], vectorstore=<langchain_community.vectorstores.chroma.Chroma object at 0x7b553d803340>)

In [None]:
# Initialize language model
print("Initializing language model...")
quantization_config = transformers.BitsAndBytesConfig(load_in_4bit=True, 
                                                      bnb_4bit_quant_type='nf4', 
                                                      bnb_4bit_use_double_quant=True, 
                                                      bnb_4bit_compute_dtype=torch.bfloat16)

print("Quant config loaded successfully.")

In [None]:
# Configuration for the model
model_id = '/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1'
quantization_config = {}  # Define your quantization_config if applicable

# Determine if CUDA (GPU) is available and set the device accordingly
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Device map configuration
device_map = "auto" if device == "cuda" else None  # Use 'auto' device mapping for GPU, otherwise None for CPU

In [None]:
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM

In [None]:
# Loading Mistral 7b model 

print("Loading language model...")
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16 if device == "cuda" else torch.float32)

# Create the pipeline
llm_pipeline = pipeline(task='text-generation', model=model, tokenizer=tokenizer, device=0 if device == "cuda" else -1, model_kwargs={'temperature': .3, 'max_length': 1024})

print("Mistral Model loaded successfully.")

In [None]:
from transformers import TextGenerationPipeline
from typing import List
from pydantic import Field

class HuggingFaceModelWrapper(BaseLanguageModel):
    pipeline: TextGenerationPipeline = Field(..., alias="pipeline")

    def __call__(self, prompts: List[str], **generate_kwargs: dict) -> List[str]:
        outputs = [self.pipeline(prompt, **generate_kwargs)[0]['generated_text'] for prompt in prompts]
        return outputs
    
    # Implement abstract methods as needed. Here we're providing placeholder implementations.
    def agenerate_prompt(self, *args, **kwargs):
        pass

    def apredict(self, *args, **kwargs):
        pass

    def apredict_messages(self, *args, **kwargs):
        pass

    def generate_prompt(self, *args, **kwargs):
        pass

    def invoke(self, *args, **kwargs):
        pass

    def predict(self, *args, **kwargs):
        pass

    def predict_messages(self, *args, **kwargs):
        pass

llm = HuggingFaceModelWrapper(pipeline=llm_pipeline)

# Now `llm` can be used with langchain_core components expecting a BaseLanguageModel
# For example, using it with RetrievalQA (assuming other required components are defined):
try:
    QnA = RetrievalQA.from_chain_type(llm=llm, chain_type='stuff', retriever=retriever, verbose=False)
    print("QnA chain defined successfully.")
except Exception as e:
    print(f"An error occurred: {e}")


In [None]:
# Defining function to fetch documents according to a query
def get_answers(QnA, query):
    answer = QnA.run(query)
    print(f"\033[1mQuery:\033[0m {query}\n")
    print(f"\033[1mAnswer:\033[0m ", answer)

In [None]:
query = """What is the scope of the claims, and are they sufficiently broad to deter competitors while being specific enough to be defensible?"""
get_answers(QnA, query)