### Cross encoder reranking 
Cross encoder reranking is a retreival optimization technique that is provided by chroma. In this section we will be diving into the technique in detail

In [3]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, SentenceTransformersTokenTextSplitter
from chromadb.api.client import Client
from pypdf import PdfReader

In [4]:
doc = PdfReader("MIV2 - LLM paper.pdf")
docs = [page.extract_text().strip() for page in doc.pages]

In [5]:
recursive_splitter = RecursiveCharacterTextSplitter(
    separators=["", " ", ".", "\n", "\n\n"], 
    chunk_size=1000,
    chunk_overlap=200
)

In [6]:
text = recursive_splitter.split_text("\n\n".join(docs))

In [7]:
sentence_splitter = SentenceTransformersTokenTextSplitter(
    chunk_overlap=0, 
    tokens_per_chunk=256
)

  from .autonotebook import tqdm as notebook_tqdm
  return self.fget.__get__(instance, owner)()


In [9]:
split_text = []
for t in text:
    split_text += sentence_splitter.split_text(t)

In [10]:
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction

In [12]:
embedding_function = SentenceTransformerEmbeddingFunction()

In [13]:
client = Client()
collection = client.create_collection(name="new_test", embedding_function=embedding_function)

In [14]:
ids = [ str(i) for i in range(len(split_text))]

In [15]:
collection.add(ids=ids, documents=split_text)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [16]:
collection.count()

83

In [31]:
relevant_docs = collection.query(query_texts="what is the general idea of the research paper", n_results=5)["documents"][0]

In [32]:
relevant_docs

['pp. 10608 - 10615 ). ieee. [ 31 ] bergerman, m., maeta, s. m., zhang, j., freitas, g. m., hamner, b., singh, s. and kantor,',
 'n behavior. in proceedings of the 36th annual acm symposium on user interface software and technology ( pp. 1 - 22 ). appendix a section title of first appendix an appendix contains supplementary information that is not an essential part of the text itself but which may be helpful in providing a more comprehensive understanding of the research problem or it is information that is too cumbersome to be included in the body of the paper. 17',
 'socratic models : composing zero - shot multimodal reasoning with language. arxiv preprint arxiv : 2204. 00598. [ 45 ] zhao, z., lee, w. s. and hsu, d., 2023. large language models as commonsense knowledge for large - scale task planning. arxiv preprint arxiv : 2305. 14078. [ 46 ] yao, s., yu, d., zhao, j., shafran, i., griffiths, t. l., cao, y. and narasimhan, k., 2023. tree of thoughts : deliberate probl',
 'interface 

In [33]:
from sentence_transformers import CrossEncoder
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

In [34]:
query = "what is the general idea of the research paper"

In [35]:
pairs = [[query, doc] for doc in relevant_docs]
scores = cross_encoder.predict(pairs)
print("Scores:")
for score in scores:
    print(score)

Scores:
-10.947703
-8.723831
-11.054613
-11.399334
-9.394058


In [36]:
import numpy as np

In [37]:
print("New Ordering:")
for o in np.argsort(scores)[::-1]:
    print(o+1)

New Ordering:
2
5
1
3
4
