## You.com x LangChain

### YouRetriever
Today we are excited to announce the release of `YouRetriever`, the easiest way to get access to the You.com Search API. The You.com Search API is designed by LLMs for LLMs with an emphasis on Retrieval Augmented Generation (RAG) applications. We accomplish this by evaluating our API on a number of datasets to benchmark performance of LLMs in the RAG-QA setting. In this blog post we will compare and contrast the You.com Search API with the Google Search API as well as give the reader the tools to evaluate LLMs in the RAG-QA setting. We will evaluate our retriever performance on [Hotpot QA](https://github.com/hotpotqa/hotpot) using the `RetrievalQA` Chain. Hotpot is a dataset which is comprised of a question, answer, and context. The context can vary in relevance to the question/answer with a special "distractor" setting where the LLM needs to not be distracted by certain misleading text within the context. In this experiment we will be removing the context from the dataset and replacing it with text snippets which come back from the search APIs. In this sense the entire internet is the distractor text since the APIs are responsible for finding the answer to the question across the entire internet not just within the list of snippets supplied in the dataset. We call this the "web distractor" setting for evaluating search APIs with respect to their performance being used in conjunction with an LLM.

In [None]:
pip install -q langchain==0.0.314 datasets google-api-python-client


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


### Initialization
All you need to initialize a `YouRetriever` is to set the environment variable `YDC_API_KEY`. We are currently in an alpha status and keys are available by invitation only. If you are interested in being an early access partner please email api@you.com with your usecase, background, and expected daily load. You will also need an OpenAI key along with credentials to the Google Search API to run the rest of the notebook.

In [None]:
import os


os.environ["YDC_API_KEY"] = ""
os.environ["OPENAI_API_KEY"] = ""
os.environ["GOOGLE_CSE_ID"] = ""
os.environ["GOOGLE_API_KEY"] = ""

In [None]:
from langchain.retrievers.you import YouRetriever


yr = YouRetriever()

### Retrieval
The first thing you will notice about our text snippets is that we provide larger text snippets when we can and will soon have the option for specifying the amount of text you want returned from a single snippet to the entire page. Let's ask it about the greatest pinball player ever, Keith Elwin

In [None]:
yr.get_relevant_documents("keith elwin pinball designer")

[Document(page_content='You can help Pinball Wiki by expanding it. ... Keith Elwin is a game designer and professionally-ranked pinball player. With multiple PAPA victories and being ranked first several times in the IFPA World Pinball Rankings, he is often considered to be one of the best players in the world.'),
 Document(page_content='Keith Elwin is a game designer and professionally-ranked pinball player. With multiple PAPA victories and being ranked first several times in the IFPA World Pinball Rankings, he is often considered to be one of the best players in the world. In 2017, Elwin was hired by Stern Pinball as a game designer.'),
 Document(page_content="In 2017, Elwin was hired by Stern Pinball as a game designer. His first game, Iron Maiden: Legacy of the Beast, was released a year later with a design based on Elwin's homebrew game Archer."),
 Document(page_content='This page was printed from https://pinside.com/pinball/machine/keith-elwin and we tried optimising it for print

### Results
You can see that even with the default settings we return 27 text snippets about the great Keith and some of the documents contain a decent amount of text. This makes our search API especially powerful for LLMs operating in the RAG-QA setting. But don't take my word for it, let's try it out on Hotpot QA.

### Hotpot QA
Let's take a look at an example from Hotpot. We load this up from the [Huggingface dataset](https://huggingface.co/datasets/hotpot_qa) using the datasets library. We use the fullwiki setting here instead of the distractor but as we said before, we'll be using our own context powered by the search APIs instead of what comes off the shelf.

In [None]:
from datasets import load_dataset


hotpot_ds = load_dataset("hotpot_qa", "fullwiki")["train"]

In [None]:
hotpot_ds[0]

{'id': '5a7a06935542990198eaf050',
 'question': "Which magazine was started first Arthur's Magazine or First for Women?",
 'answer': "Arthur's Magazine",
 'type': 'comparison',
 'level': 'medium',
 'supporting_facts': {'title': ["Arthur's Magazine", 'First for Women'],
  'sent_id': [0, 0]},
 'context': {'title': ['Radio City (Indian radio station)',
   'History of Albanian football',
   'Echosmith',
   "Women's colleges in the Southern United States",
   'First Arthur County Courthouse and Jail',
   "Arthur's Magazine",
   '2014–15 Ukrainian Hockey Championship',
   'First for Women',
   'Freeway Complex Fire',
   'William Rast'],
  'sentences': [["Radio City is India's first private FM radio station and was started on 3 July 2001.",
    ' It broadcasts on 91.1 (earlier 91.0 in most cities) megahertz from Mumbai (where it was started in 2004), Bengaluru (started first in 2001), Lucknow and New Delhi (since 2003).',
    ' It plays Hindi, English and regional songs.',
    ' It was launch

In [None]:
hotpot_ds[0]["question"]

"Which magazine was started first Arthur's Magazine or First for Women?"

The first question is asking about 2 magazines Arthur's Magazine and First for Women, specifically which was started first. I have never heard of either of these and indeed Hotpot is chock full of extremely niche questions which require knowledge across a large swath of time. Let's look at the context.

In [None]:
hotpot_ds[0]["context"]["sentences"]

[["Radio City is India's first private FM radio station and was started on 3 July 2001.",
  ' It broadcasts on 91.1 (earlier 91.0 in most cities) megahertz from Mumbai (where it was started in 2004), Bengaluru (started first in 2001), Lucknow and New Delhi (since 2003).',
  ' It plays Hindi, English and regional songs.',
  ' It was launched in Hyderabad in March 2006, in Chennai on 7 July 2006 and in Visakhapatnam October 2007.',
  ' Radio City recently forayed into New Media in May 2008 with the launch of a music portal - PlanetRadiocity.com that offers music related news, videos, songs, and other music-related features.',
  ' The Radio station currently plays a mix of Hindi and Regional music.',
  ' Abraham Thomas is the CEO of the company.'],
 ['Football in Albania existed before the Albanian Football Federation (FSHF) was created.',
  " This was evidenced by the team's registration at the Balkan Cup tournament during 1929-1931, which started in 1929 (although Albania eventually had

From the sentences we can see "Arthur's Magazine (1844–1846)" which we can assume means Arthur's ran from 1844-1846 while elsewhere it is mentioned "First for Women is a woman's magazine published by Bauer Media Group in the USA. The magazine was started in 1989." which means Arthur's came out first. Sure enough that is the answer.

In [None]:
hotpot_ds[0]["answer"]

"Arthur's Magazine"

Now let's use our `YouRetriever` in a RetrievalQA chain to see if we can answer this question using the the You.com Search API and GPT 3.5 Turbo.

In [None]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI


model = "gpt-3.5-turbo-16k"
qa = RetrievalQA.from_chain_type(llm=ChatOpenAI(model=model), chain_type="stuff", retriever=yr)

In [None]:
qa.run(hotpot_ds[0]["question"])

"Arthur's Magazine was started first. It was first published in 1852. First for Women, on the other hand, was launched much later."

We got it! Hurray! A quick note here is that we are using the 16k context window with GPT 3.5 Turbo because our API returns so much text by default it can overwhelm models with smaller context windows. Let's see what happens if we try the same thing with normal GPT 3.5 Turbo.

In [None]:
from openai.error import InvalidRequestError


try:
    small_context_model = "gpt-3.5-turbo"
    small_context_qa = RetrievalQA.from_chain_type(llm=ChatOpenAI(model=small_context_model), chain_type="stuff", retriever=yr)
    small_context_qa.run(hotpot_ds[0]["question"])
except InvalidRequestError:
    print("Boom! Too much text!")

Boom! Too much text!


There are a few options you can employ here if you don't want to use a smaller context window model. The first is the cap the number of documents you feed from our API to the LLM. The other option is to use the [map_reduce chain](https://python.langchain.com/docs/modules/chains/document/map_reduce) type. The map_reduce chain type takes larger chunks of text and breaks them down to make them digestible by the LLM. This does mean that you will need to make multiple calls to the LLM which will mean slower run-time but you'll be able to process all the data returned from the `YouRetriever`.

In [None]:
mr_small_context_qa = RetrievalQA.from_chain_type(llm=ChatOpenAI(model=small_context_model), chain_type="map_reduce", retriever=yr)
mr_small_context_qa.run(hotpot_ds[0]["question"])

The chain is able to run but we can no longer answer the question without the full payload of documents in-context at one time it would seem. This is what makes large context window models so exciting LLMs are becoming extremely good at using a ton of text to answer questions.

### Head-to-Head Evaluation
Let's take a sample from Hotpot QA and compare our search API with one of the current alternatives in LangChain, the `GoogleSearchAPIWrapper`. This isn't a retriever in LangChain but it only takes a small amount of code to make an analog retriever. All we need to do is implement the `_get_relevant_documents` method of the abstract base class `BaseRetriever`. We should note here, that you could easily repeat this experiment and swap in another web search API like Bing. First let's create a small utility for the existing wrapper.

In [None]:
from langchain.utilities import GoogleSearchAPIWrapper


search = GoogleSearchAPIWrapper()

def top10_results(query):
    return search.results(query, 10)

Now implement the `GoogleRetriever`.

In [None]:
from langchain.schema.retriever import BaseRetriever, Document
from typing import TYPE_CHECKING, Any, Dict, List, Optional
from langchain.callbacks.manager import CallbackManagerForRetrieverRun, AsyncCallbackManagerForRetrieverRun


class GoogleRetriever(BaseRetriever):
    def __int__(self):
        pass

    def _get_relevant_documents(
            self, query: str, *, run_manager: CallbackManagerForRetrieverRun
    ) -> List[Document]:
        return [Document(page_content=result.get("snippet", "")) for result in top10_results(query)]

    async def _aget_relevant_documents(
            self,
            query: str,
            *,
            run_manager: AsyncCallbackManagerForRetrieverRun,
            **kwargs: Any,
    ) -> List[Document]:
        raise NotImplementedError()

First, let's see the results of our new retriever on our test query.

In [None]:
gr = GoogleRetriever()
gr.get_relevant_documents("keith elwin pinball designer")

[Document(page_content='Keith Elwin was a member of the design team for the following pinball machines (ordered by date in descending order). Wishlist Collection Find on market\xa0...'),
 Document(page_content='Keith Elwin is a professional pinball player turned designer from California. He started as an operator and technician and made a name for himself as an\xa0...'),
 Document(page_content="“What is Keith Elwin's Top Pinball machines?” · Iron Maiden 16 votes. 20% · Jurassic Park 18 votes. 22% · Avengers infinity Quest 5 votes. 6% · Godzilla 37 votes."),
 Document(page_content="Aug 10, 2021 ... As I understand it Keith Elwin, Stern's hottest designer was brought on in the same way. ... Steve Ritchie got a job designing pinball by showing\xa0..."),
 Document(page_content='Keith Elwin is a game designer and professionally-ranked pinball player. With multiple PAPA victories and being ranked first several times in the IFPA World\xa0...'),
 Document(page_content='Apr 11, 2023 ... Thank y

As you can see Google gives us much less information to feed into our LLM. While in both cases we requested results from 10 web results, the You.com Search API will attempt to give multiple text snippets per web result. To further demonstrate, we can now get predictions from the exact same LLM so we do our best to isolate the experiment to evaluating the search APIs.

In [None]:
google_qa = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model=model), chain_type="stuff", retriever=gr
)

Let's get a sample from our dataset.

In [None]:
SAMPLE_SIZE = 20
hotpot_pds = hotpot_ds.to_pandas()
hotpot_pds_sample = hotpot_pds.sample(SAMPLE_SIZE, random_state=123).reset_index()

In [None]:
from concurrent.futures import ThreadPoolExecutor
from tqdm import tqdm


# This function is a simple way to parallelize calls to OpenAI in our pandas apply
def parallel_progress_apply(column, callback, num_workers):
    with ThreadPoolExecutor(max_workers=num_workers) as executor:
        return list(tqdm(executor.map(callback, column), total=len(column)))


# This function is a utility for our parallel pandas apply
def get_run_chain_function(chain):
    def run_chain(example):
        try:
            return chain(example)["result"]
        except:
            return ""
    return run_chain

Now we get predictions from the LLM using each search API's results. It's important to remember that at this point we have done everything we can to ensure the only thing we're testing is the quality of search results for use by an LLM to answer these questions.

In [None]:
hotpot_pds_sample["ydc_prediction"] = parallel_progress_apply(
    hotpot_pds_sample["question"], lambda x: get_run_chain_function(qa)(x), num_workers=4
)
hotpot_pds_sample["google_prediction"] = hotpot_pds_sample["question"].apply(get_run_chain_function(google_qa))

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:18<00:00,  1.06it/s]


### Calculating Scores
We use the F1 score function from the hotpot repository to ensure we are as close to the evaluation setting that was presented in the paper.

In [None]:
import re
import string
from collections import Counter


# This is all taken from hotpot_qa source code with minor modifications to only return the f1 instead of the (P,R,F1) tuple
# https://github.com/hotpotqa/hotpot/blob/master/hotpot_evaluate_v1.py#L26
def calculate_f1_score(prediction, ground_truth):
    normalized_prediction = normalize_answer(prediction)
    normalized_ground_truth = normalize_answer(ground_truth)

    ZERO_METRIC = 0

    if (
        normalized_prediction in ["yes", "no", "noanswer"]
        and normalized_prediction != normalized_ground_truth
    ):
        return ZERO_METRIC
    if (
        normalized_ground_truth in ["yes", "no", "noanswer"]
        and normalized_prediction != normalized_ground_truth
    ):
        return ZERO_METRIC

    prediction_tokens = normalized_prediction.split()
    ground_truth_tokens = normalized_ground_truth.split()
    common = Counter(prediction_tokens) & Counter(ground_truth_tokens)
    num_same = sum(common.values())
    if num_same == 0:
        return 0
    precision = 1.0 * num_same / len(prediction_tokens)
    recall = 1.0 * num_same / len(ground_truth_tokens)
    f1 = (2 * precision * recall) / (precision + recall)
    return f1


def normalize_answer(s):
    def remove_articles(text):
        return re.sub(r"\b(a|an|the)\b", " ", text)

    def white_space_fix(text):
        return " ".join(text.split())

    def remove_punc(text):
        exclude = set(string.punctuation)
        return "".join(ch for ch in text if ch not in exclude)

    def lower(text):
        return text.lower()

    return white_space_fix(remove_articles(remove_punc(lower(s))))


def exact_match_score(prediction, ground_truth):
    return normalize_answer(prediction) == normalize_answer(ground_truth)


def filter_wiki_citation(snip):
    return not snip.startswith("- ^")

In [None]:
hotpot_pds_sample["ydc_f1"] = parallel_progress_apply(
    list(hotpot_pds_sample.iterrows()),
    lambda x: calculate_f1_score(x[1]["ydc_prediction"], x[1]["answer"]),
    num_workers=8,
)
hotpot_pds_sample["google_f1"] = parallel_progress_apply(
    list(hotpot_pds_sample.iterrows()),
    lambda x: calculate_f1_score(x[1]["google_prediction"], x[1]["answer"]),
    num_workers=8,
)

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 74631.74it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 139810.13it/s]


In [None]:
print("You.com F1")
print(hotpot_pds_sample["ydc_f1"].mean())
print("Google F1")
print(hotpot_pds_sample["google_f1"].mean())

You.com F1
0.10276981188745896
Google F1
0.05933277249066722


### In Conclusion
As you can see, the You.com Search API heavily out-performs Google on this small subset of data. Please stay tuned as You.com will be releasing a much larger search study in the weeks to come. If you would like to be an early access partner of ours please email api@you.com with your background, use case, and expected daily call volume.