<a target="_blank" href="https://colab.research.google.com/github/cohere-ai/notebooks/blob/main/notebooks/llmu/End_To_End_Wikipedia_Search.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Comparing different search methods for Wikipedia

In this notebook we study several different ways to query a Wikipedia database, including:
- Keyword search 
- Dense retrieval
- Reranking

Furthermore, we combine the power of search with Cohere's Chat endpoint in order to output accurate answers in sentence format to a query.

This notebook accompanies the [Semantic Search](https://docs.cohere.com/docs/intro-semantic-search) section of LLM University.

## Setup

We'll start by installing the tools we'll need and then importing them.

In [1]:
! pip install cohere weaviate-client==4.5.4 -q

In [2]:
import weaviate
import cohere

Fill in your Cohere API key in the next cell. To do this, begin by [signing up to Cohere](https://os.cohere.ai/) (for free!) if you haven't yet. Then get your API key [here](https://dashboard.cohere.com/api-keys).

In [3]:
# Fill in your API key here. Remember to not share publicly
co = cohere.ClientV2("COHERE_API_KEY") # Get your free API key: https://dashboard.cohere.com/api-keys

In [5]:
# Connect to the Weaviate demo database containing 10M wikipedia vectors
auth_config = weaviate.auth.AuthApiKey(api_key="76320a90-53d8-42bc-b41d-678647c6672e")
client = weaviate.Client(
    url="https://cohere-demo.weaviate.network/",
    auth_client_secret=auth_config,
    additional_headers={
        "X-Cohere-Api-Key": "COHERE_API_KEY",
    }
)

client.is_ready() # check if True

True

# Keyword Search

This section accompanies the [Keyword Search](https://docs.cohere.com/docs/keyword-search) chapter of LLM University.

We'll search for two queries using keyword search.
- Simple query: "Who discovered penicillin?" (Answer: Alexander Fleming)
- Hard query: "Who was the first person to win two Nobel prizes?" (Answer: Marie Curie)

You will notice that keyword search performs very well with the simple query, and not so well with the hard one.

In [6]:
def keyword_search(query, results_lang='en', num_results=10):
    properties = ["text", "title", "url", "views", "lang", "_additional {distance}"]

    where_filter = {
        "path": ["lang"],
        "operator": "Equal",
        "valueString": results_lang
    }

    response = (
        client.query.get("Articles", properties)
        .with_bm25(
            query=query
        )
        .with_where(where_filter)
        .with_limit(num_results)
        .do()
    )
    result = response['data']['Get']['Articles']
    return result

In [7]:
def print_result(result):
    """ Print results with colorful formatting """
    for item in result:
        print(f"\033[95m{item['title']} ({item['views']}) \033[0m")
        print(f"\033[4m{item['url']}\033[0m")
        print(item['text'])
        print()

In [8]:
simple_query = "Who discovered penicillin?"
keyword_search_results_simple = keyword_search(simple_query)
print_result(keyword_search_results_simple)

[95mPenicillin (2000) [0m
[4mhttps://en.wikipedia.org/wiki?curid=23312[0m
When Alexander Fleming discovered the crude penicillin in 1928, one important observation he made was that many bacteria were not affected by penicillin. This phenomenon was realised by Ernst Chain and Edward Abraham while trying to identify the exact of penicillin. In 1940, they discovered that unsusceptible bacteria like "Escherichia coli" produced specific enzymes that can break down penicillin molecules, thus making them resistant to the antibiotic. They named the enzyme penicillinase. Penicillinase is now classified as member of enzymes called β-lactamases. These β-lactamases are naturally present in many other bacteria, and many bacteria produce them upon constant exposure to antibiotics. In most bacteria, resistance can be through three different mechanisms: reduced permeability in bacteria, reduced binding affinity of the penicillin-binding proteins (PBPs) or destruction of the antibiotic through the 

In [9]:
hard_query = "Who was the first person to win two nobel prizes?"
keyword_search_results_hard = keyword_search(hard_query)
print_result(keyword_search_results_hard)

[95mNeutrino (2000) [0m
[4mhttps://en.wikipedia.org/wiki?curid=21485[0m
In the 1960s, the now-famous Homestake experiment made the first measurement of the flux of electron neutrinos arriving from the core of the Sun and found a value that was between one third and one half the number predicted by the Standard Solar Model. This discrepancy, which became known as the solar neutrino problem, remained unresolved for some thirty years, while possible problems with both the experiment and the solar model were investigated, but none could be found. Eventually, it was realized that both were actually correct and that the discrepancy between them was due to neutrinos being more complex than was previously assumed. It was postulated that the three neutrinos had nonzero and slightly different masses, and could therefore oscillate into undetectable flavors on their flight to the Earth. This hypothesis was investigated by a new series of experiments, thereby opening a new major field of resear

# Dense Retrieval

This section accompanies the [Dense Retrieval](https://docs.cohere.com/docs/dense-retrieval) chapter of LLM University.

Now we will use dense retrieval to search the answers for the two queries. Now you will notice that the results are good for both queries.

In [10]:
# This function performs dense retrieval
def dense_retrieval(query, results_lang='en', num_results=10):
    """
    Query the vectors database and return the top results.


    Parameters
    ----------
        query: str
            The search query

        results_lang: str (optional)
            Retrieve results only in the specified language.
            The demo dataset has those languages:
            en, de, fr, es, it, ja, ar, zh, ko, hi

    """

    nearText = {"concepts": [query]}
    properties = ["text", "title", "url", "views", "lang", "_additional {distance}"]
    # To filter by language
    where_filter = {
    "path": ["lang"],
    "operator": "Equal",
    "valueString": results_lang
    }
    response = (
        client.query
        .get("Articles", properties)
        .with_near_text(nearText)
        .with_where(where_filter)
        .with_limit(num_results)
        .do()
    )

    result = response['data']['Get']['Articles']

    return result

In [11]:
simple_query = "Who discovered penicillin?"

dense_retrieval_results_simple = dense_retrieval(simple_query)
print_result(dense_retrieval_results_simple)

[95mAlexander Fleming (2000) [0m
[4mhttps://en.wikipedia.org/wiki?curid=1937[0m
Sir Alexander Fleming (6 August 1881 – 11 March 1955) was a Scottish physician and microbiologist, best known for discovering the world's first broadly effective antibiotic substance, which he named penicillin. His discovery in 1928 of what was later named benzylpenicillin (or penicillin G) from the mould "Penicillium rubens" is described as the "single greatest victory ever achieved over disease." For this discovery, he shared the Nobel Prize in Physiology or Medicine in 1945 with Howard Florey and Ernst Boris Chain.

[95mPenicillin (2000) [0m
[4mhttps://en.wikipedia.org/wiki?curid=23312[0m
Penicillin was discovered in 1928 by Scottish scientist Alexander Fleming as a crude extract of "P. rubens". Fleming's student Cecil George Paine was the first to successfully use penicillin to treat eye infection (Ophthalmia neonatorum) in 1930. The purified compound (penicillin F) was isolated in 1940 by a res

In [12]:
hard_query = "Who was the first person to win two Nobel prizes?"
dense_retrieval_results_hard = dense_retrieval(hard_query)
print_result(dense_retrieval_results_hard)

[95mNobel Prize (2000) [0m
[4mhttps://en.wikipedia.org/wiki?curid=21201[0m
Five people have received two Nobel Prizes. Marie Curie received the Physics Prize in 1903 for her work on radioactivity and the Chemistry Prize in 1911 for the isolation of pure radium, making her the only person to be awarded a Nobel Prize in two different sciences. Linus Pauling was awarded the 1954 Chemistry Prize for his research into the chemical bond and its application to the structure of complex substances. Pauling was also awarded the Peace Prize in 1962 for his activism against nuclear weapons, making him the only laureate of two unshared prizes. John Bardeen received the Physics Prize twice: in 1956 for the invention of the transistor and in 1972 for the theory of superconductivity. Frederick Sanger received the prize twice in Chemistry: in 1958 for determining the structure of the insulin molecule and in 1980 for inventing a method of determining base sequences in DNA. Karl Barry Sharpless was a

### Searching in other languages
Changing the `results_lang` parameter to any of the following: en, de, fr, es, it, ja, ar, zh, ko, hi (the available languages in the demo) allows you to get results in any language you want. For example, here are the results to the hard query in Arabic.

In [13]:
arabic_results = dense_retrieval(hard_query, results_lang='ar')
print_result(arabic_results)

[95mجائزة نوبل (1000) [0m
[4mhttps://ar.wikipedia.org/wiki?curid=1979[0m
وقد حصل أربعة أشخاص على اثنتين من جوائز نوبل. حيث حصلت ماري كوري على جائزة نوبل في الفيزياء في عام 1903 بالمشاركة مع زوجها بيير كوري لعملهما على النشاط الإشعاعي، وحصلت وحدها كذلك على جائزة نوبل في الكيمياء عام 1911 لعزل الراديوم النقي، مما يجعلها المرأة الوحيدة التي تفوز بجائزة نوبل مرتين، والشخص الوحيد الذي فاز بجائزة نوبل في مجالين مختلفين في مجالات العلوم. وفاز لينوس باولنغ بجائزة الكيمياء لعام 1954 لأبحاثه في الروابط الكيميائية وتطبيقها على هيكل من المواد المعقدة، كما فاز باولنغ على جائزة نوبل للسلام في عام 1962 لنشاطه ضد الأسلحة النووية، مما يجعل منه الفائز الوحيد في جائزتين دون مشاركة الجائزة مع أحد. وحصل جون باردين على جائزة نوبل في الفيزياء مرتين: الأولى في عام 1956 لاختراع الترانزستور، والثانية في عام 1972 لنظرية التوصيل. وتلقى فردريك سانغر الجائزة مرتين في الكيمياء: الأولى في عام 1958 لتحديد بنية جزيء الأنسولين، والثانية في عام 1980 لاختراعه طريقة لتحديد تسلسل قاعدة في الحمض النووي.

[95mقائمة الحاص

The query can also be in any other language. Here are the French results to a query in Spanish.

In [14]:
spanish_query = "Quien descubrio la penicilina?"
french_results = dense_retrieval(spanish_query, results_lang='fr')
print_result(french_results)

[95mPénicilline (1000) [0m
[4mhttps://fr.wikipedia.org/wiki?curid=92634[0m
La pénicilline (pénicilline G) fut découverte le , concentrée et surtout nommée par le Britannique Alexander Fleming. Elle a été introduite pour des thérapies à partir de 1941.

[95mPénicilline (1000) [0m
[4mhttps://fr.wikipedia.org/wiki?curid=92634[0m
La pénicilline a été redécouverte accidentellement le par Alexander Fleming. Le chercheur écossais travailla ensuite plusieurs années à essayer de purifier cet antibiotique.

[95mAlexander Fleming (800) [0m
[4mhttps://fr.wikipedia.org/wiki?curid=27093[0m
Huit ans plus tard, il découvrit la pénicilline par accident, lors de l'observation d'une moisissure qui tua les bactéries d'une de ses expériences, et surtout il comprit et fit comprendre son intérêt médical.

[95mAlexander Fleming (800) [0m
[4mhttps://fr.wikipedia.org/wiki?curid=27093[0m
Sur sa découverte, Fleming publia en 1929 dans le "" un article qui attira peu l'attention. Il continua ses re

# ReRank

This section accompanies the [Reranking](https://docs.cohere.com/docs/reranking-2) chapter of LLM University.

Rerank is a powerful method that will enhance any search model. In short, rerank takes a query and a set of responses (or documents), and will surface the ones that are the most relevant as answers to the query. We'll use Rerank to improve keyword search with the hard query.

In [15]:
def rerank_responses(query, responses, num_responses=3):
    reranked_responses = co.rerank(
        query = query,
        documents = responses,
        top_n = num_responses,
        model = 'rerank-english-v3.0',
        return_documents=True
    )
    return reranked_responses

In [16]:
hard_query = "Who was the first person to win two nobel prizes?"
keyword_searches_to_improve = keyword_search(hard_query, num_results = 100)

In [17]:
for r in keyword_searches_to_improve[:20]:
    print(r['title'], ':', r['text'])

Neutrino : In the 1960s, the now-famous Homestake experiment made the first measurement of the flux of electron neutrinos arriving from the core of the Sun and found a value that was between one third and one half the number predicted by the Standard Solar Model. This discrepancy, which became known as the solar neutrino problem, remained unresolved for some thirty years, while possible problems with both the experiment and the solar model were investigated, but none could be found. Eventually, it was realized that both were actually correct and that the discrepancy between them was due to neutrinos being more complex than was previously assumed. It was postulated that the three neutrinos had nonzero and slightly different masses, and could therefore oscillate into undetectable flavors on their flight to the Earth. This hypothesis was investigated by a new series of experiments, thereby opening a new major field of research that still continues. Eventual confirmation of the phenomenon 

In [18]:
reranked_keyword_responses = rerank_responses(hard_query, keyword_searches_to_improve, num_responses=3)

In [19]:
for idx, r in enumerate(reranked_keyword_responses.results):
    print(f"Document Rank: {idx + 1}, Document Index: {r.index}")
    print(f"Title: {r.document.title}")
    print(f"URL: {r.document.url}")
    print(f"Document: {r.document.text}")
    print(f"Relevance Score: {r.relevance_score:.2f}")
    print("\n")

Document Rank: 1, Document Index: 30
Title: Nobel Prize
URL: https://en.wikipedia.org/wiki?curid=21201
Document: Five people have received two Nobel Prizes. Marie Curie received the Physics Prize in 1903 for her work on radioactivity and the Chemistry Prize in 1911 for the isolation of pure radium, making her the only person to be awarded a Nobel Prize in two different sciences. Linus Pauling was awarded the 1954 Chemistry Prize for his research into the chemical bond and its application to the structure of complex substances. Pauling was also awarded the Peace Prize in 1962 for his activism against nuclear weapons, making him the only laureate of two unshared prizes. John Bardeen received the Physics Prize twice: in 1956 for the invention of the transistor and in 1972 for the theory of superconductivity. Frederick Sanger received the prize twice in Chemistry: in 1958 for determining the structure of the insulin molecule and in 1980 for inventing a method of determining base sequences 

# Generating responses

This section accompanies the [Generating Answers](https://docs.cohere.com/docs/generating-answers) chapter of LLM University.

Generative models are great at talking, but when it comes to answer questions with facts, they are prone to hallucinations. In other words, they can answer with the wrong answer. To prevent this, we first search for the documents that are relevant to the query (using dense retrieval, but we can use any method). We then feed them to the generative model, and instruct it to answer the question from the information from those documents.

The query is "How many people have won more than one Nobel prize?". You will notice that the model generates wrong answers, but when combined with search, it'll generate the correct answers.

In [20]:
query = "How many people have won more than one Nobel prize?"

In [23]:
prediction_without_search = [
    co.chat(
        messages=[{"role": "user", "content": query}],
        model="command-r-plus-08-2024",
        max_tokens=50,
    ) for _ in range(5)
]

In [25]:
for p in prediction_without_search:
    print(p.message.content[0].text)

As of my information cutoff in January 2024, five people have been awarded the Nobel Prize more than once.

Marie Curie is the first person to win a second Nobel Prize, receiving the Physics Prize in 1903
As of my information cutoff in January 2024, five people have been awarded the Nobel Prize more than once:

- **Marie Curie**: She was the first person to win two Nobel Prizes, and the only person to win a
As of my information cutoff date in January 2024, five people have been awarded the Nobel Prize more than once:

- **Marie Curie**: She was awarded the Nobel Prize in Physics in 1903 for her work
As of my information cutoff in January 2024, five people have been awarded the Nobel Prize more than once:

- **Marie Curie**: She was the first person to win two Nobel Prizes, and the only person to win the
As of my information cutoff date of January 2024, eight people have won more than one Nobel Prize. They are:

1. Marie Curie: She won the Nobel Prize in Physics in 1903 for her work on


In [26]:
responses = dense_retrieval(query, num_results=20)
print_result(responses)

[95mNobel Peace Prize (2000) [0m
[4mhttps://en.wikipedia.org/wiki?curid=26230922[0m
, the Peace Prize has been awarded to 110 individuals and 27 organizations. 18 women have won the Nobel Peace Prize, more than any other Nobel Prize. Only two recipients have won multiple Prizes: the International Committee of the Red Cross has won three times (1917, 1944, and 1963) and the Office of the United Nations High Commissioner for Refugees has won twice (1954 and 1981). Lê Đức Thọ is the only person who refused to accept the Nobel Peace Prize.

[95mNobel Prize (2000) [0m
[4mhttps://en.wikipedia.org/wiki?curid=21201[0m
The strict rule against awarding a prize to more than three people is also controversial. When a prize is awarded to recognise an achievement by a team of more than three collaborators, one or more will miss out. For example, in 2002, the prize was awarded to Koichi Tanaka and John Fenn for the development of mass spectrometry in protein chemistry, an award that did not r

In [27]:
context = [r['text'] for r in responses]
context[:10]

[', the Peace Prize has been awarded to 110 individuals and 27 organizations. 18 women have won the Nobel Peace Prize, more than any other Nobel Prize. Only two recipients have won multiple Prizes: the International Committee of the Red Cross has won three times (1917, 1944, and 1963) and the Office of the United Nations High Commissioner for Refugees has won twice (1954 and 1981). Lê Đức Thọ is the only person who refused to accept the Nobel Peace Prize.',
 'The strict rule against awarding a prize to more than three people is also controversial. When a prize is awarded to recognise an achievement by a team of more than three collaborators, one or more will miss out. For example, in 2002, the prize was awarded to Koichi Tanaka and John Fenn for the development of mass spectrometry in protein chemistry, an award that did not recognise the achievements of Franz Hillenkamp and Michael Karas of the Institute for Physical and Theoretical Chemistry at the University of Frankfurt.',
 'Candid

In [28]:
prompt = f"""
Use the information provided below to answer the questions at the end. If the answer to the question is not contained in the provided information, say "The answer is not in the context".
---
Context information:
{context}
---
Question: How many people have won more than one Nobel prize?
"""

In [31]:
prediction_with_search = [
    co.chat(
        messages=[{"role": "user", "content": prompt}],
        model="command-r-plus-08-2024",
        max_tokens=50,
    ) for _ in range(5)
]

In [32]:
for p in prediction_with_search:
    print(p.message.content[0].text)

Five individuals have received more than one Nobel Prize. Marie Curie, Linus Pauling, John Bardeen, Frederick Sanger, and Karl Barry Sharpless have all been awarded two Nobel Prizes.
Five individuals have received more than one Nobel Prize. Marie Curie received the Physics Prize in 1903 and the Chemistry Prize in 1911. Linus Pauling was awarded the Chemistry Prize in 1954 and the Peace
Five individuals have received more than one Nobel Prize. Marie Curie, Linus Pauling, John Bardeen, Frederick Sanger, and Karl Barry Sharpless have all won two Nobel Prizes.
Five individuals have received more than one Nobel Prize. Marie Curie received the Physics Prize in 1903 and the Chemistry Prize in 1911. Linus Pauling was awarded the 1954 Chemistry Prize and the 1
Five people have received two Nobel Prizes. Marie Curie received the Physics Prize in 1903 for her work on radioactivity and the Chemistry Prize in 1911 for the isolation of pure radium, making her the only person to
