<a href="https://colab.research.google.com/github/anthropics/anthropic-cookbook/blob/main/misc/using_citations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Having Claude Cite its Sources

This guide teaches you how to have Claude cite sources when referencing documents given to it in context. Adding citations can make your application more reliable and transparent -- when a user can click through to a piece of supporting 
documentation, it will help them develop trust in generated outputs.

# 🛠️ Setup
First, let's install the required libraries:

In [None]:
# Install requirements
%pip install requests beautifulsoup4 anthropic lxml

## Download the Anthropic Documentation
For this example, you'll test Claude's ability to answer questions about its own documentation. Let's start by writing a function to download everything at [docs.anthropic.com](docs.anthropic.com). 

First, we'll define a function to download the text content from a website's sitemap.

In [1]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

def download_sitemap_text(sitemap_url):
    # Download the sitemap content
    sitemap_response = requests.get(sitemap_url)
    sitemap_content = sitemap_response.text

    # Parse the sitemap XML
    soup = BeautifulSoup(sitemap_content, 'lxml-xml')

    # Extract the URLs from the sitemap
    urls = [loc.text for loc in soup.find_all('loc')]

    # Keep track of unique titles and content
    unique_titles = set()
    unique_content = set()

    # Download the text content of each page
    page_data = []
    for url in urls:
        try:
            # Download the page content
            page_response = requests.get(url)
            page_content = page_response.text

            # Parse the HTML content
            soup = BeautifulSoup(page_content, 'html.parser')

            # Extract the title
            title = soup.title.text.strip() if soup.title else ''

             # Extract the text content while preserving newlines
            text_elements = soup.select('p, code')
            text = '\n'.join([elem.get_text(strip=False) for elem in text_elements])

            # Check if the title or content is unique
            if title in unique_titles or text in unique_content:
                # Skip duplicates
                continue
            unique_titles.add(title)
            unique_content.add(text)

            # Create a dictionary with the page data
            page_dict = {
                'title': title,
                'url': url,
                'content': text
            }

            page_data.append(page_dict)
        except requests.exceptions.RequestException as e:
            print(f"Error downloading {url}: {e}")
            continue

    return page_data

Then we'll download the Anthropic documentation, including the url, title, and content of each page:

In [2]:
# Download the docs
anthropic_docs_sitemap_url = 'https://docs.anthropic.com/sitemap.xml'
text_content = download_sitemap_text(anthropic_docs_sitemap_url)
print(text_content[0]['title'] + " ✅")

Welcome to Claude ✅


## Set up the Anthropic client

Next, we'll initialize the Anthropic client using your API key:

In [4]:
from anthropic import Anthropic
client = Anthropic()

## Build a Question-Answering Prompt with Citations

Now we'll build a prompt that asks Claude to answer user questions about the Anthropic documentation while citing its sources. We'll start with the system prompt; read the whole prompt first, then we can break down the relevant sections:

In [5]:
# Format the docs into a useful format for Claude
# Each page is placed into XML: <page url="" title="">Page content</page>
website_content_string = ""
for page in text_content:
    website_content_string += f'<page url="{page['url']}" title="{page['title']}">\n{page['content']}\n</page>\n'

SYSTEM_PROMPT = f"""You are Anthropic's DocBot, a helpful assistant that is an expert at helping users with the Anthropic documentation.

Here is the Anthropic documentation:
<documentation>
{website_content_string}
</documentation>

When a user asks a question, perform the following tasks:
1. Find the quotes from the documentation that are the most relevant to answering the question. These quotes can be quite long if necessary (even multiple paragraphs). You may need to use many quotes to answer a single question, including code snippits and other examples.
2. Assign numbers to these quotes in the order they were found. Each page of the documentation should only be assigned a number once.
3. Based on the document and quotes, answer the question. Directly quote the documentation when possible, including examples. When relevant, code examples are preferred.
4. When answering the question provide citations references in square brackets containing the number generated in step 2 (the number the citation was found)
5. Structure the output in the following format:
<citations>
{{
   "citations": [
      {{
         
         "page_title": "string",
         "url: "string",
         "number": "integer",
         "relevant_passages": ["string"] // A list of every relevant passage on a single documentation page
      }},
      ...
   ]
}}
</citations>

<answer>A plain text answer, formatted as Markdown[1]</answer>"""

### Breaking down the system prompt

#### Format the content of the documentation into a list of documents, in an XML format that Claude will understand well.
```
website_content_string = ""
for page in text_content:
    website_content_string += f'<page url="{page['url']}" title="{page['title']}">\n{page['content']}\n</page>\n'
```
#### Assign Claude the role of "DocBot", and give it the Anthropic documentation for reference:
```
You are Anthropic's DocBot, a helpful assistant that is an expert at helping users with the Anthropic documentation.

Here is the Anthropic documentation:
<documentation>
{website_content_string}
</documentation>
```

#### Clearly define how Claude should create citations by finding relevant quotes, numbering pages, and then answering the question.
```
When a user asks a question, perform the following tasks:
1. Find the quotes from the documentation that are the most relevant to answering the question. These quotes can be quite long if necessary (even multiple paragraphs). You may need to use many quotes to answer a single question, including code snippits and other examples.
2. Assign numbers to these quotes in the order they were found. Each page of the documentation should only be assigned a number once.
3. Based on the document and quotes, answer the question. Directly quote the documentation when possible, including examples.
4. When answering the question provide citations references in square brackets containing the number generated in step 2 (the number the citation was found)
```

#### Give a prescriptive description of the output format of the citations to ensure we capture all of the relevant citation information, and encourage the model to extract the types of information that would be helpful for answering the question.
```
5. Structure the output in the following format:
<citations>
{
   "citations": [
      {
         
         "page_title": "string",
         "url: "string",
         "number": "integer",
         "relevant_passages": ["string"] // A list of every relevant passage on a single documentation page
      },
      ...
   ]
}
</citations>
```

#### Ask Claude to output the answer in XML tags for easier parsing, and give it an example of a citation:

```
<answer>A plain text answer[1]</answer>"""
```

**Note:** Having Claude put the citations before the answer will make it much more likely that Claude directly quotes from the cited material!

## The User Prompt

Ask Claude a question about the docs!

In [6]:
# ideas = [
#     "How do I use a system prompt?",
#     "how to output json?",
#     "How do I use an image in python api",
#     "How do I use messages in sheets?"
# ]

QUERY = "How do I use an image in python api"

## Send it to Claude!

In [7]:

starter_stub = '<citations>\n{\n  "citations": ['

response = client.messages.create(
    model="claude-3-haiku-20240307",
    max_tokens=1024,
    system=SYSTEM_PROMPT,
    temperature=0.0,
    messages=[
        {
            "role": "user", 
            "content": QUERY
        },
        {
            "role": "assistant",
            "content": starter_stub
        }
    ]
)

full_response = starter_stub + response.content[0].text

Let's look at the raw output:

In [8]:
print(full_response)

<citations>
{
  "citations": [
    {
      "page_title": "Vision",
      "url": "https://docs.anthropic.com/claude/docs/vision",
      "number": 1,
      "relevant_passages": [
        "Currently, you can utilize Claude's vision capabilities in three ways:\n\nUser\nFor this guide, we'll be using the Anthropic Python SDK, and the following example variables. We'll fetch sample images from Wikipedia using the httpx library, but you can use whatever image sources work for you.",
        "To utilize images when making an API request, you can provide images to Claude as a base64-encoded image in  image content blocks. Here is simple example in Python showing how to include a base64-encoded image in a Messages API request:"
      ]
    }
  ]
}
</citations>

<answer>To use an image in the Python API, you can follow these steps:

1. Fetch the image you want to use, for example from Wikipedia, and convert it to a base64-encoded string[1]:

```python
import anthropic
import base64
import httpx



# Parse and display the results

Now let's pull out the structured content from the result and display the answer as a user would see it.

In [9]:
from IPython.display import display, Markdown
import json
import re

def get_answer(completion):
    # Regex to extract answer from <answer> xml tags
   answer = re.search(r'<answer>(.*?)</answer>', completion, re.DOTALL)
   if answer is None:
         return ""
   return answer.group(1)

def get_citations(completion):
    # Regex to extract citations from <citations> xml tags
    citations = re.search(r'<citations>(.*?)</citations>', completion, re.DOTALL)
    if citations is None:
         return "{}"
    return json.loads(citations.group(1))

def render_response(full_response):
    answer = get_answer(full_response)
    citations = get_citations(full_response)
    citations_list = citations['citations']
    
    # replace [number] in the answer with a link to the relevant citation url
    for citation in citations_list:
        answer = answer.replace(f"[{citation['number']}]", f"[\\[{citation['number']}\\]]({citation['url']})")

    display(Markdown(answer))

render_response(full_response)

To use an image in the Python API, you can follow these steps:

1. Fetch the image you want to use, for example from Wikipedia, and convert it to a base64-encoded string[\[1\]](https://docs.anthropic.com/claude/docs/vision):

```python
import anthropic
import base64
import httpx

image_url = "https://upload.wikimedia.org/wikipedia/commons/a/a7/Camponotus_flavomarginatus_ant.jpg"
image_media_type = "image/jpeg"
image_data = base64.b64encode(httpx.get(image_url).content).decode("utf-8")
```

2. Then, include the base64-encoded image in the `messages` parameter when creating a new message using the Anthropic Python SDK[\[1\]](https://docs.anthropic.com/claude/docs/vision):

```python
message = anthropic.Anthropic().messages.create(
    model="claude-3-opus-20240229",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": image_media_type,
                        "data": image_data,
                    },
                }
            ],
        }
    ],
)
```

This will send the base64-encoded image to the API, allowing Claude to process and analyze the image as part of the request.


# Wrap Up!

That's it! To recap:
* Citations can help ground the Claude's responses in provided data, and can give additional context to a user
* To get Claude to generate citations:
  * Be clear about the structure of the citations
  * Ask it to create the citations first, and importantly to pull out specific quotes
  * Use structured sections to separate the citations from the answer
  * [Prefill Claude's response](https://docs.anthropic.com/claude/docs/prefill-claudes-response) to ensure it starts its response with citations
  