# Anthropic Citations
- https://www.anthropic.com/news/introducing-citations-api
- https://docs.anthropic.com/en/docs/build-with-claude/citations

In [None]:
! pip install anthropic keyring markitdown

In [118]:
import anthropic # pip install anthropic
import base64 # for converting pdf file
import httpx # to get pdf from the web
from IPython.display import Markdown, display
import keyring # pip install keyring
from markitdown import MarkItDown # pip install markitdown

In [2]:
# Set only once. Do not keep this code (with visible API key) in .py, .ipynb etc.
#keyring.set_password('Claude_API_key', 'Medium_API_test', '<API-KEY>')

In [4]:
client = anthropic.Anthropic(
    api_key=keyring.get_password('Claude_API_key', 'Medium_API _test'),
)

In [5]:
MODEL = "claude-3-5-sonnet-20241022" # https://docs.anthropic.com/en/docs/about-claude/models

## Plain text example
_Simple text documents, prose_

In [21]:
# Get text (from the web)
url = 'https://likumi.lv/ta/en/en/id/26019-labour-law'
md = MarkItDown()
result = md.convert_url(url)

In [24]:
print(result.text_content[2000:3000])

īdz |  |  |  | līdz |  |  |  |

| Statuss: |  | spēkā esošs |  | vēl nav spēkā |  | zaudējis spēku |
| --- | --- | --- | --- | --- | --- | --- |

search

notīrīt

| The translation of this document is outdated.  Translation validity: 25.11.2022.–21.10.2024.  Amendments not included: [19.09.2024.](/ta/id/355472)      | Text consolidated by Valsts valodas centrs (State Language Centre) with amending laws of: 12 December 2002 [shall come into force on 1 January 2003]; 22 January 2004 [shall come into force on 25 February 2004]; 22 April 2004 [shall come into force on 8 May 2004]; 13 October 2005 [shall come into force on 16 November 2005]; 21 September 2006 [shall come into force on 25 October 2006]; 12 June 2009 [shall come into force on 29 June 2009]; 1 December 2009 [shall come into force on 1 January 2010]; 4 March 2010 [shall come into force on 25 March 2010]; 31 March 2011 [shall come into force on 4 May 2011]; 16 June 2011 [shall come into force on 20 July 2011]; 21 June 2012 [shal

In [26]:
user_prompt = "How many days of study leave can an employee get for a State exam or diploma preparation?"

In [None]:
# chat completion without streaming
response = client.messages.create(
    model=MODEL, # https://docs.anthropic.com/en/docs/about-claude/models
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "document",
                    "source": {
                        "type": "text",
                        "media_type": "text/plain",
                        "data": result.text_content[2000:]
                    },
                    "title": "Labour Law", # optional field that will be passed to the model but not used towards cited content.
                    "context": "This is a Labour Law from likumi.lv with changes until 21.10.2024.", # optional field that will be passed to the model but not used towards cited content.
                    "citations": {"enabled": True}
                },
                {
                    "type": "text",
                    "text": user_prompt
                }
            ]
        }
    ]
)

Response result:

In [110]:
text = ''
citation_num = 1
citations = ''

for i in response.content:    
    if hasattr(i, 'citations'):
        citation_nums = []
        for citation in i.citations:
            citation_nums.append(citation_num)
            #print(citation)
            citations = citations + f'[{citation_num}] "{citation['document_title']}" - {citation['cited_text']}\n\n'
            #print(result.text_content[2000+citation['start_char_index']:2000+citation['end_char_index']]) # same as citation['cited_text']
            citation_num += 1
        text = text + i.text + ' ['+','.join(str(item) for item in citation_nums)+']'
    else:
        text += i.text
display(Markdown((text+'\n\n'+citations if citations.strip() else text)))

According to the law, an employee shall be granted a study leave of 20 working days for taking a State examination or preparing and defending a diploma paper, either with or without retaining their wage. For employees with piecework wage, the study leave shall be granted either with or without disbursing average earnings [1].

[1] "Labour Law" - (2) An employee shall be granted a study leave of 20 working days for the taking of a State examination or the preparation and defence of a diploma paper with or without retaining the wage. If a piecework wage has been specified for the employee, a study leave shall be granted with or without disbursing the average earnings.  


In [116]:
# Possible dictionary keys from the citations
    # see also: https://docs.anthropic.com/en/docs/build-with-claude/citations#example-plain-text-citation
response.content[1].citations[0].keys()

dict_keys(['type', 'cited_text', 'document_index', 'document_title', 'start_char_index', 'end_char_index'])

Used tokens for this query:

In [36]:
print(f"""
Cache creation input tokens:\t{response.usage.cache_creation_input_tokens}
Cache read input tokens:\t{response.usage.cache_read_input_tokens}
Input tokens:\t\t\t{response.usage.input_tokens}
Output tokens:\t\t\t{response.usage.output_tokens}
Total tokens:\t\t\t{response.usage.input_tokens+response.usage.output_tokens}
""")


Cache creation input tokens:	0
Cache read input tokens:	0
Input tokens:			73869
Output tokens:			88
Total tokens:			73957



This query (73'957 tokens) cost me **0.22 EUR**. Detailed information about pricing you can see [here](https://www.anthropic.com/pricing#anthropic-api).

## PDF example
_PDF files with text content. Citing images from PDFs is not currently supported._

In [125]:
# Load and encode the PDF
    # https://docs.anthropic.com/en/docs/build-with-claude/pdf-support
pdf_url = "https://www.lza.lv/images/Zinatnes-vestnesis/2024/ZV_12_2024.pdf"
pdf_data = base64.standard_b64encode(httpx.get(pdf_url).content).decode("utf-8")

In [127]:
# Prompt: "What articles (name the title of the article and the author) are included in the latest journal 'Zinātnes vēstnesis'?"
user_prompt = "Kādi raksti (nosauc raksta nosaukumu un autoru) ir iekļauti pēdējā izdevumā 'Zinātnes vēstnesis'?"

In [128]:
# chat completion without streaming
response = client.messages.create(
    model=MODEL, # https://docs.anthropic.com/en/docs/about-claude/models
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "document",
                    "source": {
                        "type": "base64",
                        "media_type": "application/pdf",
                        "data": pdf_data
                    },
                    "title": "Zinātnes Vēstnesis", # optional field that will be passed to the model but not used towards cited content.
                    "context": "Latvijas Zinātņu akadēmijas, Latvijas Zinātnes padomes un Latvijas Zinātnieku savienības laikraksts, Nr.11 (649)", # optional field that will be passed to the model but not used towards cited content.
                    "citations": {"enabled": True}
                },
                {
                    "type": "text",
                    "text": user_prompt
                }
            ]
        }
    ]
)

Response result:

In [150]:
text = ''
citation_num = 1
citations = ''

for i in response.content:    
    if hasattr(i, 'citations'):
        citation_nums = []
        for citation in i.citations:
            citation_nums.append(citation_num)
            # citation['end_page_number'] - exclusive
            pages = f"({citation['start_page_number']}.lpp.)" if citation['start_page_number'] == citation['end_page_number']-1 else f"({citation['start_page_number']}.-{citation['end_page_number']}.lpp.)"
            citations = citations + f'[{citation_num}] "{citation['document_title']}" - {citation['cited_text']} {pages}\n\n'
            citation_num += 1            
        text = text + i.text + ' ['+','.join(str(item) for item in citation_nums)+']'
    else:
        text += i.text
display(Markdown((text+'\n\n**Avoti:**\n\n'+citations if citations.strip() else text)))

Balstoties uz dokumentu, pēdējā "Zinātnes Vēstnesis" izdevumā (Nr. 11 (649), 2024. gada 30. decembris) ir iekļauti šādi galvenie raksti:

1. Latvijas Jauno zinātnieku apvienības valdes locekles Lienes Spruženieces viedokļraksts "Kāpēc zinātniekiem nevajadzētu "mērīties" ar Hirša indeksiem" [1,2]

2. "Draudzība un jauniešu morālā izaugsme: kā tās mijiedarbojas un kāpēc būtiski šo pētīt teorijā un praksē" - intervija ar Dr.paed. Manuelu Hoakinu Fernadesu-Gonsalesu, kas ir viens no trīspadsmit Latvijas Universitātes un Banku augstskolas zinātnieku grantu saņēmējiem. Līdz 2026. gada februārim viņš pētīs saistību starp draudzību un jauniešu tikumisko izaugsmi [3,4]

3. "Ievads cietumzinātnē" - raksts par soda izpildes jomu un to, kā ieslodzījuma vietu amatpersonas var radīt apstākļus likumpārkāpēja uzvedības pozitīvām pārmaiņām [5]

4. Māra Lustes raksts "Penitenciārā zinātne un soda filozofijas evolūcija" [6]

Turklāt izdevumā iekļauta arī informācija par zinātnisko grādu aizstāvēšanām un piešķiršanām.

**Avoti:**

[1] "Zinātnes Vēstnesis" - Latvijas Zinātņu akadēmija
11 (649) ISSN 1407-6748 2024. gada 30. decembris
Kāpēc zinātniekiem nevajadzētu “mērīties” ar Hirša indeksiem
Laikraksta “Zinātnes Vēstnesis” septembra numurā 
(https://www.lza.lv/images/Zinatnes-vestnesis/2024/ZV_09_2024.pdf) akadēmiķa Roberta Eglīša publicētais 
Latvijas zinātnieku rangs pēc Hirša indeksa un apgalvojumi, ka Hirša indekss (H-indekss) objektīvi atspoguļo 
zinātnieku darba izcilību izraisīja plašas diskusijas, īpaši 
jauno zinātnieku vidū.  (1.lpp.)

[2] "Zinātnes Vēstnesis" - Laikraksta “Zinātnes Vēstnesis” redakcija
Latvijas Jauno zinātnieku apvienības valdes locekles Lienes Spruženieces viedokļraksts
Raksta autors Māris Luste.  (1.lpp.)

[3] "Zinātnes Vēstnesis" - Draudzība un jauniešu morālā izaugsme: 
kā tās mijiedarbojas un kāpēc būtiski šo pētīt teorijā un praksē
Dr.paed.  (1.lpp.)

[4] "Zinātnes Vēstnesis" - Izglītības zinātņu un psiholoģijas Pedagoģijas 
zinātniskā institūta vadošais pētnieks Manuels Hoakins 
Fernandess-Gonsaless ir viens no trīspadsmit Latvijas 
Universitātes un Banku augstskolas zinātnieku (profesoru) 
grantu saņēmējiem. Līdz 2026. gada februārim viņš pētīs 
saistību starp draudzību un jauniešu tikumisko izaugsmi, kas 
ir viņa pētniecības lauks jau vairāk nekā septiņus gadus.  (1.lpp.)

[5] "Zinātnes Vēstnesis" - Ievads cietumzinātnē
Ārsta profesija nepastāvētu bez medicīnas un 
skolotāja – bez pedagoģijas zinātnes. Kā šis princips darbojas soda izpildes jomā? Vai ieslodzījuma vietu amatpersona, 
nesekojot fundamentāliem pierādījumiem, spētu radīt 
apstākļus, kuros tai uzticētā likumpārkāpēja uzvedība 
piedzīvo pozitīvas pārmaiņas?
 (1.lpp.)

[6] "Zinātnes Vēstnesis" - Laikrakstam “Zinātnes Vēstnesis”
sagatavoja ESF projekta 
“Resocializācijas sistēmas efektivitātes paaugstināšana” 
vadītājs (2017–2023) Māris Luste
“Zinātnes Vēstnesis” 2024. gada 30. decembris (6.lpp.)



In [137]:
# Possible dictionary keys from the citations
    # see also: https://docs.anthropic.com/en/docs/build-with-claude/citations#example-pdf-citation
response.content[1].citations[0].keys()

dict_keys(['type', 'cited_text', 'document_index', 'document_title', 'start_page_number', 'end_page_number'])

Used tokens for this query:

In [134]:
print(f"""
Cache creation input tokens:\t{response.usage.cache_creation_input_tokens}
Cache read input tokens:\t{response.usage.cache_read_input_tokens}
Input tokens:\t\t\t{response.usage.input_tokens}
Output tokens:\t\t\t{response.usage.output_tokens}
Total tokens:\t\t\t{response.usage.input_tokens+response.usage.output_tokens}
""")


Cache creation input tokens:	0
Cache read input tokens:	0
Input tokens:			54353
Output tokens:			541
Total tokens:			54894



This query (54'894 tokens) cost me **0.17 EUR**. Detailed information about pricing you can see [here](https://www.anthropic.com/pricing#anthropic-api).

## Custom content example
_Lists, transcripts, special formatting, more granular citations_

In [160]:
qa_url = "https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/refs/heads/main/03-vector-search/eval/documents-with-ids.json"
qa_raw = httpx.get(qa_url).json()

Create separate lists for each course Q&A:

In [169]:
dataeng_list = []
ml_list = []
mlops_list = []
for i in qa_raw:
    if i['course'] == 'data-engineering-zoomcamp':
        dataeng_list.append({"type": "text", "text": f"QUESTION: {i['question']}; ANSWER: {i['text']}"})
    if i['course'] == 'machine-learning-zoomcamp':
        ml_list.append({"type": "text", "text": f"QUESTION: {i['question']}; ANSWER: {i['text']}"})
    if i['course'] == 'mlops-zoomcamp':
        mlops_list.append({"type": "text", "text": f"QUESTION: {i['question']}; ANSWER: {i['text']}"})

In [178]:
user_prompt = "Will I get a certificate for ML Zoomcamp course and data engineering zoomcamp?"

To decrease input token amount I'm using only first 10 or 15 records from each list:

In [181]:
# chat completion without streaming
response = client.messages.create(
    model=MODEL, # https://docs.anthropic.com/en/docs/about-claude/models
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "document",
                    "source": {
                        "type": "content",
                        "content": dataeng_list[:15]
                            },
                    "title": "Data engineering zoomcamp", # optional field that will be passed to the model but not used towards cited content.
                    "context": "Q&A from Data engineering zoomcamp", # optional field that will be passed to the model but not used towards cited content.
                    "citations": {"enabled": True}
                },
                {
                    "type": "document",
                    "source": {
                        "type": "content",
                        "content": ml_list[:10]
                            },
                    "title": "Machine Learning zoomcamp", # optional field that will be passed to the model but not used towards cited content.
                    "context": "Q&A from Machine Learning zoomcamp", # optional field that will be passed to the model but not used towards cited content.
                    "citations": {"enabled": True}
                },
                {
                    "type": "document",
                    "source": {
                        "type": "content",
                        "content": mlops_list[:10]
                            },
                    "title": "MLOps zoomcamp", # optional field that will be passed to the model but not used towards cited content.
                    "context": "Q&A from MLOps zoomcamp", # optional field that will be passed to the model but not used towards cited content.
                    "citations": {"enabled": True}
                },
                {
                    "type": "text",
                    "text": user_prompt
                }
            ]
        }
    ]
)

Response result:

In [187]:
text = ''
citation_num = 1
citations = ''

for i in response.content:    
    if hasattr(i, 'citations'):
        citation_nums = []
        for citation in i.citations:
            citation_nums.append(citation_num)
            citations = citations + f'[{citation_num}] "{citation['document_title']}" - {citation['cited_text']}\n\n'
            citation_num += 1
        text = text + i.text + ' ['+','.join(str(item) for item in citation_nums)+']'
    else:
        text += i.text
display(Markdown((text+'\n\n**Source:**\n\n'+citations if citations.strip() else text)))

Based on the documents, here are the certificate requirements for both courses:

For ML Zoomcamp:
Yes, you will get a certificate if you complete at least 2 out of 3 projects and review 3 peers' Projects by the deadline. [1]

For Data Engineering Zoomcamp:
You can only get a certificate if you finish the course with a "live" cohort. Certificates are not awarded for the self-paced mode. This is because you need to peer-review capstone projects after submitting your project, which can only be done when the course is running. [2]

Important points about the courses:

Data Engineering Zoomcamp:
- There's only one "live" cohort per year for the Data Engineering certification, which generally runs from January to April. [3]

While all materials remain available after the course finishes and you can follow the course at your own pace, you'll need to participate in the live cohort if you want to obtain a certificate. [4]

ML Zoomcamp:
- The course videos are pre-recorded, and you can start watching them right away. There are occasional office hours (live sessions) which are also recorded. If you miss any session, you won't miss anything as everything is recorded. [5]
- The course duration is approximately 4 months [6].

**Source:**

[1] "Machine Learning zoomcamp" - QUESTION: Will I get a certificate?; ANSWER: Yes, if you finish at least 2 out of 3 projects and review 3 peers’ Projects by the deadline, you will get a certificate. This is what it looks like: link. There’s also a version without a robot: link.

[2] "Data engineering zoomcamp" - QUESTION: Certificate - Can I follow the course in a self-paced mode and get a certificate?; ANSWER: No, you can only get a certificate if you finish the course with a “live” cohort. We don't award certificates for the self-paced mode. The reason is you need to peer-review capstone(s) after submitting a project. You can only peer-review projects at the time the course is running.

[3] "Data engineering zoomcamp" - QUESTION: Course - how many Zoomcamps in a year?; ANSWER: There are 3 Zoom Camps in a year, as of 2024. However, they are for separate courses:
Data-Engineering (Jan - Apr)
MLOps (May - Aug)
Machine Learning (Sep - Jan)
There's only one Data-Engineering Zoomcamp “live” cohort per year, for the certification. Same as for the other Zoomcamps.
They follow pretty much the same schedule for each cohort per zoomcamp. For Data-Engineering it is (generally) from Jan-Apr of the year. If you’re not interested in the Certificate, you can take any zoom camps at any time, at your own pace, out of sync with any “live” cohort.

[4] "Data engineering zoomcamp" - QUESTION: Course - Can I follow the course after it finishes?; ANSWER: Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.
You can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.

[5] "Machine Learning zoomcamp" - QUESTION: Is it going to be live? When?; ANSWER: The course videos are pre-recorded, you can start watching the course right now.
We will also occasionally have office hours - live sessions where we will answer your questions. The office hours sessions are recorded too.
You can see the office hours as well as the pre-recorded course videos in the course playlist on YouTube.QUESTION: What if I miss a session?; ANSWER: Everything is recorded, so you won’t miss anything. You will be able to ask your questions for office hours in advance and we will cover them during the live stream. Also, you can always ask questions in Slack.

[6] "Machine Learning zoomcamp" - QUESTION: How long is the course?; ANSWER: Approximately 4 months, but may take more if you want to do some extra activities (an extra project, an article, etc)



In [183]:
# Possible dictionary keys from the citations
    # see also: https://docs.anthropic.com/en/docs/build-with-claude/citations#example-citation
response.content[1].citations[0].keys()

dict_keys(['type', 'cited_text', 'document_index', 'document_title', 'start_block_index', 'end_block_index'])

Used tokens for this query:

In [184]:
print(f"""
Cache creation input tokens:\t{response.usage.cache_creation_input_tokens}
Cache read input tokens:\t{response.usage.cache_read_input_tokens}
Input tokens:\t\t\t{response.usage.input_tokens}
Output tokens:\t\t\t{response.usage.output_tokens}
Total tokens:\t\t\t{response.usage.input_tokens+response.usage.output_tokens}
""")


Cache creation input tokens:	0
Cache read input tokens:	0
Input tokens:			4272
Output tokens:			380
Total tokens:			4652



This query (4'652 tokens) cost me **0.02 EUR**. Detailed information about pricing you can see [here](https://www.anthropic.com/pricing#anthropic-api).

---

This code was generated using following Python and package versions:

In [146]:
from importlib.metadata import version
import sys

packages = ['anthropic', 'keyring', 'markitdown']

text = f"Python version: {sys.version}\n\n"
for i in packages:
    text += f"[{i}](https://pypi.org/project/{i}/) version: {version(i)}\n\n"
display(Markdown(text))

Python version: 3.12.4 (tags/v3.12.4:8e8a4ba, Jun  6 2024, 19:30:16) [MSC v.1940 64 bit (AMD64)]

[anthropic](https://pypi.org/project/anthropic/) version: 0.43.0

[keyring](https://pypi.org/project/keyring/) version: 25.6.0

[markitdown](https://pypi.org/project/markitdown/) version: 0.0.1a3

