# Claude PDF Data Extraction Workshop

This notebook provides a step-by-step guide on how to use Claude to extract structured data from PDFs. We'll start with the basics of API usage and progress to more complex techniques like structured responses and data extraction.

## Table of Contents
1. [Introduction to APIs and Anthropic API Setup](#section1)
2. [Understanding the Messages API](#section2)
3. [Using XML Tags for Message Organization](#section3)
4. [Structured JSON Responses with XML Tags](#section4)
5. [Reading PDFs](#section5)
6. [Sending PDFs to Anthropic](#section6)
7. [Explicit Analysis Prompts](#section7)
8. [Practical Exercise: Extract Data from Your PDFs](#section8)

Let's dive in!

<a id='section1'></a>
## 1. Introduction to APIs and Anthropic API Setup

### What is an API?

An API (Application Programming Interface) is a set of rules and protocols that allows different software applications to communicate with each other. In simpler terms, it's like a waiter in a restaurant:

1. You (the client) make a request to the waiter (the API)
2. The waiter takes your request to the kitchen (the server)
3. The kitchen prepares your order (processes the request)
4. The waiter brings back your food (the response)

In our case, we'll be using Anthropic's API to send requests to Claude and receive its responses programmatically.

### Signing Up for an Anthropic API Key

To use the Anthropic API, you need an API key. Here's how to get one:

1. Go to [Anthropic's website](https://console.anthropic.com/)
2. Sign up for an account or log in if you already have one
3. Navigate to the API section in your dashboard
4. Create a new API key
5. Copy your API key and keep it secure (never share it publicly or commit it to version control)


### Setting Up Your API Key Securely

It's important to handle your API key securely. Let's use environment variables to store it:

In [1]:
import os
import getpass
import json
import re
import pandas as pd
import base64
from anthropic import Anthropic

# API client setup
# - os, getpass: For secure environment variable handling
# - json, re: For parsing JSON responses from Claude
# - pandas: For data manipulation and visualization
# - base64: For encoding PDFs to send to Claude
# - anthropic: The official Python client for Anthropic API

# Set up API key securely
api_key = os.environ.get("ANTHROPIC_API_KEY")

# If API key is not found in environment variables, prompt user to enter it
if not api_key:
    print("ANTHROPIC_API_KEY not found in environment variables.")
    api_key = getpass.getpass("Please enter your Anthropic API key: ")
    
# Initialize the client with the API key
client = Anthropic(api_key=api_key)

ANTHROPIC_API_KEY not found in environment variables.


Please enter your Anthropic API key:  ········


<a id='section2'></a>
## 2. Understanding the Messages API

Anthropic's Messages API is designed for conversational interactions with Claude. Let's create a simple example to understand how it works.

In [4]:
from anthropic import Anthropic

# Send a simple message to Claude

# Display Claude's response

### Understanding the API Parameters

Let's break down the key parameters in our API call:

- `model`: Specifies which Claude model to use
- `max_tokens`: Maximum number of tokens (roughly words/subwords) in Claude's response
- `temperature`: Controls randomness (0 = deterministic, 1 = more creative)
- `messages`: The conversation history in a specific format

### Multi-turn Conversations

The Messages API can handle multi-turn conversations by including the full conversation history:

In [5]:
# Continue the conversation by adding both the user's message and Claude's response 

# Send the continued conversation 

# Display Claude's response 


<a id='section3'></a>
## 3. Using XML Tags for Message Organization

XML tags help organize your prompts and make them more structured. This improves Claude's understanding of the task and leads to more consistent responses.

Let's see how to use XML tags to structure our messages:

In [4]:
xml_structured_prompt = """
I'm going to provide information about a research paper, and I'd like you to analyze it.

<paper_details>
Title: Effects of Climate Change on Pollinator Behavior
Authors: Smith, J., Johnson, A., & Williams, B.
Year: 2023
Journal: Journal of Ecological Research
</paper_details>

<key_findings>
- Temperature increases of 2°C reduced pollinator activity by 15%
- Bee species showed adaptive behavior in urban heat islands
- Flowering times shifted an average of 5.2 days earlier per decade
</key_findings>

Based on the information above, please provide:
<questions>
1. A brief summary of the paper
2. Potential implications for agriculture
3. Suggestions for future research
</questions>
"""

response = client.messages.create(
    model="claude-3-7-sonnet-20250219",
    max_tokens=1000,
    temperature=0,
    messages=[
        {"role": "user", "content": xml_structured_prompt}
    ]
)

print(response.content[0].text)

# Analysis of "Effects of Climate Change on Pollinator Behavior"

## 1. Brief Summary

This 2023 paper by Smith, Johnson, and Williams published in the Journal of Ecological Research examines how climate change affects pollinator behavior. The researchers found that temperature increases of 2°C resulted in a 15% reduction in pollinator activity. However, they also observed that bee species demonstrated adaptive behaviors in urban heat islands, suggesting some capacity for adjustment to warming conditions. Additionally, the study documented that flowering times have shifted approximately 5.2 days earlier per decade, indicating phenological changes in plant-pollinator relationships.

## 2. Potential Implications for Agriculture

The findings have several significant implications for agriculture:

- **Reduced Crop Yields**: A 15% reduction in pollinator activity could translate to substantial decreases in crop yields for pollinator-dependent crops, which represent approximately 75% of glo

### Benefits of Using XML Tags

1. **Clarity**: XML tags make it clear to Claude which parts of your message serve what purpose
2. **Organization**: They help divide complex prompts into logical sections
3. **Consistency**: They lead to more consistent responses, as Claude has a clearer understanding of the request
4. **Focus**: They help Claude focus on specific aspects of a prompt

You can create custom XML tags that make sense for your specific use case. Some commonly used tags include:

- `<instructions>`: For providing overall instructions
- `<example>`: For showing examples of expected output
- `<context>`: For providing background information
- `<question>`: For specifying the exact question or task

Let's try another example with a different set of tags:

In [5]:
analysis_prompt = """
<instructions>
Analyze the following data from a hypothetical PDF about butterfly migration patterns.
</instructions>

<data>
Species: Monarch (Danaus plexippus)
Migration Distance: 3000 miles
Winter Location: Central Mexico
Population Trend: Declining (-80% since 1990s)
Main Threats: Habitat loss, pesticides, climate change
</data>

<request>
Please provide a concise analysis of the conservation challenges facing this species based on the data provided.
</request>
"""

response = client.messages.create(
    model="claude-3-7-sonnet-20250219",
    max_tokens=1000,
    temperature=0,
    messages=[
        {"role": "user", "content": analysis_prompt}
    ]
)

print(response.content[0].text)

# Analysis of Monarch Butterfly Conservation Challenges

Based on the provided data, the Monarch butterfly (Danaus plexippus) faces several significant conservation challenges:

1. **Severe Population Decline**: The species has experienced an alarming 80% population reduction since the 1990s, indicating a critical conservation situation.

2. **Extensive Migration Requirements**: Monarchs travel approximately 3,000 miles to reach their winter location in Central Mexico, making them vulnerable to disruptions along this extensive migration route.

3. **Multiple Concurrent Threats**:
   - **Habitat Loss**: Likely affecting both breeding grounds and overwintering sites
   - **Pesticide Exposure**: Particularly concerning for a species that relies on specific host plants
   - **Climate Change**: Potentially disrupting migration timing, breeding patterns, and habitat suitability

The combination of these factors creates a complex conservation challenge requiring coordinated efforts across the

### The effect of system prompts

Let's now show how we can use a system prompt to tell how Claude should act.

In [1]:
response = client.messages.create(
    model="claude-3-7-sonnet-20250219",
    max_tokens=1000,
    temperature=0,
    system = "",
    messages=[
        {"role": "user", "content": analysis_prompt}
    ]
)

print(response.content[0].text)

In [2]:
response = client.messages.create(
    model="claude-3-7-sonnet-20250219",
    max_tokens=1000,
    temperature=0,
    system = "",
    messages=[
        {"role": "user", "content": analysis_prompt}
    ]
)

print(response.content[0].text)

In [3]:
response = client.messages.create(
    model="claude-3-7-sonnet-20250219",
    max_tokens=1000,
    temperature=0,
    system = "",
    messages=[
        {"role": "user", "content": analysis_prompt}
    ]
)

print(response.content[0].text)

<a id='section4'></a>
## 4. Structured JSON Responses with XML Tags

When extracting data from PDFs, you often want standardized, structured responses that you can easily parse and work with. JSON is perfect for this purpose, and Claude can generate JSON responses when properly instructed.

Let's create a prompt that asks Claude to extract specific information and return it in JSON format:

In [6]:
json_extraction_prompt = """
<instructions>
Extract the following information from this hypothetical scientific paper excerpt and provide it in a structured JSON format.
</instructions>

<paper_excerpt>
Title: Ecological Impact of Native Pollinators on Agricultural Yields
Authors: Garcia, M., Zhang, L., & Patel, K. (2023)

Abstract: This study examined the relationship between native pollinator diversity and crop yields across 25 farms in the Midwest United States. We found that farms with higher native bee diversity (>15 species) experienced 23% higher yields in squash, 17% higher yields in apple orchards, and 14% higher yields in almond groves compared to farms with low diversity (<5 species). Additionally, farms implementing pollinator-friendly practices such as hedgerows and cover crops showed a 35% increase in wild pollinator visitation rates. Our economic analysis indicates that investing in native pollinator habitat can provide a return on investment of 3:1 over a five-year period through increased yields and reduced honeybee rental costs.

Keywords: agroecology, ecosystem services, native bees, crop pollination, biodiversity
</paper_excerpt>

<output_format>
Please return a JSON object with the following fields:
- title: The paper title
- authors: An array of author names
- year: Publication year
- study_location: Where the study was conducted
- sample_size: Number of farms studied
- key_findings: Array of the main numerical findings
- crops_studied: Array of crops mentioned
- keywords: Array of keywords
- economic_benefits: Any economic benefits mentioned
</output_format>

Your response should be valid JSON wrapped in <json></json> tags.
"""

response = client.messages.create(
    model="claude-3-7-sonnet-20250219",
    max_tokens=1000,
    temperature=0,
    messages=[
        {"role": "user", "content": json_extraction_prompt}
    ]
)

print(response.content[0].text)

I'll extract the requested information from the scientific paper excerpt and provide it in a structured JSON format.

<json>
{
  "title": "Ecological Impact of Native Pollinators on Agricultural Yields",
  "authors": ["Garcia, M.", "Zhang, L.", "Patel, K."],
  "year": 2023,
  "study_location": "Midwest United States",
  "sample_size": 25,
  "key_findings": [
    "Farms with higher native bee diversity (>15 species) experienced 23% higher yields in squash",
    "Farms with higher native bee diversity (>15 species) experienced 17% higher yields in apple orchards",
    "Farms with higher native bee diversity (>15 species) experienced 14% higher yields in almond groves",
    "Farms implementing pollinator-friendly practices showed a 35% increase in wild pollinator visitation rates"
  ],
  "crops_studied": ["squash", "apple", "almond"],
  "keywords": ["agroecology", "ecosystem services", "native bees", "crop pollination", "biodiversity"],
  "economic_benefits": "Investing in native pollinat

Now let's parse the JSON from Claude's response to work with it programmatically:

In [9]:
import json
import re

# Extract JSON from Claude's response using regex
def extract_json(text):
    # Look for JSON inside <json> tags
    match = re.search(r'<json>(.*?)</json>', text, re.DOTALL)
    if match:
        json_str = match.group(1).strip()
        try:
            return json.loads(json_str)
        except json.JSONDecodeError as e:
            print(f"Error parsing JSON: {e}")
            return None
    return None

# Parse the JSON response
extracted_data = extract_json(response.content[0].text)

if extracted_data:
    print("Successfully extracted structured data:")
    print(extracted_data,'\n\n\n\n')
    print(f"Title: {extracted_data.get('title')}")
    print(f"Authors: {', '.join(extracted_data.get('authors', []))}")
    print(f"Crops studied: {', '.join(extracted_data.get('crops_studied', []))}")
    print(f"Economic benefits: {extracted_data.get('economic_benefits')}")
else:
    print("Failed to extract JSON from Claude's response")

Successfully extracted structured data:
{'title': 'Ecological Impact of Native Pollinators on Agricultural Yields', 'authors': ['Garcia, M.', 'Zhang, L.', 'Patel, K.'], 'year': 2023, 'study_location': 'Midwest United States', 'sample_size': 25, 'key_findings': ['Farms with higher native bee diversity (>15 species) experienced 23% higher yields in squash', 'Farms with higher native bee diversity (>15 species) experienced 17% higher yields in apple orchards', 'Farms with higher native bee diversity (>15 species) experienced 14% higher yields in almond groves', 'Farms implementing pollinator-friendly practices showed a 35% increase in wild pollinator visitation rates'], 'crops_studied': ['squash', 'apple', 'almond'], 'keywords': ['agroecology', 'ecosystem services', 'native bees', 'crop pollination', 'biodiversity'], 'economic_benefits': 'Investing in native pollinator habitat can provide a return on investment of 3:1 over a five-year period through increased yields and reduced honeybee r

### Tips for Getting Valid JSON Responses

1. **Be explicit about the format**: Clearly describe all fields you want and their expected types
2. **Provide examples**: When possible, show Claude an example of the expected output
3. **Request validation**: Ask Claude to ensure the JSON is valid before returning it
4. **Use XML tags**: Wrap the JSON request and response in XML tags to help Claude understand what you want
5. **Set temperature to 0**: This makes Claude's responses more deterministic and reduces JSON formatting errors

<a id='section5'></a>
## 5. Reading PDFs

We will now learn how to read a pdf file and send it to Anthropic. We have to read the pdf as a binary file and decode it as a string:

In [10]:
import base64

def read_pdf_file(file_path):
    """Prepare a file for sending to the Anthropic API."""
    # Read the file as bytes
    with open(file_path, "rb") as file:
        pdf_data = base64.b64encode(file.read()).decode('utf-8')
    
    return pdf_data

Now let's use this function to read an example pdf:

In [11]:
pdf_data = read_pdf_file("example_pdf/deMedeiros2013Zootaxa.pdf")

In [12]:
pdf_data[:1000]

'JVBERi0xLjYNJeLjz9MNCjQxNyAwIG9iag08PC9MaW5lYXJpemVkIDEvTCA1OTY3MzQvTyA0MTkvRSAxMzI2Ny9OIDcvVCA1OTYyNTYvSCBbIDUzNCAyOTRdPj4NZW5kb2JqDSAgICAgICAgICAgICAgDQo0MzQgMCBvYmoNPDwvRGVjb2RlUGFybXM8PC9Db2x1bW5zIDQvUHJlZGljdG9yIDEyPj4vRmlsdGVyL0ZsYXRlRGVjb2RlL0lEWzw2N0IxQTk5OUYxMUZDQjBDMUQ0RDdDQjQ5RTRGMUY4OD48RTQ5M0E4MUI0MjI0QTU0RTk5M0Y4Rjc1RjA3NTIzMTY+XS9JbmRleFs0MTcgNTRdL0luZm8gNDE2IDAgUi9MZW5ndGggOTAvUHJldiA1OTYyNTcvUm9vdCA0MTggMCBSL1NpemUgNDcxL1R5cGUvWFJlZi9XWzEgMiAxXT4+c3RyZWFtDQpo3mJiZBBgYGJg1gESjLeBBMMtIMH2DEiwrgcSLEogQhHErQaxukDquEGELEhsOYhYBTJgH0gvSJYpCUicb2ZgYmQ0AIkxMA4MwfuCNB3/GVO/AwQYAMNTDYYNCmVuZHN0cmVhbQ1lbmRvYmoNc3RhcnR4cmVmDQowDQolJUVPRg0KICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIA0KNDcwIDAgb2JqDTw8L0MgMjAwL0UgMTg0L0ZpbHRlci9GbGF0ZURlY29kZS9JIDIyMi9MZW5ndGggMTkzL08gMTY4L1MgMTEzPj5zdHJlYW0NCmjeYmBgYGZgYPrIwMLAoLyRQZgBAYSBMqxAcY4NQgoFjI0JDEzSBvPMlRtFGAobpDjKIh8wMEjuvcDzWOHsu4ZkS5YX1csZjhWxBhgbA/UaGzAwKCkpMTBg6ANKyjNwvDAH0nxALAK2qoOBmyu0SnjncqM/DwSahTbUCSonsJS1O0wXXsD

<a id='section6'></a>
## 6. Sending PDFs to Anthropic

Anthropic's documentation suggests sending the file before the instructions. We have to tell it we are sending a pdf

In [13]:
response = client.messages.create(
    model="claude-3-7-sonnet-20250219",
    max_tokens=1000,
    temperature=0,
    messages=[
        {"role": "user", "content": [{"type":"document",
                                     "source":{
                                         "type":"base64",
                                         "media_type":"application/pdf",
                                         "data":pdf_data
                                     }},{
            "type":"text","text":"What is this paper about?"
                                     }]}
    ]
)

print(response.content[0].text)

This paper describes three new species of weevils in the genus Anchylorhynchus from Colombia. The authors, Bruno A. S. de Medeiros and Luis A. Núñez-Avellaneda, provide detailed descriptions of:

1. Anchylorhynchus pinocchio sp. nov. - Named after the character Pinocchio due to its extremely elongated rostrum (snout)
2. Anchylorhynchus centrosquamatus sp. nov. - Named for its centrally directed pronotal scales
3. Anchylorhynchus luteobrunneus sp. nov. - Named for its yellow and brown coloration

The paper includes detailed morphological descriptions of each species, including measurements, distinctive features, and male genitalia. It also provides information about their biology - all three species are found in inflorescences (flower clusters) of Syagrus palm species, where they act as pollinators. The adults feed on pollen and the larvae develop inside fruits, consuming the endosperm and causing fruit abortion.

The authors note that these weevils were discovered during complementary 

<a id='section7'></a>
## 7. Explicit Analysis Prompts

For complex PDF data extraction, it's helpful to guide Claude's analysis with explicit, step-by-step instructions. This is especially important for documents with specialized content or complex structure.

Let's create a prompt template for detailed scientific paper analysis:

In [14]:
scientific_paper_prompt = """
<instructions>
I'm sharing a scientific paper as a PDF. Please perform a structured analysis following these steps:

1. First, identify the paper's metadata (title, authors, journal, publication date, DOI)

2. Next, identify and summarize each major section in the paper:
   - Abstract
   - Introduction
   - Methods
   - Results
   - Discussion
   - Conclusion

3. For the Methods section, extract:
   - Experimental design
   - Sample size and characteristics
   - Equipment or techniques used
   - Statistical methods

4. For the Results section, extract:
   - Key numerical findings and statistics (p-values, effect sizes)
   - Data from tables and figures (describe what each shows)
   - Any unexpected or highlighted findings

5. For the Discussion/Conclusion, extract:
   - Main interpretations of the results
   - Limitations acknowledged by the authors
   - Implications for the field
   - Suggestions for future research
</instructions>

<format>
Provide your analysis in structured JSON format wrapped in <json></json> tags.
</format>
"""

Let's try it out now:

In [15]:
response = client.messages.create(
    model="claude-3-7-sonnet-20250219",
    max_tokens=1000,
    temperature=0,
    messages=[
        {"role": "user", "content": [{"type":"document",
                                     "source":{
                                         "type":"base64",
                                         "media_type":"application/pdf",
                                         "data":pdf_data
                                     }},{
            "type":"text","text":scientific_paper_prompt
                                     }]}
    ]
)

print(response.content[0].text)

I'll analyze this scientific paper for you in a structured format.

<json>
{
  "metadata": {
    "title": "Three new species of Anchylorhynchus Schoenherr, 1836 from Colombia (Coleoptera: Curculionidae; Curculioninae; Acalyptini)",
    "authors": ["Bruno A. S. de Medeiros", "Luis A. Núñez-Avellaneda"],
    "journal": "Zootaxa",
    "publication_date": "4 April 2013",
    "volume": "3636 (2)",
    "pages": "394-400",
    "doi": "10.11646/zootaxa.3636.2.10"
  },
  "sections": {
    "abstract": {
      "summary": "The paper describes three new species of the genus Anchylorhynchus from Colombia: A. pinocchio sp. nov., A. centrosquamatus sp. nov., and A. luteobrunneus sp. nov. The authors provide morphological descriptions including male genitalia for each species and compare them with similar species in the genus. All three species are found in inflorescences of Syagrus Mart. (Arecaceae). The adults are pollinators and the larvae develop inside fruits, feeding on the endosperm, which inter

Let's try a different prompt to get the data for each species in a structured form:

In [24]:
taxonomic_prompt = """
<instructions>
I'm sharing a scientific paper as a PDF. Please perform a structured analysis following these steps:

1. First, identify all of the species described or redescribed in the paper

2. Next, analyze the structure of the descriptions, including how formatting relates to the relationship between characters.

3. Next, list only five random characters described across all species. Use only five because we are testing currently. Use the following format for characters: [life stage],[sex],[major body area],[specific body area], [feature], [units]

4. Finally, construct a database in json format listing the character states for the three species, following this example. Include characters exactly as in the list you compiled initialy, to make sure the same characters are described for all species. The output should be structured to facilitate building a matrix of morphological characters.
Example:
{['species A':{'adult, both sexes, head, rostrum, number of carinae': '3', 
              'adult, female, abdomen, ventrite III shape':'curved',
              'adult, male, body, pronotum+elytra, length, mm': 7-8, 
              '..., },
  'species B':{..., ..., ...}]}
  
5. For testing purposes, include only five characters and their respective traits for now.
</instructions>


<format>
The initial analysis can be unstructured. Wrap it in <analysis></analysis> tags.

Provide your final response in structured JSON format wrapped in <json></json> tags.
</format>
"""

Let's try this out now.

In [25]:
response = client.messages.create(
    model="claude-3-7-sonnet-20250219",
    max_tokens=5000,
    temperature=0,
    messages=[
        {"role": "user", "content": [{"type":"document",
                                     "source":{
                                         "type":"base64",
                                         "media_type":"application/pdf",
                                         "data":pdf_data
                                     }},{
            "type":"text","text":taxonomic_prompt
                                     }]}
    ]
)

print(response.content[0].text)

<analysis>
# Analysis of the Scientific Paper

## 1. Species Described in the Paper

The paper describes three new species of the genus Anchylorhynchus from Colombia:
1. Anchylorhynchus pinocchio sp. nov.
2. Anchylorhynchus centrosquamatus sp. nov.
3. Anchylorhynchus luteobrunneus sp. nov.

## 2. Structure of the Descriptions

The descriptions follow a standard taxonomic format with:

- **Holotype information**: Includes collection data, location, collector, host plant, and repository information.
- **Paratype information**: Similar to holotype but for additional specimens.
- **Description**: Detailed morphological description with measurements and proportions, often separated by sex (♂/♀).
- **Etymology**: Explanation of the species name.
- **Remarks**: Comparison with similar species and distinguishing characteristics.
- **Biological information**: Behavioral and ecological data.
- **Host species**: Plant associations.
- **Type locality**: Specific collection location.
- **Geographic

## Let's now parse the output

In [26]:
import json
import pandas as pd
import re

def extract_json_from_response(response_text):
    """
    Extracts JSON content from between <json> and </json> tags in the response text.
    
    Args:
        response_text (str): The complete text response from the API
        
    Returns:
        dict: The parsed JSON object, or None if no valid JSON is found
    """
    # Find the JSON content between <json> and </json> tags
    json_pattern = re.compile(r'<json>(.*?)</json>', re.DOTALL)
    json_match = json_pattern.search(response_text)
    
    if not json_match:
        print("No JSON data found in the response.")
        return None
    
    # Extract and parse JSON
    json_text = json_match.group(1)
    try:
        data = json.loads(json_text)
        return data
    except json.JSONDecodeError as e:
        print(f"Error parsing JSON: {e}")
        return None

def transform_json_to_dataframe(json_data):
    """
    Transforms a nested JSON object of taxonomic data into a pandas DataFrame.
    
    Args:
        json_data (dict): The parsed JSON object with species as keys
        
    Returns:
        pandas.DataFrame: A DataFrame with one row per species and columns for attributes
    """
    if json_data is None:
        return pd.DataFrame()
    
    # Transform the nested JSON into a flattened dataframe structure
    rows = []
    
    for species, attributes in json_data.items():
        species_dict = {"species": species}
        for attr_key, attr_value in attributes.items():
            species_dict[attr_key] = attr_value
        rows.append(species_dict)
    
    # Create DataFrame
    df = pd.DataFrame(rows)
    
    return df

In [27]:
json_out = extract_json_from_response(response.content[0].text)
print(json_out)

{'Anchylorhynchus pinocchio': {'adult, both sexes, head, rostrum, number of longitudinal carinae, count': '7', 'adult, male, body, pronotum+elytra, length, mm': '4.8-5.7', 'adult, both sexes, pronotum, median basal area, scale direction, description': 'directed toward base', 'adult, male, genitalia, aedeagus, length to width ratio, ratio': '2.3', 'adult, both sexes, antennae, second antennomere of funicle, length relative to first, description': 'longer than first'}, 'Anchylorhynchus centrosquamatus': {'adult, both sexes, head, rostrum, number of longitudinal carinae, count': '7', 'adult, male, body, pronotum+elytra, length, mm': '5.3-5.9', 'adult, both sexes, pronotum, median basal area, scale direction, description': 'directed either to the center or obliquely to center-base', 'adult, male, genitalia, aedeagus, length to width ratio, ratio': '1.9', 'adult, both sexes, antennae, second antennomere of funicle, length relative to first, description': 'longer than first'}, 'Anchylorhynch

In [28]:
from IPython.display import display
df = transform_json_to_dataframe(json_out)
display(df)

Unnamed: 0,species,"adult, both sexes, head, rostrum, number of longitudinal carinae, count","adult, male, body, pronotum+elytra, length, mm","adult, both sexes, pronotum, median basal area, scale direction, description","adult, male, genitalia, aedeagus, length to width ratio, ratio","adult, both sexes, antennae, second antennomere of funicle, length relative to first, description"
0,Anchylorhynchus pinocchio,7,4.8-5.7,directed toward base,2.3,longer than first
1,Anchylorhynchus centrosquamatus,7,5.3-5.9,directed either to the center or obliquely to ...,1.9,longer than first
2,Anchylorhynchus luteobrunneus,7,3.9-4.7,directed toward base,2.0,longer than first


<a id='section8'></a>
## 8. Practical Exercise: Extract Data from Your PDFs

Now it's time to apply what you've learned to extract data from your own PDFs. In this exercise, we'll create a function that helps you easily extract data from any PDF using the patterns we've explored.

Let's implement a reusable function for PDF analysis:

In [None]:
def send_pdf_for_analysis(pdf_path, prompt, model="claude-3-7-sonnet-20250219"):
    """
    Send a PDF to Claude for analysis with a custom prompt.
    
    Args:
        pdf_path (str): Path to the PDF file
        prompt (str): The prompt to send to Claude
        model (str): The Claude model to use
        
    Returns:
        str: Claude's response
    """
    # Read and prepare the PDF
    pdf_data = read_pdf_file(pdf_path)
    
    # Create the message with both text and PDF content
    response = client.messages.create(
        model=model,
        max_tokens=4000,  # More tokens for PDF analysis
        temperature=0,
        messages=[
            {"role": "user", "content": [
                {
                    "type": "document",
                    "source": {
                        "type": "base64",
                        "media_type": "application/pdf",
                        "data": pdf_data
                    }
                },
                {
                    "type": "text",
                    "text": prompt
                }
            ]}
        ]
    )
    
    return response.content[0].text

# Example usage:
pdf_path = "example_pdf/deMedeiros2013Zootaxa.pdf"

# Create a custom prompt for taxonomic data extraction
custom_prompt = """
<instructions>
I'm sharing a scientific paper as a PDF. Please extract the following information:
- The names of all new species described in this paper
- For each species, extract the etymology (origin of the name)
- For each species, extract the host plant associations
</instructions>

<output_format>
Please return your analysis in a structured JSON format with the following schema:
{
  "species": [
    {
      "name": "Scientific name",
      "etymology": "Explanation of name origin",
      "host_plants": ["List of host plants"]
    }
  ]
}
Wrap the JSON in <json></json> tags.
</output_format>
"""

# Run the extraction
try:
    result = send_pdf_for_analysis(pdf_path, custom_prompt)
    print(result)
    
    # Parse the resulting JSON for further processing
    json_data = extract_json_from_response(result)
    if json_data:
        print("\nSuccessfully extracted species data:")
        for species in json_data.get("species", []):
            print(f"- {species['name']}")
            print(f"  Etymology: {species['etymology']}")
            print(f"  Host plants: {', '.join(species['host_plants'])}")
    
except Exception as e:
    print(f"Error processing PDF: {e}")

## Workshop Conclusion

In this workshop, we've covered several key techniques for extracting structured data from PDFs using Claude:

1. **API Fundamentals**: We learned how to set up and use the Anthropic API securely, managing API keys and making basic requests.

2. **Prompt Engineering**: We explored how to use XML tags to structure prompts, making them more effective and easier for Claude to understand.

3. **Structured Responses**: We implemented techniques to get JSON-formatted responses that can be easily parsed and integrated into data pipelines.

4. **PDF Processing**: We learned how to read and encode PDFs to send them to Claude through the API.

5. **Specialized Analysis**: We created detailed prompts that guide Claude through complex analytical tasks, such as extracting specific data from scientific papers.

6. **Data Transformation**: We built functions to parse and transform Claude's responses into usable data structures like pandas DataFrames.

These skills can be applied to a wide range of use cases including:

- Scientific literature review and data extraction
- Taxonomic database creation from published descriptions
- Legal document analysis
- Financial data extraction
- Metadata extraction from technical documents
- Content summarization and knowledge base creation

To continue building on what you've learned, try:

1. Creating specialized prompts for your specific domain or document types
2. Building workflows that process multiple PDFs in batches
3. Combining Claude's PDF extraction capabilities with other tools for downstream analysis
4. Fine-tuning your prompts to extract even more precise information

Remember that the quality of your prompt significantly impacts the quality of data extraction. Be specific about what you want to extract, provide clear formatting instructions, and use XML tags to organize your requests.