# Trieve PDF Upload With Default Chunking Example

At Trieve, we have a default chunking algorithm that splits documents based on converting them to HTML with Apache Tika and then splitting on headings or recursively when chunks get too large. You can look at the code for our default chunking approach [here](https://github.com/devflowinc/trieve/blob/main/server/server-python/chunker.py). It serves as a sane default for the majority of use-cases. 

However, we do strongly encourage you to invest in chunking your own data. In our opinion, this is the most valuable thing you can do for increasing the performance of your RAG or search pipeline. Feel free to join the community on [Discord](https://discord.gg/E9sPRZqpDT) or [Matrix](https://matrix.to/#/#trieve-general:matrix.zerodao.gg) to get help from Trieve core developers directly. We are always happy to help! 

## 1. Download a PDF to upload to Trieve

For this example, we will use the famous "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?" paper which can be found [here](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922). It's a very interesting publication if you have already read it.

In [2]:
import requests
import os

def download_and_save_pdf(url, save_path):
    # Ensure the save_path directory exists, if not, create it.
    os.makedirs(os.path.dirname(save_path), exist_ok=True)
    
    # Get the PDF content from the URL
    response = requests.get(url)
    
    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        # Open the target file in binary write mode and save the PDF content
        with open(save_path, 'wb') as file:
            file.write(response.content)
        print(f'PDF has been downloaded and saved to {save_path}')
    else:
        print(f'Failed to download PDF. Status code: {response.status_code}')

# Example usage
pdf_url = 'https://dl.acm.org/doi/pdf/10.1145/3442188.3445922'
save_path = './tmp/stochastic_parrots.pdf'
download_and_save_pdf(pdf_url, save_path)


PDF has been downloaded and saved to ./tmp/stochastic_parrots.pdf


## 2. Get an API Key For Trieve

1. Sign up for a free tier account at [dashboard.trieve.ai](https://dashboard.trieve.ai) which comes with 512mb of free file storage
2. Create a new dataset
3. Create a `Read - Write` level API Key

Finally, copy and paste your API key and dataset ID into the relevant values below.

In [3]:

trieve_dataset_id = "2330f358-3ad5-485f-b83e-351496128859"
trieve_api_key = "tr-rtxQmkhSPe06hCDNynCe9DBbbRrSpVbA"

## 3. Install and import the trieve client python package

In [4]:
%pip install trieve_client_py

Defaulting to user installation because normal site-packages is not writeable
Collecting trieve_client_py
  Using cached trieve_client_py-0.3.21-py3-none-any.whl (171 kB)
Installing collected packages: trieve_client_py
Successfully installed trieve_client_py-0.3.21
Note: you may need to restart the kernel to use updated packages.


## 4. Initialize an authenticated trieve_client

In [5]:
from trieve_client import AuthenticatedClient

client = AuthenticatedClient(
    base_url="https://api.trieve.ai", prefix="", token=trieve_api_key
).with_headers({"TR-Dataset": trieve_dataset_id})

## 5. base64 encode your file to prepare it for upload

In [6]:
import base64

def to_base64(file):
    try:
        with open(file, 'rb') as f:
            return base64.b64encode(f.read()).decode('utf-8')  
    except Exception as error:
        print(f"An error occurred: {error}")
        return None

base64_file = to_base64(save_path or "")

if base64_file:
    base64_file = base64_file.replace("+", "-")  # Convert '+' to '-'
    base64_file = base64_file.replace("/", "_")  # Convert '/' to '_'
    base64_file = base64_file.rstrip('=')  # Remove ending '='


## 6. Send your file to Trieve

In [7]:
from trieve_client.models import UploadFileData, UploadFileResult
from trieve_client.api.file import upload_file_handler
from trieve_client.models.error_response_body import ErrorResponseBody

file_upload_response = upload_file_handler.sync(
    client=client,
    tr_dataset=trieve_dataset_id,
    body=UploadFileData(
      file_name="stochastic_parrots.pdf",
      file_mime_type="application/pdf",
      base64_file=base64_file
    )
)

if type(file_upload_response) == UploadFileResult:
  print("File has been chunked and chunks have been queued for indexing")
  uploaded_file_id = file_upload_response.file_metadata.id
  print(f"File ID: {file_upload_response.file_metadata}")
elif type(file_upload_response) == ErrorResponseBody:
  print(f"Failed to upload file {file_upload_response.message}")

KeyError: 'user_id'

## 7. Get the chunk_group created with your file's chunks

#### NOTE: Soon we plan to revamp Trieve's [get_events](https://api.trieve.ai/redoc#tag/events/operation/get_events) route with webhook support and other niceities so this hacky approach is no longer necessary 

When a file is uploaded all of its chunks are added to a `chunk_group` within our data schema. To check all of the files you have uploaded, you can paginate through the available `chunk_groups`. 

Below is an example of doing that to locate the `chunk_group` for the file you just uploaded. 

In [44]:
from trieve_client.api.chunk_group import get_specific_dataset_chunk_groups
from trieve_client.models.group_data import GroupData

page = 1
file_group_id = None

while file_group_id is None:
  chunk_groups_response = get_specific_dataset_chunk_groups.sync(
      client=client,
      tr_dataset=trieve_dataset_id,
      dataset_id=trieve_dataset_id,
      page=page,
  )
  if type(chunk_groups_response) == GroupData:
    groups = chunk_groups_response.groups
    for group in groups:
      if group.file_id == uploaded_file_id:
        file_group_id = group.id
        break

    if len(groups) == 10:   
      page += 1
  else:
    print(f"Failed to fetch chunk groups: {chunk_groups_response.message}")

file_group_id


'd97f9b44-748d-4a93-ac38-b5b9bdb58bae'

## 8. Semantic Search Within the File's chunk_group

Beyond searching the whole dataset with the [search_chunk](https://api.trieve.ai/redoc#tag/chunk/operation/search_chunk) route, you also have the ability to search within only the subset of chunks inside of a `chunk_group`. 

In the context of this example, those will be the chunks belonging to the `stochastic_parrots.pdf` paper. 

Also, please be aware that all of this functionality is exposed via Trieve's admin tooling UI's at [search.trieve.ai](https://search.trieve.ai) and [chat.trieve.ai](https://chat.trieve.ai).

In [45]:
from trieve_client.api.chunk_group import search_groups
from trieve_client.models.search_groups_data import SearchGroupsData
from trieve_client.models.search_groups_result import SearchGroupsResult

search_groups_response = search_groups.sync(
    client=client,
    tr_dataset=trieve_dataset_id,
    body=SearchGroupsData(
        group_id=file_group_id,
        search_type="semantic",
        query="the language models in this scenario are incapable of encapsulating any form of reasoning"
    )
)

if type(search_groups_response) == SearchGroupsResult:
  print(search_groups_response.bookmarks)
else:
  print(f"Failed to search groups: {search_groups_response.message}")


[ScoreChunkDTO(metadata=[ChunkMetadataWithFileData(content=' FAccT ’21, March 3–10, 2021, Virtual Event, Canada Bender and Gebru, et al. in [21, 93] and direct resources away from efforts that would facili- tate long-term progress towards natural language understanding, without using unfathomable training data. Furthermore, the tendency of human interlocutors to impute meaning where there is none can mislead both NLP researchers and the general public into taking synthetic text as meaningful. Combined with the ability of LMs to pick up on both subtle biases and overtly abusive language patterns in training data, this leads to risks of harms, including encountering derogatory language and experiencing discrimination at the hands of others who reproduce racist, sexist, ableist, extremist or other harmful ideologies rein- forced through interactions with synthetic language. We explore these potential harms in §6 and potential paths forward in §7. We hope that a critical overview of the ri