# All of the Above, but More

Conceptually, we have covered the fundamental concepts of embeddings and similarity calculations. These two concepts enable capabilities that are important in language tasks and that form the foundation of agentic capabilities ([OpenAI, 2025](https://platform.openai.com/docs/guides/embeddings)):

- Search
- Clustering
- Recommendations
- Anomaly detection
- Diversity measurement
- Classification

## Document RAG

Some of these tasks are related to Retrieval-Augmented Generation (RAG). In the diagram below, we depict how to split a document, each chunk's embeddings and store them in a vector DB.

<img src="./img/02_document_rag_embed.png" width = 800>

Once embeddings are stored, given a query we can use proximity search to find the nearest chunk. The chunk (and other related data) are context in prompt sent to an LLM.


<img src="./img/02_document_rag_query.png" width = 800>

# Introducing LangChain

[LangChain](https://www.langchain.com/) is a set of tools that support cross-model for agent engineering. The library is useful and popular among the many options available.

Some useful resources are:

- [LangChain Documentation](https://docs.langchain.com/).
- [Directory of LangChain Resources](https://www.langchain.com/resources).

## Document Splitting 

Document splitting  or chunking is usually the first step in any RAG setup. The idea is that we want to split documents into smaller sections to:

- Comply with the models context length constraints.
- Enhance search quality.
- Reduce latency.
- Control costs.

<img src="./img/02_document_rag_embed.png" width = 900>


LangChain contains a family of [document loaders](https://python.langchain.com/docs/integrations/document_loaders/). Each document loader has its own set of parameters, but they all implement the `.load()` method. A few examples include:

### Common File Types

- [CSVLoader](https://python.langchain.com/docs/integrations/document_loaders/csv): CSV files
- [DirectoryLoader](https://python.langchain.com/docs/how_to/document_loader_directory): All files in a given directory.
- [Unstructured](https://python.langchain.com/docs/integrations/document_loaders/unstructured_file): Many file types (see https://docs.unstructured.io/platform/supported-file-types)
- [JSONLoader](https://python.langchain.com/docs/integrations/document_loaders/json): JSON files

### PDF

- [PyPDF](https://python.langchain.com/docs/integrations/document_loaders/pypdfloader): Uses - [pypdf](https://pypi.org/project/pypdf/) to load and parse PDFs	(Package).
- [Unstructured](https://python.langchain.com/docs/integrations/document_loaders/unstructured_file): Uses [Unstructured's](https://pypi.org/project/unstructured/) open source library to load PDFs	(Package).
- [PDFPlumber](https://python.langchain.com/docs/integrations/document_loaders/pdfplumber):  Load PDF files using [PDFPlumber](https://pypi.org/project/pdfplumber/)	(Package).


### Web Pages

- [Web](https://python.langchain.com/docs/integrations/document_loaders/web_base): Uses urllib and BeautifulSoup to load and parse HTML web pages	(Package).
- [Unstructured](https://python.langchain.com/docs/integrations/document_loaders/unstructured_file): Uses [Unstructured](https://pypi.org/project/unstructured/) to load and parse web pages (Package).
- [RecursiveURL](https://python.langchain.com/docs/integrations/document_loaders/recursive_url): Recursively scrapes all child links from a root URL (Package).



## JSONLoader

[JSONLoader](https://python.langchain.com/docs/integrations/document_loaders/json/) implements a JSON (including JSON lines) document loader. JSONLoader uses [`jq`](https://jqlang.org/) to specify hwo to use the data passed in the document. 

A few notes on the code below:

- `jq_schema="."` indicates that we will read all keys from each JSON line. The [`jq specification`](https://jqlang.org/manual/#basic-filters) affords flexible filtering. 
- `content_key="content"` is required when more than one key is included in `jq_schema`.
- `json_lines=True` means that the file is a [JSON lines file](https://jsonlines.org/). Each line of a JSON line file is a fully compliant JSON.
- `metadata_func=get_metadata` indicates that we want to use the function `get_metadata()` to extract metdata from the filtered JSON line.

In [1]:
%load_ext dotenv
%dotenv ../../05_src/.secrets

In [2]:
from langchain_community.document_loaders import JSONLoader
from langchain_text_splitters  import RecursiveCharacterTextSplitter
import os

In [3]:
def get_metadata(record:dict, metadata: dict) -> dict:
    metadata['reviewid'] = record.get('reviewid')
    return metadata

loader = JSONLoader("../../05_src/documents/pitchfork_content.jsonl", 
                    jq_schema=".",
                    content_key="content",
                    json_lines=True,
                    text_content=True,
                    metadata_func=get_metadata)

In [4]:
data = loader.load()
data

[Document(metadata={'source': 'C:\\Users\\JesusCalderon\\work\\dsi_deploying_ai\\05_src\\documents\\pitchfork_content.jsonl', 'seq_num': 1, 'reviewid': 22703}, page_content="Trip-hop eventually became a 90s punchline, a music-press shorthand for overhyped hotel lounge music. But today, the much-maligned subgenre almost feels like a secret precedent. Listen to any of the canonical Bristol-scene albums of the mid-late 90s, when the genre was starting to chafe against its boundaries, and youd think the claustrophobic, anxious 21st century started a few years ahead of schedule. Looked at from the right angle, trip-hopis part of an unbroken chain that runs from the abrasion of 80s post-punk to the ruminative pop-R&B-dance fusion of the moment.The best of it has aged far more gracefully (and forcefully) than anything recorded in the waning days of the record industrys pre-filesharing monomania has any right to. Tricky rebelled against being attached at the hip to a scene he was already looki

In [23]:
data[1].model_dump()

{'id': None,
 'metadata': {'source': 'C:\\Users\\JesusCalderon\\work\\dsi_deploying_ai\\05_src\\documents\\pitchfork_content.jsonl',
  'seq_num': 2,
  'reviewid': 22721},
 'page_content': 'Eight years, five albums, and two EPs in, the New York-based outfit Krallice have long since shut up purists about their hipster black metal. Their four-man, post-structural assembly line runs at a breakneck pace, taking great care to balance the intricate (Colin Marston and Mick Barrs interlocking riffs, Lev Weinsteins head-spinning polyrhythms) with the incendiary (best exemplified by Barr and Nick McMasters shared, animalistic vocal duties; the formers a screaming eagle, the latter a growling hellhound). The quartet frequently capitalize on the element of surprise; Krallices last two releases2015s Ygg Huurand last winters HyperionEPdropped spontaneously, a pair of inter-dimensional rifts masquerading as albums, far from the hum of the hype machine. Early last month, the band opened the portal once

## Splitting Documents

There are good reasons to split documents. As explained in [LangChain's Documentation](https://python.langchain.com/docs/concepts/text_splitters/#why-split-documents):


- Handling non-uniform document lengths: Real-world document collections often contain texts of varying sizes. Splitting ensures consistent processing across all documents.
- Overcoming model limitations: Many embedding models and language models have maximum input size constraints. Splitting allows us to process documents that would otherwise exceed these limits.
- Improving representation quality: For longer documents, the quality of embeddings or other representations may degrade as they try to capture too much information. Splitting can lead to more focused and accurate representations of each section.
- Enhancing retrieval precision: In information retrieval systems, splitting can improve the granularity of search results, allowing for more precise matching of queries to relevant document sections.
- Optimizing computational resources: Working with smaller chunks of text can be more memory-efficient and allow for better parallelization of processing tasks.

## Text Splitters in LangChain

LangChain contains a family of [document splitters](https://docs.langchain.com/oss/python/integrations/splitters/index):

- Length-based: simple and intuitive approach that ensures a specific text length. Can be based on [characters](https://python.langchain.com/docs/how_to/character_text_splitter/) or [tokens](https://python.langchain.com/docs/how_to/split_by_token/).
- Text structure-based: tries to use the natural structure of text, including paragraphs, sentences, and words. More specifically:

    + The [RecursiveCharacterTextSplitter](https://docs.langchain.com/oss/python/integrations/splitters/recursive_text_splitter) attempts to keep larger units (e.g., paragraphs) intact.
    + If a unit exceeds the chunk size, it moves to the next level (e.g., sentences).
    + This process continues down to the word level if necessary.

- Document Structure-based: Uses the structure of documents in specific formats, including Markdown, HTML, and JSON.

In [6]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 2000, 
    chunk_overlap=200, 
    length_function = len, 
    add_start_index = True
)

In [7]:
chunks = text_splitter.split_documents(data)
print(f'Split {len(data)} reviews (documents) into {len(chunks)} chunks.' )

Split 18393 reviews (documents) into 48346 chunks.


Notice that the output documents (the "chunks") include the keys:

- `seq_num`: a sequential number identifying each of the original documents. 
- `start_index`: the starting index for the chunk.
- `page_content`: text of the document chunk.

In [8]:
chunks

[Document(metadata={'source': 'C:\\Users\\JesusCalderon\\work\\dsi_deploying_ai\\05_src\\documents\\pitchfork_content.jsonl', 'seq_num': 1, 'reviewid': 22703, 'start_index': 0}, page_content="Trip-hop eventually became a 90s punchline, a music-press shorthand for overhyped hotel lounge music. But today, the much-maligned subgenre almost feels like a secret precedent. Listen to any of the canonical Bristol-scene albums of the mid-late 90s, when the genre was starting to chafe against its boundaries, and youd think the claustrophobic, anxious 21st century started a few years ahead of schedule. Looked at from the right angle, trip-hopis part of an unbroken chain that runs from the abrasion of 80s post-punk to the ruminative pop-R&B-dance fusion of the moment.The best of it has aged far more gracefully (and forcefully) than anything recorded in the waning days of the record industrys pre-filesharing monomania has any right to. Tricky rebelled against being attached at the hip to a scene he

## Batch Embeddings

We now have a large number of documents for which we need embeddings. We could use a direct call to the Embeddings API. However, here we demonstrate how to request embeddings using the [Batch API](https://platform.openai.com/docs/api-reference/batch). From the documentation:

The Batch API is used to send asynchronous groups of requests. This API offers lower costs, a separate pool of significantly higher rate limits, and a clear 24-hour turnaround time. The service is ideal for processing jobs that don't require immediate responses. 

A couple of useful references are: 

- [Batch API Guide](https://platform.openai.com/docs/guides/batch)
- [API Reference](https://platform.openai.com/docs/api-reference/batch)

## Creating Batches

The batch process works as follows:

1. Prepare the batch file. Batches start with a .jsonl file where each line contains the details of an individual request to the API.
2. Upload the batch file to input. We must first input the batch file, so that we can reference it below.
3. Create the batch.    
4. Check status of the batch.
5. Retrieve the results.

In addition to the steps above, the API allows us to list all batches and to cancel a batch.

### 1. Prepare the Batch File

Batch processing using the API requires input files to follow a specific format. 

A few notes from the [documentation](https://platform.openai.com/docs/guides/batch#1-prepare-your-batch-file)

+ Batches start with a .jsonl file where each line contains the details of an individual request to the API. 
+ The available endpoints are:

    - Responses API: /v1/responses
    - Chat Completions API: /v1/chat/completions 
    - Embeddings API: /v1/embeddings 
    - Completions API: /v1/completions 
    - Moderations API: /v1/moderations 

+ For a given input file, the parameters in each line's body field are the same as the parameters for the underlying endpoint. 
+ Each request **must include a unique custom_id value**, which you can use to reference results after completion. 

#### Rate Limits

It is important to keep in mind the [API's rate limits](https://platform.openai.com/docs/guides/batch#rate-limits):


+ **Per-batch limits**: A single batch may include up to 50,000 requests, and a batch input file can be up to 200 MB in size. Note that /v1/embeddings batches are also restricted to a maximum of 50,000 embedding inputs across all requests in the batch.
+ **Enqueued prompt tokens per model**: Each model has a maximum number of enqueued prompt tokens allowed for batch processing. You can find these limits on the [Platform Settings](https://platform.openai.com/settings/organization/limits) page.

It is important to note: 

> There are no limits for output tokens or number of submitted requests for the Batch API today. Because Batch API rate limits are a new, separate pool, using the Batch API will not consume tokens from your standard per-model rate limits, thereby offering you a convenient way to increase the number of requests and processed tokens you can use when querying our API 



We must create files that contain the `page_content` and an identifier that would arguably include important metadata (like 'reviewid' and a chunk identifier) of our document chunks. We also want to create files that are within the rate limits (i.e., at most 50,000 documents per batch).

The batch definition jsonl should contain one line per request. [Each request is defined as](https://cookbook.openai.com/examples/batch_processing#creating-the-batch-file):

```
{
    "custom_id": <REQUEST_ID>,
    "method": "POST",
    "url": "/v1/chat/completions",
    "body": {
        "model": <MODEL>,
        "messages": <MESSAGES>,
        // other parameters
    }
}
```


In [9]:
chunks[0].page_content

"Trip-hop eventually became a 90s punchline, a music-press shorthand for overhyped hotel lounge music. But today, the much-maligned subgenre almost feels like a secret precedent. Listen to any of the canonical Bristol-scene albums of the mid-late 90s, when the genre was starting to chafe against its boundaries, and youd think the claustrophobic, anxious 21st century started a few years ahead of schedule. Looked at from the right angle, trip-hopis part of an unbroken chain that runs from the abrasion of 80s post-punk to the ruminative pop-R&B-dance fusion of the moment.The best of it has aged far more gracefully (and forcefully) than anything recorded in the waning days of the record industrys pre-filesharing monomania has any right to. Tricky rebelled against being attached at the hip to a scene he was already looking to shed and decamped for Jamaica to record a more aggressive, bristling-energy mutation of his style in 96; the namePre-Millennium Tension is the only obvious thing that 

In [10]:
chunks[0].metadata['reviewid']

22703

In [11]:

import json
import os

def prep_batch_file_for_embedding(input:list, output_path:str, max_lines_per_file:int=1000):
    total_lines = len(input)
    num_files = (total_lines // max_lines_per_file) + 1
    print(f'Total lines: {total_lines}, Number of files to create: {num_files}')

    for num_file in range(num_files):
        start_index = num_file * max_lines_per_file
        end_index = min(start_index + max_lines_per_file, total_lines)
        output_file = os.path.join(output_path, f"pitchfork_reviews_batch_{num_file+1}.jsonl")
        print(f'Creating file: {output_file} with lines from {start_index} to {end_index-1}')
        create_single_batch_file(input, start_index, end_index, output_file)

def create_single_batch_file(input, start_index, end_index, output_file):
    with open(output_file, 'w') as outfile:
        for line in input[start_index:end_index]:
            custom_id = (
                    str(line.metadata['reviewid']) + "_" + 
                    str(line.metadata['seq_num']) + "_" + 
                    str(line.metadata['start_index'])
                )
            content = line.page_content
            out_dict = {
                    "custom_id": custom_id, 
                    "method": "POST", 
                    "url": "/v1/embeddings", 
                    "body": {
                        "model": "text-embedding-3-small", 
                        "input": content
                    }
                }
            outfile.write(json.dumps(out_dict) + '\n')
        
            

In [12]:
prep_batch_file_for_embedding(
    input=chunks, 
    output_path='../../05_src/documents/'
)

Total lines: 48346, Number of files to create: 49
Creating file: ../../05_src/documents/pitchfork_reviews_batch_1.jsonl with lines from 0 to 999
Creating file: ../../05_src/documents/pitchfork_reviews_batch_2.jsonl with lines from 1000 to 1999
Creating file: ../../05_src/documents/pitchfork_reviews_batch_3.jsonl with lines from 2000 to 2999
Creating file: ../../05_src/documents/pitchfork_reviews_batch_4.jsonl with lines from 3000 to 3999
Creating file: ../../05_src/documents/pitchfork_reviews_batch_5.jsonl with lines from 4000 to 4999
Creating file: ../../05_src/documents/pitchfork_reviews_batch_6.jsonl with lines from 5000 to 5999
Creating file: ../../05_src/documents/pitchfork_reviews_batch_7.jsonl with lines from 6000 to 6999
Creating file: ../../05_src/documents/pitchfork_reviews_batch_8.jsonl with lines from 7000 to 7999
Creating file: ../../05_src/documents/pitchfork_reviews_batch_9.jsonl with lines from 8000 to 8999
Creating file: ../../05_src/documents/pitchfork_reviews_batch_1

### 2. Upload the Input File

Before running the batch process, we will upload the files to the API. File management has some useful functions.

#### List available files

In [13]:
from openai import OpenAI

import os
client = OpenAI(base_url='https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1', 
                api_key='any value',
                default_headers={"x-api-key": os.getenv('API_GATEWAY_KEY')})
files = client.files.list()


In [14]:
files.to_dict()['data']

[{'id': 'file-3MAVMYnYdsjH4QXYuFeKuV',
  'bytes': 46,
  'created_at': 1770922564,
  'filename': 'test.md',
  'object': 'file',
  'purpose': 'assistants',
  'status': 'processed',
  'expires_at': None,
  'status_details': None},
 {'id': 'file-EehnYoYC8jHacYXhYMD3ry',
  'bytes': 46,
  'created_at': 1770922531,
  'filename': 'test.md',
  'object': 'file',
  'purpose': 'assistants',
  'status': 'processed',
  'expires_at': None,
  'status_details': None},
 {'id': 'file-3vdqgefYXvgkNx4ot8ZtG6',
  'bytes': 46,
  'created_at': 1770922512,
  'filename': 'test.md',
  'object': 'file',
  'purpose': 'assistants',
  'status': 'processed',
  'expires_at': None,
  'status_details': None},
 {'id': 'file-FPyMffTF4ZBp5nXz4EekKA',
  'bytes': 46,
  'created_at': 1770921689,
  'filename': 'test.md',
  'object': 'file',
  'purpose': 'assistants',
  'status': 'processed',
  'expires_at': None,
  'status_details': None},
 {'id': 'file-XHYHSLy1XahNJYDof6GmcK',
  'bytes': 26135557,
  'created_at': 1770917566,


#### Remove Files

You can remove files from storage using code like the one below, which deletes all files in the account. 
Note: this is a destructive action that cannot be undone.

In [None]:
# for file in files.to_dict()['data']:
#     print(f'Deleting file: {file["filename"]}')
#     resp = client.files.delete(file["id"])
#     print(resp)

#### Search and Upload Files

We search for the files that we created and upload them

In [15]:
from glob import glob

batch_files = glob('../../05_src/documents/pitchfork_reviews_batch_*.jsonl')
batch_files

['../../05_src/documents\\pitchfork_reviews_batch_1.jsonl',
 '../../05_src/documents\\pitchfork_reviews_batch_10.jsonl',
 '../../05_src/documents\\pitchfork_reviews_batch_11.jsonl',
 '../../05_src/documents\\pitchfork_reviews_batch_12.jsonl',
 '../../05_src/documents\\pitchfork_reviews_batch_13.jsonl',
 '../../05_src/documents\\pitchfork_reviews_batch_14.jsonl',
 '../../05_src/documents\\pitchfork_reviews_batch_15.jsonl',
 '../../05_src/documents\\pitchfork_reviews_batch_16.jsonl',
 '../../05_src/documents\\pitchfork_reviews_batch_17.jsonl',
 '../../05_src/documents\\pitchfork_reviews_batch_18.jsonl',
 '../../05_src/documents\\pitchfork_reviews_batch_19.jsonl',
 '../../05_src/documents\\pitchfork_reviews_batch_2.jsonl',
 '../../05_src/documents\\pitchfork_reviews_batch_20.jsonl',
 '../../05_src/documents\\pitchfork_reviews_batch_21.jsonl',
 '../../05_src/documents\\pitchfork_reviews_batch_22.jsonl',
 '../../05_src/documents\\pitchfork_reviews_batch_23.jsonl',
 '../../05_src/documents\\

In [16]:
from openai import OpenAI
from tqdm import tqdm
# client = OpenAI(base_url='https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1', 
#                 default_headers={"x-api-key": os.getenv('API_GATEWAY_KEY')})
client = OpenAI()


for b_file in tqdm(batch_files):
    batch_input_file = client.files.create(
        file=open(b_file, "rb"), 
        purpose='batch'
    )
    print(batch_input_file)

  2%|▏         | 1/49 [00:01<00:51,  1.07s/it]

FileObject(id='file-EyaeFndN776oxeBL8ABimb', bytes=1837973, created_at=1770924130, filename='pitchfork_reviews_batch_1.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516130, status_details=None)


  4%|▍         | 2/49 [00:02<00:49,  1.06s/it]

FileObject(id='file-EymyPGMirSj8vseFdAnMfM', bytes=1840801, created_at=1770924131, filename='pitchfork_reviews_batch_10.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516131, status_details=None)


  6%|▌         | 3/49 [00:03<00:48,  1.06s/it]

FileObject(id='file-HvUWzXtRi5AZTRZ84ZYnnP', bytes=1847247, created_at=1770924132, filename='pitchfork_reviews_batch_11.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516132, status_details=None)


  8%|▊         | 4/49 [00:04<00:47,  1.06s/it]

FileObject(id='file-SZ6bdoXvByfvGBmgp5QQg7', bytes=1842549, created_at=1770924133, filename='pitchfork_reviews_batch_12.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516133, status_details=None)


 10%|█         | 5/49 [00:05<00:42,  1.04it/s]

FileObject(id='file-NtqCMjJ2sBu52FMUzJtCM2', bytes=1819084, created_at=1770924134, filename='pitchfork_reviews_batch_13.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516134, status_details=None)


 12%|█▏        | 6/49 [00:05<00:38,  1.13it/s]

FileObject(id='file-DHizMUMrhJVgKJNEv61N8g', bytes=1826662, created_at=1770924135, filename='pitchfork_reviews_batch_14.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516135, status_details=None)


 14%|█▍        | 7/49 [00:06<00:38,  1.08it/s]

FileObject(id='file-1Pqh7eM8LnKoJypDZTeKwp', bytes=1833509, created_at=1770924135, filename='pitchfork_reviews_batch_15.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516135, status_details=None)


 16%|█▋        | 8/49 [00:07<00:36,  1.13it/s]

FileObject(id='file-RzsyiFSchBTR3TXYZ4PSBF', bytes=1841699, created_at=1770924136, filename='pitchfork_reviews_batch_16.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516136, status_details=None)


 18%|█▊        | 9/49 [00:08<00:34,  1.15it/s]

FileObject(id='file-EkJG7ftaFTasLXMdixA4Xs', bytes=1818279, created_at=1770924137, filename='pitchfork_reviews_batch_17.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516137, status_details=None)


 20%|██        | 10/49 [00:09<00:36,  1.06it/s]

FileObject(id='file-K1kBWXVWabn92ioaU5MAyz', bytes=1779547, created_at=1770924138, filename='pitchfork_reviews_batch_18.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516138, status_details=None)


 22%|██▏       | 11/49 [00:10<00:33,  1.14it/s]

FileObject(id='file-5qyrYhoWHBcUyzx6kaJgzP', bytes=1738280, created_at=1770924139, filename='pitchfork_reviews_batch_19.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516139, status_details=None)


 24%|██▍       | 12/49 [00:10<00:30,  1.21it/s]

FileObject(id='file-ScwcvdCzgNgHcKKGJfhe6Q', bytes=1842602, created_at=1770924140, filename='pitchfork_reviews_batch_2.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516140, status_details=None)


 27%|██▋       | 13/49 [00:11<00:29,  1.22it/s]

FileObject(id='file-XQg6tPEN9bA7ckjd7nKhmJ', bytes=1739908, created_at=1770924141, filename='pitchfork_reviews_batch_20.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516141, status_details=None)


 29%|██▊       | 14/49 [00:12<00:31,  1.10it/s]

FileObject(id='file-AzABL6EC8MFzRiiDxhbm9q', bytes=1721782, created_at=1770924141, filename='pitchfork_reviews_batch_21.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516141, status_details=None)


 31%|███       | 15/49 [00:13<00:30,  1.12it/s]

FileObject(id='file-3AwioGVHCm135ds4gP4pGx', bytes=1700138, created_at=1770924142, filename='pitchfork_reviews_batch_22.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516142, status_details=None)


 33%|███▎      | 16/49 [00:14<00:27,  1.18it/s]

FileObject(id='file-96VGjbK1qH9GNg7zQn2jzN', bytes=1792623, created_at=1770924143, filename='pitchfork_reviews_batch_23.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516143, status_details=None)


 35%|███▍      | 17/49 [00:15<00:26,  1.21it/s]

FileObject(id='file-SiqMfWP6uHERXXyxZFa35Y', bytes=1816709, created_at=1770924144, filename='pitchfork_reviews_batch_24.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516144, status_details=None)


 37%|███▋      | 18/49 [00:16<00:26,  1.15it/s]

FileObject(id='file-6dUYTPcwMdQKViBNNdceRN', bytes=1802925, created_at=1770924145, filename='pitchfork_reviews_batch_25.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516145, status_details=None)


 39%|███▉      | 19/49 [00:16<00:25,  1.18it/s]

FileObject(id='file-Bh6m37MHrLcvmLqzqSkuH8', bytes=1798241, created_at=1770924146, filename='pitchfork_reviews_batch_26.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516146, status_details=None)


 41%|████      | 20/49 [00:17<00:24,  1.19it/s]

FileObject(id='file-1CzmE8y6t7q3w8YgVuVcEU', bytes=1793531, created_at=1770924147, filename='pitchfork_reviews_batch_27.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516147, status_details=None)


 43%|████▎     | 21/49 [00:18<00:25,  1.10it/s]

FileObject(id='file-X65qYsA47j1mw1QJzhsLEF', bytes=1806813, created_at=1770924147, filename='pitchfork_reviews_batch_28.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516147, status_details=None)


 45%|████▍     | 22/49 [00:19<00:23,  1.13it/s]

FileObject(id='file-1zCCVCaL8hAAtHfBkz6oVg', bytes=1807648, created_at=1770924148, filename='pitchfork_reviews_batch_29.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516148, status_details=None)


 47%|████▋     | 23/49 [00:20<00:23,  1.11it/s]

FileObject(id='file-4kKAv7V24KTvQMMxiFMAVk', bytes=1831754, created_at=1770924149, filename='pitchfork_reviews_batch_3.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516149, status_details=None)


 49%|████▉     | 24/49 [00:21<00:24,  1.02it/s]

FileObject(id='file-FyT5cCLuFPKAHykA7ZQfsm', bytes=1820406, created_at=1770924150, filename='pitchfork_reviews_batch_30.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516150, status_details=None)


 51%|█████     | 25/49 [00:22<00:21,  1.09it/s]

FileObject(id='file-WZmBjapQG5KXmj32ZACDaz', bytes=1800884, created_at=1770924151, filename='pitchfork_reviews_batch_31.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516151, status_details=None)


 53%|█████▎    | 26/49 [00:23<00:19,  1.15it/s]

FileObject(id='file-SUVUr9s7XDJTqhTep1zUSp', bytes=1804009, created_at=1770924152, filename='pitchfork_reviews_batch_32.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516152, status_details=None)


 55%|█████▌    | 27/49 [00:24<00:19,  1.11it/s]

FileObject(id='file-DJL2R73nM5ZYa7x3ypduJe', bytes=1764596, created_at=1770924153, filename='pitchfork_reviews_batch_33.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516153, status_details=None)


 57%|█████▋    | 28/49 [00:25<00:20,  1.01it/s]

FileObject(id='file-Nyq1HdiBKXm2X3KGrtqF7U', bytes=1763908, created_at=1770924154, filename='pitchfork_reviews_batch_34.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516154, status_details=None)


 59%|█████▉    | 29/49 [00:26<00:21,  1.09s/it]

FileObject(id='file-2oDejLcn1ojZY1ZXYGv1xX', bytes=1751975, created_at=1770924155, filename='pitchfork_reviews_batch_35.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516155, status_details=None)


 61%|██████    | 30/49 [00:28<00:21,  1.11s/it]

FileObject(id='file-HM1NyW2gUHEFvHX6Hi8Qb3', bytes=1753136, created_at=1770924156, filename='pitchfork_reviews_batch_36.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516156, status_details=None)


 63%|██████▎   | 31/49 [00:29<00:20,  1.11s/it]

FileObject(id='file-Wf7FQtqTnAahPP6f9pGAHe', bytes=1703495, created_at=1770924158, filename='pitchfork_reviews_batch_37.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516158, status_details=None)


 65%|██████▌   | 32/49 [00:30<00:21,  1.24s/it]

FileObject(id='file-7oPdhLFUdZFD3DUZ3HTjNR', bytes=1816755, created_at=1770924159, filename='pitchfork_reviews_batch_38.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516159, status_details=None)


 67%|██████▋   | 33/49 [00:31<00:17,  1.09s/it]

FileObject(id='file-ByRmtyzqZKBAeYZ1wZ2XaT', bytes=1828602, created_at=1770924160, filename='pitchfork_reviews_batch_39.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516160, status_details=None)


 69%|██████▉   | 34/49 [00:32<00:15,  1.02s/it]

FileObject(id='file-AhHQKPq6pJcEv6qzBGdLEZ', bytes=1799466, created_at=1770924161, filename='pitchfork_reviews_batch_4.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516161, status_details=None)


 71%|███████▏  | 35/49 [00:33<00:13,  1.01it/s]

FileObject(id='file-FKg33gaysUCaJzpwHwntji', bytes=1828017, created_at=1770924162, filename='pitchfork_reviews_batch_40.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516162, status_details=None)


 73%|███████▎  | 36/49 [00:33<00:12,  1.08it/s]

FileObject(id='file-KdHmeco2WqwoMJPZNtmtTm', bytes=1811672, created_at=1770924163, filename='pitchfork_reviews_batch_41.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516163, status_details=None)


 76%|███████▌  | 37/49 [00:35<00:13,  1.09s/it]

FileObject(id='file-Lh2f5XKoRMj9NN4Yiu9rPR', bytes=1834390, created_at=1770924164, filename='pitchfork_reviews_batch_42.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516164, status_details=None)


 78%|███████▊  | 38/49 [00:36<00:12,  1.11s/it]

FileObject(id='file-1YEbRTdsqk8ysGycPoc1oi', bytes=1821820, created_at=1770924165, filename='pitchfork_reviews_batch_43.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516165, status_details=None)


 80%|███████▉  | 39/49 [00:37<00:10,  1.00s/it]

FileObject(id='file-BpdhbPTGqb1rgX2m3EfkoG', bytes=1823231, created_at=1770924166, filename='pitchfork_reviews_batch_44.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516166, status_details=None)


 82%|████████▏ | 40/49 [00:38<00:08,  1.04it/s]

FileObject(id='file-T6Ye1QS5GLwB7WVn1S9fRP', bytes=1841492, created_at=1770924167, filename='pitchfork_reviews_batch_45.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516167, status_details=None)


 84%|████████▎ | 41/49 [00:39<00:07,  1.10it/s]

FileObject(id='file-YDb6JK5VYmkpLGRCSy7JsL', bytes=1835970, created_at=1770924168, filename='pitchfork_reviews_batch_46.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516168, status_details=None)


 86%|████████▌ | 42/49 [00:39<00:06,  1.12it/s]

FileObject(id='file-JoULMLT6UWri2B67cRRmMo', bytes=1842522, created_at=1770924169, filename='pitchfork_reviews_batch_47.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516169, status_details=None)


 88%|████████▊ | 43/49 [00:40<00:05,  1.15it/s]

FileObject(id='file-1NhhgYaoWu2iexgNmsBhCD', bytes=1758853, created_at=1770924169, filename='pitchfork_reviews_batch_48.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516169, status_details=None)


 90%|████████▉ | 44/49 [00:41<00:03,  1.34it/s]

FileObject(id='file-R9nVUBeHKpSaTPg8hUN3LC', bytes=597852, created_at=1770924170, filename='pitchfork_reviews_batch_49.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516170, status_details=None)


 92%|█████████▏| 45/49 [00:42<00:03,  1.20it/s]

FileObject(id='file-3AgobaBddnQstnvy7u65Hv', bytes=1803712, created_at=1770924171, filename='pitchfork_reviews_batch_5.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516171, status_details=None)


 94%|█████████▍| 46/49 [00:43<00:02,  1.09it/s]

FileObject(id='file-Q72yuafiq76nQU1yQm6wRo', bytes=1808931, created_at=1770924172, filename='pitchfork_reviews_batch_6.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516172, status_details=None)


 96%|█████████▌| 47/49 [00:44<00:01,  1.03it/s]

FileObject(id='file-ECzZ2QZCd1tmE8ecZJtB8t', bytes=1833888, created_at=1770924173, filename='pitchfork_reviews_batch_7.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516173, status_details=None)


 98%|█████████▊| 48/49 [00:45<00:01,  1.01s/it]

FileObject(id='file-1EiWYszPmEoUNm9AfTfEAA', bytes=1841734, created_at=1770924174, filename='pitchfork_reviews_batch_8.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516174, status_details=None)


100%|██████████| 49/49 [00:46<00:00,  1.06it/s]

FileObject(id='file-MjzJVjeK4guNZ7fWSiiAby', bytes=1837811, created_at=1770924175, filename='pitchfork_reviews_batch_9.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516175, status_details=None)





### 3. Create Batches

As before, we can consult the files that we have in store:

In [17]:
batch_files = client.files.list().to_dict()
batch_file_ids = [file['id'] for file in batch_files['data']]
batch_file_ids

[FileObject(id='file-EyaeFndN776oxeBL8ABimb', bytes=1837973, created_at=1770924130, filename='pitchfork_reviews_batch_1.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516130, status_details=None),
 FileObject(id='file-EymyPGMirSj8vseFdAnMfM', bytes=1840801, created_at=1770924131, filename='pitchfork_reviews_batch_10.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516131, status_details=None),
 FileObject(id='file-HvUWzXtRi5AZTRZ84ZYnnP', bytes=1847247, created_at=1770924132, filename='pitchfork_reviews_batch_11.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516132, status_details=None),
 FileObject(id='file-SZ6bdoXvByfvGBmgp5QQg7', bytes=1842549, created_at=1770924133, filename='pitchfork_reviews_batch_12.jsonl', object='file', purpose='batch', status='processed', expires_at=1773516133, status_details=None),
 FileObject(id='file-NtqCMjJ2sBu52FMUzJtCM2', bytes=1819084, created_at=1770924134, filename='pit

At a difference with the files API, there is no easy way of removing batches that have a completed or failed state, so the description and status are important. 

Now we can create the batch procedure. For each file, we create the batch with the call below:

In [None]:
my_id = <add your id here>

In [19]:
from datetime import datetime

timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
batch_description = f"Pitchfork reviews content embeddings ({my_id}) {timestamp}"

for file_id in tqdm(batch_file_ids):
    client.batches.create(
            input_file_id = file_id,
            endpoint="/v1/embeddings",
            completion_window="24h",
            metadata={
                "description": batch_description,
                "timestamp": timestamp
            }
        )

100%|██████████| 49/49 [00:10<00:00,  4.62it/s]


In [20]:
batch_description

'Pitchfork reviews content embeddings (jcalderon_20260212) 2026-02-12 14:22:54'

In [21]:
batch_processes = client.batches.list().to_dict()
batch_info= [
    {'batch_id': batch['id'],
     'description': batch['metadata']['description'],
    'status': batch['status'],
    'request_counts': batch['request_counts'],
    'output_file_id': batch['output_file_id']}  
            for batch in batch_processes['data'] if batch['metadata']['description'] == batch_description
    ]
batch_info

[{'batch_id': 'batch_698e289a55f881908160dc50245a838a',
  'description': 'Pitchfork reviews content embeddings (jcalderon_20260212) 2026-02-12 14:22:54',
  'status': 'validating',
  'request_counts': {'completed': 0, 'failed': 0, 'total': 0},
  'output_file_id': None},
 {'batch_id': 'batch_698e289a1db881909344f9c3abd0f6d6',
  'description': 'Pitchfork reviews content embeddings (jcalderon_20260212) 2026-02-12 14:22:54',
  'status': 'validating',
  'request_counts': {'completed': 0, 'failed': 0, 'total': 0},
  'output_file_id': None},
 {'batch_id': 'batch_698e2899f0608190b8852c482cf342aa',
  'description': 'Pitchfork reviews content embeddings (jcalderon_20260212) 2026-02-12 14:22:54',
  'status': 'validating',
  'request_counts': {'completed': 0, 'failed': 0, 'total': 0},
  'output_file_id': None},
 {'batch_id': 'batch_698e28999c30819092543b7b214b8bbc',
  'description': 'Pitchfork reviews content embeddings (jcalderon_20260212) 2026-02-12 14:22:54',
  'status': 'validating',
  'request

If you need to cancel a batch, you can use the code below:

In [22]:
# for batch in batch_info:
#     client.batches.cancel(batch['batch_id'])