Gemini Batch Processor

A Python application that processes documents in batches using Google Gemini, Cloud Storage Buckets, and Batch API.

Installation and Setup

Google Cloud Project and Storage Bucket

Create a Google Cloud Project.
Create a Google Cloud Storage Bucket.
- Select a Location Type (e.g. "Region") and a Location (e.g. "northamerica-northeast2 (Toronto)")
- Select Default Storage Class
- Use the default settings for the rest of the prompts.
Create 3 folders inside your bucket.
- Input Folder (this is where you will upload all the original data)
- JSONL Folder (this is where the script will upload and generate JSONL files)
- Output Folder (this is where the resulting data will be stored)

Google Cloud CLI

Install Google Cloud CLI.
Initialize and authorize the gcloud CLI by running gcloud init. This will open a web browser to authorize access. Follow the instructions to complete the process. For more information and troubleshooting, refer to the documentation here.
Run gcloud auth application-default login to establish a connection between the code and the Google Cloud API. You will be redirected to the browser. Follow the instructions to complete the process. You will see a notice "You are now authenticated with the gcloud CLI!" if your setup was completed successfully.

Virtual Environment

Create a Python virtual environment: python3 -m venv venv.
Activate the virtual environment: source venv/bin/activate.
Install dependencies and packages: pip install -r requirements.txt.
Fill the .env.template with your credentials and save it as .env. Follow the instructions in docs.

Demo Usage

There is some sample data provided in the sample_data directory. It contains 10 text documents, each containing 1 paragraph of text. Our goal is to process these documents in a batch and generate output documents that contain the 'topic' and 'summary' for each of the input documents in JSON format.

Upload the sample data to the Input Folder of your bucket.
Run make batch. This will create an input JSONL file, upload it to the JSONL Folder, and then create a batch job using that JSONL file.
After the batch job is finished, we need to update the .env file. Enter the folder name of your output JSONL file from your bucket in OUTPUT_JSONL_PATH.
Run make post_process. This will process our documents using the output JSONL file and our Output Folder in the bucket will be populated with our desired data.

Additional Documentation

See more detailed documentation here.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
docs		docs
sample_data		sample_data
.env.template		.env.template
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
create_batch.py		create_batch.py
create_input_jsonl.py		create_input_jsonl.py
models.py		models.py
post_process.py		post_process.py
requirements.txt		requirements.txt
upload_input_jsonl.py		upload_input_jsonl.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gemini Batch Processor

Installation and Setup

Google Cloud Project and Storage Bucket

Google Cloud CLI

Virtual Environment

Demo Usage

Additional Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Gemini Batch Processor

Installation and Setup

Google Cloud Project and Storage Bucket

Google Cloud CLI

Virtual Environment

Demo Usage

Additional Documentation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages