A Python application that processes documents in batches using Google Gemini, Cloud Storage Buckets, and Batch API.
- Create a Google Cloud Project.
- Create a Google Cloud Storage Bucket.
- Select a Location Type (e.g. "Region") and a Location (e.g. "northamerica-northeast2 (Toronto)")
- Select Default Storage Class
- Use the default settings for the rest of the prompts.
- Create 3 folders inside your bucket.
- Input Folder (this is where you will upload all the original data)
- JSONL Folder (this is where the script will upload and generate JSONL files)
- Output Folder (this is where the resulting data will be stored)
- Install Google Cloud CLI.
- Initialize and authorize the gcloud CLI by running
gcloud init. This will open a web browser to authorize access. Follow the instructions to complete the process. For more information and troubleshooting, refer to the documentation here. - Run
gcloud auth application-default loginto establish a connection between the code and the Google Cloud API. You will be redirected to the browser. Follow the instructions to complete the process. You will see a notice"You are now authenticated with the gcloud CLI!"if your setup was completed successfully.
- Create a Python virtual environment:
python3 -m venv venv. - Activate the virtual environment:
source venv/bin/activate. - Install dependencies and packages:
pip install -r requirements.txt. - Fill the
.env.templatewith your credentials and save it as.env. Follow the instructions in docs.
There is some sample data provided in the sample_data directory.
It contains 10 text documents, each containing 1 paragraph of text. Our goal is to process these documents in a batch and generate output documents that contain the 'topic' and 'summary' for each of the input documents in JSON format.
- Upload the sample data to the Input Folder of your bucket.
- Run
make batch. This will create an input JSONL file, upload it to the JSONL Folder, and then create a batch job using that JSONL file. - After the batch job is finished, we need to update the .env file. Enter the folder name of your output JSONL file from your bucket in
OUTPUT_JSONL_PATH. - Run
make post_process. This will process our documents using the output JSONL file and our Output Folder in the bucket will be populated with our desired data.
See more detailed documentation here.