Skip to content

addhe/csv-data-extractor

Repository files navigation

AI CSV Data Extractor

This web application, built with Python and Flask, cleans and standardizes data from CSV or XLSX files using the Google Gemini API. It serves as a Python-based re-implementation of the concepts from the original AI Employee CSV Cleaner (a React/Node.js application).

Features

  • File Upload: Supports both .csv and .xlsx file formats.
  • AI-Powered Cleaning: Leverages the Gemini model to process each row, correcting typos, standardizing formats, and handling inconsistencies.
  • Download Results: Allows you to download the fully cleaned dataset as a new CSV file.
  • Containerized: Includes a Dockerfile for easy containerization.
  • Cloud Run Deployment: Comes with a deployment script (deploy.sh) for simple, one-command deploys to Google Cloud Run.

Architecture

This application is designed to be memory-efficient, especially when processing large files. It achieves this through the following mechanisms:

  • Chunk-Based Processing: Instead of loading the entire file into memory, the application reads and processes the file in smaller chunks.
  • Streaming to a Temporary File: The cleaned data is streamed to a temporary file on disk instead of being held in memory. This significantly reduces the application's memory footprint.
  • No Data Preview: To further conserve memory, the data preview feature has been disabled. The full cleaned data can be downloaded as a CSV file.

Tech Stack

  • Backend: Python, Flask, Gunicorn
  • Data Handling: Pandas
  • AI: Google Gemini
  • Testing: pytest, pytest-cov, pytest-mock
  • Containerization: Docker
  • Deployment: Google Cloud Run, Artifact Registry, Secret Manager

Local Development

Follow these steps to run the application on your local machine.

1. Prerequisites

  • Python 3.9+
  • A valid Google Gemini API Key.

2. Setup

  1. Navigate to the project directory:

    cd csv-data-extractor
  2. Create and activate a virtual environment:

    python3 -m venv .venv
    source .venv/bin/activate
    # On Windows, use: .venv\Scripts\activate
  3. Install dependencies:

    pip install -r requirements.txt
  4. Configure your API Key: Create a .env file by copying the example file, then add your API key.

    cp .env.example .env

    Now, open the .env file and replace YOUR_API_KEY_HERE with your actual Gemini API key.

3. Running the Application

  1. Start the Flask development server:

    python main.py
  2. Open your web browser and navigate to http://127.0.0.1:5000.


Running the Tests

This project uses pytest for unit testing.

  1. Install test dependencies:

    pip install pytest pytest-cov pytest-mock
  2. Run the tests with coverage:

    pytest --cov=main --cov=services --cov=utils

Deployment to Google Cloud Run

The included deploy.sh script automates the entire deployment process.

1. Prerequisites

2. Configuration

The script uses the following environment variables, which you can override:

  • PROJECT_ID: Your Google Cloud project ID (default: awanmasterpiece).
  • REGION: The Cloud Run region (default: asia-southeast2).
  • SERVICE: The Cloud Run service name (default: csv-data-extractor).
  • REPO: The Artifact Registry repository name (default: csv-data-extractor).

3. Running the Deployment Script

  1. Make the script executable:

    chmod +x deploy.sh
  2. Run the. script: You can provide your Gemini API key as an argument. The script will automatically create or update a secret in Google Secret Manager.

    ./deploy.sh "YOUR_GEMINI_API_KEY"

    Alternatively, you can set it as an environment variable first:

    export GEMINI_API_KEY="YOUR_GEMINI_API_KEY"
    ./deploy.sh

The script will build the Docker image, push it to Artifact Registry, and deploy the service to Cloud Run. The GEMINI_API_KEY will be securely mounted into the service from Secret Manager.

About

Example code of data extractor from csv to different format using Google Gemini AI Generative

Resources

License

Stars

Watchers

Forks

Packages

No packages published