This web application, built with Python and Flask, cleans and standardizes data from CSV or XLSX files using the Google Gemini API. It serves as a Python-based re-implementation of the concepts from the original AI Employee CSV Cleaner (a React/Node.js application).
- File Upload: Supports both
.csvand.xlsxfile formats. - AI-Powered Cleaning: Leverages the Gemini model to process each row, correcting typos, standardizing formats, and handling inconsistencies.
- Download Results: Allows you to download the fully cleaned dataset as a new CSV file.
- Containerized: Includes a
Dockerfilefor easy containerization. - Cloud Run Deployment: Comes with a deployment script (
deploy.sh) for simple, one-command deploys to Google Cloud Run.
This application is designed to be memory-efficient, especially when processing large files. It achieves this through the following mechanisms:
- Chunk-Based Processing: Instead of loading the entire file into memory, the application reads and processes the file in smaller chunks.
- Streaming to a Temporary File: The cleaned data is streamed to a temporary file on disk instead of being held in memory. This significantly reduces the application's memory footprint.
- No Data Preview: To further conserve memory, the data preview feature has been disabled. The full cleaned data can be downloaded as a CSV file.
- Backend: Python, Flask, Gunicorn
- Data Handling: Pandas
- AI: Google Gemini
- Testing: pytest, pytest-cov, pytest-mock
- Containerization: Docker
- Deployment: Google Cloud Run, Artifact Registry, Secret Manager
Follow these steps to run the application on your local machine.
- Python 3.9+
- A valid Google Gemini API Key.
-
Navigate to the project directory:
cd csv-data-extractor -
Create and activate a virtual environment:
python3 -m venv .venv source .venv/bin/activate # On Windows, use: .venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Configure your API Key: Create a
.envfile by copying the example file, then add your API key.cp .env.example .env
Now, open the
.envfile and replaceYOUR_API_KEY_HEREwith your actual Gemini API key.
-
Start the Flask development server:
python main.py
-
Open your web browser and navigate to
http://127.0.0.1:5000.
This project uses pytest for unit testing.
-
Install test dependencies:
pip install pytest pytest-cov pytest-mock
-
Run the tests with coverage:
pytest --cov=main --cov=services --cov=utils
The included deploy.sh script automates the entire deployment process.
- Google Cloud SDK (
gcloud) installed and authenticated. - Docker installed and running.
- A Google Cloud Project with billing enabled.
The script uses the following environment variables, which you can override:
PROJECT_ID: Your Google Cloud project ID (default:awanmasterpiece).REGION: The Cloud Run region (default:asia-southeast2).SERVICE: The Cloud Run service name (default:csv-data-extractor).REPO: The Artifact Registry repository name (default:csv-data-extractor).
-
Make the script executable:
chmod +x deploy.sh
-
Run the. script: You can provide your Gemini API key as an argument. The script will automatically create or update a secret in Google Secret Manager.
./deploy.sh "YOUR_GEMINI_API_KEY"Alternatively, you can set it as an environment variable first:
export GEMINI_API_KEY="YOUR_GEMINI_API_KEY" ./deploy.sh
The script will build the Docker image, push it to Artifact Registry, and deploy the service to Cloud Run. The GEMINI_API_KEY will be securely mounted into the service from Secret Manager.