AI CSV Data Extractor

This web application, built with Python and Flask, cleans and standardizes data from CSV or XLSX files using the Google Gemini API. It serves as a Python-based re-implementation of the concepts from the original AI Employee CSV Cleaner (a React/Node.js application).

Features

File Upload: Supports both .csv and .xlsx file formats.
AI-Powered Cleaning: Leverages the Gemini model to process each row, correcting typos, standardizing formats, and handling inconsistencies.
Download Results: Allows you to download the fully cleaned dataset as a new CSV file.
Containerized: Includes a Dockerfile for easy containerization.
Cloud Run Deployment: Comes with a deployment script (deploy.sh) for simple, one-command deploys to Google Cloud Run.

Architecture

This application is designed to be memory-efficient, especially when processing large files. It achieves this through the following mechanisms:

Chunk-Based Processing: Instead of loading the entire file into memory, the application reads and processes the file in smaller chunks.
Streaming to a Temporary File: The cleaned data is streamed to a temporary file on disk instead of being held in memory. This significantly reduces the application's memory footprint.
No Data Preview: To further conserve memory, the data preview feature has been disabled. The full cleaned data can be downloaded as a CSV file.

Tech Stack

Backend: Python, Flask, Gunicorn
Data Handling: Pandas
AI: Google Gemini
Testing: pytest, pytest-cov, pytest-mock
Containerization: Docker
Deployment: Google Cloud Run, Artifact Registry, Secret Manager

Local Development

Follow these steps to run the application on your local machine.

1. Prerequisites

Python 3.9+
A valid Google Gemini API Key.

2. Setup

Navigate to the project directory:
```
cd csv-data-extractor
```

Create and activate a virtual environment:

python3 -m venv .venv
source .venv/bin/activate
# On Windows, use: .venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```
Configure your API Key: Create a .env file by copying the example file, then add your API key.
```
cp .env.example .env
```
Now, open the .env file and replace YOUR_API_KEY_HERE with your actual Gemini API key.

3. Running the Application

Start the Flask development server:
```
python main.py
```
Open your web browser and navigate to http://127.0.0.1:5000.

Running the Tests

This project uses pytest for unit testing.

Install test dependencies:

pip install pytest pytest-cov pytest-mock

Run the tests with coverage:

pytest --cov=main --cov=services --cov=utils

Deployment to Google Cloud Run

The included deploy.sh script automates the entire deployment process.

1. Prerequisites

Google Cloud SDK (gcloud) installed and authenticated.
Docker installed and running.
A Google Cloud Project with billing enabled.

2. Configuration

The script uses the following environment variables, which you can override:

PROJECT_ID: Your Google Cloud project ID (default: awanmasterpiece).
REGION: The Cloud Run region (default: asia-southeast2).
SERVICE: The Cloud Run service name (default: csv-data-extractor).
REPO: The Artifact Registry repository name (default: csv-data-extractor).

3. Running the Deployment Script

Make the script executable:
```
chmod +x deploy.sh
```
Run the. script: You can provide your Gemini API key as an argument. The script will automatically create or update a secret in Google Secret Manager.
```
./deploy.sh "YOUR_GEMINI_API_KEY"
```
Alternatively, you can set it as an environment variable first:
```
export GEMINI_API_KEY="YOUR_GEMINI_API_KEY"
./deploy.sh
```

The script will build the Docker image, push it to Artifact Registry, and deploy the service to Cloud Run. The GEMINI_API_KEY will be securely mounted into the service from Secret Manager.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
templates		templates
tests		tests
.coverage		.coverage
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
config.py		config.py
deploy.sh		deploy.sh
main.py		main.py
requirements.txt		requirements.txt
services.py		services.py
use_cases.py		use_cases.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AI CSV Data Extractor

Features

Architecture

Tech Stack

Local Development

1. Prerequisites

2. Setup

3. Running the Application

Running the Tests

Deployment to Google Cloud Run

1. Prerequisites

2. Configuration

3. Running the Deployment Script

About

Uh oh!

Releases

Packages

Languages

License

addhe/csv-data-extractor

Folders and files

Latest commit

History

Repository files navigation

AI CSV Data Extractor

Features

Architecture

Tech Stack

Local Development

1. Prerequisites

2. Setup

3. Running the Application

Running the Tests

Deployment to Google Cloud Run

1. Prerequisites

2. Configuration

3. Running the Deployment Script

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages