Command‑line utility that downloads a dataset from any HTTP(S) endpoint and runs a lightweight preprocessing pass before handing you a clean CSV. The workflow is split into two clear stages—downloader.py for retrieval and processor.py for data cleanup
- Downloads arbitrary files via
requestswith explicit error handling for connection, HTTP, and generic request failures. - Cleans CSV files with
pandas: currently drops rows with missing values but can be extended with your own transformations. - Clear CLI defined in
main.py, wiring both stages together so the clean dataset is produced in a single command.
- Python 3.9+
- Dependencies:
requests,pandas
Install them globally or inside a virtual environment:
python -m venv .venv
.venv\Scripts\activate
pip install requests pandas
python main.py --url https://example.com/data.csv --output data/raw.csv --clean_output data/clean.csv
| Argument | Description |
|---|---|
--url |
HTTP(S) link to the dataset you want to download. |
--output |
Path where the raw response will be saved (directories must exist). |
--clean_output |
Path for the cleaned CSV produced by processor.py. |
downloader.pyfetches the remote file and saves it to--output.processor.pyreads the downloaded CSV, drops rows with missing values (extend this step as needed), and writes it to--clean_output.
- Network issues (connection errors, bad status codes, malformed responses) terminate early with actionable messages.
- File handling issues (missing paths, permission problems, empty or malformed CSVs) are caught and reported before exiting.
dataset_cli/
├── downloader.py # Handles HTTP download logic
├── processor.py # Cleans and exports the CSV
└── main.py # CLI entry point composing the workflow
- Add new preprocessing logic inside
preprocess_data(e.g., casting columns, filtering, feature engineering). - Replace the downloader if you need authenticated requests or alternate protocols.
- Wrap the CLI with your own orchestration or scheduling system for recurring dataset refreshes.