dataset_cli

Command‑line utility that downloads a dataset from any HTTP(S) endpoint and runs a lightweight preprocessing pass before handing you a clean CSV. The workflow is split into two clear stages—downloader.py for retrieval and processor.py for data cleanup

Features

Downloads arbitrary files via requests with explicit error handling for connection, HTTP, and generic request failures.
Cleans CSV files with pandas: currently drops rows with missing values but can be extended with your own transformations.
Clear CLI defined in main.py, wiring both stages together so the clean dataset is produced in a single command.

Requirements

Python 3.9+
Dependencies: requests, pandas

Install them globally or inside a virtual environment:

python -m venv .venv
.venv\Scripts\activate
pip install requests pandas

Usage

python main.py --url https://example.com/data.csv --output data/raw.csv --clean_output data/clean.csv

Argument	Description
`--url`	HTTP(S) link to the dataset you want to download.
`--output`	Path where the raw response will be saved (directories must exist).
`--clean_output`	Path for the cleaned CSV produced by `processor.py`.

Typical flow

downloader.py fetches the remote file and saves it to --output.
processor.py reads the downloaded CSV, drops rows with missing values (extend this step as needed), and writes it to --clean_output.

Error handling

Network issues (connection errors, bad status codes, malformed responses) terminate early with actionable messages.
File handling issues (missing paths, permission problems, empty or malformed CSVs) are caught and reported before exiting.

Project structure

dataset_cli/
├── downloader.py   # Handles HTTP download logic
├── processor.py    # Cleans and exports the CSV
└── main.py         # CLI entry point composing the workflow

Extending

Add new preprocessing logic inside preprocess_data (e.g., casting columns, filtering, feature engineering).
Replace the downloader if you need authenticated requests or alternate protocols.
Wrap the CLI with your own orchestration or scheduling system for recurring dataset refreshes.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
downloader.py		downloader.py
main.py		main.py
processor.py		processor.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dataset_cli

Features

Requirements

Usage

Typical flow

Error handling

Project structure

Extending

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

dataset_cli

Features

Requirements

Usage

Typical flow

Error handling

Project structure

Extending

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages