Skip to content

XTejeshX/dataset_cli

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

dataset_cli

Command‑line utility that downloads a dataset from any HTTP(S) endpoint and runs a lightweight preprocessing pass before handing you a clean CSV. The workflow is split into two clear stages—downloader.py for retrieval and processor.py for data cleanup

Features

  • Downloads arbitrary files via requests with explicit error handling for connection, HTTP, and generic request failures.
  • Cleans CSV files with pandas: currently drops rows with missing values but can be extended with your own transformations.
  • Clear CLI defined in main.py, wiring both stages together so the clean dataset is produced in a single command.

Requirements

  • Python 3.9+
  • Dependencies: requests, pandas

Install them globally or inside a virtual environment:

python -m venv .venv
.venv\Scripts\activate
pip install requests pandas

Usage

python main.py --url https://example.com/data.csv --output data/raw.csv --clean_output data/clean.csv
Argument Description
--url HTTP(S) link to the dataset you want to download.
--output Path where the raw response will be saved (directories must exist).
--clean_output Path for the cleaned CSV produced by processor.py.

Typical flow

  1. downloader.py fetches the remote file and saves it to --output.
  2. processor.py reads the downloaded CSV, drops rows with missing values (extend this step as needed), and writes it to --clean_output.

Error handling

  • Network issues (connection errors, bad status codes, malformed responses) terminate early with actionable messages.
  • File handling issues (missing paths, permission problems, empty or malformed CSVs) are caught and reported before exiting.

Project structure

dataset_cli/
├── downloader.py   # Handles HTTP download logic
├── processor.py    # Cleans and exports the CSV
└── main.py         # CLI entry point composing the workflow

Extending

  • Add new preprocessing logic inside preprocess_data (e.g., casting columns, filtering, feature engineering).
  • Replace the downloader if you need authenticated requests or alternate protocols.
  • Wrap the CLI with your own orchestration or scheduling system for recurring dataset refreshes.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages