This repo is the first milestone of a full VC-focused data engineering portfolio: a Python ETL that ingests startup data, cleans & standardises it, and loads it to analytics-friendly storage (Parquet + SQLite).
- Python, Pandas
- Parquet (pyarrow)
- SQLite (SQLAlchemy)
- Pytest + GitHub Actions CI
- (Next steps: Airflow, dbt, Docker, Azure)
startup-analytics-pipeline-python-etl/
├─ src/
│ ├─ extract.py # Extract raw CSV/JSON → Parquet
│ ├─ transform.py # Clean/standardise → analytics-ready features
│ └─ load.py # Load → SQLite demo warehouse
├─ data/
│ ├─ raw/ # put raw files here (sample provided)
│ └─ processed/ # parquet + sqlite outputs
├─ tests/ # pytest
├─ .github/workflows/ci.yml # CI pipeline
├─ requirements.txt
└─ README.md
# 1) Create & activate venv (recommended)
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# 2) Install deps
pip install -r requirements.txt
# 3) Extract (CSV → Parquet)
python src/extract.py --input data/raw/sample_startups.csv --output data/processed/raw_extracted.parquet
# 4) Transform (clean + features)
python src/transform.py --input data/processed/raw_extracted.parquet --output data/processed/startups_clean.parquet
# 5) Load to SQLite (optional analytics)
python src/load.py --input data/processed/startups_clean.parquet --sqlite data/processed/startups.db
# 6) Run tests
pytest -q- Normalised text, parsed dates & currency (USD)
- Derived features:
company_age_years,investor_count,has_recent_round,is_uk - Basic quality checks (missing columns → error)
Title: Building a Python ETL for Startup Investment Analytics (Step 1 of a VC Data Stack)
Sections:
- Problem & goal (why VC firms need clean startup data)
- Dataset choice (sample CSV / API, schema)
- ETL steps (Extract → Transform → Load) with code snippets
- Derived features that matter for investors
- Storing results (Parquet, SQLite) + simple query example
- CI with pytest & GitHub Actions
- What’s next (Airflow, dbt, Docker, Azure)
Once loaded:
-- Top funded UK startups with recent rounds
SELECT company_name, sector, last_funding_type, last_funding_date, last_funding_amount_usd
FROM startups
WHERE is_uk = 1 AND has_recent_round = 1
ORDER BY last_funding_amount_usd DESC
LIMIT 10;© 2025 Daniel Jude Chigozie — MIT License