Skip to content

dajuctech/startup-analytics-pipeline-python-etl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Startup Investment Analytics — Step 1 (Python ETL)

This repo is the first milestone of a full VC-focused data engineering portfolio: a Python ETL that ingests startup data, cleans & standardises it, and loads it to analytics-friendly storage (Parquet + SQLite).

Stack (Step 1)

  • Python, Pandas
  • Parquet (pyarrow)
  • SQLite (SQLAlchemy)
  • Pytest + GitHub Actions CI
  • (Next steps: Airflow, dbt, Docker, Azure)

📦 Project Structure

startup-analytics-pipeline-python-etl/
├─ src/
│  ├─ extract.py        # Extract raw CSV/JSON → Parquet
│  ├─ transform.py      # Clean/standardise → analytics-ready features
│  └─ load.py           # Load → SQLite demo warehouse
├─ data/
│  ├─ raw/              # put raw files here (sample provided)
│  └─ processed/        # parquet + sqlite outputs
├─ tests/               # pytest
├─ .github/workflows/ci.yml  # CI pipeline
├─ requirements.txt
└─ README.md

🚀 Quickstart

# 1) Create & activate venv (recommended)
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

# 2) Install deps
pip install -r requirements.txt

# 3) Extract (CSV → Parquet)
python src/extract.py --input data/raw/sample_startups.csv --output data/processed/raw_extracted.parquet

# 4) Transform (clean + features)
python src/transform.py --input data/processed/raw_extracted.parquet --output data/processed/startups_clean.parquet

# 5) Load to SQLite (optional analytics)
python src/load.py --input data/processed/startups_clean.parquet --sqlite data/processed/startups.db

# 6) Run tests
pytest -q

🧪 What the Transform Adds

  • Normalised text, parsed dates & currency (USD)
  • Derived features: company_age_years, investor_count, has_recent_round, is_uk
  • Basic quality checks (missing columns → error)

📝 Medium Article Template

Title: Building a Python ETL for Startup Investment Analytics (Step 1 of a VC Data Stack)
Sections:

  1. Problem & goal (why VC firms need clean startup data)
  2. Dataset choice (sample CSV / API, schema)
  3. ETL steps (Extract → Transform → Load) with code snippets
  4. Derived features that matter for investors
  5. Storing results (Parquet, SQLite) + simple query example
  6. CI with pytest & GitHub Actions
  7. What’s next (Airflow, dbt, Docker, Azure)

🔗 Example Queries (SQLite)

Once loaded:

-- Top funded UK startups with recent rounds
SELECT company_name, sector, last_funding_type, last_funding_date, last_funding_amount_usd
FROM startups
WHERE is_uk = 1 AND has_recent_round = 1
ORDER BY last_funding_amount_usd DESC
LIMIT 10;

© 2025 Daniel Jude Chigozie — MIT License

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages