Startup Investment Analytics — Step 1 (Python ETL)

This repo is the first milestone of a full VC-focused data engineering portfolio: a Python ETL that ingests startup data, cleans & standardises it, and loads it to analytics-friendly storage (Parquet + SQLite).

Stack (Step 1)

Python, Pandas
Parquet (pyarrow)
SQLite (SQLAlchemy)
Pytest + GitHub Actions CI
(Next steps: Airflow, dbt, Docker, Azure)

📦 Project Structure

startup-analytics-pipeline-python-etl/
├─ src/
│  ├─ extract.py        # Extract raw CSV/JSON → Parquet
│  ├─ transform.py      # Clean/standardise → analytics-ready features
│  └─ load.py           # Load → SQLite demo warehouse
├─ data/
│  ├─ raw/              # put raw files here (sample provided)
│  └─ processed/        # parquet + sqlite outputs
├─ tests/               # pytest
├─ .github/workflows/ci.yml  # CI pipeline
├─ requirements.txt
└─ README.md

🚀 Quickstart

# 1) Create & activate venv (recommended)
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

# 2) Install deps
pip install -r requirements.txt

# 3) Extract (CSV → Parquet)
python src/extract.py --input data/raw/sample_startups.csv --output data/processed/raw_extracted.parquet

# 4) Transform (clean + features)
python src/transform.py --input data/processed/raw_extracted.parquet --output data/processed/startups_clean.parquet

# 5) Load to SQLite (optional analytics)
python src/load.py --input data/processed/startups_clean.parquet --sqlite data/processed/startups.db

# 6) Run tests
pytest -q

🧪 What the Transform Adds

Normalised text, parsed dates & currency (USD)
Derived features: company_age_years, investor_count, has_recent_round, is_uk
Basic quality checks (missing columns → error)

📝 Medium Article Template

Title: Building a Python ETL for Startup Investment Analytics (Step 1 of a VC Data Stack)
Sections:

Problem & goal (why VC firms need clean startup data)
Dataset choice (sample CSV / API, schema)
ETL steps (Extract → Transform → Load) with code snippets
Derived features that matter for investors
Storing results (Parquet, SQLite) + simple query example
CI with pytest & GitHub Actions
What’s next (Airflow, dbt, Docker, Azure)

🔗 Example Queries (SQLite)

Once loaded:

-- Top funded UK startups with recent rounds
SELECT company_name, sector, last_funding_type, last_funding_date, last_funding_amount_usd
FROM startups
WHERE is_uk = 1 AND has_recent_round = 1
ORDER BY last_funding_amount_usd DESC
LIMIT 10;

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Startup Investment Analytics — Step 1 (Python ETL)

Stack (Step 1)

📦 Project Structure

🚀 Quickstart

🧪 What the Transform Adds

📝 Medium Article Template

🔗 Example Queries (SQLite)

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
data		data
src		src
tests		tests
.gitignore		.gitignore
BLOG_DRAFT.md		BLOG_DRAFT.md
README.md		README.md
requirements.txt		requirements.txt

dajuctech/startup-analytics-pipeline-python-etl

Folders and files

Latest commit

History

Repository files navigation

Startup Investment Analytics — Step 1 (Python ETL)

Stack (Step 1)

📦 Project Structure

🚀 Quickstart

🧪 What the Transform Adds

📝 Medium Article Template

🔗 Example Queries (SQLite)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages