diff --git a/contributing/DEVELOPMENT.md b/contributing/DEVELOPMENT.md index c5f2a6c..50fb372 100644 --- a/contributing/DEVELOPMENT.md +++ b/contributing/DEVELOPMENT.md @@ -111,10 +111,27 @@ pre-commit run --all-files ### Building Documentation -The project previously used MkDocs for documentation. Documentation now lives primarily in: -- README.md - Main documentation -- Docstrings in source code -- Contributing guides in /contributing +MkDocs (Material theme) powers the public documentation site hosted at `https://allisonwang-db.github.io/pyspark-data-sources/`. + +#### Preview Locally + +Run the live preview server (restarts on save): + +```bash +poetry run mkdocs serve +``` + +The site is served at `http://127.0.0.1:8000/` by default. + +#### Build for Verification + +Before sending a PR, ensure the static build succeeds and address any warnings: + +```bash +poetry run mkdocs build +``` + +Common warnings include missing navigation entries or broken links—update `mkdocs.yml` or the relevant Markdown files to resolve them. ### Writing Docstrings diff --git a/contributing/RELEASE.md b/contributing/RELEASE.md index 6fe91c1..a2ac3ed 100644 --- a/contributing/RELEASE.md +++ b/contributing/RELEASE.md @@ -173,6 +173,21 @@ gh workflow run docs.yml # Go to Actions tab → Deploy MkDocs to GitHub Pages → Run workflow ``` +### Releasing the Documentation Site + +Follow these steps when you want to publish documentation updates: + +1. Verify the docs build locally: + ```bash + poetry run mkdocs build + ``` +2. Commit any updated Markdown or configuration files and push to the default branch. This triggers the `docs.yml` workflow, which rebuilds and publishes the site to GitHub Pages. +3. (Optional) If you need to deploy immediately without waiting for CI, run: + ```bash + poetry run mkdocs gh-deploy + ``` + This command builds the site and pushes it to the `gh-pages` branch directly. + ### Documentation URLs - **Live Docs**: https://allisonwang-db.github.io/pyspark-data-sources diff --git a/docs/datasources/arrow.md b/docs/datasources/arrow.md new file mode 100644 index 0000000..c64d848 --- /dev/null +++ b/docs/datasources/arrow.md @@ -0,0 +1,6 @@ +# ArrowDataSource + +> Requires the [`PyArrow`](https://arrow.apache.org/docs/python/) library. You can install it manually: `pip install pyarrow` +> or use `pip install pyspark-data-sources[arrow]`. + +::: pyspark_datasources.arrow.ArrowDataSource diff --git a/docs/datasources/fake.md b/docs/datasources/fake.md new file mode 100644 index 0000000..acb3ddc --- /dev/null +++ b/docs/datasources/fake.md @@ -0,0 +1,6 @@ +# FakeDataSource + +> Requires the [`Faker`](https://github.com/joke2k/faker) library. You can install it manually: `pip install faker` +> or use `pip install pyspark-data-sources[faker]`. + +::: pyspark_datasources.fake.FakeDataSource diff --git a/docs/datasources/github.md b/docs/datasources/github.md new file mode 100644 index 0000000..22daa7f --- /dev/null +++ b/docs/datasources/github.md @@ -0,0 +1,3 @@ +# GithubDataSource + +::: pyspark_datasources.github.GithubDataSource diff --git a/docs/datasources/googlesheets.md b/docs/datasources/googlesheets.md new file mode 100644 index 0000000..084191b --- /dev/null +++ b/docs/datasources/googlesheets.md @@ -0,0 +1,3 @@ +# GoogleSheetsDataSource + +::: pyspark_datasources.googlesheets.GoogleSheetsDataSource diff --git a/docs/datasources/huggingface.md b/docs/datasources/huggingface.md new file mode 100644 index 0000000..f4937ab --- /dev/null +++ b/docs/datasources/huggingface.md @@ -0,0 +1,5 @@ +# HuggingFaceDatasets + +> Requires the [`datasets`](https://huggingface.co/docs/datasets/en/index) library. + +::: pyspark_datasources.huggingface.HuggingFaceDatasets diff --git a/docs/datasources/jsonplaceholder.md b/docs/datasources/jsonplaceholder.md new file mode 100644 index 0000000..a175dd9 --- /dev/null +++ b/docs/datasources/jsonplaceholder.md @@ -0,0 +1,3 @@ +# JSONPlaceholderDataSource + +::: pyspark_datasources.jsonplaceholder.JSONPlaceholderDataSource \ No newline at end of file diff --git a/docs/datasources/kaggle.md b/docs/datasources/kaggle.md new file mode 100644 index 0000000..b031ad0 --- /dev/null +++ b/docs/datasources/kaggle.md @@ -0,0 +1,5 @@ +# KaggleDataSource + +> Requires the [`kagglehub`](https://github.com/Kaggle/kagglehub) library. + +::: pyspark_datasources.kaggle.KaggleDataSource diff --git a/docs/datasources/lance.md b/docs/datasources/lance.md new file mode 100644 index 0000000..e6c7848 --- /dev/null +++ b/docs/datasources/lance.md @@ -0,0 +1,6 @@ +# LanceSink + +> Requires the [`Lance`](https://lancedb.github.io/lance/) library. You can install it manually: `pip install lance` +> or use `pip install pyspark-data-sources[lance]`. + +::: pyspark_datasources.lance.LanceSink diff --git a/docs/datasources/opensky.md b/docs/datasources/opensky.md new file mode 100644 index 0000000..f611186 --- /dev/null +++ b/docs/datasources/opensky.md @@ -0,0 +1,5 @@ +# OpenSkyDataSource + +> No additional dependencies required. Uses the OpenSky Network REST API for real-time aircraft tracking data. + +::: pyspark_datasources.opensky.OpenSkyDataSource diff --git a/docs/datasources/robinhood.md b/docs/datasources/robinhood.md new file mode 100644 index 0000000..ccfb33e --- /dev/null +++ b/docs/datasources/robinhood.md @@ -0,0 +1,6 @@ +# RobinhoodDataSource + +> Requires the [`pynacl`](https://github.com/pyca/pynacl) library for cryptographic signing. You can install it manually: `pip install pynacl` +> or use `pip install pyspark-data-sources[robinhood]`. + +::: pyspark_datasources.robinhood.RobinhoodDataSource diff --git a/docs/datasources/salesforce.md b/docs/datasources/salesforce.md new file mode 100644 index 0000000..688b61c --- /dev/null +++ b/docs/datasources/salesforce.md @@ -0,0 +1,6 @@ +# SalesforceDataSource + +> Requires the [`simple-salesforce`](https://github.com/simple-salesforce/simple-salesforce) library. You can install it manually: `pip install simple-salesforce` +> or use `pip install pyspark-data-sources[salesforce]`. + +::: pyspark_datasources.salesforce.SalesforceDataSource \ No newline at end of file diff --git a/docs/datasources/simplejson.md b/docs/datasources/simplejson.md new file mode 100644 index 0000000..c72e846 --- /dev/null +++ b/docs/datasources/simplejson.md @@ -0,0 +1,3 @@ +# SimpleJsonDataSource + +::: pyspark_datasources.simplejson.SimpleJsonDataSource diff --git a/docs/datasources/stock.md b/docs/datasources/stock.md new file mode 100644 index 0000000..b6b506e --- /dev/null +++ b/docs/datasources/stock.md @@ -0,0 +1,3 @@ +# StockDataSource + +::: pyspark_datasources.stock.StockDataSource \ No newline at end of file diff --git a/docs/datasources/weather.md b/docs/datasources/weather.md new file mode 100644 index 0000000..f7f5258 --- /dev/null +++ b/docs/datasources/weather.md @@ -0,0 +1,5 @@ +# WeatherDataSource + +> No additional dependencies required. Uses the Tomorrow.io API for weather data. Requires an API key. + +::: pyspark_datasources.weather.WeatherDataSource diff --git a/docs/index.md b/docs/index.md new file mode 100644 index 0000000..ed53cd7 --- /dev/null +++ b/docs/index.md @@ -0,0 +1,45 @@ +# PySpark Data Sources + +Custom Spark data sources for reading and writing data in Apache Spark, using the Python Data Source API. + +## Installation + +```bash +pip install pyspark-data-sources +``` + +If you want to install all extra dependencies, use: + +```bash +pip install pyspark-data-sources[all] +``` + +## Usage + +```python +from pyspark_datasources.fake import FakeDataSource + +# Register the data source +spark.dataSource.register(FakeDataSource) + +spark.read.format("fake").load().show() + +# For streaming data generation +spark.readStream.format("fake").load().writeStream.format("console").start() +``` + + +## Data Sources + +| Data Source | Short Name | Description | Dependencies | +| ------------------------------------------------------- | -------------- | --------------------------------------------- | --------------------- | +| [GithubDataSource](./datasources/github.md) | `github` | Read pull requests from a Github repository | None | +| [FakeDataSource](./datasources/fake.md) | `fake` | Generate fake data using the `Faker` library | `faker` | +| [HuggingFaceDatasets](./datasources/huggingface.md) | `huggingface` | Read datasets from the HuggingFace Hub | `datasets` | +| [StockDataSource](./datasources/stock.md) | `stock` | Read stock data from Alpha Vantage | None | +| [SalesforceDataSource](./datasources/salesforce.md) | `pyspark.datasource.salesforce` | Write streaming data to Salesforce objects |`simple-salesforce` | +| [GoogleSheetsDataSource](./datasources/googlesheets.md) | `googlesheets` | Read table from public Google Sheets document | None | +| [KaggleDataSource](./datasources/kaggle.md) | `kaggle` | Read datasets from Kaggle | `kagglehub`, `pandas` | +| [JSONPlaceHolder](./datasources/jsonplaceholder.md) | `jsonplaceholder` | Read JSON data for testing and prototyping | None | +| [RobinhoodDataSource](./datasources/robinhood.md) | `robinhood` | Read cryptocurrency market data from Robinhood API | `pynacl` | +| [SalesforceDataSource](./datasources/salesforce.md) | `salesforce` | Write streaming data to Salesforce objects |`simple-salesforce` | \ No newline at end of file diff --git a/mkdocs.yml b/mkdocs.yml new file mode 100644 index 0000000..cdace9f --- /dev/null +++ b/mkdocs.yml @@ -0,0 +1,53 @@ +# yaml-language-server: $schema=https://squidfunk.github.io/mkdocs-material/schema.json + +site_name: PySpark Data Sources +site_url: https://allisonwang-db.github.io/pyspark-data-sources +repo_url: https://github.com/allisonwang-db/pyspark-data-sources +theme: + name: material + +plugins: + - mkdocstrings: + default_handler: python + handlers: + python: + options: + docstring_style: numpy + - search + +nav: + - Index: index.md + - Guides: + - API Reference: api-reference.md + - Building Data Sources: building-data-sources.md + - Data Sources Guide: data-sources-guide.md + - Simple Stream Reader Architecture: simple-stream-reader-architecture.md + - Data Sources: + - datasources/arrow.md + - datasources/github.md + - datasources/fake.md + - datasources/huggingface.md + - datasources/stock.md + - datasources/simplejson.md + - datasources/salesforce.md + - datasources/googlesheets.md + - datasources/kaggle.md + - datasources/jsonplaceholder.md + - datasources/robinhood.md + - datasources/lance.md + - datasources/opensky.md + - datasources/weather.md + +markdown_extensions: + - pymdownx.highlight: + anchor_linenums: true + - pymdownx.inlinehilite + - pymdownx.snippets + - admonition + - pymdownx.arithmatex: + generic: true + - footnotes + - pymdownx.details + - pymdownx.superfences + - pymdownx.mark + - attr_list