Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 21 additions & 4 deletions contributing/DEVELOPMENT.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,10 +111,27 @@ pre-commit run --all-files

### Building Documentation

The project previously used MkDocs for documentation. Documentation now lives primarily in:
- README.md - Main documentation
- Docstrings in source code
- Contributing guides in /contributing
MkDocs (Material theme) powers the public documentation site hosted at `https://allisonwang-db.github.io/pyspark-data-sources/`.

#### Preview Locally

Run the live preview server (restarts on save):

```bash
poetry run mkdocs serve
```

The site is served at `http://127.0.0.1:8000/` by default.

#### Build for Verification

Before sending a PR, ensure the static build succeeds and address any warnings:

```bash
poetry run mkdocs build
```

Common warnings include missing navigation entries or broken links—update `mkdocs.yml` or the relevant Markdown files to resolve them.

### Writing Docstrings

Expand Down
15 changes: 15 additions & 0 deletions contributing/RELEASE.md
Original file line number Diff line number Diff line change
Expand Up @@ -173,6 +173,21 @@ gh workflow run docs.yml
# Go to Actions tab → Deploy MkDocs to GitHub Pages → Run workflow
```

### Releasing the Documentation Site

Follow these steps when you want to publish documentation updates:

1. Verify the docs build locally:
```bash
poetry run mkdocs build
```
2. Commit any updated Markdown or configuration files and push to the default branch. This triggers the `docs.yml` workflow, which rebuilds and publishes the site to GitHub Pages.
3. (Optional) If you need to deploy immediately without waiting for CI, run:
```bash
poetry run mkdocs gh-deploy
```
This command builds the site and pushes it to the `gh-pages` branch directly.

### Documentation URLs

- **Live Docs**: https://allisonwang-db.github.io/pyspark-data-sources
Expand Down
6 changes: 6 additions & 0 deletions docs/datasources/arrow.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# ArrowDataSource

> Requires the [`PyArrow`](https://arrow.apache.org/docs/python/) library. You can install it manually: `pip install pyarrow`
> or use `pip install pyspark-data-sources[arrow]`.

::: pyspark_datasources.arrow.ArrowDataSource
6 changes: 6 additions & 0 deletions docs/datasources/fake.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# FakeDataSource

> Requires the [`Faker`](https://github.com/joke2k/faker) library. You can install it manually: `pip install faker`
> or use `pip install pyspark-data-sources[faker]`.

::: pyspark_datasources.fake.FakeDataSource
3 changes: 3 additions & 0 deletions docs/datasources/github.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# GithubDataSource

::: pyspark_datasources.github.GithubDataSource
3 changes: 3 additions & 0 deletions docs/datasources/googlesheets.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# GoogleSheetsDataSource

::: pyspark_datasources.googlesheets.GoogleSheetsDataSource
5 changes: 5 additions & 0 deletions docs/datasources/huggingface.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# HuggingFaceDatasets

> Requires the [`datasets`](https://huggingface.co/docs/datasets/en/index) library.

::: pyspark_datasources.huggingface.HuggingFaceDatasets
3 changes: 3 additions & 0 deletions docs/datasources/jsonplaceholder.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# JSONPlaceholderDataSource

::: pyspark_datasources.jsonplaceholder.JSONPlaceholderDataSource
5 changes: 5 additions & 0 deletions docs/datasources/kaggle.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# KaggleDataSource

> Requires the [`kagglehub`](https://github.com/Kaggle/kagglehub) library.

::: pyspark_datasources.kaggle.KaggleDataSource
6 changes: 6 additions & 0 deletions docs/datasources/lance.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# LanceSink

> Requires the [`Lance`](https://lancedb.github.io/lance/) library. You can install it manually: `pip install lance`
> or use `pip install pyspark-data-sources[lance]`.

::: pyspark_datasources.lance.LanceSink
5 changes: 5 additions & 0 deletions docs/datasources/opensky.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# OpenSkyDataSource

> No additional dependencies required. Uses the OpenSky Network REST API for real-time aircraft tracking data.

::: pyspark_datasources.opensky.OpenSkyDataSource
6 changes: 6 additions & 0 deletions docs/datasources/robinhood.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# RobinhoodDataSource

> Requires the [`pynacl`](https://github.com/pyca/pynacl) library for cryptographic signing. You can install it manually: `pip install pynacl`
> or use `pip install pyspark-data-sources[robinhood]`.

::: pyspark_datasources.robinhood.RobinhoodDataSource
6 changes: 6 additions & 0 deletions docs/datasources/salesforce.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# SalesforceDataSource

> Requires the [`simple-salesforce`](https://github.com/simple-salesforce/simple-salesforce) library. You can install it manually: `pip install simple-salesforce`
> or use `pip install pyspark-data-sources[salesforce]`.

::: pyspark_datasources.salesforce.SalesforceDataSource
3 changes: 3 additions & 0 deletions docs/datasources/simplejson.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# SimpleJsonDataSource

::: pyspark_datasources.simplejson.SimpleJsonDataSource
3 changes: 3 additions & 0 deletions docs/datasources/stock.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# StockDataSource

::: pyspark_datasources.stock.StockDataSource
5 changes: 5 additions & 0 deletions docs/datasources/weather.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# WeatherDataSource

> No additional dependencies required. Uses the Tomorrow.io API for weather data. Requires an API key.

::: pyspark_datasources.weather.WeatherDataSource
45 changes: 45 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# PySpark Data Sources

Custom Spark data sources for reading and writing data in Apache Spark, using the Python Data Source API.

## Installation

```bash
pip install pyspark-data-sources
```

If you want to install all extra dependencies, use:

```bash
pip install pyspark-data-sources[all]
```

## Usage

```python
from pyspark_datasources.fake import FakeDataSource

# Register the data source
spark.dataSource.register(FakeDataSource)

spark.read.format("fake").load().show()

# For streaming data generation
spark.readStream.format("fake").load().writeStream.format("console").start()
```


## Data Sources

| Data Source | Short Name | Description | Dependencies |
| ------------------------------------------------------- | -------------- | --------------------------------------------- | --------------------- |
| [GithubDataSource](./datasources/github.md) | `github` | Read pull requests from a Github repository | None |
| [FakeDataSource](./datasources/fake.md) | `fake` | Generate fake data using the `Faker` library | `faker` |
| [HuggingFaceDatasets](./datasources/huggingface.md) | `huggingface` | Read datasets from the HuggingFace Hub | `datasets` |
| [StockDataSource](./datasources/stock.md) | `stock` | Read stock data from Alpha Vantage | None |
| [SalesforceDataSource](./datasources/salesforce.md) | `pyspark.datasource.salesforce` | Write streaming data to Salesforce objects |`simple-salesforce` |
| [GoogleSheetsDataSource](./datasources/googlesheets.md) | `googlesheets` | Read table from public Google Sheets document | None |
| [KaggleDataSource](./datasources/kaggle.md) | `kaggle` | Read datasets from Kaggle | `kagglehub`, `pandas` |
| [JSONPlaceHolder](./datasources/jsonplaceholder.md) | `jsonplaceholder` | Read JSON data for testing and prototyping | None |
| [RobinhoodDataSource](./datasources/robinhood.md) | `robinhood` | Read cryptocurrency market data from Robinhood API | `pynacl` |
| [SalesforceDataSource](./datasources/salesforce.md) | `salesforce` | Write streaming data to Salesforce objects |`simple-salesforce` |
53 changes: 53 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# yaml-language-server: $schema=https://squidfunk.github.io/mkdocs-material/schema.json

site_name: PySpark Data Sources
site_url: https://allisonwang-db.github.io/pyspark-data-sources
repo_url: https://github.com/allisonwang-db/pyspark-data-sources
theme:
name: material

plugins:
- mkdocstrings:
default_handler: python
handlers:
python:
options:
docstring_style: numpy
- search

nav:
- Index: index.md
- Guides:
- API Reference: api-reference.md
- Building Data Sources: building-data-sources.md
- Data Sources Guide: data-sources-guide.md
- Simple Stream Reader Architecture: simple-stream-reader-architecture.md
- Data Sources:
- datasources/arrow.md
- datasources/github.md
- datasources/fake.md
- datasources/huggingface.md
- datasources/stock.md
- datasources/simplejson.md
- datasources/salesforce.md
- datasources/googlesheets.md
- datasources/kaggle.md
- datasources/jsonplaceholder.md
- datasources/robinhood.md
- datasources/lance.md
- datasources/opensky.md
- datasources/weather.md

markdown_extensions:
- pymdownx.highlight:
anchor_linenums: true
- pymdownx.inlinehilite
- pymdownx.snippets
- admonition
- pymdownx.arithmatex:
generic: true
- footnotes
- pymdownx.details
- pymdownx.superfences
- pymdownx.mark
- attr_list