From eff49555b4508c2cac4f5dfd0947c62471a98a81 Mon Sep 17 00:00:00 2001 From: Allison Wang Date: Thu, 14 Aug 2025 12:47:02 -0700 Subject: [PATCH 1/2] update README --- README.md | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 94a9c66..87cc4fc 100644 --- a/README.md +++ b/README.md @@ -40,14 +40,18 @@ spark.readStream.format("fake").load().writeStream.format("console").start() | Data Source | Short Name | Description | Dependencies | |-------------------------------------------------------------------------|----------------|-----------------------------------------------|-----------------------| -| [GithubDataSource](pyspark_datasources/github.py) | `github` | Read pull requests from a Github repository | None | +| [ArrowDataSource](pyspark_datasources/arrow.py) | `arrow` | Read Apache Arrow files (.arrow) | `pyarrow` | | [FakeDataSource](pyspark_datasources/fake.py) | `fake` | Generate fake data using the `Faker` library | `faker` | -| [StockDataSource](pyspark_datasources/stock.py) | `stock` | Read stock data from Alpha Vantage | None | +| [GithubDataSource](pyspark_datasources/github.py) | `github` | Read pull requests from a Github repository | None | | [GoogleSheetsDataSource](pyspark_datasources/googlesheets.py) | `googlesheets` | Read table from public Google Sheets | None | +| [HuggingFaceDatasets](pyspark_datasources/huggingface.py) | `huggingface` | Read datasets from HuggingFace Hub | `datasets` | | [KaggleDataSource](pyspark_datasources/kaggle.py) | `kaggle` | Read datasets from Kaggle | `kagglehub`, `pandas` | -| [SimpleJsonDataSource](pyspark_datasources/simplejson.py) | `simplejson` | Write JSON data to Databricks DBFS | `databricks-sdk` | +| [LanceSink](pyspark_datasources/lance.py) | `lance` | Write data in Lance format | `lance` | | [OpenSkyDataSource](pyspark_datasources/opensky.py) | `opensky` | Read from OpenSky Network. | None | | [SalesforceDataSource](pyspark_datasources/salesforce.py) | `pyspark.datasource.salesforce` | Streaming datasource for writing data to Salesforce | `simple-salesforce` | +| [SimpleJsonDataSource](pyspark_datasources/simplejson.py) | `simplejson` | Write JSON data to Databricks DBFS | `databricks-sdk` | +| [StockDataSource](pyspark_datasources/stock.py) | `stock` | Read stock data from Alpha Vantage | None | +| [WeatherDataSource](pyspark_datasources/weather.py) | `weather` | Fetch weather data from tomorrow.io | None | See more here: https://allisonwang-db.github.io/pyspark-data-sources/. From 0b5e9b1fe8de5300504ca8be5ccd89882deb18ba Mon Sep 17 00:00:00 2001 From: Allison Wang Date: Thu, 14 Aug 2025 12:57:17 -0700 Subject: [PATCH 2/2] more update --- README.md | 31 +++++++++++++++++-------------- docs/datasources/arrow.md | 6 ++++++ docs/datasources/lance.md | 6 ++++++ docs/datasources/opensky.md | 5 +++++ docs/datasources/weather.md | 5 +++++ 5 files changed, 39 insertions(+), 14 deletions(-) create mode 100644 docs/datasources/arrow.md create mode 100644 docs/datasources/lance.md create mode 100644 docs/datasources/opensky.md create mode 100644 docs/datasources/weather.md diff --git a/README.md b/README.md index 87cc4fc..c8120f7 100644 --- a/README.md +++ b/README.md @@ -38,20 +38,23 @@ spark.readStream.format("fake").load().writeStream.format("console").start() ## Example Data Sources -| Data Source | Short Name | Description | Dependencies | -|-------------------------------------------------------------------------|----------------|-----------------------------------------------|-----------------------| -| [ArrowDataSource](pyspark_datasources/arrow.py) | `arrow` | Read Apache Arrow files (.arrow) | `pyarrow` | -| [FakeDataSource](pyspark_datasources/fake.py) | `fake` | Generate fake data using the `Faker` library | `faker` | -| [GithubDataSource](pyspark_datasources/github.py) | `github` | Read pull requests from a Github repository | None | -| [GoogleSheetsDataSource](pyspark_datasources/googlesheets.py) | `googlesheets` | Read table from public Google Sheets | None | -| [HuggingFaceDatasets](pyspark_datasources/huggingface.py) | `huggingface` | Read datasets from HuggingFace Hub | `datasets` | -| [KaggleDataSource](pyspark_datasources/kaggle.py) | `kaggle` | Read datasets from Kaggle | `kagglehub`, `pandas` | -| [LanceSink](pyspark_datasources/lance.py) | `lance` | Write data in Lance format | `lance` | -| [OpenSkyDataSource](pyspark_datasources/opensky.py) | `opensky` | Read from OpenSky Network. | None | -| [SalesforceDataSource](pyspark_datasources/salesforce.py) | `pyspark.datasource.salesforce` | Streaming datasource for writing data to Salesforce | `simple-salesforce` | -| [SimpleJsonDataSource](pyspark_datasources/simplejson.py) | `simplejson` | Write JSON data to Databricks DBFS | `databricks-sdk` | -| [StockDataSource](pyspark_datasources/stock.py) | `stock` | Read stock data from Alpha Vantage | None | -| [WeatherDataSource](pyspark_datasources/weather.py) | `weather` | Fetch weather data from tomorrow.io | None | +| Data Source | Short Name | Type | Description | Dependencies | Example | +|-------------------------------------------------------------------------|----------------|----------------|-----------------------------------------------|-----------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| **Batch Read** | | | | | | +| [ArrowDataSource](pyspark_datasources/arrow.py) | `arrow` | Batch Read | Read Apache Arrow files (.arrow) | `pyarrow` | `pip install pyspark-data-sources[arrow]`
`spark.read.format("arrow").load("/path/to/file.arrow")` | +| [FakeDataSource](pyspark_datasources/fake.py) | `fake` | Batch/Streaming Read | Generate fake data using the `Faker` library | `faker` | `pip install pyspark-data-sources[fake]`
`spark.read.format("fake").load()` or `spark.readStream.format("fake").load()` | +| [GithubDataSource](pyspark_datasources/github.py) | `github` | Batch Read | Read pull requests from a Github repository | None | `pip install pyspark-data-sources`
`spark.read.format("github").load("apache/spark")` | +| [GoogleSheetsDataSource](pyspark_datasources/googlesheets.py) | `googlesheets` | Batch Read | Read table from public Google Sheets | None | `pip install pyspark-data-sources`
`spark.read.format("googlesheets").load("https://docs.google.com/spreadsheets/d/...")` | +| [HuggingFaceDatasets](pyspark_datasources/huggingface.py) | `huggingface` | Batch Read | Read datasets from HuggingFace Hub | `datasets` | `pip install pyspark-data-sources[huggingface]`
`spark.read.format("huggingface").load("imdb")` | +| [KaggleDataSource](pyspark_datasources/kaggle.py) | `kaggle` | Batch Read | Read datasets from Kaggle | `kagglehub`, `pandas` | `pip install pyspark-data-sources[kaggle]`
`spark.read.format("kaggle").load("titanic")` | +| [StockDataSource](pyspark_datasources/stock.py) | `stock` | Batch Read | Read stock data from Alpha Vantage | None | `pip install pyspark-data-sources`
`spark.read.format("stock").option("symbols", "AAPL,GOOGL").option("api_key", "key").load()` | +| **Batch Write** | | | | | | +| [LanceSink](pyspark_datasources/lance.py) | `lance` | Batch Write | Write data in Lance format | `lance` | `pip install pyspark-data-sources[lance]`
`df.write.format("lance").mode("append").save("/tmp/lance_data")` | +| **Streaming Read** | | | | | | +| [OpenSkyDataSource](pyspark_datasources/opensky.py) | `opensky` | Streaming Read | Read from OpenSky Network. | None | `pip install pyspark-data-sources`
`spark.readStream.format("opensky").option("region", "EUROPE").load()` | +| [WeatherDataSource](pyspark_datasources/weather.py) | `weather` | Streaming Read | Fetch weather data from tomorrow.io | None | `pip install pyspark-data-sources`
`spark.readStream.format("weather").option("locations", "[(37.7749, -122.4194)]").option("apikey", "key").load()` | +| **Streaming Write** | | | | | | +| [SalesforceDataSource](pyspark_datasources/salesforce.py) | `pyspark.datasource.salesforce` | Streaming Write | Streaming datasource for writing data to Salesforce | `simple-salesforce` | `pip install pyspark-data-sources[salesforce]`
`df.writeStream.format("pyspark.datasource.salesforce").option("username", "user").start()` | See more here: https://allisonwang-db.github.io/pyspark-data-sources/. diff --git a/docs/datasources/arrow.md b/docs/datasources/arrow.md new file mode 100644 index 0000000..c64d848 --- /dev/null +++ b/docs/datasources/arrow.md @@ -0,0 +1,6 @@ +# ArrowDataSource + +> Requires the [`PyArrow`](https://arrow.apache.org/docs/python/) library. You can install it manually: `pip install pyarrow` +> or use `pip install pyspark-data-sources[arrow]`. + +::: pyspark_datasources.arrow.ArrowDataSource diff --git a/docs/datasources/lance.md b/docs/datasources/lance.md new file mode 100644 index 0000000..e6c7848 --- /dev/null +++ b/docs/datasources/lance.md @@ -0,0 +1,6 @@ +# LanceSink + +> Requires the [`Lance`](https://lancedb.github.io/lance/) library. You can install it manually: `pip install lance` +> or use `pip install pyspark-data-sources[lance]`. + +::: pyspark_datasources.lance.LanceSink diff --git a/docs/datasources/opensky.md b/docs/datasources/opensky.md new file mode 100644 index 0000000..f611186 --- /dev/null +++ b/docs/datasources/opensky.md @@ -0,0 +1,5 @@ +# OpenSkyDataSource + +> No additional dependencies required. Uses the OpenSky Network REST API for real-time aircraft tracking data. + +::: pyspark_datasources.opensky.OpenSkyDataSource diff --git a/docs/datasources/weather.md b/docs/datasources/weather.md new file mode 100644 index 0000000..f7f5258 --- /dev/null +++ b/docs/datasources/weather.md @@ -0,0 +1,5 @@ +# WeatherDataSource + +> No additional dependencies required. Uses the Tomorrow.io API for weather data. Requires an API key. + +::: pyspark_datasources.weather.WeatherDataSource