Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 40 additions & 0 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
name: Publish to Test PyPI

on:
push:
branches:
- 'feature*'

jobs:
test-and-publish:
runs-on: ubuntu-latest

steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.12'

- name: Install Poetry
run: |
curl -sSL https://install.python-poetry.org | python3 -
echo "$HOME/.local/bin" >> $GITHUB_PATH

- name: Install dependencies
run: poetry install

- name: Run tests
run: poetry run pytest

- name: Build the package
run: poetry build

- name: Publish to Test PyPI
env:
POETRY_PYPI_TOKEN_TESTPYPI: ${{ secrets.TEST_PYPI_TOKEN }}
run: |
poetry config repositories.testpypi https://test.pypi.org/legacy/
poetry publish -r testpypi --build
204 changes: 89 additions & 115 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,163 +1,137 @@
# Apache PySpark Custom Data Source Template

This repository provides a template for creating a custom data source for Apache PySpark. It is designed to help developers extend PySpark’s data source API to support custom data ingestion and storage mechanisms.
# pyspark-msgraph-source

A **PySpark DataSource** to seamlessly integrate and read data from **Microsoft Graph API**, enabling easy access to resources like **SharePoint List Items**, and more.

## Motivation

When developing custom PySpark data sources, I encountered several challenges that made the development process frustrating:

1. **Environment Setup Complexity**: Setting up a development environment for PySpark data source development was unnecessarily complex, with multiple dependencies and version conflicts.

2. **Test Data Management**: Managing test data and maintaining consistent test environments across different machines was challenging.

3. **Debugging Issues**: The default setup made it difficult to debug custom data source code effectively, especially when dealing with Spark's distributed nature.

4. **Documentation Gaps**: Existing documentation for custom data source development was scattered and often incomplete.

This template repository aims to solve these pain points and provide a streamlined development experience.

---

## Features
- Entra ID Authentication
Securely authenticate with Microsoft Graph using DefaultAzureCredential, supporting local development and production seamlessly.

- Pre-configured development environment
- Ready-to-use test infrastructure
- Example implementation
- Automated tests setup
- Debug-friendly configuration
- Automatic Pagination Handling
Fetches all paginated data from Microsoft Graph without manual intervention.

## Getting Started
- Dynamic Schema Inference
Automatically detects the schema of the resource by sampling data, so you don't need to define it manually.

Follow these steps to set up and use this repository:
- Simple Configuration with .option()
Easily configure resources and query parameters directly in your Spark read options, making it flexible and intuitive.

### Prerequisites
- Zero External Ingestion Services
No additional services like Azure Data Factory or Logic Apps are needed—directly ingest data into Spark from Microsoft Graph.

- Docker
- Visual Studio Code
- Python 3.11
- Extensible Resource Providers
Add custom resource providers to support more Microsoft Graph endpoints as needed.

### Creating a Repository from This Template
- Pluggable Architecture
Dynamically load resource providers without modifying core logic.

To create a new repository based on this template:
- Optimized for PySpark
Designed to work natively with Spark's DataFrame API for big data processing.

1. Go to the [GitHub repository](https://github.com/geekwhocodes/pyspark-custom-datasource-template).
2. Click the **Use this template** button.
3. Select **Create a new repository**.
4. Choose a repository name, visibility (public or private), and click **Create repository from template**.
5. Clone your new repository:
- Secure by Design
Credentials and secrets are handled using Azure Identity best practices, avoiding hardcoding sensitive data.

```sh
git clone https://github.com/your-username/your-new-repository.git
cd your-new-repository
```
---

### Setup
## Installation

1. **Open the repository in Visual Studio Code:**

```sh
code .
```

2. **Build and start the development container:**

Open the command palette (Ctrl+Shift+P) and select `Remote-Containers: Reopen in Container`.
```bash
pip install pyspark-msgraph-source
```

3. **Initialize the environment:**
---

The environment will be initialized automatically by running the `init-env.sh` script defined in the `devcontainer.json` file.
## ⚡ Quickstart

### Project Structure
### 1. Authentication

The project follows this structure:
This package uses [DefaultAzureCredential](https://learn.microsoft.com/en-us/python/api/overview/azure/identity-readme?view=azure-python#defaultazurecredential).
Ensure you're authenticated:

```
.
├── src/
│ ├── fake_source/ # Default fake data source implementation
│ │ ├── __init__.py
│ │ ├── source.py # Implementation of the fake data source
│ │ ├── schema.py # Schema definitions (if applicable)
│ │ └── utils.py # Helper functions (if needed)
│ ├── tests/ # Unit tests for the custom data source
│ │ ├── __init__.py
│ │ ├── test_source.py # Tests for the data source
│ │ └── conftest.py # Test configuration and fixtures
├── .devcontainer/ # Development container setup files
│ ├── Dockerfile
│ ├── devcontainer.json
├── |── scripts
├── | ├── init-env.sh # Initialization script for setting up the environment
├── pyproject.toml # Project dependencies and build system configuration
├── README.md # Project documentation
├── LICENSE # License file
```bash
az login
```

### Usage
Or set environment variables:
```bash
export AZURE_CLIENT_ID=<your-client-id>
export AZURE_TENANT_ID=<your-tenant-id>
export AZURE_CLIENT_SECRET=<your-client-secret>
```

By default, this template includes a **fake data source** that generates mock data. You can use it as-is or replace it with your own implementation.
### 2. Example Usage

1. **Register the custom data source:**
```python
from pyspark.sql import SparkSession

```python
from pyspark.sql import SparkSession
from fake_source.source import FakeDataSource
spark = SparkSession.builder \
.appName("MSGraphExample") \
.getOrCreate()

spark = SparkSession.builder.getOrCreate()
spark.dataSource.register(FakeDataSource)
```
from pyspark_msgraph_source.core.source import MSGraphDataSource
spark.dataSource.register(MSGraphDataSource)

2. **Read data using the custom data source:**
df = spark.read.format("msgraph") \
.option("resource", "list_items") \
.option("site-id", "<YOUR_SITE_ID>") \
.option("list-id", "<YOUR_LIST_ID>") \
.option("top", 100) \
.option("expand", "fields") \
.load()

```python
df = spark.read.format("fake").load()
df.show()
```
df.show()

3. **Run tests:**
# with schema

```sh
pytest
```
df = spark.read.format("msgraph") \
.option("resource", "list_items") \
.option("site-id", "<YOUR_SITE_ID>") \
.option("list-id", "<YOUR_LIST_ID>") \
.option("top", 100) \
.option("expand", "fields") \
.schema("id string, Title string")
.load()

### Customization
df.show()

To replace the fake data source with your own:
```

1. **Rename the package folder:**
---

```sh
mv src/fake_source src/your_datasource_name
```
## Supported Resources

2. **Update imports in `source.py` and other files:**
| Resource | Description |
|--------------|-----------------------------|
| `list_items`| SharePoint List Items |
| *(more coming soon...)* | |

```python
from your_datasource_name.source import CustomDataSource
```
---

3. **Update `pyproject.toml` to reflect the new package name.**
## Development

4. **Modify the schema and options in `source.py` to fit your use case.**
Coming soon...

### References
1. [Microsoft Learn - PySpark custom data sources](https://learn.microsoft.com/en-us/azure/databricks/pyspark/datasources)
---

### License
## Troubleshooting

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
| Issue | Solution |
|---------------------------------|----------------------------------------------|
| `ValueError: resource missing` | Add `.option("resource", "list_items")` |
| Empty dataframe | Verify IDs, permissions, and access |
| Authentication failures | Check Azure credentials and login status |

### Contact
---

For issues and questions, please use the GitHub Issues section.
## 📄 License

[MIT License](LICENSE)

### Need Help Setting Up a Data Intelligence Platform with Databricks?
If you need expert guidance on setting up a modern data intelligence platform using Databricks, we can help. Our consultancy specializes in:
---

- Custom data source development for Databricks and Apache Spark
- Optimizing ETL pipelines for performance and scalability
- Data governance and security using Unity Catalog
- Building ML & AI solutions on Databricks
## 📚 Resources

🚀 [Contact us](https://www.linkedin.com/in/geekwhocodes/) for a consultation and take your data platform to the next level.
- [Microsoft Graph API](https://learn.microsoft.com/en-us/graph/overview)
- [DefaultAzureCredential](https://learn.microsoft.com/en-us/python/api/overview/azure/identity-readme?view=azure-python#defaultazurecredential)
3 changes: 3 additions & 0 deletions docs/api/core/async-iterator.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Async To Sync Iterator

::: pyspark_msgraph_source.core.async_iterator
3 changes: 3 additions & 0 deletions docs/api/core/client.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Base Client

::: pyspark_msgraph_source.core.base_client
3 changes: 3 additions & 0 deletions docs/api/core/models.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Core Models

::: pyspark_msgraph_source.core.models
3 changes: 3 additions & 0 deletions docs/api/core/resource-provider.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Resouorce Provider

::: pyspark_msgraph_source.core.resource_provider
3 changes: 3 additions & 0 deletions docs/api/core/source.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Source

::: pyspark_msgraph_source.core.source
3 changes: 3 additions & 0 deletions docs/api/core/utils.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Utils

::: pyspark_msgraph_source.core.utils
14 changes: 14 additions & 0 deletions docs/api/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# API Reference

Welcome to the API Reference of `your_package`.

Below are the available modules and submodules:

## Core
- [Core Overview](core.md)

## Utils
- [Utils Helpers](utils.md)

## API Client
- [API Client](api_client.md)
33 changes: 33 additions & 0 deletions docs/api/resources/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@

# Available Resources

This page lists the Microsoft Graph resources currently supported by the `pyspark-msgraph-source` connector.

---

## Supported Resources

| Resource Name | Description | Read more |
|---------------|-------------|------------------|
| `list_items` | Retrieves items from a SharePoint List | [Configuration](list-items.md) |

---

## Adding New Resources

Want to add support for more resources?
Check out the [Contributing Guide](contributing.md) to learn how to extend the connector!

---

## Notes
- Resources may require specific Microsoft Graph API permissions.
- Pagination, authentication, and schema inference are handled automatically.

---

## Request New Resources

Is your desired resource not listed here?
Open an [issue](https://github.com/geekwhocodes/pyspark-msgraph-source/issues) to request it!

4 changes: 4 additions & 0 deletions docs/api/resources/list-items.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Resource - List Items


::: pyspark_msgraph_source.resources.list_items
Loading