geekwhocodes · geekwhocodes · Mar 5, 2025 · Mar 5, 2025 · Mar 5, 2025 · Mar 5, 2025
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -0,0 +1,40 @@
+name: Publish to Test PyPI
+
+on:
+    push:
+      branches:
+        - 'feature*'
+
+jobs:
+  test-and-publish:
+    runs-on: ubuntu-latest
+
+    steps:
+      - name: Checkout code
+        uses: actions/checkout@v4
+
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: '3.12'
+
+      - name: Install Poetry
+        run: |
+          curl -sSL https://install.python-poetry.org | python3 -
+          echo "$HOME/.local/bin" >> $GITHUB_PATH
+
+      - name: Install dependencies
+        run: poetry install
+
+      - name: Run tests
+        run: poetry run pytest
+
+      - name: Build the package
+        run: poetry build
+
+      - name: Publish to Test PyPI
+        env:
+          POETRY_PYPI_TOKEN_TESTPYPI: ${{ secrets.TEST_PYPI_TOKEN }}
+        run: |
+          poetry config repositories.testpypi https://test.pypi.org/legacy/
+          poetry publish -r testpypi --build
diff --git a/README.md b/README.md
@@ -1,163 +1,137 @@
-# Apache PySpark Custom Data Source Template
 
-This repository provides a template for creating a custom data source for Apache PySpark. It is designed to help developers extend PySpark’s data source API to support custom data ingestion and storage mechanisms.
+# pyspark-msgraph-source
 
+A **PySpark DataSource** to seamlessly integrate and read data from **Microsoft Graph API**, enabling easy access to resources like **SharePoint List Items**, and more.
 
-## Motivation
-
-When developing custom PySpark data sources, I encountered several challenges that made the development process frustrating:
-
-1. **Environment Setup Complexity**: Setting up a development environment for PySpark data source development was unnecessarily complex, with multiple dependencies and version conflicts.
-
-2. **Test Data Management**: Managing test data and maintaining consistent test environments across different machines was challenging.
-
-3. **Debugging Issues**: The default setup made it difficult to debug custom data source code effectively, especially when dealing with Spark's distributed nature.
-
-4. **Documentation Gaps**: Existing documentation for custom data source development was scattered and often incomplete.
-
-This template repository aims to solve these pain points and provide a streamlined development experience.
-
+---
 
 ## Features
+- Entra ID Authentication
+Securely authenticate with Microsoft Graph using DefaultAzureCredential, supporting local development and production seamlessly.
 
-- Pre-configured development environment
-- Ready-to-use test infrastructure
-- Example implementation
-- Automated tests setup
-- Debug-friendly configuration
+- Automatic Pagination Handling
+Fetches all paginated data from Microsoft Graph without manual intervention.
 
-## Getting Started
+- Dynamic Schema Inference
+Automatically detects the schema of the resource by sampling data, so you don't need to define it manually.
 
-Follow these steps to set up and use this repository:
+- Simple Configuration with .option()
+Easily configure resources and query parameters directly in your Spark read options, making it flexible and intuitive.
 
-### Prerequisites
+- Zero External Ingestion Services
+No additional services like Azure Data Factory or Logic Apps are needed—directly ingest data into Spark from Microsoft Graph.
 
-- Docker
-- Visual Studio Code
-- Python 3.11
+- Extensible Resource Providers
+Add custom resource providers to support more Microsoft Graph endpoints as needed.
 
-### Creating a Repository from This Template
+- Pluggable Architecture
+Dynamically load resource providers without modifying core logic.
 
-To create a new repository based on this template:
+- Optimized for PySpark
+Designed to work natively with Spark's DataFrame API for big data processing.
 
-1. Go to the [GitHub repository](https://github.com/geekwhocodes/pyspark-custom-datasource-template).
-2. Click the **Use this template** button.
-3. Select **Create a new repository**.
-4. Choose a repository name, visibility (public or private), and click **Create repository from template**.
-5. Clone your new repository:
+- Secure by Design
+Credentials and secrets are handled using Azure Identity best practices, avoiding hardcoding sensitive data.
 
-    ```sh
-    git clone https://github.com/your-username/your-new-repository.git
-    cd your-new-repository
-    ```
+---
 
-### Setup
+## Installation
 
-1. **Open the repository in Visual Studio Code:**
-
-    ```sh
-    code .
-    ```
-
-2. **Build and start the development container:**
-
-    Open the command palette (Ctrl+Shift+P) and select `Remote-Containers: Reopen in Container`.
+```bash
+pip install pyspark-msgraph-source
+```
 
-3. **Initialize the environment:**
+---
 
-    The environment will be initialized automatically by running the `init-env.sh` script defined in the `devcontainer.json` file.
+## ⚡ Quickstart
 
-### Project Structure
+### 1. Authentication
 
-The project follows this structure:
+This package uses [DefaultAzureCredential](https://learn.microsoft.com/en-us/python/api/overview/azure/identity-readme?view=azure-python#defaultazurecredential).  
+Ensure you're authenticated:
 
-```
-.
-├── src/
-│   ├── fake_source/         # Default fake data source implementation
-│   │   ├── __init__.py
-│   │   ├── source.py        # Implementation of the fake data source
-│   │   ├── schema.py        # Schema definitions (if applicable)
-│   │   └── utils.py         # Helper functions (if needed)
-│   ├── tests/               # Unit tests for the custom data source
-│   │   ├── __init__.py
-│   │   ├── test_source.py   # Tests for the data source
-│   │   └── conftest.py      # Test configuration and fixtures
-├── .devcontainer/           # Development container setup files
-│   ├── Dockerfile
-│   ├── devcontainer.json
-├── |── scripts
-├── |   ├── init-env.sh              # Initialization script for setting up the environment
-├── pyproject.toml           # Project dependencies and build system configuration
-├── README.md                # Project documentation
-├── LICENSE                  # License file
+```bash
+az login
 ```
 
-### Usage
+Or set environment variables:
+```bash
+export AZURE_CLIENT_ID=<your-client-id>
+export AZURE_TENANT_ID=<your-tenant-id>
+export AZURE_CLIENT_SECRET=<your-client-secret>
+```
 
-By default, this template includes a **fake data source** that generates mock data. You can use it as-is or replace it with your own implementation.
+### 2. Example Usage
 
-1. **Register the custom data source:**
+```python
+from pyspark.sql import SparkSession
 
-    ```python
-    from pyspark.sql import SparkSession
-    from fake_source.source import FakeDataSource
+spark = SparkSession.builder \ 
+.appName("MSGraphExample") \ 
+.getOrCreate()
 
-    spark = SparkSession.builder.getOrCreate()
-    spark.dataSource.register(FakeDataSource)
-    ```
+from pyspark_msgraph_source.core.source import MSGraphDataSource
+spark.dataSource.register(MSGraphDataSource)
 
-2. **Read data using the custom data source:**
+df = spark.read.format("msgraph") \ 
+.option("resource", "list_items") \ 
+.option("site-id", "<YOUR_SITE_ID>") \ 
+.option("list-id", "<YOUR_LIST_ID>") \ 
+.option("top", 100) \ 
+.option("expand", "fields") \ 
+.load()
 
-    ```python
-    df = spark.read.format("fake").load()
-    df.show()
-    ```
+df.show()
 
-3. **Run tests:**
+# with schema
 
-    ```sh
-    pytest
-    ```
+df = spark.read.format("msgraph") \ 
+.option("resource", "list_items") \ 
+.option("site-id", "<YOUR_SITE_ID>") \ 
+.option("list-id", "<YOUR_LIST_ID>") \ 
+.option("top", 100) \ 
+.option("expand", "fields") \ 
+.schema("id string, Title string")
+.load()
 
-### Customization
+df.show()
 
-To replace the fake data source with your own:
+```
 
-1. **Rename the package folder:**
+---
 
-    ```sh
-    mv src/fake_source src/your_datasource_name
-    ```
+## Supported Resources
 
-2. **Update imports in `source.py` and other files:**
+| Resource     | Description                 |
+|--------------|-----------------------------|
+| `list_items`| SharePoint List Items       |
+| *(more coming soon...)* |                 |
 
-    ```python
-    from your_datasource_name.source import CustomDataSource
-    ```
+---
 
-3. **Update `pyproject.toml` to reflect the new package name.**
+## Development
 
-4. **Modify the schema and options in `source.py` to fit your use case.**
+Coming soon...
 
-### References
-1. [Microsoft Learn - PySpark custom data sources](https://learn.microsoft.com/en-us/azure/databricks/pyspark/datasources)
+---
 
-### License
+## Troubleshooting
 
-This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
+| Issue                          | Solution                                     |
+|---------------------------------|----------------------------------------------|
+| `ValueError: resource missing` | Add `.option("resource", "list_items")`     |
+| Empty dataframe                | Verify IDs, permissions, and access         |
+| Authentication failures        | Check Azure credentials and login status    |
 
-### Contact
+---
 
-For issues and questions, please use the GitHub Issues section.
+## 📄 License
 
+[MIT License](LICENSE)
 
-### Need Help Setting Up a Data Intelligence Platform with Databricks?
-If you need expert guidance on setting up a modern data intelligence platform using Databricks, we can help. Our consultancy specializes in:
+---
 
-- Custom data source development for Databricks and Apache Spark
-- Optimizing ETL pipelines for performance and scalability
-- Data governance and security using Unity Catalog
-- Building ML & AI solutions on Databricks
+## 📚 Resources
 
-🚀 [Contact us](https://www.linkedin.com/in/geekwhocodes/) for a consultation and take your data platform to the next level.
+- [Microsoft Graph API](https://learn.microsoft.com/en-us/graph/overview)
+- [DefaultAzureCredential](https://learn.microsoft.com/en-us/python/api/overview/azure/identity-readme?view=azure-python#defaultazurecredential)
diff --git a/docs/api/core/async-iterator.md b/docs/api/core/async-iterator.md
@@ -0,0 +1,3 @@
+# Async To Sync Iterator
+
+::: pyspark_msgraph_source.core.async_iterator
diff --git a/docs/api/core/client.md b/docs/api/core/client.md
@@ -0,0 +1,3 @@
+# Base Client
+
+::: pyspark_msgraph_source.core.base_client
diff --git a/docs/api/core/models.md b/docs/api/core/models.md
@@ -0,0 +1,3 @@
+# Core Models
+
+::: pyspark_msgraph_source.core.models
diff --git a/docs/api/core/resource-provider.md b/docs/api/core/resource-provider.md
@@ -0,0 +1,3 @@
+# Resouorce Provider
+
+::: pyspark_msgraph_source.core.resource_provider
diff --git a/docs/api/core/source.md b/docs/api/core/source.md
@@ -0,0 +1,3 @@
+# Source
+
+::: pyspark_msgraph_source.core.source
diff --git a/docs/api/core/utils.md b/docs/api/core/utils.md
@@ -0,0 +1,3 @@
+# Utils
+
+::: pyspark_msgraph_source.core.utils
diff --git a/docs/api/index.md b/docs/api/index.md
@@ -0,0 +1,14 @@
+# API Reference
+
+Welcome to the API Reference of `your_package`.
+
+Below are the available modules and submodules:
+
+## Core
+- [Core Overview](core.md)
+
+## Utils
+- [Utils Helpers](utils.md)
+
+## API Client
+- [API Client](api_client.md)
diff --git a/docs/api/resources/index.md b/docs/api/resources/index.md
@@ -0,0 +1,33 @@
+
+# Available Resources
+
+This page lists the Microsoft Graph resources currently supported by the `pyspark-msgraph-source` connector.
+
+---
+
+## Supported Resources
+
+| Resource Name | Description | Read more |
+|---------------|-------------|------------------|
+| `list_items` | Retrieves items from a SharePoint List | [Configuration](list-items.md) |
+
+---
+
+## Adding New Resources
+
+Want to add support for more resources?  
+Check out the [Contributing Guide](contributing.md) to learn how to extend the connector!
+
+---
+
+## Notes
+- Resources may require specific Microsoft Graph API permissions.
+- Pagination, authentication, and schema inference are handled automatically.
+
+---
+
+## Request New Resources
+
+Is your desired resource not listed here?  
+Open an [issue](https://github.com/geekwhocodes/pyspark-msgraph-source/issues) to request it!
+
diff --git a/docs/api/resources/list-items.md b/docs/api/resources/list-items.md
@@ -0,0 +1,4 @@
+# Resource - List Items
+
+
+::: pyspark_msgraph_source.resources.list_items
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		# Async To Sync Iterator

		::: pyspark_msgraph_source.core.async_iterator
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		# Base Client

		::: pyspark_msgraph_source.core.base_client
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		# Core Models

		::: pyspark_msgraph_source.core.models
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		# Resouorce Provider

		::: pyspark_msgraph_source.core.resource_provider
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		# Source

		::: pyspark_msgraph_source.core.source
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,4 @@
		# Resource - List Items


		::: pyspark_msgraph_source.resources.list_items