Add Microsoft Fabric tutorial for OneTable (#237)

Addresses #236 Adds step by step example tutorial for querying tables translated by OneTable with Microsoft Fabric. Document tested manually on a local setup.
apache · Nov 17, 2023 · 001f5e7 · 001f5e7
1 parent a7c3a79
commit 001f5e7
Show file tree

Hide file tree

Showing 10 changed files with 133 additions and 4 deletions.
diff --git a/website/docs/fabric.md b/website/docs/fabric.md
@@ -0,0 +1,128 @@
+---
+sidebar_position: 5
+title: "Microsoft Fabric"
+---
+
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
+
+# Querying from Microsoft Fabric
+This guide offers a short tutorial on how to query Apache Iceberg and Apache Hudi tables in Microsoft Fabric utilizing 
+the translation capabilities of OneTable. This tutorial is intended solely for demonstration and to verify the 
+compatibility of OneTable's output with Fabric. The tutorial leverages the currently[^1] available features in Fabric, like 
+`Shortcuts`.
+
+
+## What is Microsoft Fabric
+Microsoft Fabric is a unified data analytics platform for data analytics. It offers a wide range of capabilities 
+including data engineering, data science, real-time analytics, and business intelligence. At its core is the data 
+lake, OneLake, which provides tabular data stored in Delta Parquet format, allowing one copy of the data to be used in 
+all Fabric analytical services like T-SQL, 
+Spark, and Power BI. OneLake is designed to do away with the need for data duplication and eliminates the necessity 
+for data migration or transfers. Whether it's a data engineer adding data to a table using Spark or a SQL developer 
+analyzing data with T-SQL in a data warehouse, both can access and work on the same data copy stored in OneLake. 
+Additionally, OneLake enables integration of existing storage accounts via its 'Shortcut' feature, which acts as a link
+to data in other file systems.
+
+## Tutorial
+The objective of the following tutorial is to translate an Iceberg or Hudi table in ADLS storage account into Delta Lake
+format using OneTable. After translation, this table will be accessible for querying from various Fabric engines,
+including T-SQL, Spark, and Power BI.
+
+### Pre-requisites
+* An active [Microsoft Fabric Workspace](https://learn.microsoft.com/en-us/fabric/get-started/workspaces).
+* A storage account with a container in [Azure Data Lake Storage Gen2](https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction) (ADLS).
+
+### Step 1. Create a source table in ADLS
+This step creates a source table in Iceberg or Hudi format in the ADLS storage account. The primary actions to be 
+taken to create a table `people` in Iceberg or Hudi format are documented in 
+[Creating your first interoperable table - Create Dataset](/docs/how-to#create-dataset) tutorial section. However, instead of creating the 
+`people` table locally, configure `local_base_path` to point to the ADLS storage account.
+
+Assuming the container name is `mycontainer` and the storage account name is `mystorageaccount`, the `local_base_path`
+should be set to `abfs://mycontainer@mystorageaccount.dfs.core.windows.net/`.
+
+An example spark configuration for authenticating to ADLS storage account is as follows:
+```
+spark.hadoop.fs.azure.account.auth.type=OAuth
+spark.hadoop.fs.azure.account.oauth.provider.type=org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
+spark.hadoop.fs.azure.account.oauth2.client.endpoint=https://login.microsoftonline.com/<tenant-id>/oauth2/token
+spark.hadoop.fs.azure.account.oauth2.client.id=<client-id>
+spark.hadoop.fs.azure.account.oauth2.client.secret=<client-secret>
+```
+
+### Step 2. Translate source table to Delta Lake format using OneTable
+This step translates the table `people` originally in Iceberg or Hudi format to Delta Lake format using OneTable.
+The primary actions for the translation are documented in 
+[Creating your first interoperable table - Running Sync](/docs/how-to#running-sync) tutorial section. 
+However, since the table is in ADLS, you need to update datasets path and hadoop configurations.
+
+For e.g. if the source table is in Iceberg format, the configuration file should look like:
+
+```yaml md title="my_config.yaml"
+sourceFormat: ICEBERG
+targetFormats:
+  - DELTA
+datasets:
+  -
+    tableBasePath: abfs://mycontainer@mystorageaccount.dfs.core.windows.net/default/people
+    tableDataPath: abfs://mycontainer@mystorageaccount.dfs.core.windows.net/default/people/data
+    tableName: people
+    # In the configuration above `default` refers to the default spark database.
+```
+
+An example hadoop configuration for authenticating to ADLS storage account is as follows:
+```xml md title="hadoop.xml"
+<configuration>
+  <property>
+    <name>fs.azure.account.auth.type</name>
+    <value>OAuth</value>
+  </property>
+  <property>
+    <name>fs.azure.account.oauth.provider.type</name>
+    <value>org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider</value>
+  </property>
+  <property>
+    <name>fs.azure.account.oauth2.client.endpoint</name>
+    <value>https://login.microsoftonline.com/[tenant-id]/oauth2/token</value>
+  </property>
+  <property>
+    <name>fs.azure.account.oauth2.client.id</name>
+    <value>[client-id]</value>
+  </property>
+  <property>
+    <name>fs.azure.account.oauth2.client.secret</name>
+    <value>[client-secret]</value>
+  </property>
+</configuration>
+```
+
+```shell md title="shell"
+java -jar utilities/target/utilities-0.1.0-SNAPSHOT-bundled.jar --datasetConfig my_config.yaml --hadoopConfig hadoop.xml
+```
+
+Running the above command will translate the table `people` in Iceberg or Hudi format to Delta Lake format. To validate
+the translation, you can list the directories in the table's data path. For this tutorial, `_delta_log` directory 
+should be present in the `people` table's data path at  `default/people/data/_delta_log`.
+
+### Step 3. Create a Shortcut in Fabric and Analyze
+This step creates a shortcut in Fabric to the Delta Lake table created in the previous step. The shortcut is a link to
+the table's data path in ADLS storage account.
+
+1. Navigate to your Fabric Lakehouse and click on the `Tables` and select `New Shortcut`.
+> ![Invoke new shortcut](/images/fabric/shortcut_1_1.png)
+
+2. In the `New Shortcut` dialog, select `Azure Delta Lake Storaget Gen2` as the `External Source`
+> ![Select external source](/images/fabric/shortcut_1_2.png)
+
+4. In the `New Shortcut` dialog, enter the `Connection settings` and authorize Fabric to access the storage account.
+> ![Enter connection settings](/images/fabric/shortcut_1_3.png)
+
+4. In the `New Shortcut` dialog, enter the `Shortcut settings` and click `Create`.
+> ![Enter shortcut settings](/images/fabric/shortcut_1_4.png)
+
+5. The shortcut is now created and the table is available for querying from Fabric.
+> ![Shortcut created](/images/fabric/shortcut_1_5.png)
+
+
+[^1]: Updated on 2023-11-16
diff --git a/website/docs/presto.md b/website/docs/presto.md
@@ -1,5 +1,5 @@
 ---
-sidebar_position: 5
+sidebar_position: 6
 title: "Presto"
 ---
 
@@ -23,7 +23,7 @@ over other columns in the Delta table. During sync, OneTable uses the same logic
 Currently, the generated columns from OneTable sync shows `NULL` when queried from Presto CLI.
 :::
 
-For hands on experimentation, please follow [Creating your first interoperable table](/docs/setup) tutorial
+For hands on experimentation, please follow [Creating your first interoperable table](/docs/how-to) tutorial
 to create OneTable synced tables followed by [Hive Metastore](/docs/hms) tutorial to register the target table
 in Hive Metastore. Once done, follow the below high level steps:
 1. If you are working with a self-managed Presto service, from the presto-server directory run `./bin/launcher run`

diff --git a/website/docs/snowflake.md b/website/docs/snowflake.md
@@ -1,5 +1,5 @@
 ---
-sidebar_position: 6
+sidebar_position: 7
 title: "Snowflake"
 ---
 

diff --git a/website/docs/trino.md b/website/docs/trino.md
@@ -1,5 +1,5 @@
 ---
-sidebar_position: 7
+sidebar_position: 8
 title: "Trino"
 ---
 

diff --git a/website/sidebars.js b/website/sidebars.js
@@ -54,6 +54,7 @@ module.exports = {
                         'redshift',
                         'spark',
                         'bigquery',
+                        'fabric',
                         'presto',
                         'snowflake',
                         'trino',

diff --git a/website/static/images/fabric/shortcut_1_1.png b/website/static/images/fabric/shortcut_1_1.png
diff --git a/website/static/images/fabric/shortcut_1_2.png b/website/static/images/fabric/shortcut_1_2.png
diff --git a/website/static/images/fabric/shortcut_1_3.png b/website/static/images/fabric/shortcut_1_3.png
diff --git a/website/static/images/fabric/shortcut_1_4.png b/website/static/images/fabric/shortcut_1_4.png
diff --git a/website/static/images/fabric/shortcut_1_5.png b/website/static/images/fabric/shortcut_1_5.png