Skip to content

Commit

Permalink
feat(ingestion): powerbi # Configurable Admin API (#7055)
Browse files Browse the repository at this point in the history
Co-authored-by: MohdSiddique Bagwan <mohdsiddique.bagwan@gslab.com>
  • Loading branch information
siddiquebagwan and siddiquebagwan-gslab committed Feb 14, 2023
1 parent 24dcdd0 commit 3a095f9
Show file tree
Hide file tree
Showing 27 changed files with 4,619 additions and 1,331 deletions.
49 changes: 44 additions & 5 deletions metadata-ingestion/docs/sources/powerbi/powerbi_pre.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,16 @@
## Configuration Notes
See the
1. [Microsoft AD App Creation doc](https://docs.microsoft.com/en-us/power-bi/developer/embedded/embed-service-principal) for the steps to create an app client ID and secret and allow service principals to use Power BI APIs
2. Login to Power BI as Admin and from `Admin API settings` allow below permissions
1. Refer [Microsoft AD App Creation doc](https://docs.microsoft.com/en-us/power-bi/developer/embedded/embed-service-principal) to create a Microsoft AD Application. Once Microsoft AD Application is created you can configure client-credential i.e. client_id and client_secret in recipe for ingestion.
2. Enable admin access only if you want to ingest dataset, lineage and endorsement tags. Refer section [Admin Ingestion vs. Basic Ingestion](#admin-ingestion-vs-basic-ingestion) for more detail.

Login to PowerBI as Admin and from `Admin API settings` allow below permissions

- Allow service principals to use read-only admin APIs
- Enhance admin APIs responses with detailed metadata
- Enhance admin APIs responses with DAX and mashup expressions

## Concept mapping

| Power BI | Datahub |
| PowerBI | Datahub |
|-----------------------|---------------------|
| `Dashboard` | `Dashboard` |
| `Dataset's Table` | `Dataset` |
Expand All @@ -23,7 +24,7 @@ If Tile is created from report then Chart.externalUrl is set to Report.webUrl.

## Lineage

This source extract table lineage for tables present in Power BI Datasets. Lets consider a PowerBI Dataset `SALES_REPORT` and a PostgreSQL database is configured as data-source in `SALES_REPORT` dataset.
This source extract table lineage for tables present in PowerBI Datasets. Lets consider a PowerBI Dataset `SALES_REPORT` and a PostgreSQL database is configured as data-source in `SALES_REPORT` dataset.

Consider `SALES_REPORT` PowerBI Dataset has a table `SALES_ANALYSIS` which is backed by `SALES_ANALYSIS_VIEW` of PostgreSQL Database then in this case `SALES_ANALYSIS_VIEW` will appear as upstream dataset for `SALES_ANALYSIS` table.

Expand Down Expand Up @@ -103,3 +104,41 @@ combine_result
By default, extracting endorsement information to tags is disabled. The feature may be useful if organization uses [endorsements](https://learn.microsoft.com/en-us/power-bi/collaborate-share/service-endorse-content) to identify content quality.
Please note that the default implementation overwrites tags for the ingested entities, if you need to preserve existing tags, consider using a [transformer](../../../../metadata-ingestion/docs/transformer/dataset_transformer.md#simple-add-dataset-globaltags) with `semantics: PATCH` tags instead of `OVERWRITE`.
## Admin Ingestion vs. Basic Ingestion
PowerBI provides two sets of API i.e. [Basic API and Admin API](https://learn.microsoft.com/en-us/rest/api/power-bi/).
The Basic API returns metadata of PowerBI resources where service principal has granted access explicitly on resources whereas Admin API returns metadata of all PowerBI resources irrespective of whether service principal has granted or doesn't granted access explicitly on resources.
The Admin Ingestion (explain below) is the recommended way to execute PowerBI ingestion as this ingestion can extract most of the metadata.
### Admin Ingestion: Service Principal As Admin in Tenant Setting and Added as Member In Workspace
To grant admin access to the service principal, visit your PowerBI tenant Settings.
If you have added service principal as `member` in workspace and also allowed below permissions from PowerBI tenant Settings
- Allow service principal to use read-only PowerBI Admin APIs
- Enhance admin APIs responses with detailed metadata
- Enhance admin APIs responses with DAX and mashup expressions
PowerBI Source would be able to ingest below listed metadata of that particular workspace
- Lineage
- PowerBI Dataset
- Endorsement as tag
- Dashboards
- Reports
- Dashboard's Tiles
- Report's Pages
Lets consider user don't want (or doesn't have access) to add service principal as member in workspace then you can enable the `admin_apis_only: true` in recipe to use PowerBI Admin API only. if `admin_apis_only` is set to `true` then report's pages would not get ingested as page API is not available in PowerBI Admin API.
### Basic Ingestion: Service Principal As Member In Workspace
If you have added service principal as `member` in workspace then PowerBI Source would be able ingest below metadata of that particular workspace
- Dashboards
- Reports
- Dashboard's Tiles
- Report's Pages
99 changes: 79 additions & 20 deletions metadata-ingestion/src/datahub/ingestion/source/powerbi/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,14 @@

import datahub.emitter.mce_builder as builder
from datahub.configuration.common import AllowDenyPattern
from datahub.configuration.source_common import DEFAULT_ENV, EnvBasedSourceConfigBase
from datahub.ingestion.api.source import SourceReport
from datahub.configuration.source_common import DEFAULT_ENV
from datahub.ingestion.source.state.stale_entity_removal_handler import (
StaleEntityRemovalSourceReport,
StatefulStaleMetadataRemovalConfig,
)
from datahub.ingestion.source.state.stateful_ingestion_base import (
StatefulIngestionConfigBase,
)

logger = logging.getLogger(__name__)

Expand All @@ -25,6 +31,7 @@ class Constant:
REPORT_LIST = "REPORT_LIST"
PAGE_BY_REPORT = "PAGE_BY_REPORT"
DATASET_GET = "DATASET_GET"
DATASET_LIST = "DATASET_LIST"
REPORT_GET = "REPORT_GET"
DATASOURCE_GET = "DATASOURCE_GET"
TILE_GET = "TILE_GET"
Expand All @@ -33,10 +40,10 @@ class Constant:
SCAN_GET = "SCAN_GET"
SCAN_RESULT_GET = "SCAN_RESULT_GET"
Authorization = "Authorization"
WorkspaceId = "WorkspaceId"
DashboardId = "DashboardId"
DatasetId = "DatasetId"
ReportId = "ReportId"
WORKSPACE_ID = "workspaceId"
DASHBOARD_ID = "powerbi.linkedin.com/dashboards/{}"
DATASET_ID = "datasetId"
REPORT_ID = "reportId"
SCAN_ID = "ScanId"
Dataset_URN = "DatasetURN"
CHART_URN = "ChartURN"
Expand All @@ -49,30 +56,60 @@ class Constant:
STATUS = "status"
CHART_ID = "powerbi.linkedin.com/charts/{}"
CHART_KEY = "chartKey"
DASHBOARD_ID = "powerbi.linkedin.com/dashboards/{}"
DASHBOARD = "dashboard"
DASHBOARDS = "dashboards"
DASHBOARD_KEY = "dashboardKey"
OWNERSHIP = "ownership"
BROWSERPATH = "browsePaths"
DASHBOARD_INFO = "dashboardInfo"
DATAPLATFORM_INSTANCE = "dataPlatformInstance"
DATASET = "dataset"
DATASET_ID = "powerbi.linkedin.com/datasets/{}"
DATASETS = "datasets"
DATASET_KEY = "datasetKey"
DATASET_PROPERTIES = "datasetProperties"
VALUE = "value"
ENTITY = "ENTITY"
ID = "ID"
ID = "id"
HTTP_RESPONSE_TEXT = "HttpResponseText"
HTTP_RESPONSE_STATUS_CODE = "HttpResponseStatusCode"
NAME = "name"
DISPLAY_NAME = "displayName"
ORDER = "order"
IDENTIFIER = "identifier"
EMAIL_ADDRESS = "emailAddress"
PRINCIPAL_TYPE = "principalType"
GRAPH_ID = "graphId"
WORKSPACES = "workspaces"
TITLE = "title"
EMBED_URL = "embedUrl"
ACCESS_TOKEN = "access_token"
IS_READ_ONLY = "isReadOnly"
WEB_URL = "webUrl"
ODATA_COUNT = "@odata.count"
DESCRIPTION = "description"
REPORT = "report"
REPORTS = "reports"
CREATED_FROM = "createdFrom"
SUCCEEDED = "SUCCEEDED"
ENDORSEMENT = "endorsement"
ENDORSEMENT_DETAIL = "endorsementDetails"
TABLES = "tables"
EXPRESSION = "expression"
SOURCE = "source"
PLATFORM_NAME = "powerbi"
REPORT_TYPE_NAME = "Report"
CHART_COUNT = "chartCount"
WORKSPACE_NAME = "workspaceName"
DATASET_WEB_URL = "datasetWebUrl"


@dataclass
class PowerBiDashboardSourceReport(SourceReport):
class PowerBiDashboardSourceReport(StaleEntityRemovalSourceReport):
dashboards_scanned: int = 0
charts_scanned: int = 0
filtered_dashboards: List[str] = dataclass_field(default_factory=list)
filtered_charts: List[str] = dataclass_field(default_factory=list)
number_of_workspaces: int = 0

def report_dashboards_scanned(self, count: int = 1) -> None:
self.dashboards_scanned += count
Expand All @@ -86,6 +123,9 @@ def report_dashboards_dropped(self, model: str) -> None:
def report_charts_dropped(self, view: str) -> None:
self.filtered_charts.append(view)

def report_number_of_workspaces(self, number_of_workspaces: int) -> None:
self.number_of_workspaces = number_of_workspaces


@dataclass
class PlatformDetail:
Expand All @@ -99,7 +139,16 @@ class PlatformDetail:
)


class PowerBiAPIConfig(EnvBasedSourceConfigBase):
class PowerBiDashboardSourceConfig(StatefulIngestionConfigBase):
platform_name: str = pydantic.Field(
default=Constant.PLATFORM_NAME, hidden_from_schema=True
)

platform_urn: str = pydantic.Field(
default=builder.make_data_platform_urn(platform=Constant.PLATFORM_NAME),
hidden_from_schema=True,
)

# Organisation Identifier
tenant_id: str = pydantic.Field(description="PowerBI tenant identifier")
# PowerBi workspace identifier
Expand Down Expand Up @@ -130,21 +179,25 @@ class PowerBiAPIConfig(EnvBasedSourceConfigBase):
)
# Enable/Disable extracting ownership information of Dashboard
extract_ownership: bool = pydantic.Field(
default=True, description="Whether ownership should be ingested"
default=False,
description="Whether ownership should be ingested. Admin API access is required if this setting is enabled. "
"Note that enabling this may overwrite owners that you've added inside DataHub's web application.",
)
# Enable/Disable extracting report information
extract_reports: bool = pydantic.Field(
default=True, description="Whether reports should be ingested"
)
# Enable/Disable extracting lineage information of PowerBI Dataset
extract_lineage: bool = pydantic.Field(
default=True, description="Whether lineage should be ingested"
default=True,
description="Whether lineage should be ingested between X and Y. Admin API access is required if this setting is enabled",
)
# Enable/Disable extracting endorsements to tags. Please notice this may overwrite
# any existing tags defined to those entitiies
# any existing tags defined to those entities
extract_endorsements_to_tags: bool = pydantic.Field(
default=False,
description="Whether to extract endorsements to tags, note that this may overwrite existing tags",
description="Whether to extract endorsements to tags, note that this may overwrite existing tags. Admin API "
"access is required is this setting is enabled",
)
# Enable/Disable extracting workspace information to DataHub containers
extract_workspaces_to_containers: bool = pydantic.Field(
Expand All @@ -161,11 +214,22 @@ class PowerBiAPIConfig(EnvBasedSourceConfigBase):
default=False,
description="Whether to convert the PowerBI assets urns to lowercase",
)

# convert lineage dataset's urns to lowercase
convert_lineage_urns_to_lowercase: bool = pydantic.Field(
default=True,
description="Whether to convert the urns of ingested lineage dataset to lowercase",
)
# Configuration for stateful ingestion
stateful_ingestion: Optional[StatefulStaleMetadataRemovalConfig] = pydantic.Field(
default=None, description="PowerBI Stateful Ingestion Config."
)
# Retrieve PowerBI Metadata using Admin API only
admin_apis_only: bool = pydantic.Field(
default=False,
description="Retrieve metadata using PowerBI Admin API only. If this is enabled, then Report Pages will not "
"be extracted. Admin API access is required if this setting is enabled",
)

@validator("dataset_type_mapping")
@classmethod
Expand Down Expand Up @@ -197,8 +261,3 @@ def workspace_id_backward_compatibility(cls, values: Dict) -> Dict:
)
values.pop("workspace_id")
return values


class PowerBiDashboardSourceConfig(PowerBiAPIConfig):
platform_name: str = "powerbi"
platform_urn: str = builder.make_data_platform_urn(platform=platform_name)
Original file line number Diff line number Diff line change
Expand Up @@ -17,14 +17,14 @@ def remove_special_characters(native_query: str) -> str:

def get_tables(native_query: str) -> List[str]:
native_query = remove_special_characters(native_query)
logger.debug("Processing query = %s", native_query)
logger.debug(f"Processing query = {native_query}")
tables: List[str] = []
parsed = sqlparse.parse(native_query)[0]
tokens: List[sqlparse.sql.Token] = list(parsed.tokens)
length: int = len(tokens)
from_index: int = -1
for index, token in enumerate(tokens):
logger.debug("%s=%s", token.value, token.ttype)
logger.debug(f"{token.value}={token.ttype}")
if (
token.value.lower().strip() == "from"
and str(token.ttype) == "Token.Keyword"
Expand All @@ -37,8 +37,8 @@ def get_tables(native_query: str) -> List[str]:
from_index < length
and isinstance(tokens[from_index], sqlparse.sql.Where) is not True
):
logger.debug("%s=%s", tokens[from_index].value, tokens[from_index].ttype)
logger.debug("Type=%s", type(tokens[from_index]))
logger.debug(f"{tokens[from_index].value}={tokens[from_index].ttype}")
logger.debug(f"Type={type(tokens[from_index])}")
if isinstance(tokens[from_index], sqlparse.sql.Identifier):
# Split on as keyword and collect the table name from 0th position. strip any spaces
tables.append(tokens[from_index].value.split("as")[0].strip())
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

from datahub.ingestion.source.powerbi.config import PowerBiDashboardSourceReport
from datahub.ingestion.source.powerbi.m_query import resolver, validator
from datahub.ingestion.source.powerbi.proxy import PowerBiAPI
from datahub.ingestion.source.powerbi.rest_api_wrapper.data_classes import Table

logger = logging.getLogger(__name__)

Expand All @@ -32,7 +32,7 @@ def _parse_expression(expression: str) -> Tree:

parse_tree: Tree = lark_parser.parse(expression)

logger.debug("Parsing expression = %s", expression)
logger.debug(f"Parsing expression = {expression}")

if (
logger.level == logging.DEBUG
Expand All @@ -43,12 +43,12 @@ def _parse_expression(expression: str) -> Tree:


def get_upstream_tables(
table: PowerBiAPI.Table,
table: Table,
reporter: PowerBiDashboardSourceReport,
native_query_enabled: bool = True,
) -> List[resolver.DataPlatformTable]:
if table.expression is None:
logger.debug("Expression is none for table %s", table.full_name)
logger.debug(f"Expression is none for table {table.full_name}")
return []

try:
Expand All @@ -57,7 +57,7 @@ def get_upstream_tables(
parse_tree, native_query_enabled=native_query_enabled
)
if valid is False:
logger.debug("Validation failed: %s", cast(str, message))
logger.debug(f"Validation failed: {cast(str, message)}")
reporter.report_warning(table.full_name, cast(str, message))
return []
except lark.exceptions.UnexpectedCharacters as e:
Expand Down
Loading

0 comments on commit 3a095f9

Please sign in to comment.