Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingestion): powerbi # Configurable Admin API #7055

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
75 commits
Select commit Hold shift + click to select a range
5660959
initial code commit
siddiquebagwan-gslab Jan 16, 2023
eca1650
rename proxy.py to rest_api_wrapper.py
siddiquebagwan-gslab Jan 17, 2023
ad66a6a
Admin and Regular API path
siddiquebagwan-gslab Jan 18, 2023
041aa45
Integration Test
siddiquebagwan-gslab Jan 19, 2023
6e0c02b
lint fix
siddiquebagwan-gslab Jan 19, 2023
18ea4b0
resolve merge conflict
siddiquebagwan-gslab Jan 19, 2023
bef57ad
lint fix
siddiquebagwan-gslab Jan 19, 2023
764d08f
Merge branch 'master' into master+acr-5098-stateful-ingestion
siddiquebagwan-gslab Jan 19, 2023
10a5299
Merge branch 'master+acr-5097-admin-api-path' of github.com:acryldata…
siddiquebagwan-gslab Jan 19, 2023
0a236f8
Merge branch 'master' into master+acr-5097-admin-api-path
siddiquebagwan Jan 19, 2023
b5a3b6c
doc update
siddiquebagwan-gslab Jan 19, 2023
2361e25
Merge branch 'master+acr-5097-admin-api-path' of github.com:acryldata…
siddiquebagwan-gslab Jan 19, 2023
a283400
Merge branch 'master+acr-5097-admin-api-path' of github.com:mohdsiddi…
siddiquebagwan-gslab Jan 19, 2023
4d76a76
stateful ingestion
siddiquebagwan-gslab Jan 20, 2023
26b29b3
stateful test file
siddiquebagwan-gslab Jan 23, 2023
bc0bc33
Merge branch 'master' into master+acr-5097-admin-api-path
siddiquebagwan-gslab Jan 23, 2023
1cdb853
resovle merge conflict
siddiquebagwan-gslab Jan 23, 2023
dafdd09
stateful ingestion test
siddiquebagwan-gslab Jan 23, 2023
eee5cc1
lint fix
siddiquebagwan-gslab Jan 23, 2023
3986949
stateful test
siddiquebagwan-gslab Jan 24, 2023
5cce476
Merge branch 'master' into master+acr-5098-stateful-ingestion
siddiquebagwan-gslab Jan 24, 2023
c91f420
lint fix
siddiquebagwan-gslab Jan 24, 2023
db87e4f
Merge branch 'master' into master+acr-5097-admin-api-path
siddiquebagwan-gslab Jan 24, 2023
a1be1ac
golden files updated
siddiquebagwan-gslab Jan 25, 2023
9a91bbd
lint fix
siddiquebagwan-gslab Jan 25, 2023
e272085
Merge branch 'master+acr-5098-stateful-ingestion' into master+acr-509…
siddiquebagwan-gslab Jan 25, 2023
e619d27
Merge branch 'master+acr-5097-admin-api-path' of github.com:acryldata…
siddiquebagwan-gslab Jan 25, 2023
e0467ff
resolve merge conflict
siddiquebagwan-gslab Jan 27, 2023
8972e1e
test fix
siddiquebagwan-gslab Jan 27, 2023
fbe94ed
Merge branch 'master+acr-5097-admin-api-path' of github.com:acryldata…
siddiquebagwan-gslab Jan 27, 2023
6cf4288
rename data_fetcher to data_resolver
siddiquebagwan-gslab Jan 27, 2023
5dbec9c
lint fix
siddiquebagwan-gslab Jan 27, 2023
eb266f7
Merge branch 'master+acr-5097-admin-api-path' of github.com:acryldata…
siddiquebagwan-gslab Jan 27, 2023
8cca537
Merge branch 'master' into master+acr-5097-admin-api-path
siddiquebagwan-gslab Jan 29, 2023
97182c8
lint fix
siddiquebagwan-gslab Jan 30, 2023
54f38b8
handle 401 and 403
siddiquebagwan-gslab Jan 30, 2023
238f0b6
403 test case
siddiquebagwan-gslab Jan 30, 2023
bd6e735
Merge branch 'master' into master+acr-5097-admin-api-path
siddiquebagwan-gslab Jan 30, 2023
1d3e626
Merge branch 'master+acr-5097-admin-api-path' into acr-5097+admin-onl…
siddiquebagwan-gslab Jan 30, 2023
3613e62
test with logging enabled
siddiquebagwan-gslab Jan 30, 2023
4bf0347
admin-only test
siddiquebagwan-gslab Jan 30, 2023
955bb5b
lint fix
siddiquebagwan-gslab Jan 30, 2023
1100652
Merge branch 'master+acr-5097-admin-api-path' of github.com:acryldata…
siddiquebagwan-gslab Jan 30, 2023
fac988b
remove empty quotes
siddiquebagwan-gslab Jan 30, 2023
23c9923
hidden schema
siddiquebagwan-gslab Jan 30, 2023
28f79a5
review comments
siddiquebagwan-gslab Jan 30, 2023
8d047eb
Merge branch 'master+acr-5097-admin-api-path' of github.com:acryldata…
siddiquebagwan-gslab Jan 30, 2023
878efd0
update doc in config
siddiquebagwan-gslab Jan 30, 2023
1faf198
Merge branch 'master+acr-5097-admin-api-path' of github.com:acryldata…
siddiquebagwan-gslab Jan 30, 2023
191c770
Merge branch 'master' into master+acr-5097-admin-api-path
siddiquebagwan-gslab Jan 31, 2023
5f0b8cb
Merge branch 'master+acr-5097-admin-api-path' of github.com:acryldata…
siddiquebagwan-gslab Jan 31, 2023
4459c18
doc update
siddiquebagwan-gslab Jan 31, 2023
a90048b
Merge branch 'master+acr-5097-admin-api-path' of github.com:acryldata…
siddiquebagwan-gslab Jan 31, 2023
de86b97
resolve merge conflict
siddiquebagwan-gslab Feb 6, 2023
8e260a0
review comments
siddiquebagwan-gslab Feb 6, 2023
345ad67
Merge branch 'master+acr-5097-admin-api-path' of github.com:acryldata…
siddiquebagwan-gslab Feb 6, 2023
9b34c96
constant in file
siddiquebagwan-gslab Feb 8, 2023
5161b7b
WIP
siddiquebagwan-gslab Feb 8, 2023
f906c8c
all constant in config file
siddiquebagwan-gslab Feb 8, 2023
97f475c
code review comments
siddiquebagwan-gslab Feb 9, 2023
758bcb7
doc review comment
siddiquebagwan-gslab Feb 9, 2023
aeed12d
Merge branch 'master' into master+acr-5097-admin-api-path
siddiquebagwan-gslab Feb 9, 2023
d0a9603
Merge branch 'master+acr-5097-admin-api-path' of github.com:acryldata…
siddiquebagwan-gslab Feb 9, 2023
3d8e8fc
second round of review comments
siddiquebagwan-gslab Feb 10, 2023
1209742
Merge branch 'master' into master+acr-5097-admin-api-path
siddiquebagwan-gslab Feb 10, 2023
707c6e3
correct tableau py file
siddiquebagwan-gslab Feb 10, 2023
548671e
lint fix
siddiquebagwan-gslab Feb 10, 2023
e9d2f5b
doc update
siddiquebagwan-gslab Feb 10, 2023
a589193
Merge branch 'master+acr-5097-admin-api-path' of github.com:acryldata…
siddiquebagwan-gslab Feb 10, 2023
ae5dc82
doc update
siddiquebagwan-gslab Feb 10, 2023
34390fe
Merge branch 'master' into master+acr-5097-admin-api-path
siddiquebagwan-gslab Feb 10, 2023
0bbfd8f
Merge branch 'master+acr-5097-admin-api-path' of github.com:acryldata…
siddiquebagwan-gslab Feb 10, 2023
d7910b9
correct the build.gradle
siddiquebagwan-gslab Feb 11, 2023
5efea89
Merge branch 'master+acr-5097-admin-api-path' of github.com:acryldata…
siddiquebagwan-gslab Feb 11, 2023
fa5dc61
Merge branch 'master' into master+acr-5097-admin-api-path
siddiquebagwan Feb 14, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 44 additions & 5 deletions metadata-ingestion/docs/sources/powerbi/powerbi_pre.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,16 @@
## Configuration Notes
See the
1. [Microsoft AD App Creation doc](https://docs.microsoft.com/en-us/power-bi/developer/embedded/embed-service-principal) for the steps to create an app client ID and secret and allow service principals to use Power BI APIs
2. Login to Power BI as Admin and from `Admin API settings` allow below permissions
1. Refer [Microsoft AD App Creation doc](https://docs.microsoft.com/en-us/power-bi/developer/embedded/embed-service-principal) to create a Microsoft AD Application. Once Microsoft AD Application is created you can configure client-credential i.e. client_id and client_secret in recipe for ingestion.
2. Enable admin access only if you want to ingest dataset, lineage and endorsement tags. Refer section [Admin Ingestion vs. Basic Ingestion](#admin-ingestion-vs-basic-ingestion) for more detail.

Login to PowerBI as Admin and from `Admin API settings` allow below permissions

- Allow service principals to use read-only admin APIs
- Enhance admin APIs responses with detailed metadata
- Enhance admin APIs responses with DAX and mashup expressions

## Concept mapping

| Power BI | Datahub |
| PowerBI | Datahub |
|-----------------------|---------------------|
| `Dashboard` | `Dashboard` |
| `Dataset's Table` | `Dataset` |
Expand All @@ -23,7 +24,7 @@ If Tile is created from report then Chart.externalUrl is set to Report.webUrl.

## Lineage

This source extract table lineage for tables present in Power BI Datasets. Lets consider a PowerBI Dataset `SALES_REPORT` and a PostgreSQL database is configured as data-source in `SALES_REPORT` dataset.
This source extract table lineage for tables present in PowerBI Datasets. Lets consider a PowerBI Dataset `SALES_REPORT` and a PostgreSQL database is configured as data-source in `SALES_REPORT` dataset.

Consider `SALES_REPORT` PowerBI Dataset has a table `SALES_ANALYSIS` which is backed by `SALES_ANALYSIS_VIEW` of PostgreSQL Database then in this case `SALES_ANALYSIS_VIEW` will appear as upstream dataset for `SALES_ANALYSIS` table.

Expand Down Expand Up @@ -103,3 +104,41 @@ combine_result
By default, extracting endorsement information to tags is disabled. The feature may be useful if organization uses [endorsements](https://learn.microsoft.com/en-us/power-bi/collaborate-share/service-endorse-content) to identify content quality.

Please note that the default implementation overwrites tags for the ingested entities, if you need to preserve existing tags, consider using a [transformer](../../../../metadata-ingestion/docs/transformer/dataset_transformer.md#simple-add-dataset-globaltags) with `semantics: PATCH` tags instead of `OVERWRITE`.

## Admin Ingestion vs. Basic Ingestion
PowerBI provides two sets of API i.e. [Basic API and Admin API](https://learn.microsoft.com/en-us/rest/api/power-bi/).

The Basic API returns metadata of PowerBI resources where service principal has granted access explicitly on resources whereas Admin API returns metadata of all PowerBI resources irrespective of whether service principal has granted or doesn't granted access explicitly on resources.

The Admin Ingestion (explain below) is the recommended way to execute PowerBI ingestion as this ingestion can extract most of the metadata.


### Admin Ingestion: Service Principal As Admin in Tenant Setting and Added as Member In Workspace
To grant admin access to the service principal, visit your PowerBI tenant Settings.

If you have added service principal as `member` in workspace and also allowed below permissions from PowerBI tenant Settings

- Allow service principal to use read-only PowerBI Admin APIs
- Enhance admin APIs responses with detailed metadata
- Enhance admin APIs responses with DAX and mashup expressions

PowerBI Source would be able to ingest below listed metadata of that particular workspace

- Lineage
siddiquebagwan marked this conversation as resolved.
Show resolved Hide resolved
- PowerBI Dataset
siddiquebagwan marked this conversation as resolved.
Show resolved Hide resolved
- Endorsement as tag
- Dashboards
- Reports
- Dashboard's Tiles
- Report's Pages

Lets consider user don't want (or doesn't have access) to add service principal as member in workspace then you can enable the `admin_apis_only: true` in recipe to use PowerBI Admin API only. if `admin_apis_only` is set to `true` then report's pages would not get ingested as page API is not available in PowerBI Admin API.


### Basic Ingestion: Service Principal As Member In Workspace
If you have added service principal as `member` in workspace then PowerBI Source would be able ingest below metadata of that particular workspace

- Dashboards
- Reports
- Dashboard's Tiles
- Report's Pages
99 changes: 79 additions & 20 deletions metadata-ingestion/src/datahub/ingestion/source/powerbi/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,14 @@

import datahub.emitter.mce_builder as builder
from datahub.configuration.common import AllowDenyPattern
from datahub.configuration.source_common import DEFAULT_ENV, EnvBasedSourceConfigBase
from datahub.ingestion.api.source import SourceReport
from datahub.configuration.source_common import DEFAULT_ENV
from datahub.ingestion.source.state.stale_entity_removal_handler import (
StaleEntityRemovalSourceReport,
StatefulStaleMetadataRemovalConfig,
)
from datahub.ingestion.source.state.stateful_ingestion_base import (
StatefulIngestionConfigBase,
)

logger = logging.getLogger(__name__)

Expand All @@ -25,6 +31,7 @@ class Constant:
REPORT_LIST = "REPORT_LIST"
PAGE_BY_REPORT = "PAGE_BY_REPORT"
DATASET_GET = "DATASET_GET"
DATASET_LIST = "DATASET_LIST"
REPORT_GET = "REPORT_GET"
DATASOURCE_GET = "DATASOURCE_GET"
TILE_GET = "TILE_GET"
Expand All @@ -33,10 +40,10 @@ class Constant:
SCAN_GET = "SCAN_GET"
SCAN_RESULT_GET = "SCAN_RESULT_GET"
Authorization = "Authorization"
WorkspaceId = "WorkspaceId"
DashboardId = "DashboardId"
DatasetId = "DatasetId"
ReportId = "ReportId"
WORKSPACE_ID = "workspaceId"
DASHBOARD_ID = "powerbi.linkedin.com/dashboards/{}"
DATASET_ID = "datasetId"
REPORT_ID = "reportId"
SCAN_ID = "ScanId"
Dataset_URN = "DatasetURN"
CHART_URN = "ChartURN"
Expand All @@ -49,30 +56,60 @@ class Constant:
STATUS = "status"
CHART_ID = "powerbi.linkedin.com/charts/{}"
CHART_KEY = "chartKey"
DASHBOARD_ID = "powerbi.linkedin.com/dashboards/{}"
DASHBOARD = "dashboard"
DASHBOARDS = "dashboards"
DASHBOARD_KEY = "dashboardKey"
OWNERSHIP = "ownership"
BROWSERPATH = "browsePaths"
DASHBOARD_INFO = "dashboardInfo"
DATAPLATFORM_INSTANCE = "dataPlatformInstance"
DATASET = "dataset"
DATASET_ID = "powerbi.linkedin.com/datasets/{}"
DATASETS = "datasets"
DATASET_KEY = "datasetKey"
DATASET_PROPERTIES = "datasetProperties"
VALUE = "value"
ENTITY = "ENTITY"
ID = "ID"
ID = "id"
HTTP_RESPONSE_TEXT = "HttpResponseText"
HTTP_RESPONSE_STATUS_CODE = "HttpResponseStatusCode"
NAME = "name"
DISPLAY_NAME = "displayName"
ORDER = "order"
IDENTIFIER = "identifier"
EMAIL_ADDRESS = "emailAddress"
PRINCIPAL_TYPE = "principalType"
GRAPH_ID = "graphId"
WORKSPACES = "workspaces"
TITLE = "title"
EMBED_URL = "embedUrl"
ACCESS_TOKEN = "access_token"
IS_READ_ONLY = "isReadOnly"
WEB_URL = "webUrl"
ODATA_COUNT = "@odata.count"
DESCRIPTION = "description"
REPORT = "report"
REPORTS = "reports"
CREATED_FROM = "createdFrom"
SUCCEEDED = "SUCCEEDED"
ENDORSEMENT = "endorsement"
ENDORSEMENT_DETAIL = "endorsementDetails"
TABLES = "tables"
EXPRESSION = "expression"
SOURCE = "source"
PLATFORM_NAME = "powerbi"
REPORT_TYPE_NAME = "Report"
CHART_COUNT = "chartCount"
WORKSPACE_NAME = "workspaceName"
DATASET_WEB_URL = "datasetWebUrl"


@dataclass
class PowerBiDashboardSourceReport(SourceReport):
class PowerBiDashboardSourceReport(StaleEntityRemovalSourceReport):
dashboards_scanned: int = 0
charts_scanned: int = 0
filtered_dashboards: List[str] = dataclass_field(default_factory=list)
filtered_charts: List[str] = dataclass_field(default_factory=list)
number_of_workspaces: int = 0

def report_dashboards_scanned(self, count: int = 1) -> None:
self.dashboards_scanned += count
Expand All @@ -86,6 +123,9 @@ def report_dashboards_dropped(self, model: str) -> None:
def report_charts_dropped(self, view: str) -> None:
self.filtered_charts.append(view)

def report_number_of_workspaces(self, number_of_workspaces: int) -> None:
self.number_of_workspaces = number_of_workspaces


@dataclass
class PlatformDetail:
Expand All @@ -99,7 +139,16 @@ class PlatformDetail:
)


class PowerBiAPIConfig(EnvBasedSourceConfigBase):
class PowerBiDashboardSourceConfig(StatefulIngestionConfigBase):
platform_name: str = pydantic.Field(
default=Constant.PLATFORM_NAME, hidden_from_schema=True
)

platform_urn: str = pydantic.Field(
default=builder.make_data_platform_urn(platform=Constant.PLATFORM_NAME),
hidden_from_schema=True,
)

# Organisation Identifier
tenant_id: str = pydantic.Field(description="PowerBI tenant identifier")
# PowerBi workspace identifier
Expand Down Expand Up @@ -130,21 +179,25 @@ class PowerBiAPIConfig(EnvBasedSourceConfigBase):
)
# Enable/Disable extracting ownership information of Dashboard
extract_ownership: bool = pydantic.Field(
default=True, description="Whether ownership should be ingested"
default=False,
description="Whether ownership should be ingested. Admin API access is required if this setting is enabled. "
"Note that enabling this may overwrite owners that you've added inside DataHub's web application.",
)
# Enable/Disable extracting report information
extract_reports: bool = pydantic.Field(
default=True, description="Whether reports should be ingested"
)
# Enable/Disable extracting lineage information of PowerBI Dataset
extract_lineage: bool = pydantic.Field(
default=True, description="Whether lineage should be ingested"
default=True,
description="Whether lineage should be ingested between X and Y. Admin API access is required if this setting is enabled",
)
# Enable/Disable extracting endorsements to tags. Please notice this may overwrite
# any existing tags defined to those entitiies
# any existing tags defined to those entities
extract_endorsements_to_tags: bool = pydantic.Field(
default=False,
description="Whether to extract endorsements to tags, note that this may overwrite existing tags",
description="Whether to extract endorsements to tags, note that this may overwrite existing tags. Admin API "
"access is required is this setting is enabled",
)
# Enable/Disable extracting workspace information to DataHub containers
extract_workspaces_to_containers: bool = pydantic.Field(
Expand All @@ -161,11 +214,22 @@ class PowerBiAPIConfig(EnvBasedSourceConfigBase):
default=False,
description="Whether to convert the PowerBI assets urns to lowercase",
)

# convert lineage dataset's urns to lowercase
convert_lineage_urns_to_lowercase: bool = pydantic.Field(
default=True,
description="Whether to convert the urns of ingested lineage dataset to lowercase",
)
# Configuration for stateful ingestion
stateful_ingestion: Optional[StatefulStaleMetadataRemovalConfig] = pydantic.Field(
default=None, description="PowerBI Stateful Ingestion Config."
)
# Retrieve PowerBI Metadata using Admin API only
admin_apis_only: bool = pydantic.Field(
default=False,
description="Retrieve metadata using PowerBI Admin API only. If this is enabled, then Report Pages will not "
"be extracted. Admin API access is required if this setting is enabled",
)

@validator("dataset_type_mapping")
@classmethod
Expand Down Expand Up @@ -197,8 +261,3 @@ def workspace_id_backward_compatibility(cls, values: Dict) -> Dict:
)
values.pop("workspace_id")
return values


class PowerBiDashboardSourceConfig(PowerBiAPIConfig):
platform_name: str = "powerbi"
platform_urn: str = builder.make_data_platform_urn(platform=platform_name)
Original file line number Diff line number Diff line change
Expand Up @@ -17,14 +17,14 @@ def remove_special_characters(native_query: str) -> str:

def get_tables(native_query: str) -> List[str]:
native_query = remove_special_characters(native_query)
logger.debug("Processing query = %s", native_query)
logger.debug(f"Processing query = {native_query}")
tables: List[str] = []
parsed = sqlparse.parse(native_query)[0]
tokens: List[sqlparse.sql.Token] = list(parsed.tokens)
length: int = len(tokens)
from_index: int = -1
for index, token in enumerate(tokens):
logger.debug("%s=%s", token.value, token.ttype)
logger.debug(f"{token.value}={token.ttype}")
if (
token.value.lower().strip() == "from"
and str(token.ttype) == "Token.Keyword"
Expand All @@ -37,8 +37,8 @@ def get_tables(native_query: str) -> List[str]:
from_index < length
and isinstance(tokens[from_index], sqlparse.sql.Where) is not True
):
logger.debug("%s=%s", tokens[from_index].value, tokens[from_index].ttype)
logger.debug("Type=%s", type(tokens[from_index]))
logger.debug(f"{tokens[from_index].value}={tokens[from_index].ttype}")
logger.debug(f"Type={type(tokens[from_index])}")
if isinstance(tokens[from_index], sqlparse.sql.Identifier):
# Split on as keyword and collect the table name from 0th position. strip any spaces
tables.append(tokens[from_index].value.split("as")[0].strip())
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

from datahub.ingestion.source.powerbi.config import PowerBiDashboardSourceReport
from datahub.ingestion.source.powerbi.m_query import resolver, validator
from datahub.ingestion.source.powerbi.proxy import PowerBiAPI
from datahub.ingestion.source.powerbi.rest_api_wrapper.data_classes import Table

logger = logging.getLogger(__name__)

Expand All @@ -32,7 +32,7 @@ def _parse_expression(expression: str) -> Tree:

parse_tree: Tree = lark_parser.parse(expression)

logger.debug("Parsing expression = %s", expression)
logger.debug(f"Parsing expression = {expression}")

if (
logger.level == logging.DEBUG
Expand All @@ -43,12 +43,12 @@ def _parse_expression(expression: str) -> Tree:


def get_upstream_tables(
table: PowerBiAPI.Table,
table: Table,
reporter: PowerBiDashboardSourceReport,
native_query_enabled: bool = True,
) -> List[resolver.DataPlatformTable]:
if table.expression is None:
logger.debug("Expression is none for table %s", table.full_name)
logger.debug(f"Expression is none for table {table.full_name}")
return []

try:
Expand All @@ -57,7 +57,7 @@ def get_upstream_tables(
parse_tree, native_query_enabled=native_query_enabled
)
if valid is False:
logger.debug("Validation failed: %s", cast(str, message))
logger.debug(f"Validation failed: {cast(str, message)}")
reporter.report_warning(table.full_name, cast(str, message))
return []
except lark.exceptions.UnexpectedCharacters as e:
Expand Down
Loading