# Graph API Module Example 
This example demonstrates how to use the Graph API module to process incoming Microsoft Graph data, perform data prep, and view the data in an example Power BI dashboard.
## Preface
a.) Before running this notebook, you will need to change the storage account URLs to match your own, within every cell.

b.) Above each of the data processing cells in each step, there is "file viewer" code commented out. You can uncomment this for debugging purposes, or to see the contents within each file.

c.) Directly below each of the data processing cells for steps 2 and 3 (m365 and teams data), there is code for randomization of data within each of the tables. **If you wish to preserve any original data that you chose to integrate via the Graph pipeline, you must comment these out.**

d.) Step 2 (Processing the M365 apps user details data) has one additional cell, in order to clean the data into the right columns. Do not comment this out.
## Running the example (LAST STEP NEEDS TO BE EDITED)
1.) Select your spark pool in the "Attach to" dropdown list above.

2.) Click on "Publish" in the top nav bar (and wait a few seconds for the notification that says "Publishing completed").

3.) Click on "Run all" at the top of this tab (and wait for the processing to complete - which can take around 5 to 10 minutes).

4.) Open the dashboard in Power BI desktop and point it to your newly setup data lake (you can download the pbix from here: [techInequityDashboardContoso v2.pbix](https://github.com/microsoft/OpenEduAnalytics/blob/main/packages/ContosoISD/power_bi/techInequityDashboardContoso%20v2.pbix) )

## 1.) Processing "users" raw data from Graph API 
Data is cleaned from "users" in stage1np of the data lake, written as a parquet, and landed in stage2np. 


In [1]:
# View "users" JSON
#%%pyspark
#df = spark.read.load('abfss://stage1np@stoeahybriddev2.dfs.core.windows.net/GraphAPI/users.json', format='json')
#display(df.limit(10))

StatementMeta(, , , Cancelled, )

In [56]:
#%%pyspark
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, ArrayType
from pyspark.sql.functions import explode

user_schema = StructType(fields=[
    StructField('value', ArrayType(
        StructType([
            StructField('surname', StringType(), False),
            StructField('givenName', StringType(), False),
            StructField('userPrincipalName', StringType(), False),
            StructField('id', StringType(), False)
        ])
    ))
])

df = spark.read.load('abfss://stage1np@stoeahybriddev2.dfs.core.windows.net/GraphAPI/users.json', format='json', schema=user_schema)
df = df.select(explode('value').alias('exploded_values')).select("exploded_values.*")
display(df.limit(10))
#display(df)

# Convert the data in the JSON above to parquet and land this parquet in stage 2np
df.write.format('parquet').mode('overwrite').save('abfss://stage2np@stoeahybriddev2.dfs.core.windows.net/GraphAPI/users')

# Create spark db graphapi to allow for access to the data in the data lake via SQL on-demand, and create the table "users".
spark.sql('CREATE DATABASE IF NOT EXISTS GRAPHAPI')
spark.sql("create table if not exists GraphAPI.users using PARQUET location 'abfss://stage2np@stoeahybriddev2.dfs.core.windows.net/GraphAPI/users'")

StatementMeta(spark3p1sm, 46, 1, Finished, Available)

SynapseWidget(Synapse.DataFrame, 0cfed38b-57aa-41fe-a263-4e9356833ebb)

## 2.) Processing "Microsoft 365 app user detail" raw data from Graph API
Data is cleaned from "m365_app_user_detail" in stage1np of the data lake, written as a parquet, and landed in stage2np.

In [1]:
# View "m365_app_user_detail" JSON
#%%pyspark
#df = spark.read.load('abfss://stage1np@stoeahybriddev2.dfs.core.windows.net/GraphAPI/m365_app_user_detail.json', format='json')
#display(df.limit(10))

StatementMeta(spark3p1sm, 41, 1, Finished, Available)

In [63]:

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, ArrayType, BooleanType
from pyspark.sql.functions import explode, split, col, regexp_replace, rand, when

m365_app_user_details_schema = StructType(fields=[
    StructField('value', ArrayType(
        StructType([
            StructField('reportRefreshDate', StringType(), False),
            StructField('userPrincipalName', StringType(), False),
            StructField('lastActivityDate', StringType(), False),
            StructField('reportPeriod', StringType(), False),
            StructField('excel', StringType(), False),
            StructField('excelWeb', StringType(), False),
            StructField('outlook', StringType(), False),
            StructField('outlookWeb', StringType(), False),
            StructField('powerPoint', StringType(), False),
            StructField('powerPointWeb', StringType(), False),
            StructField('teams', StringType(), False),
            StructField('teamsWeb', StringType(), False),
            StructField('word', StringType(), False),
            StructField('wordWeb', StringType(), False),
            StructField('details', StringType(
                #StructType([
                #    StructField('reportPeriod', StringType(), False),
                #    StructField('excel', StringType(), False),
                #    StructField('excelWeb', StringType(), False),
                #    StructField('outlook', StringType(), False),
                #    StructField('outlookWeb', StringType(), False),
                #    StructField('powerPoint', StringType(), False),
                #    StructField('powerPointWeb', StringType(), False),
                #    StructField('teams', StringType(), False),
                #    StructField('teamsWeb', StringType(), False),
                #    StructField('word', StringType(), False),
                #    StructField('wordWeb', StringType(), False),
                #    StructField('userPrincipalName', StringType(), False),
                #])
            ))
        ])
    ))
])

df = spark.read.load('abfss://stage1np@stoeahybriddev2.dfs.core.windows.net/GraphAPI/m365_app_user_detail.json', format='json', schema=m365_app_user_details_schema)

df = df.select(explode('value').alias('exploded_values')).select("exploded_values.*")
#df = df.select(explode('details').alias('exploded_details')).select("exploded_details.*")

display(df.limit(10))

StatementMeta(spark3p1sm, 46, 8, Finished, Available)

SynapseWidget(Synapse.DataFrame, 8e1117de-e81f-4eec-97fe-e3197421cb4f)

In [64]:
# This is a very roundabout method (taking in details as a string rather than array), but nontheless - it works for now.

# Isolate and allocate data from "details" to their respective columns
splitDF = df.withColumn('reportPeriod', split(col('details'), ',').getItem(0)).withColumn('outlook', split(col('details'), ',').getItem(5)).withColumn('word', split(col('details'), ',').getItem(6)).withColumn('excel', split(col('details'), ',').getItem(7)).withColumn('powerPoint', split(col('details'), ',').getItem(8)).withColumn('teams', split(col('details'), ',').getItem(10)).withColumn('outlookWeb', split(col('details'), ',').getItem(29)).withColumn('wordWeb', split(col('details'), ',').getItem(30)).withColumn('excelWeb', split(col('details'), ',').getItem(31)).withColumn('powerPointWeb', split(col('details'), ',').getItem(32)).withColumn('teamsWeb', split(col('details'), ',').getItem(34))
splitDF = splitDF.drop('details')

# Clean the data within each column, to remove excess string pieces
splitDF = splitDF.withColumn('reportPeriod', regexp_replace('reportPeriod', '"reportPeriod":', '')).withColumn('excel', regexp_replace('excel','"excel":', '')).withColumn('excelWeb', regexp_replace('excelWeb','"excelWeb":', '')).withColumn('outlook', regexp_replace('outlook','"outlook":', '')).withColumn('outlookWeb', regexp_replace('outlookWeb','"outlookWeb":', '')).withColumn('powerPoint', regexp_replace('powerPoint','"powerPoint":', '')).withColumn('powerPointWeb', regexp_replace('powerPointWeb','"powerPointWeb":', '')).withColumn('teams', regexp_replace('teams','"teams":', '')).withColumn('teamsWeb', regexp_replace('teamsWeb','"teamsWeb":', '')).withColumn('word', regexp_replace('word','"word":', '')).withColumn('wordWeb', regexp_replace('wordWeb','"wordWeb":', ''))
splitDF = splitDF.withColumn('reportPeriod', regexp_replace('reportPeriod', '\W+', '')).withColumn('teamsWeb', regexp_replace('teamsWeb', '\W+', ''))

# Transform the datatypes of the columns to be accurate to what they represent
splitDF = splitDF.withColumn('reportPeriod', col('reportPeriod').cast("int")).withColumn('excel', col('excel').cast("boolean")).withColumn('excelWeb', col('excelWeb').cast("boolean")).withColumn('outlook', col('outlook').cast("boolean")).withColumn('outlookWeb', col('outlookWeb').cast("boolean")).withColumn('powerPoint', col('powerPoint').cast("boolean")).withColumn('powerPointWeb', col('powerPointWeb').cast("boolean")).withColumn('teams', col('teams').cast("boolean")).withColumn('teamsWeb', col('teamsWeb').cast("boolean")).withColumn('word', col('word').cast("boolean")).withColumn('wordWeb', col('wordWeb').cast("boolean"))

#display(splitDF.limit(10))
#splitDF.printSchema
#display(splitDF)


StatementMeta(spark3p1sm, 46, 9, Finished, Available)

SynapseWidget(Synapse.DataFrame, b5365f2a-2a7d-4156-9489-fb942d941941)

In [61]:
# Randomization for sample data
splitDF = splitDF.withColumn('excel', when(rand() > 0.3, "true").otherwise("false")).withColumn('excelWeb', when(rand() > 0.7, "true").otherwise("false")).withColumn('outlook', when(rand() > 0.9, "true").otherwise("false")).withColumn('outlookWeb', when(rand() > 0.1, "true").otherwise("false")).withColumn('powerPoint', when(rand() > 0.7, "true").otherwise("false")).withColumn('powerPointWeb', when(rand() > 0.9, "true").otherwise("false")).withColumn('teams', when(rand() > 0.1, "true").otherwise("false")).withColumn('teamsWeb', when(rand() > 0.3, "true").otherwise("false")).withColumn('word', when(rand() > 0.4, "true").otherwise("false")).withColumn('wordWeb', when(rand() > 0.7, "true").otherwise("false"))

display(splitDF.limit(10))

StatementMeta(spark3p1sm, 46, 6, Finished, Available)

SynapseWidget(Synapse.DataFrame, 883ef332-0298-4018-92f6-c3bc8420dc11)

In [None]:
# Convert the data in the JSON above to parquet and land this parquet in stage 2np
#splitDF.write.format('parquet').mode('overwrite').save('abfss://stage2np@stoeahybriddev2.dfs.core.windows.net/GraphAPI/m365_app_user_detail')

# Create table "m365_app_user_detail" in spark db "graphapi" to allow for access to the data in the data lake.
#spark.sql("create table if not exists GraphAPI.m365_app_user_detail using PARQUET location 'abfss://stage2np@stoeahybriddev2.dfs.core.windows.net/GraphAPI/m365_app_user_detail'")

## 3.) Processing "Teams activity user details" raw data from Graph API
Data is cleaned from "teams_activity_user_details" in stage1np of the data lake, written as a parquet, and landed in stage2np.

In [None]:
# View "teams_activity_user_details" JSON
#%%pyspark
#df = spark.read.load('abfss://stage1np@stoeahybriddev2.dfs.core.windows.net/GraphAPI/teams_activity_user_details', format='json')
#display(df.limit(10))

StatementMeta(, , , Cancelled, )

In [57]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, ArrayType, BooleanType
from pyspark.sql.functions import explode, split, col, regexp_replace, rand, when, round

teams_activity_user_details_schema = StructType(fields=[
    StructField('value', ArrayType(
        StructType([
            StructField('reportRefreshDate', StringType(), False),
            StructField('reportPeriod', StringType(), False),
            StructField('userPrincipalName', StringType(), False),
            StructField('privateChatMessageCount', IntegerType(), False),
            StructField('teamChatMessageCount', IntegerType(), False),
            StructField('meetingsAttendedCount', IntegerType(), False),
            StructField('meetingCount', IntegerType(), False),
            StructField('audioDuration', StringType(), False),
        ])
    ))
])

df = spark.read.load('abfss://stage1np@stoeahybriddev2.dfs.core.windows.net/GraphAPI/teams_activity_user_details.json', format='json', schema=teams_activity_user_details_schema)
df = df.select(explode('value').alias('exploded_values')).select("exploded_values.*")

display(df.limit(10))
#display(df)


StatementMeta(spark3p1sm, 46, 2, Finished, Available)

SynapseWidget(Synapse.DataFrame, 265cacbd-4dd3-47d9-bc63-6db7438049f6)

In [58]:
# Randomization for sample data (meetingCount = meetingsAttendedCount)
df = df.withColumn('privateChatMessageCount', round(rand(seed=10)*100).cast("int")).withColumn('teamChatMessageCount', round(rand()*50).cast("int")).withColumn('meetingsAttendedCount', round(rand(seed=20)*50).cast("int")).withColumn('meetingCount', round(rand(seed=20)*50).cast("int"))
df = df.withColumn('audioDuration', when(rand() > 0.9, '2:00').when(rand() > 0.7, '1:30').when(rand() > 0.5, '1:00').when(rand() > 0.3, '0:30').when(rand() > 0.1, '0:15').otherwise('0:00'))

display(df.limit(10))

StatementMeta(spark3p1sm, 46, 3, Finished, Available)

SynapseWidget(Synapse.DataFrame, 366b8b9a-3244-4b90-b406-81eb3a4cbcdc)

In [None]:
# Convert the data in the JSON above to parquet and land this parquet in stage 2np
#df.write.format('parquet').mode('overwrite').save('abfss://stage2np@stoeahybriddev2.dfs.core.windows.net/GraphAPI/teams_activity_user_details')

# Create table "teams_activity_user_details" in spark db "graphapi" to allow for access to the data in the data lake.
#spark.sql("create table if not exists GraphAPI.teams_activity_user_details using PARQUET location 'abfss://stage2np@stoeahybriddev2.dfs.core.windows.net/GraphAPI/teams_activity_user_details'")

# Reset everything
You can uncomment line 8 in the last cell below and run that cell to reset everything and walk through the process again.

Note: remember to comment out line 8 again to prevent accidental resetting of the example

In [1]:
%run /OEA_py_updated

StatementMeta(, 38, -1, Finished, Available)

In [2]:
oea = OEA()

StatementMeta(spark3p1sm, 38, 2, Finished, Available)

2021-09-27 14:18:15,619 - OEA - DEBUG - OEA initialized.
OEA initialized.

In [19]:
def reset_all_processing():
    #oea.rm_if_exists(oea.stage2np + '/GraphAPI')

    oea.drop_db('graphapi')

# Uncomment the following line and run this cell to reset everything if you want to walk through the process again, INCLUDING THE PIPELINE INTEGRATION TRIGGER.
#reset_all_processing()

StatementMeta(spark3p1sm, 38, 19, Finished, Available)

Database dropped: graphapi