# Graph Module Ingestion - Pre-Processing

This notebook demonstrates the utility of the OEA_py class notebook, while removing the '@odata.context' column from the meeting_attendance_report table pre-ingestion. Once the column is removed, the rest of the table overwrites the original meeting_attendance_report, and proceeds to ingest as normally.

The steps outlined below describe how this notebook is used to correct the Microsoft Graph Reports API module meeting_attendance_report table:
- Set the workspace for where the meeting_attendance_report table is to be corrected. 
- Read in the original JSON landed in ```stage1/Transactional/graph_api/v1.0/meeting_attendance_report``` and remove the @odata.context column. Overwrite the JSON (and remove any additional rundate folders, as described by the method below).
- 1 function is defined and used:
   1. **clean_data_lake_latest**: removes any additional folders in the data lake for a location, keeping only the latest rundate folder.

In [9]:
workspace = 'dev'

In [10]:
%run OEA_py

In [11]:
# 1) set the workspace (this determines where in the data lake you'll be writing to and reading from).
# You can work in 'dev', 'prod', or a sandbox with any name you choose.
# For example, Sam the developer can create a 'sam' workspace and expect to find his datasets in the data lake under oea/sandboxes/sam
oea.set_workspace(workspace)

In [29]:
# 2) read in the original meeting_attendance_report table, remove the '@odata.context' column and confirm it has been removed.
df = spark.read.format('json').load(oea.to_url('stage1/Transactional/graph_api/v1.0/meeting_attendance_report'), multiline='true')
df_corrected = df.select('id', 'totalParticipantCount', 'meetingStartDateTime', 'meetingEndDateTime', 'attendanceRecords')
display(df_corrected.limit(10))

In [32]:
print('Number of rows/meeting reports:')
print(df_corrected.count())
df_corrected.printSchema()

In [34]:
# 2.5) set the current date and time (using the correct format), and write out to the same relative same location, with a new rundate partition-folder.
import datetime
currentDate = datetime.datetime.now()
currentDateTime = currentDate.strftime("%Y-%m-%d %H-%M-%S")
table_path = 'stage1/Transactional/graph_api/v1.0/meeting_attendance_report/additive_batch_data/rundate=' + currentDateTime
df_corrected.write.save(oea.to_url(table_path), format='json', mode='overwrite', overwriteSchema='true')

In [36]:
# 3) only house the latest rundate folder compared to the old data (which had the '@odata.context' column).
def clean_data_lake_latest(source_path):
    latest_folder = oea.get_latest_folder(source_path)
    items = mssparkutils.fs.ls(oea.to_url(source_path))
    for item in items:
        if item.name != latest_folder:
            logger.info('file removal path: ' + item.path + ' with item: ' + item.name)
            oea.rm_if_exists(source_path + '/' + item.name)
            logger.info('Successfully removed folder: ' + item.name + ' from path: ' + item.path)
        else:
            logger.info('Kept folder: ' + item.name + ' from path: ' + item.path)
    logger.info('Finished cleaning data lake to house only the latest folder')

In [37]:
clean_data_lake_latest('stage1/Transactional/graph_api/v1.0/meeting_attendance_report/additive_batch_data')

In [38]:
# 4) ad hoc work - remove the _SUCCESS file, otherwise this will throw an error when ingesting the table.
table_path = 'stage1/Transactional/graph_api/v1.0/meeting_attendance_report/additive_batch_data/rundate=' + currentDateTime
oea.rm_if_exists(table_path + '/_SUCCESS', False)