# Test for processing Microsoft Education Insights data

This notebook demonstrates possible data processing and exploration of the Microsoft Education Insights data, using the OEA_py class notebook. 

Most of the data processing done in this notebook are also achieved by executing the Insights module main pipeline. This notebook is designed as an alternate approach to the same processing, as well as module data exploration and visualization. 

The steps are clearly outlined below:
1. Set the workspace,
2. Land Insights Module (K-12 or Higher Ed.) Test Data,
3. Ingest the Insights Module Test Data,
4. Insights Schema Correction,
5. Refine the Insights Module Test Data, 
6. Demonstrate Lake Database Queries/Final Remarks, and
7. Appendix

In [1]:
%run OEA_py

StatementMeta(, 8, -1, Finished, Available)

2022-12-27 22:29:21,428 - OEA - INFO - Now using workspace: dev
2022-12-27 22:29:21,430 - OEA - INFO - OEA initialized.


In [2]:
# 1) set the workspace (this determines where in the data lake you'll be writing to and reading from).
# You can work in 'dev', 'prod', or a sandbox with any name you choose.
# For example, Sam the developer can create a 'sam' workspace and expect to find his datasets in the data lake under oea/sandboxes/sam
oea.set_workspace('sam')

StatementMeta(spark3p2med, 8, 2, Finished, Available)

2022-12-27 22:29:23,059 - OEA - INFO - Now using workspace: sam


## 2.) Land Insights Module (K-12 or Higher Ed.) Test Data
Below are 2 code blocks - the first lands test data formatted as K-12 data in your data lake, whereas the second lands Higher Ed. test data in your lake. Choose and run *one* of these blocks.

**To-Do's:**
 - Confirm that files are being landed "correctly" in their proper folder partitions 
    * Took some liberties with test data directory, due to framework updates - although, the test data directory will look significantly different from the production data directory.
 - Confirm the **activity** table is considered additive batch data.

In [None]:
# 2.1) Land batch data files into stage1 of the data lake.
# In this example we pull Insights K-12 test csv data files from github and land it in oea/sandboxes/sam/stage1/Transactional/M365/v1.14
# NOTE: You can choose to ingest Insights Higher Ed. test data instead, by running the code block below rather than this one.
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/activity/2022-01-28/ApplicationUsage.csv').text
oea.land(data, 'M365/v1.14/activity', 'activity_k12_test_data.csv', oea.ADDITIVE_BATCH_DATA)

data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/AadGroup/aadgroup.csv').text
oea.land(data, 'M365/v1.14/AadGroup', 'aadgroup_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/AadGroupMembership/aadgroupmembership.csv').text
oea.land(data, 'M365/v1.14/AadGroupMembership', 'aadgroupmembership_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/AadUser/aaduser.csv').text
oea.land(data, 'M365/v1.14/AadUser', 'aaduser_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/AadUserPersonMapping/aaduserpersonmapping.csv').text
oea.land(data, 'M365/v1.14/AadUserPersonMapping', 'aaduserpersonmapping_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/Course/course.csv').text
oea.land(data, 'M365/v1.14/Course', 'course_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/CourseGradeLevel/coursegradelevel.csv').text
oea.land(data, 'M365/v1.14/CourseGradeLevel', 'coursegradelevel_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/CourseSubject/coursesubject.csv').text
oea.land(data, 'M365/v1.14/CourseSubject', 'coursesubject_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/Enrollment/enrollment.csv').text
oea.land(data, 'M365/v1.14/Enrollment', 'enrollment_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/Organization/organization.csv').text
oea.land(data, 'M365/v1.14/Organization', 'organization_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/Person/person.csv').text
oea.land(data, 'M365/v1.14/Person', 'person_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/PersonDemographic/persondemographic.csv').text
oea.land(data, 'M365/v1.14/PersonDemographic', 'persondemographic_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/PersonDemographicEthnicity/persondemographicethnicity.csv').text
oea.land(data, 'M365/v1.14/PersonDemographicEthnicity', 'persondemographicethnicity_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/PersonDemographicPersonFlag/persondemographicpersonflag.csv').text
oea.land(data, 'M365/v1.14/PersonDemographicPersonFlag', 'persondemographicpersonflag_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/PersonDemographicRace/persondemographicrace.csv').text
oea.land(data, 'M365/v1.14/PersonDemographicRace', 'persondemographicrace_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/PersonEmailAddress/personemailaddress.csv').text
oea.land(data, 'M365/v1.14/PersonEmailAddress', 'personemailaddress_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/PersonIdentifier/personidentifier.csv').text
oea.land(data, 'M365/v1.14/PersonIdentifier', 'personidentifier_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/PersonOrganizationRole/personorganizationrole.csv').text
oea.land(data, 'M365/v1.14/PersonOrganizationRole', 'personorganizationrole_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/PersonPhoneNumber/personphonenumber.csv').text
oea.land(data, 'M365/v1.14/PersonPhoneNumber', 'personphonenumber_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/PersonRelationship/personrelationship.csv').text
oea.land(data, 'M365/v1.14/PersonRelationship', 'personrelationship_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/RefDefinition/refdefinition.csv').text
oea.land(data, 'M365/v1.14/RefDefinition', 'refdefinition_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/RefTranslation/reftranslation.csv').text
oea.land(data, 'M365/v1.14/RefTranslation', 'reftranslation_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/Section/section.csv').text
oea.land(data, 'M365/v1.14/Section', 'section_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/SectionGradeLevel/sectiongradelevel.csv').text
oea.land(data, 'M365/v1.14/SectionGradeLevel', 'sectiongradelevel_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/SectionSession/sectionsession.csv').text
oea.land(data, 'M365/v1.14/SectionSession', 'sectionsession_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/SectionSubject/sectionsubject.csv').text
oea.land(data, 'M365/v1.14/SectionSubject', 'sectionsubject_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/Session/session.csv').text
oea.land(data, 'M365/v1.14/Session', 'session_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/SourceSystem/sourcesystem.csv').text
oea.land(data, 'M365/v1.14/SourceSystem', 'sourcesystem_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)

In [3]:
# 2.2) Land batch data files into stage1 of the data lake.
# In this example we pull Insights Higher Ed. test csv data files from github and land it in oea/sandboxes/sam/stage1/Transactional/M365/v1.14
# NOTE: You can choose to ingest Insights K-12 test data instead, by running the code block above rather than this one.
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/hed_test_data/activity/2022-01-28/ApplicationUsage.csv').text
oea.land(data, 'M365/v1.14/activity', 'activity_hed_test_data.csv', oea.ADDITIVE_BATCH_DATA)

data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/hed_test_data/roster/2022-01-28T06-16-22/AadGroup/aadgroup.csv').text
oea.land(data, 'M365/v1.14/AadGroup', 'aadgroup_hed_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/hed_test_data/roster/2022-01-28T06-16-22/AadGroupMembership/aadgroupmembership.csv').text
oea.land(data, 'M365/v1.14/AadGroupMembership', 'aadgroupmembership_hed_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/hed_test_data/roster/2022-01-28T06-16-22/AadUser/aaduser.csv').text
oea.land(data, 'M365/v1.14/AadUser', 'aaduser_hed_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/hed_test_data/roster/2022-01-28T06-16-22/AadUserPersonMapping/aaduserpersonmapping.csv').text
oea.land(data, 'M365/v1.14/AadUserPersonMapping', 'aaduserpersonmapping_hed_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/hed_test_data/roster/2022-01-28T06-16-22/Course/course.csv').text
oea.land(data, 'M365/v1.14/Course', 'course_hed_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/hed_test_data/roster/2022-01-28T06-16-22/CourseGradeLevel/coursegradelevel.csv').text
oea.land(data, 'M365/v1.14/CourseGradeLevel', 'coursegradelevel_hed_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/hed_test_data/roster/2022-01-28T06-16-22/CourseSubject/coursesubject.csv').text
oea.land(data, 'M365/v1.14/CourseSubject', 'coursesubject_hed_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/hed_test_data/roster/2022-01-28T06-16-22/Enrollment/enrollment.csv').text
oea.land(data, 'M365/v1.14/Enrollment', 'enrollment_hed_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/hed_test_data/roster/2022-01-28T06-16-22/Organization/organization.csv').text
oea.land(data, 'M365/v1.14/Organization', 'organization_hed_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/hed_test_data/roster/2022-01-28T06-16-22/Person/person.csv').text
oea.land(data, 'M365/v1.14/Person', 'person_hed_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/hed_test_data/roster/2022-01-28T06-16-22/PersonDemographic/persondemographic.csv').text
oea.land(data, 'M365/v1.14/PersonDemographic', 'persondemographic_hed_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/hed_test_data/roster/2022-01-28T06-16-22/PersonDemographicEthnicity/persondemographicethnicity.csv').text
oea.land(data, 'M365/v1.14/PersonDemographicEthnicity', 'persondemographicethnicity_hed_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/hed_test_data/roster/2022-01-28T06-16-22/PersonDemographicPersonFlag/persondemographicpersonflag.csv').text
oea.land(data, 'M365/v1.14/PersonDemographicPersonFlag', 'persondemographicpersonflag_hed_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/hed_test_data/roster/2022-01-28T06-16-22/PersonDemographicRace/persondemographicrace.csv').text
oea.land(data, 'M365/v1.14/PersonDemographicRace', 'persondemographicrace_hed_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/hed_test_data/roster/2022-01-28T06-16-22/PersonEmailAddress/personemailaddress.csv').text
oea.land(data, 'M365/v1.14/PersonEmailAddress', 'personemailaddress_hed_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/hed_test_data/roster/2022-01-28T06-16-22/PersonIdentifier/personidentifier.csv').text
oea.land(data, 'M365/v1.14/PersonIdentifier', 'personidentifier_hed_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/hed_test_data/roster/2022-01-28T06-16-22/PersonOrganizationRole/personorganizationrole.csv').text
oea.land(data, 'M365/v1.14/PersonOrganizationRole', 'personorganizationrole_hed_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/hed_test_data/roster/2022-01-28T06-16-22/PersonPhoneNumber/personphonenumber.csv').text
oea.land(data, 'M365/v1.14/PersonPhoneNumber', 'personphonenumber_hed_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/hed_test_data/roster/2022-01-28T06-16-22/PersonRelationship/personrelationship.csv').text
oea.land(data, 'M365/v1.14/PersonRelationship', 'personrelationship_hed_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/hed_test_data/roster/2022-01-28T06-16-22/RefDefinition/refdefinition.csv').text
oea.land(data, 'M365/v1.14/RefDefinition', 'refdefinition_hed_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/hed_test_data/roster/2022-01-28T06-16-22/RefTranslation/reftranslation.csv').text
oea.land(data, 'M365/v1.14/RefTranslation', 'reftranslation_hed_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/hed_test_data/roster/2022-01-28T06-16-22/Section/section.csv').text
oea.land(data, 'M365/v1.14/Section', 'section_hed_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/hed_test_data/roster/2022-01-28T06-16-22/SectionGradeLevel/sectiongradelevel.csv').text
oea.land(data, 'M365/v1.14/SectionGradeLevel', 'sectiongradelevel_hed_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/hed_test_data/roster/2022-01-28T06-16-22/SectionSession/sectionsession.csv').text
oea.land(data, 'M365/v1.14/SectionSession', 'sectionsession_hed_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/hed_test_data/roster/2022-01-28T06-16-22/SectionSubject/sectionsubject.csv').text
oea.land(data, 'M365/v1.14/SectionSubject', 'sectionsubject_hed_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/hed_test_data/roster/2022-01-28T06-16-22/Session/session.csv').text
oea.land(data, 'M365/v1.14/Session', 'session_hed_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/hed_test_data/roster/2022-01-28T06-16-22/SourceSystem/sourcesystem.csv').text
oea.land(data, 'M365/v1.14/SourceSystem', 'sourcesystem_hed_test_data.csv', oea.SNAPSHOT_BATCH_DATA)

StatementMeta(spark3p2med, 8, 3, Finished, Available)

'stage1/Transactional/M365/v1.14/SourceSystem/snapshot_batch_data/rundate=2022-12-27 22:29:43/sourcesystem_hed_test_data.csv'

## 3.) Ingest the Insights Module Test Data

This step ingests the Insights module test data from stage1 to stage2/Ingested (ingests either the K-12 or Higher Ed. test data - whichever dataset is in your lake, chosen from the last step).

Both test datasets are formatted exactly as the Insights data - thus, there will be no column names or correct dtypes, initially. Ingest the data using the ```oea.ingest()``` function as normal, and next step will correct the table schemas.

**To-Do's:**
 - Find solution to ingesting the AadGroupMembership table.
    * This particular table does not have a unique-primary key per row, thus the ```oea.ingest()``` would drop most of the rows, since every column has intentional duplicates.
 - Add test data for the PersonRelationship and RefTranslation tables.

In [4]:
# 3) The next step is to ingest the batch data into stage2
# Note that when you run this the first time, you'll see an info message like "Number of new inbound rows processed: 2".
# If you run this a second time, the number of inbound rows processed will be 0 because the ingestion uses spark structured streaming to keep track of what data has already been processed.
options = {'header':False}
oea.ingest(f'M365/v1.14/activity', '_c3', options)
oea.ingest(f'M365/v1.14/AadGroup', '_c0', options)
#oea.ingest(f'M365/v1.14/AadGroupMembership', options) # <- no solution (at the moment) for ingesting this table, since there isn't a unique-primary key
oea.ingest(f'M365/v1.14/AadUser', '_c0', options)
oea.ingest(f'M365/v1.14/AadUserPersonMapping', '_c0', options)
oea.ingest(f'M365/v1.14/Course', '_c0', options)
oea.ingest(f'M365/v1.14/CourseGradeLevel', '_c0', options)
oea.ingest(f'M365/v1.14/CourseSubject', '_c0', options)
oea.ingest(f'M365/v1.14/Enrollment', '_c0', options)
oea.ingest(f'M365/v1.14/Organization', '_c0', options)
oea.ingest(f'M365/v1.14/Person', '_c0', options)
oea.ingest(f'M365/v1.14/PersonDemographic', '_c0', options)
oea.ingest(f'M365/v1.14/PersonDemographicEthnicity', '_c0', options)
oea.ingest(f'M365/v1.14/PersonDemographicPersonFlag', '_c0', options)
oea.ingest(f'M365/v1.14/PersonDemographicRace', '_c0', options)
oea.ingest(f'M365/v1.14/PersonEmailAddress', '_c0', options)
oea.ingest(f'M365/v1.14/PersonIdentifier', '_c0', options)
oea.ingest(f'M365/v1.14/PersonOrganizationRole', '_c0', options)
oea.ingest(f'M365/v1.14/PersonPhoneNumber', '_c0', options)
#oea.ingest(f'M365/v1.14/PersonRelationship', '_c0', options) # <- no test data currently
oea.ingest(f'M365/v1.14/RefDefinition', '_c0', options)
#oea.ingest(f'M365/v1.14/RefTranslation', '_c0', options) # <- no test data currently
oea.ingest(f'M365/v1.14/Section', '_c0', options)
oea.ingest(f'M365/v1.14/SectionGradeLevel', '_c0', options)
oea.ingest(f'M365/v1.14/SectionSession', '_c0', options)
oea.ingest(f'M365/v1.14/SectionSubject', '_c0', options)
oea.ingest(f'M365/v1.14/Session', '_c0', options)
oea.ingest(f'M365/v1.14/SourceSystem', '_c0', options)

StatementMeta(spark3p2med, 8, 4, Finished, Available)

2022-12-27 22:29:45,032 - OEA - INFO - Ingesting from: stage1/Transactional/M365/v1.14/activity, batch type of: additive, source data format of: csv
2022-12-27 22:30:02,067 - py4j.java_gateway - INFO - Callback Server Starting
2022-12-27 22:30:02,068 - py4j.java_gateway - INFO - Socket listening on ('127.0.0.1', 41219)
2022-12-27 22:30:04,213 - py4j.java_gateway - INFO - Callback Connection ready to receive messages
2022-12-27 22:30:04,238 - py4j.java_gateway - INFO - Received command c on object id p0
2022-12-27 22:30:40,506 - OEA - INFO - Number of new inbound rows processed: 34379
2022-12-27 22:30:57,185 - OEA - INFO - Ingesting from: stage1/Transactional/M365/v1.14/AadGroup, batch type of: snapshot, source data format of: csv
2022-12-27 22:31:00,836 - py4j.java_gateway - INFO - Received command c on object id p1
2022-12-27 22:31:05,492 - OEA - INFO - Number of new inbound rows processed: 87
2022-12-27 22:31:06,892 - OEA - INFO - Ingesting from: stage1/Transactional/M365/v1.14/AadUs

1

In [5]:
# 3.5) Now you can run queries against the auto-generated "lake database" with the ingested Insights data.
df = spark.sql("select * from ldb_sam_s2i_m365_v1p14.activity")
display(df.limit(10))

StatementMeta(spark3p2med, 8, 5, Finished, Available)

SynapseWidget(Synapse.DataFrame, 7fc86b72-01f0-4dfc-a794-afd160450586)

## 4.) Insights Schema Corrections

This step uses the same four functions from the "Insights_schema_correction" notebook, where the metadata.csv is used to correct each table's schema. Each table's schema is updated with the corrected column names and dtypes.

After the schema is corrected, each table is overwritten in stage2/Ingested.

In [6]:
# 4) schema correction, since Insights test data initially landed doesn't have column headers or correct dtypes.

def _extract_element(lst, element_num=0):
    return [item[element_num] for item in lst]

def _dtype_config(dtype_lst):
    return [item.capitalize() + 'Type()' for item in dtype_lst]

def correct_insights_table_schema(df, table_name):
    list_of_column_names = _extract_element(metadata[table_name])
    list_of_column_dtypes = _extract_element(metadata[table_name], 1)
    list_of_column_dtypes = _dtype_config(list_of_column_dtypes)

    n = 0
    df_updatedColumns = df
    for c in df.columns:
        if c != 'rundate':
            new_col_name = list_of_column_names[n]
            df_updatedColumns = df_updatedColumns.withColumnRenamed(c, new_col_name)
            if list_of_column_dtypes[n] != 'StringType()':
                if list_of_column_dtypes[n] == 'IntegerType()':
                    df_updatedColumns = df_updatedColumns.withColumn(new_col_name, df_updatedColumns[new_col_name].cast(IntegerType()))
                elif list_of_column_dtypes[n] == 'TimestampType()':
                    df_updatedColumns = df_updatedColumns.withColumn(new_col_name, df_updatedColumns[new_col_name].cast(TimestampType()))
                elif list_of_column_dtypes == 'ShortType()':
                    df_updatedColumns = df_updatedColumns.withColumn(new_col_name, df_updatedColumns[new_col_name].cast(ShortType()))
                elif list_of_column_dtypes[n] == 'LongType()':
                    df_updatedColumns = df_updatedColumns.withColumn(new_col_name, df_updatedColumns[new_col_name].cast(LongType()))
                elif list_of_column_dtypes[n] == 'DoubleType()':
                    df_updatedColumns = df_updatedColumns.withColumn(new_col_name, df_updatedColumns[new_col_name].cast(DoubleType()))
                elif list_of_column_dtypes[n] == 'DateType()':
                    df_updatedColumns = df_updatedColumns.withColumn(new_col_name, df_updatedColumns[new_col_name].cast(DateType()))
                elif list_of_column_dtypes[n] == 'BooleanType()':
                    df_updatedColumns = df_updatedColumns.withColumn(new_col_name, df_updatedColumns[new_col_name].cast(BooleanType()))
        else:
            df_updatedColumns = df_updatedColumns
        n = n + 1
    return df_updatedColumns

def correct_insights_dataset(tables_source):
    items = oea.get_folders(tables_source)
    for item in items: 
        if item == 'metadata.csv':
            logger.info('ignore metadata processing, since this is not a table to be ingested')
        else:
            table_path = tables_source +'/'+ item
            df = spark.read.format('delta').load(oea.to_url(table_path), header='false')
            df_correctedSchema = correct_insights_table_schema(df, table_name=item)
            df_correctedSchema.write.save(oea.to_url(table_path), format='delta', mode='overwrite', header='true', overwriteSchema='true')
            logger.info('Successfully corrected the schema for table: ' + item + ' from: ' + table_path)

StatementMeta(spark3p2med, 8, 6, Finished, Available)

In [7]:
#metadata = oea.get_metadata_from_url('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/metadata.csv')
metadata = oea.get_metadata_from_url('https://raw.githubusercontent.com/cstohlmann/oea-insights-module/main/test_data/metadata.csv')
correct_insights_dataset('stage2/Ingested/M365/v1.14')

StatementMeta(spark3p2med, 8, 7, Finished, Available)

2022-12-27 22:34:14,360 - OEA - INFO - Successfully corrected the schema for table: AadGroup from: stage2/Ingested/M365/v1.14/AadGroup
2022-12-27 22:34:17,013 - OEA - INFO - Successfully corrected the schema for table: AadUser from: stage2/Ingested/M365/v1.14/AadUser
2022-12-27 22:34:20,231 - OEA - INFO - Successfully corrected the schema for table: AadUserPersonMapping from: stage2/Ingested/M365/v1.14/AadUserPersonMapping
2022-12-27 22:34:23,630 - OEA - INFO - Successfully corrected the schema for table: Course from: stage2/Ingested/M365/v1.14/Course
2022-12-27 22:34:26,932 - OEA - INFO - Successfully corrected the schema for table: CourseGradeLevel from: stage2/Ingested/M365/v1.14/CourseGradeLevel
2022-12-27 22:34:30,568 - OEA - INFO - Successfully corrected the schema for table: CourseSubject from: stage2/Ingested/M365/v1.14/CourseSubject
2022-12-27 22:34:34,584 - OEA - INFO - Successfully corrected the schema for table: Enrollment from: stage2/Ingested/M365/v1.14/Enrollment
2022-12

In [8]:
df = spark.read.format('delta').load(oea.to_url('stage2/Ingested/M365/v1.14/activity'), header='true')
display(df.limit(10))

StatementMeta(spark3p2med, 8, 8, Finished, Available)

SynapseWidget(Synapse.DataFrame, 7d0ac717-5219-4e4f-b059-bfde364e2057)

In [9]:
df.printSchema()

StatementMeta(spark3p2med, 8, 9, Finished, Available)

root
 |-- SignalType: string (nullable = true)
 |-- StartTime: timestamp (nullable = true)
 |-- UserAgent: string (nullable = true)
 |-- SignalId: string (nullable = true)
 |-- SisClassId: string (nullable = true)
 |-- ClassId: string (nullable = true)
 |-- ChannelId: string (nullable = true)
 |-- AppName: string (nullable = true)
 |-- ActorId: string (nullable = true)
 |-- ActorRole: string (nullable = true)
 |-- SchemaVersion: string (nullable = true)
 |-- AssignmentId: string (nullable = true)
 |-- SubmissionId: string (nullable = true)
 |-- SubmissionCreatedTime: timestamp (nullable = true)
 |-- Action: string (nullable = true)
 |-- DueDate: timestamp (nullable = true)
 |-- ClassCreationDate: timestamp (nullable = true)
 |-- Grade: double (nullable = true)
 |-- SourceFileExtension: string (nullable = true)
 |-- MeetingDuration: string (nullable = true)
 |-- MeetingSessionId: string (nullable = true)
 |-- MeetingType: string (nullable = true)
 |-- ReadingSubmissionWordsPerMinute: in

## 5.) Refine the Insights Module Test Data
This step then refines the Insights test data from stage2/Ingested to stage2/Refined, using the metadata.csv. This step is responsible for pseudonymization, which preserves sensitive student information by either hashing or masking the sensitive columns. 

Tables are separated into either ```stage2/Refined/M365/v1.14/general``` or ```stage2/Refined/M365/v1.14/sensitive```, depending on whether each table is pseudonymized or has a sensitive column-hashing/masking mapping, respectively.

There are some minor bugs with this step, as some of the lookup tables are not created, as needed. This will be updated.

**To-Do's:**
 - Find workaround for creating lookup tables, when the primary key is un-hashed after pseudonymization 
    * (i.e. *affected tables*: PersonDemographicEthnicity, PersonDemographicPersonFlag, PersonDemographicRace, PersonEmailAddress, PersonIdentifier, PersonOrganizationRole, and PersonPhoneNumber). 
 - Resolve ingesting and refining AadGroupMembership table.

In [13]:
# 5) After ingesting data, the next step is to refine the data through the use of metadata (this is where the pseudonymization of the data occurs).
def refine_insights_dataset(tables_source):
    items = oea.get_folders(tables_source)
    for item in items: 
        table_path = tables_source +'/'+ item
        if item == 'metadata.csv':
            logger.info('ignore metadata processing, since this is not a table to be ingested')
        else:
            try:
                if item == 'activity':
                    oea.refine('M365/v1.14/activity', metadata[item], 'SignalId')
                elif item == 'AadGroup':
                    oea.refine('M365/v1.14/AadGroup', metadata[item], 'ObjectId_pseudonym')
                elif item == 'AadUser':
                    oea.refine('M365/v1.14/AadUser', metadata[item], 'ObjectId_pseudonym')
                elif item == 'AadUserPersonMapping':
                    oea.refine('M365/v1.14/AadUserPersonMapping', metadata[item], 'ObjectId_pseudonym')
                elif item == 'Person':
                    oea.refine('M365/v1.14/Person', metadata[item], 'Id_pseudonym')
                elif item == 'PersonDemographic':
                    oea.refine('M365/v1.14/PersonDemographic', metadata[item], 'PersonId_pseudonym')
                else:
                    oea.refine('M365/v1.14/' + item, metadata[item], 'Id')
            except AnalysisException as e:
                # This means the table may have not been properly refined due to errors with the primary key not aligning with columns expected in the lookup table.
                pass
            
            logger.info('Refined table: ' + item + ' from: ' + table_path)

StatementMeta(spark3p2med, 8, 13, Finished, Available)

In [14]:
#metadata = oea.get_metadata_from_url('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/metadata.csv')
metadata = oea.get_metadata_from_url('https://raw.githubusercontent.com/cstohlmann/oea-ms_insights-module/main/test_data/metadata.csv')
refine_insights_dataset('stage2/Ingested/M365/v1.14')

StatementMeta(spark3p2med, 8, 14, Finished, Available)

2022-12-27 22:56:28,017 - OEA - INFO - Processed 87 updated rows from stage2/Ingested/M365/v1.14/AadGroup into stage2/Refined
2022-12-27 22:56:28,419 - OEA - INFO - Refined table: AadGroup from: stage2/Ingested/M365/v1.14/AadGroup
2022-12-27 22:56:38,839 - OEA - INFO - Processed 640 updated rows from stage2/Ingested/M365/v1.14/AadUser into stage2/Refined
2022-12-27 22:56:39,184 - OEA - INFO - Refined table: AadUser from: stage2/Ingested/M365/v1.14/AadUser
2022-12-27 22:56:42,772 - OEA - INFO - Refined table: AadUserPersonMapping from: stage2/Ingested/M365/v1.14/AadUserPersonMapping
2022-12-27 22:56:48,513 - OEA - INFO - Refined table: Course from: stage2/Ingested/M365/v1.14/Course
2022-12-27 22:56:51,688 - OEA - INFO - Refined table: CourseGradeLevel from: stage2/Ingested/M365/v1.14/CourseGradeLevel
2022-12-27 22:56:54,476 - OEA - INFO - Refined table: CourseSubject from: stage2/Ingested/M365/v1.14/CourseSubject
2022-12-27 22:56:58,322 - OEA - INFO - Refined table: Enrollment from: sta

In [11]:
# This block represents what the blocks above (in this step) accomplish
#metadata = oea.get_metadata_from_url('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/metadata.csv')

#oea.refine('M365/v1.14/activity', metadata['activity'], 'SignalId')
#oea.refine('M365/v1.14/AadGroup', metadata['AadGroup'], 'ObjectId_pseudonym')
#oea.refine('M365/v1.14/AadGroupMembership', metadata['AadGroupMembership']) # <- no solution (at the moment) for refining this table, since there isn't a unique-primary key
#oea.refine('M365/v1.14/AadUser', metadata['AadUser'], 'ObjectId_pseudonym')
#oea.refine('M365/v1.14/AadUserPersonMapping', metadata['AadUserPersonMapping'], 'ObjectId_pseudonym')
#oea.refine('M365/v1.14/Course', metadata['Course'], 'Id')
#oea.refine('M365/v1.14/CourseGradeLevel', metadata['CourseGradeLevel'], 'Id')
#oea.refine('M365/v1.14/CourseSubject', metadata['CourseSubject'], 'Id')
#oea.refine('M365/v1.14/Enrollment', metadata['Enrollment'], 'Id')
#oea.refine('M365/v1.14/Organization', metadata['Organization'], 'Id')
#oea.refine('M365/v1.14/Person', metadata['Person'], 'Id_pseudonym')
#oea.refine('M365/v1.14/PersonDemographic', metadata['PersonDemographic'], 'PersonId_pseudonym')
#oea.refine('M365/v1.14/PersonDemographicEthnicity', metadata['PersonDemographicEthnicity'], 'Id')
#oea.refine('M365/v1.14/PersonDemographicPersonFlag', metadata['PersonDemographicPersonFlag'], 'Id')
#oea.refine('M365/v1.14/PersonDemographicRace', metadata['PersonDemographicRace'], 'Id')
#oea.refine('M365/v1.14/PersonEmailAddress', metadata['PersonEmailAddress'], 'Id')
#oea.refine('M365/v1.14/PersonIdentifier', metadata['PersonIdentifier'], 'Id')
#oea.refine('M365/v1.14/PersonOrganizationRole', metadata['PersonOrganizationRole'], 'Id')
#oea.refine('M365/v1.14/PersonPhoneNumber', metadata['PersonPhoneNumber'], 'Id')
#oea.refine('M365/v1.14/PersonRelationship', metadata['PersonRelationship'], 'Id') # <- no test data currently
#oea.refine('M365/v1.14/RefDefinition', metadata['RefDefinition'], 'Id')
#oea.refine('M365/v1.14/RefTranslation', metadata['RefTranslation'], 'Id') # <- no test data currently
#oea.refine('M365/v1.14/Section', metadata['Section'], 'Id')
#oea.refine('M365/v1.14/SectionGradeLevel', metadata['SectionGradeLevel'], 'Id')
#oea.refine('M365/v1.14/SectionSession', metadata['SectionSession'], 'Id')
#oea.refine('M365/v1.14/SectionSubject', metadata['SectionSubject'], 'Id')
#oea.refine('M365/v1.14/Session', metadata['Session'], 'Id')
#oea.refine('M365/v1.14/SourceSystem', metadata['SourceSystem'], 'Id')

StatementMeta(spark3p2med, 8, 11, Finished, Available)

## 6.) Demonstrate Lake Database Queries/Final Remarks

In [17]:
oea.add_to_lake_db('stage2/Refined/M365/v1.14/general/activity')

StatementMeta(spark3p2med, 8, 17, Finished, Available)

In [18]:
# 6) Now you can query the refined data tables in the lake db
df = spark.sql("select * from ldb_sam_s2r_m365_v1p14.activity")
display(df)
df.printSchema()
df = spark.sql("select * from ldb_sam_s2r_m365_v1p14.AadGroup_lookup")
display(df)
df.printSchema()
# You can use the "lookup" table for joins (people with restricted access won't be able to perform this query because they won't have access to data in the "sensitive" folder in the data lake)
df = spark.sql("select a.SignalType, a.StartTime, a.SignalId, a.AppName, a.ActorId_pseudonym, a.ActorRole, agl.DisplayName from ldb_sam_s2r_m365_v1p14.AadGroup_lookup agl, ldb_sam_s2r_m365_v1p14.activity a where agl.ObjectId = a.ClassId")
display(df)

StatementMeta(spark3p2med, 8, 18, Finished, Available)

SynapseWidget(Synapse.DataFrame, f199d3a9-15ac-49e4-bb47-d1fd633d8ba6)

root
 |-- SignalType: string (nullable = true)
 |-- StartTime: timestamp (nullable = true)
 |-- UserAgent: string (nullable = true)
 |-- SignalId: string (nullable = true)
 |-- SisClassId: string (nullable = true)
 |-- ClassId: string (nullable = true)
 |-- ChannelId: string (nullable = true)
 |-- AppName: string (nullable = true)
 |-- ActorId_pseudonym: string (nullable = true)
 |-- ActorRole: string (nullable = true)
 |-- SchemaVersion: string (nullable = true)
 |-- AssignmentId: string (nullable = true)
 |-- SubmissionId: string (nullable = true)
 |-- SubmissionCreatedTime: timestamp (nullable = true)
 |-- Action: string (nullable = true)
 |-- DueDate: timestamp (nullable = true)
 |-- ClassCreationDate: timestamp (nullable = true)
 |-- Grade: double (nullable = true)
 |-- SourceFileExtension: string (nullable = true)
 |-- MeetingDuration: string (nullable = true)
 |-- MeetingSessionId: string (nullable = true)
 |-- MeetingType: string (nullable = true)
 |-- ReadingSubmissionWordsPer

SynapseWidget(Synapse.DataFrame, aaa79174-1763-43fe-a9e9-a0a44959291c)

root
 |-- ObjectId: string (nullable = true)
 |-- DisplayName: string (nullable = true)
 |-- Mail: string (nullable = true)
 |-- MailNickname: string (nullable = true)
 |-- AnchorId: string (nullable = true)
 |-- ObjectId_pseudonym: string (nullable = true)
 |-- AnchorId_pseudonym: string (nullable = true)



SynapseWidget(Synapse.DataFrame, 119c1540-f5f8-40df-8b1e-686d8a104186)

In [19]:
# Run this cell to reset this example (deleting all the example Insights data in your workspace)
oea.rm_if_exists('stage1/Transactional/M365')
oea.rm_if_exists('stage2/Ingested/M365')
oea.rm_if_exists('stage2/Refined/M365')
oea.drop_lake_db('ldb_sam_s2i_m365_v1p14')
oea.drop_lake_db('ldb_sam_s2r_m365_v1p14')

StatementMeta(spark3p2med, 8, 19, Finished, Available)

2022-12-27 23:16:17,574 - OEA - INFO - Database dropped: ldb_sam_s2i_m365_v1p14
2022-12-27 23:16:18,749 - OEA - INFO - Database dropped: ldb_sam_s2r_m365_v1p14


'Database dropped: ldb_sam_s2r_m365_v1p14'

## Appendix

In [None]:
# generate an initial metadata file for manual modification
metadata = oea.create_metadata_from_lake_db('ldb_sam_s2i_m365_v1p14')
dlw = DataLakeWriter(oea.to_url('stage1/Transactional/M365'))
dlw.write('metadata.csv', metadata)

In [None]:
# Create a sql db for the ingested Insights data
oea.create_sql_db('stage2/Ingested/M365')