# Canvas Module - Ingest

This notebook demonstrates the utility of the OEA_py class notebook, and speeding up the process of ingesting the Canvas data.

The steps outlined below describe how this notebook is used to ingest the Canvas module tables:

- Set the workspace for where the tables are located. 
- 1 function is defined and used:
  - ```ingest_canvas_dataset```: identifies primary keys per table and ingests each table from Canvas listed:
 1. **accounts**
 2. **assignments**
 3. **assignment_submissions**
 4. **assignment_submission_summary**
 5. **courses**
 6. **course_sections**
 7. **enrollments**
 8. **enrollment_terms**
 9. **roles**
 10. **users** 

In [None]:
workspace = 'dev'
version = '2.0'

StatementMeta(spark3p3sm, 75, 1, Finished, Available)

In [None]:
%run OEA_py

StatementMeta(, 75, -1, Finished, Available)

2023-01-13 17:28:45,990 - OEA - INFO - Now using workspace: dev
2023-01-13 17:28:45,991 - OEA - INFO - OEA initialized.


In [None]:
# 1) set the workspace (this determines where in the data lake you'll be writing to and reading from).
# You can work in 'dev', 'prod', or a sandbox with any name you choose.
# For example, Sam the developer can create a 'sam' workspace and expect to find his datasets in the data lake under oea/sandboxes/sam
oea.set_workspace(workspace)

StatementMeta(spark3p3sm, 75, 3, Finished, Available)

2023-01-13 17:28:47,566 - OEA - INFO - Now using workspace: dev


In [None]:
items = oea.get_folders('stage1/Transactional/Canvas/v' + version)

StatementMeta(spark3p3sm, 75, 5, Finished, Available)

In [None]:
print(items)

StatementMeta(spark3p3sm, 75, 6, Finished, Available)

['AadGroup', 'AadGroupMembership', 'AadUser', 'AadUserPersonMapping', 'Course', 'CourseGradeLevel', 'CourseSubject', 'Enrollment', 'Organization', 'Person', 'PersonDemographic', 'PersonDemographicEthnicity', 'PersonDemographicPersonFlag', 'PersonDemographicRace', 'PersonEmailAddress', 'PersonIdentifier', 'PersonOrganizationRole', 'PersonPhoneNumber', 'PersonRelationship', 'RefDefinition', 'RefTranslation', 'Section', 'SectionGradeLevel', 'SectionSession', 'SectionSubject', 'Session', 'SourceSystem', 'activity']


In [None]:
def should_ingest(entity_path):
    raw_path = f'stage1/Transactional/{entity_path}'
    batch_type, source_data_format = oea.get_batch_info(raw_path)
    logger.info(f'Ingesting from: {raw_path}, batch type of: {batch_type}, source data format of: {source_data_format}')
    source_url = oea.to_url(f'{raw_path}/{batch_type}_batch_data')

    if batch_type == 'snapshot' or batch_type=='additive': source_url = f'{source_url}/{oea.get_latest_folder(source_url)}' 

    return oea.get_folder_size(source_url) > 0

# 2) this step refines the data through the use of metadata (this is where the pseudonymization of the data occurs).
def ingest_canvas_dataset(tables_source):
    items = oea.get_folders(tables_source)
    options = {'multiline':True}
    for item in items: 
        table_path = tables_source +'/'+ item
        try:
            entity_path = 'canvas/v'+ version +'/' + item
            if item == 'metadata.csv':
                logger.info('ignore metadata csv - not a table to be ingested')
            else:
                if(should_ingest(entity_path)):
                    oea.ingest(entity_path, 'id', options)
        except AnalysisException as e:
            # This means the table may have not been properly refined due to errors with the primary key not aligning with columns expected in the lookup table.
            pass

StatementMeta(spark3p3sm, 75, 17, Finished, Available)

In [None]:
metadata = oea.get_metadata_from_url('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Canvas/test_data/metadata.csv')
ingest_canvas_dataset('stage1/Transactional/canvas/v' + version)

StatementMeta(spark3p3sm, 75, 18, Finished, Available)

2023-01-13 18:20:23,112 - OEA - INFO - Ingesting from: stage1/Transactional/M365/v1.14/AadGroup, batch type of: snapshot, source data format of: csv
2023-01-13 18:20:37,659 - py4j.java_gateway - INFO - Callback Connection ready to receive messages
2023-01-13 18:20:37,660 - py4j.java_gateway - INFO - Received command c on object id p1
2023-01-13 18:20:55,302 - OEA - INFO - Number of new inbound rows processed: 87
2023-01-13 18:21:03,391 - OEA - INFO - Ingesting from: stage1/Transactional/M365/v1.14/AadGroupMembership, batch type of: snapshot, source data format of: csv
2023-01-13 18:21:05,456 - py4j.java_gateway - INFO - Received command c on object id p2
2023-01-13 18:21:08,940 - OEA - INFO - Number of new inbound rows processed: 3640
2023-01-13 18:21:11,185 - OEA - INFO - Ingesting from: stage1/Transactional/M365/v1.14/AadUser, batch type of: snapshot, source data format of: csv
2023-01-13 18:21:13,000 - py4j.java_gateway - INFO - Received command c on object id p3
2023-01-13 18:21:17