# Test for processing Moodle data

This notebook demonstrates possible data processing and exploration of the Moodle data, using the OEA_py class notebook. 

Most of the data processing done in this notebook are also achieved by executing the Moodle module main pipeline. This notebook is designed as an alternate approach to the same processing, as well as module data exploration and visualization. 

The steps are clearly outlined below:
1. Set the workspace,
2. Land Moodle Module Higher Ed. Test Data,
3. Ingest the Moodle Module Test Data,
4. Refine the Moodle Module Test Data, 
5. Demonstrate Lake Database Queries/Final Remarks, and
6. Appendix

In [1]:
%run OEA_py

In [2]:
# 1) set the workspace (this determines where in the data lake you'll be writing to and reading from).
# You can work in 'dev', 'prod', or a sandbox with any name you choose.
# For example, Sam the developer can create a 'sam' workspace and expect to find his datasets in the data lake under oea/sandboxes/sam
oea.set_workspace('sam')

## 2.) Land Moodle Module Higher Ed. Test Data

Directory: ```GitHub.com (raw data) -> stage1/Transactional/moodle```

The code block below lands 26 OEA Moodle module test data tables, formatted as Moodle Higher Ed. data in your data lake. 

Moodle test data tables landed in stage 1:
 1. **assign**
 2. **assign_grades**
 3. **assign_submission**
 4. **assignsubmission_file**
 5. **assign_user_mapping**
 6. **context**
 7. **course**
 8. **course_categories**
 9. **enrol**
 10. **forum**
 11. **forum_discussions**
 12. **forum_grades**
 13. **lesson**
 14. **lesson_answers**
 15. **lesson_attempts**
 16. **lesson_grades**
 17. **messages**
 18. **message_conversations**
 19. **message_conversation_members**
 20. **quiz**
 21. **quiz_attempts**
 22. **quiz_grades**
 23. **role**
 24. **role_assignments**
 25. **user** 
 26. **user_enrolments**


**To-Do's:**
 - Confirm that files are being landed "correctly" in their proper folder partitions.

In [None]:
# 2.1) Land batch data files into stage1 of the data lake.
# In this example we pull Moodle HEd test csv data files from github and land it in oea/sandboxes/sam/stage1/Transactional/moodle/v0.1
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Moodle/test_data/hed_test_data/context/2023-03-21/moodle_context_hed_test_data.csv').text
oea.land(data, 'moodle/v0.1/context', 'context_hed_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Moodle/test_data/hed_test_data/course/2023-03-21/moodle_course_hed_test_data.csv').text
oea.land(data, 'moodle/v0.1/course', 'course_hed_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Moodle/test_data/hed_test_data/course_categories/2023-03-21/moodle_course_categories_hed_test_data.csv').text
oea.land(data, 'moodle/v0.1/course_categories', 'coursecategories_hed_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Moodle/test_data/hed_test_data/enrol/2023-03-21/moodle_enrol_hed_test_data.csv').text
oea.land(data, 'moodle/v0.1/enrol', 'enrol_hed_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Moodle/test_data/hed_test_data/role/2023-03-21/moodle_role_hed_test_data.csv').text
oea.land(data, 'moodle/v0.1/role', 'role_hed_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Moodle/test_data/hed_test_data/role_assignments/2023-03-21/moodle_role_assignments_hed_test_data.csv').text
oea.land(data, 'moodle/v0.1/role_assignments', 'roleassignments_hed_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Moodle/test_data/hed_test_data/user/2023-03-21/moodle_user_assignments_hed_test_data.csv').text
oea.land(data, 'moodle/v0.1/user', 'user_hed_test_data.csv', oea.SNAPSHOT_BATCH_DATA)

data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Moodle/test_data/hed_test_data/user_enrolments/2023-03-21/moodle_user_enrolments_hed_test_data.csv').text
oea.land(data, 'moodle/v0.1/user_enrolments', 'userenrolments_hed_test_data.csv', oea.DELTA_BATCH_DATA)

data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Moodle/test_data/hed_test_data/assign/2023-03-21/moodle_assign_hed_test_data.csv').text
oea.land(data, 'moodle/v0.1/assign', 'assign_hed_test_data.csv', oea.ADDITIVE_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Moodle/test_data/hed_test_data/assign_grades/2023-03-21/moodle_assign_grades_hed_test_data.csv').text
oea.land(data, 'moodle/v0.1/assign_grades', 'assigngrades_hed_test_data.csv', oea.ADDITIVE_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Moodle/test_data/hed_test_data/assign_submission/2023-03-21/moodle_assign_submission_hed_test_data.csv').text
oea.land(data, 'moodle/v0.1/assign_submission', 'assignsubmission_hed_test_data.csv', oea.ADDITIVE_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Moodle/test_data/hed_test_data/assignsubmission_file/2023-03-21/moodle_assignsubmission_file_hed_test_data.csv').text
oea.land(data, 'moodle/v0.1/assignsubmission_file', 'assignsubmissionfile_hed_test_data.csv', oea.ADDITIVE_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Moodle/test_data/hed_test_data/assign_user_mapping/2023-03-21/moodle_assign_user_mapping_hed_test_data.csv').text
oea.land(data, 'moodle/v0.1/assign_user_mapping', 'assignusermapping_hed_test_data.csv', oea.ADDITIVE_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Moodle/test_data/hed_test_data/forum/2023-03-21/moodle_forum_hed_test_data.csv').text
oea.land(data, 'moodle/v0.1/forum', 'forum_hed_test_data.csv', oea.ADDITIVE_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Moodle/test_data/hed_test_data/forum_discussions/2023-03-21/moodle_forum_discussions_hed_test_data.csv').text
oea.land(data, 'moodle/v0.1/forum_discussions', 'forumdiscussions_hed_test_data.csv', oea.ADDITIVE_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Moodle/test_data/hed_test_data/forum_grades/2023-03-21/moodle_forum_grades_hed_test_data.csv').text
oea.land(data, 'moodle/v0.1/forum_grades', 'forumgrades_hed_test_data.csv', oea.ADDITIVE_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Moodle/test_data/hed_test_data/lesson/2023-03-21/moodle_lesson_hed_test_data.csv').text
oea.land(data, 'moodle/v0.1/lesson', 'lesson_hed_test_data.csv', oea.ADDITIVE_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Moodle/test_data/hed_test_data/lesson_answers/2023-03-21/moodle_lesson_answers_hed_test_data.csv').text
oea.land(data, 'moodle/v0.1/lesson_answers', 'lessonanswers_hed_test_data.csv', oea.ADDITIVE_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Moodle/test_data/hed_test_data/lesson_attempts/2023-03-21/moodle_lesson_attempts_hed_test_data.csv').text
oea.land(data, 'moodle/v0.1/lesson_attempts', 'lessonattempts_hed_test_data.csv', oea.ADDITIVE_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Moodle/test_data/hed_test_data/lesson_grades/2023-03-21/moodle_lesson_grades_hed_test_data.csv').text
oea.land(data, 'moodle/v0.1/lesson_grades', 'lessongrades_hed_test_data.csv', oea.ADDITIVE_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Moodle/test_data/hed_test_data/messages/2023-03-21/moodle_messages_hed_test_data.csv').text
oea.land(data, 'moodle/v0.1/messages', 'messages_hed_test_data.csv', oea.ADDITIVE_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Moodle/test_data/hed_test_data/message_conversations/2023-03-21/moodle_message_conversations_hed_test_data.csv').text
oea.land(data, 'moodle/v0.1/message_conversations', 'messageconversations_hed_test_data.csv', oea.ADDITIVE_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Moodle/test_data/hed_test_data/message_conversation_members/2023-03-21/moodle_message_conversation_members_hed_test_data.csv').text
oea.land(data, 'moodle/v0.1/message_conversation_members', 'messageconversationmembers_hed_test_data.csv', oea.ADDITIVE_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Moodle/test_data/hed_test_data/quiz/2023-03-21/moodle_quiz_hed_test_data.csv').text
oea.land(data, 'moodle/v0.1/quiz', 'quiz_hed_test_data.csv', oea.ADDITIVE_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Moodle/test_data/hed_test_data/quiz_attempts/2023-03-21/moodle_quiz_attempts_hed_test_data.csv').text
oea.land(data, 'moodle/v0.1/quiz_attempts', 'quizattempts_hed_test_data.csv', oea.ADDITIVE_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Moodle/test_data/hed_test_data/quiz_grades/2023-03-21/moodle_quiz_grades_hed_test_data.csv').text
oea.land(data, 'moodle/v0.1/quiz_grades', 'quizgrades_hed_test_data.csv', oea.ADDITIVE_BATCH_DATA)

## 3.) Ingest the Moodle Module Test Data

Directory: ```stage1/Transactional/moodle -> stage2/Ingested/moodle```

This step ingests the Moodle module test data from stage1 to stage2/Ingested.

The code blocks in this step ingest the data using the ```oea.ingest()``` function as normal.

**To-Do's:**
 - Check if Moodle test data accurately reflects actual (production) Moodle data.

In [8]:
# 3) The next step is to ingest the batch data into stage2
# Note that when you run this the first time, you'll see an info message like "Number of new inbound rows processed: 2".
# If you run this a second time, the number of inbound rows processed will be 0 because the ingestion uses spark structured streaming to keep track of what data has already been processed.
#options = {'header':True}
oea.ingest(f'moodle/v0.1/assign', 'id')
oea.ingest(f'moodle/v0.1/assign_grades', 'id')
oea.ingest(f'moodle/v0.1/assign_submission', 'id')
oea.ingest(f'moodle/v0.1/assign_user_mapping', 'id')
oea.ingest(f'moodle/v0.1/assignsubmission_file', 'id')
oea.ingest(f'moodle/v0.1/context', 'id')
oea.ingest(f'moodle/v0.1/course', 'id')
oea.ingest(f'moodle/v0.1/course_categories', 'id')
oea.ingest(f'moodle/v0.1/enrol', 'id')
oea.ingest(f'moodle/v0.1/forum', 'id')
oea.ingest(f'moodle/v0.1/forum_discussions', 'id')
oea.ingest(f'moodle/v0.1/forum_grades', 'id')
oea.ingest(f'moodle/v0.1/lesson', 'id')
oea.ingest(f'moodle/v0.1/lesson_answers', 'id')
oea.ingest(f'moodle/v0.1/lesson_attempts', 'id')
oea.ingest(f'moodle/v0.1/lesson_grades', 'id')
oea.ingest(f'moodle/v0.1/messages', 'id')
oea.ingest(f'moodle/v0.1/message_conversations', 'id')
oea.ingest(f'moodle/v0.1/message_conversation_members', 'id')
oea.ingest(f'moodle/v0.1/quiz', 'id')
oea.ingest(f'moodle/v0.1/quiz_attempts', 'id')
oea.ingest(f'moodle/v0.1/quiz_grades', 'id')
oea.ingest(f'moodle/v0.1/role', 'id')
oea.ingest(f'moodle/v0.1/role_assignments', 'id')
oea.ingest(f'moodle/v0.1/user', 'id')
oea.ingest(f'moodle/v0.1/user_enrolments', 'id')

In [9]:
# 3.5) Now you can run queries against the auto-generated "lake database" with the ingested Moodle data.
#df = spark.sql("select * from ldb_sam_s2i_moodle_v0p1.course")
df = spark.sql("select * from ldb_dev_s2i_moodle_v0p1.course")
display(df.limit(10))

## 4.) Refine the Moodle Module Test Data

Directory: ```stage2/Ingested/moodle -> stage2/Refined/moodle```

This step then refines the Moodle test data from stage2/Ingested to stage2/Refined, using the metadata.csv. This step is responsible for pseudonymization, which preserves sensitive student information by either hashing or masking the sensitive columns. 

Tables are separated into either ```stage2/Refined/moodle/v0.1/general``` or ```stage2/Refined/moodle/v0.1/sensitive```, depending on whether each table is pseudonymized or has a sensitive column-hashing/masking mapping, respectively.


In [10]:
# 4) this step refines the data through the use of metadata (this is where the pseudonymization of the data occurs).
def refine_moodle_dataset(tables_source):
    items = oea.get_folders(tables_source)
    for item in items: 
        table_path = tables_source +'/'+ item
        if item == 'metadata.csv':
            logger.info('ignore metadata processing, since this is not a table to be ingested')
        else:
            try:
                if item == 'assign':
                    oea.refine('moodle/v0.1/assign', metadata[item], 'id_pseudonym')
                elif item == 'user':
                    oea.refine('moodle/v0.1/user', metadata[item], 'id_pseudonym')
                else:
                    oea.refine('moodle/v0.1/' + item, metadata[item], 'id')
            except AnalysisException as e:
                # This means the table may have not been properly refined due to errors with the primary key not aligning with columns expected in the lookup table.
                pass
            
            logger.info('Refined table: ' + item + ' from: ' + table_path)
    logger.info('Finished refining Moodle tables')

In [11]:
#metadata = oea.get_metadata_from_url('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Moodle/test_data/metadata.csv')
metadata = oea.get_metadata_from_url('https://raw.githubusercontent.com/cstohlmann/oea-moodle-module/main/test_data/metadata.csv')
refine_moodle_dataset('stage2/Ingested/moodle/v0.1')

In [12]:
# This block represents what the blocks above (in this step) accomplish
#metadata = oea.get_metadata_from_url('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Moodle/test_data/metadata.csv')

#oea.refine('moodle/v0.1/assign', metadata['assign'], 'id_pseudonym')
#oea.refine('moodle/v0.1/assign_grades', metadata['assign_grades'], 'id')
#oea.refine('moodle/v0.1/assign_submission', metadata['assign_submission'], 'id')
#oea.refine('moodle/v0.1/assign_user_mapping', metadata['assign_user_mapping'], 'id')
#oea.refine('moodle/v0.1/assignsubmission_file', metadata['assignsubmission_file'], 'id')
#oea.refine('moodle/v0.1/context', metadata['context'], 'id')
#oea.refine('moodle/v0.1/course', metadata['course'], 'id')
#oea.refine('moodle/v0.1/course_categories', metadata['course_categories'], 'id')
#oea.refine('moodle/v0.1/enrol', metadata['enrol'], 'id')
#oea.refine('moodle/v0.1/forum', metadata['forum'], 'id')
#oea.refine('moodle/v0.1/forum_discussions', metadata['forum_discussions'], 'id')
#oea.refine('moodle/v0.1/forum_grades', metadata['forum_grades'], 'id')
#oea.refine('moodle/v0.1/lesson', metadata['lesson'], 'id')
#oea.refine('moodle/v0.1/lesson_answers', metadata['lesson_answers'], 'id')
#oea.refine('moodle/v0.1/lesson_attempts', metadata['lesson_attempts'], 'id')
#oea.refine('moodle/v0.1/lesson_grades', metadata['lesson_grades'], 'id')
#oea.refine('moodle/v0.1/messages', metadata['messages'], 'id')
#oea.refine('moodle/v0.1/message_conversations', metadata['message_conversations'], 'id')
#oea.refine('moodle/v0.1/message_conversation_members', metadata['message_conversation_members'], 'id')
#oea.refine('moodle/v0.1/quiz', metadata['quiz'], 'id') 
#oea.refine('moodle/v0.1/quiz_attempts', metadata['quiz_attempts'], 'id')
#oea.refine('moodle/v0.1/quiz_grades', metadata['quiz_grades'], 'id')
#oea.refine('moodle/v0.1/role', metadata['role'], 'id')
#oea.refine('moodle/v0.1/role_assignments', metadata['role_assignments'], 'id')
#oea.refine('moodle/v0.1/user', metadata['user'], 'id_pseudonym')
#oea.refine('moodle/v0.1/user_enrolments', metadata['user_enrolments'], 'id')


## 5.) Demonstrate Lake Database Queries/Final Remarks

In [13]:
oea.add_to_lake_db('stage2/Refined/moodle/v0.1/general/assign')

In [14]:
oea.add_to_lake_db('stage2/Refined/moodle/v0.1/general/course')

In [15]:
# 5) Now you can query the refined data tables in the lake db
#df = spark.sql("select * from ldb_sam_s2r_moodle_v0p1.assign")
df = spark.sql("select * from ldb_dev_s2r_moodle_v0p1.assign")
display(df)
df.printSchema()
#df = spark.sql("select * from ldb_sam_s2r_moodle_v0p1.course")
df = spark.sql("select * from ldb_dev_s2r_moodle_v0p1.course")
display(df)
df.printSchema()
# You can use the "lookup" table for joins (people with restricted access won't be able to perform this query because they won't have access to data in the "sensitive" folder in the data lake)
#df = spark.sql("select c.fullname, c.id, a.id_pseudonym, a.name, a.nosubmissions, a.maxattempts, a.grade from ldb_sam_s2r_moodle_v0p1.course c, ldb_sam_s2r_moodle_v0p1.assign a where c.id = a.course")
df = spark.sql("select c.fullname, c.id, a.id_pseudonym, a.name, a.nosubmissions, a.maxattempts, a.grade from ldb_dev_s2r_moodle_v0p1.course c, ldb_dev_s2r_moodle_v0p1.assign a where c.id = a.course")
display(df)

In [19]:
# Run this cell to reset this example (deleting all the example Moodle data in your workspace)
oea.rm_if_exists('stage1/Transactional/moodle')
oea.rm_if_exists('stage2/Ingested/moodle')
oea.rm_if_exists('stage2/Refined/moodle')
oea.drop_lake_db('ldb_sam_s2i_moodle_v0p1')
oea.drop_lake_db('ldb_sam_s2r_moodle_v0p1')

## Appendix

In [None]:
# generate an initial metadata file for manual modification
metadata = oea.create_metadata_from_lake_db('ldb_sam_s2i_moodle_v0p1')
dlw = DataLakeWriter(oea.to_url('stage1/Transactional/moodle'))
dlw.write('metadata.csv', metadata)

In [None]:
# Create a sql db for the ingested Moodle data
oea.create_sql_db('stage2/Ingested/moodle')

In [16]:
oea.create_sql_db('stage2/Refined/moodle')