The Open University Learning Analytics Dataset (OULAD) is an anonymized, tabular dataset containing information on student demographics, course registrations, assessments, and clickstream interactions with the Virtual Learning Environment (VLE) for seven selected modules over multiple presentations (runs) in 2013–2014 
nature.com
analyse.kmi.open.ac.uk
. It comprises several CSV files (e.g., studentInfo.csv, studentRegistration.csv, studentVle.csv, vle.csv, studentAssessment.csv) that can be linked via unique identifiers (such as code_module, code_presentation, id_student, id_site, etc.) 
archive.ics.uci.edu
github.com
. This dataset has been widely used for learning analytics tasks such as predicting student performance, early identification of at-risk students, clustering student engagement patterns, and more. ()

Data Understanding and Loading

Dataset files and schema:
studentInfo.csv: demographic attributes (age_band, gender, disability, highest_education, region, etc.).
studentRegistration.csv: records of registrations per module presentation, including enrollment date, final result (e.g., “Pass”, “Fail”, “Withdraw”), and other metadata.
studentVle.csv: for each student and VLE item (identified by id_site), records of click counts, activity type (forum, resource, etc.), and date of interaction.
vle.csv: metadata per VLE item: description, activity type, week number within the module.
studentAssessment.csv: for each student and assessment item: assessment type (TMA, EMA), score, date submitted, etc.
assessments.csv giving metadata on assessments.
Each table can be loaded into pandas DataFrames in Python, joining on identifiers such as id_student, code_module, code_presentation, id_site, id_assessment.

In [2]:
import os
import pandas as pd

# List files in data/raw
raw_dir = "../data/raw/OULAD"
files = [f for f in os.listdir(raw_dir) if f.endswith(".csv")]
print("Raw CSV files:", files)
# Load each CSV into a dict
dataframes = {}
for fname in files:
    path = os.path.join(raw_dir, fname)
    try:
        df = pd.read_csv(path)
        dataframes[fname] = df
        print(f"\nLoaded {fname}: shape={df.shape}")
        display(df.head())
        print(df.info())
    except Exception as e:
        print(f"Failed to load {fname}: {e}")

# Inspect columns, missing values
for name, df in dataframes.items():
    print(f"\n{name}:")
    print(df.columns.tolist())
    print("Missing values per column:")
    print(df.isnull().sum())


Raw CSV files: ['assessments.csv', 'courses.csv', 'studentAssessment.csv', 'studentInfo.csv', 'studentRegistration.csv', 'studentVle.csv', 'vle.csv']

Loaded assessments.csv: shape=(206, 6)


Unnamed: 0,code_module,code_presentation,id_assessment,assessment_type,date,weight
0,AAA,2013J,1752,TMA,19.0,10.0
1,AAA,2013J,1753,TMA,54.0,20.0
2,AAA,2013J,1754,TMA,117.0,20.0
3,AAA,2013J,1755,TMA,166.0,20.0
4,AAA,2013J,1756,TMA,215.0,30.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 206 entries, 0 to 205
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   code_module        206 non-null    object 
 1   code_presentation  206 non-null    object 
 2   id_assessment      206 non-null    int64  
 3   assessment_type    206 non-null    object 
 4   date               195 non-null    float64
 5   weight             206 non-null    float64
dtypes: float64(2), int64(1), object(3)
memory usage: 9.8+ KB
None

Loaded courses.csv: shape=(22, 3)


Unnamed: 0,code_module,code_presentation,module_presentation_length
0,AAA,2013J,268
1,AAA,2014J,269
2,BBB,2013J,268
3,BBB,2014J,262
4,BBB,2013B,240


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22 entries, 0 to 21
Data columns (total 3 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   code_module                 22 non-null     object
 1   code_presentation           22 non-null     object
 2   module_presentation_length  22 non-null     int64 
dtypes: int64(1), object(2)
memory usage: 656.0+ bytes
None

Loaded studentAssessment.csv: shape=(173912, 5)


Unnamed: 0,id_assessment,id_student,date_submitted,is_banked,score
0,1752,11391,18,0,78.0
1,1752,28400,22,0,70.0
2,1752,31604,17,0,72.0
3,1752,32885,26,0,69.0
4,1752,38053,19,0,79.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 173912 entries, 0 to 173911
Data columns (total 5 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   id_assessment   173912 non-null  int64  
 1   id_student      173912 non-null  int64  
 2   date_submitted  173912 non-null  int64  
 3   is_banked       173912 non-null  int64  
 4   score           173739 non-null  float64
dtypes: float64(1), int64(4)
memory usage: 6.6 MB
None

Loaded studentInfo.csv: shape=(32593, 12)


Unnamed: 0,code_module,code_presentation,id_student,gender,region,highest_education,imd_band,age_band,num_of_prev_attempts,studied_credits,disability,final_result
0,AAA,2013J,11391,M,East Anglian Region,HE Qualification,90-100%,55<=,0,240,N,Pass
1,AAA,2013J,28400,F,Scotland,HE Qualification,20-30%,35-55,0,60,N,Pass
2,AAA,2013J,30268,F,North Western Region,A Level or Equivalent,30-40%,35-55,0,60,Y,Withdrawn
3,AAA,2013J,31604,F,South East Region,A Level or Equivalent,50-60%,35-55,0,60,N,Pass
4,AAA,2013J,32885,F,West Midlands Region,Lower Than A Level,50-60%,0-35,0,60,N,Pass


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32593 entries, 0 to 32592
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   code_module           32593 non-null  object
 1   code_presentation     32593 non-null  object
 2   id_student            32593 non-null  int64 
 3   gender                32593 non-null  object
 4   region                32593 non-null  object
 5   highest_education     32593 non-null  object
 6   imd_band              31482 non-null  object
 7   age_band              32593 non-null  object
 8   num_of_prev_attempts  32593 non-null  int64 
 9   studied_credits       32593 non-null  int64 
 10  disability            32593 non-null  object
 11  final_result          32593 non-null  object
dtypes: int64(3), object(9)
memory usage: 3.0+ MB
None

Loaded studentRegistration.csv: shape=(32593, 5)


Unnamed: 0,code_module,code_presentation,id_student,date_registration,date_unregistration
0,AAA,2013J,11391,-159.0,
1,AAA,2013J,28400,-53.0,
2,AAA,2013J,30268,-92.0,12.0
3,AAA,2013J,31604,-52.0,
4,AAA,2013J,32885,-176.0,


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32593 entries, 0 to 32592
Data columns (total 5 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   code_module          32593 non-null  object 
 1   code_presentation    32593 non-null  object 
 2   id_student           32593 non-null  int64  
 3   date_registration    32548 non-null  float64
 4   date_unregistration  10072 non-null  float64
dtypes: float64(2), int64(1), object(2)
memory usage: 1.2+ MB
None

Loaded studentVle.csv: shape=(10655280, 6)


Unnamed: 0,code_module,code_presentation,id_student,id_site,date,sum_click
0,AAA,2013J,28400,546652,-10,4
1,AAA,2013J,28400,546652,-10,1
2,AAA,2013J,28400,546652,-10,1
3,AAA,2013J,28400,546614,-10,11
4,AAA,2013J,28400,546714,-10,1


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10655280 entries, 0 to 10655279
Data columns (total 6 columns):
 #   Column             Dtype 
---  ------             ----- 
 0   code_module        object
 1   code_presentation  object
 2   id_student         int64 
 3   id_site            int64 
 4   date               int64 
 5   sum_click          int64 
dtypes: int64(4), object(2)
memory usage: 487.8+ MB
None

Loaded vle.csv: shape=(6364, 6)


Unnamed: 0,id_site,code_module,code_presentation,activity_type,week_from,week_to
0,546943,AAA,2013J,resource,,
1,546712,AAA,2013J,oucontent,,
2,546998,AAA,2013J,resource,,
3,546888,AAA,2013J,url,,
4,547035,AAA,2013J,resource,,


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6364 entries, 0 to 6363
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id_site            6364 non-null   int64  
 1   code_module        6364 non-null   object 
 2   code_presentation  6364 non-null   object 
 3   activity_type      6364 non-null   object 
 4   week_from          1121 non-null   float64
 5   week_to            1121 non-null   float64
dtypes: float64(2), int64(1), object(3)
memory usage: 298.4+ KB
None

assessments.csv:
['code_module', 'code_presentation', 'id_assessment', 'assessment_type', 'date', 'weight']
Missing values per column:
code_module           0
code_presentation     0
id_assessment         0
assessment_type       0
date                 11
weight                0
dtype: int64

courses.csv:
['code_module', 'code_presentation', 'module_presentation_length']
Missing values per column:
code_module                   0
code_pre