Below is a Python script showing how to load the core OULAD tables, extract “content-based” features from the VLE interactions (i.e. resource types and click counts), plus a few auxiliary signals (active days, assignment performance), and assemble a per-student feature matrix suitable for content-based profiling in a recommender

In [23]:
import pandas as pd
import numpy as np


import os
import sys

In [18]:

def load_oulad(path="../data/raw/OULAD"):
    
    """Load the main OULAD CSVs into pandas DataFrames."""
    student_info       = pd.read_csv(f"{path}/studentInfo.csv")
    vle                = pd.read_csv(f"{path}/vle.csv")
    student_vle        = pd.read_csv(f"{path}/studentVle.csv")
    student_assess     = pd.read_csv(f"{path}/studentAssessment.csv")
    assessments        = pd.read_csv(f"{path}/assessments.csv")
    return student_info, vle, student_vle, student_assess, assessments

**Clicks by type**


Merge student_vle (which has, for each student and VLE interaction, the number of clicks in that session) with vle[['id','activity_type']] so that every row now knows its activity_type. Group and sum all sessions for each student – activity_type pair into a single sum. In addition renames columns like "forum" → "click_forum", and brings id_student back out of the index into its own column. 

returns: pandas DataFrame where e ach row corresponds to one student (id_student), each column (after id_student) is named click_<activity_type>, e.g. click_forum, click_resource, click_url, etc.
Each cell in column click_X holds the total number of clicks that the student made on VLE resources of type X.

In [24]:
def build_content_features(vle, student_vle):
    """
    Join student_vle to vle to get activity_type, then pivot to get total clicks per resource type per student (content profile).
    """
    # join in the resource type
    sv = student_vle.merge(
        vle[['id_site', 'activity_type']],
        on='id_site',
        how='left'
    )
    # sum clicks per student × activity_type
    clicks_by_type = (sv
        .groupby(['id_student', 'activity_type'])['sum_click']
        .sum()
        .unstack(fill_value=0)
        .add_prefix("click_") 
        .reset_index()
    )
    return clicks_by_type

**Engagement features**  function counts each student’s raw VLE‐click logs into a small set of interpretable engagement statistics. 

The function returns a pandas DataFrame with one row per student and these columns:
 - id_student	(int)	Unique student identifier
 - n_active_days (int)	Count of distinct days the student accessed any VLE content
 - n_sessions (int)	Total number of VLE interaction records (i.e. sessions)
 - total_clicks (int)	Sum of all clicks the student made across the VLE
 - mean_clicks_per_session	(float)	Average clicks per session

How we use these features: 
Normalize click counts. A student with more sessions naturally accumulates more clicks;  use mean_clicks_per_session to compare across students fairly.

Time-based analysis. n_active_days highlights how consistent a student’s engagement is.

Early warning signals. Very low values (e.g. 1–2 active days) could trigger a “re-engagement” notification.

Feature fusion. Concatenate these with content-based click vectors to give the recommender a sense of how they click, how often, and how densely.

In [20]:
def build_engagement_features(student_vle):
    """
    Simple engagement features: number of distinct active days,  total sessions, mean clicks per session.
    """
    # count of distinct days active
    days_active = (student_vle
        .groupby('id_student')['date']
        .nunique()
        .rename("n_active_days")
        .reset_index()
    )
    # total number of VLE records (sessions)
    sessions = (student_vle
        .groupby('id_student')
        .size()
        .rename("n_sessions")
        .reset_index()
    )
    # total clicks per student
    total_clicks = (student_vle
        .groupby('id_student')['sum_click']
        .sum()
        .rename("total_clicks")
        .reset_index()
    )
    # mean clicks per session
    df = days_active.merge(sessions, on='id_student')
    df = df.merge(total_clicks, on='id_student')
    df['mean_clicks_per_session'] = df['total_clicks'] / df['n_sessions']
    return df

**Assignment relayted features:**

We don’t have to include the assignment‐related features in our content-based user model — if all we care about is how a student interacts with different types of learning objects (videos, quizzes, forums, etc.), then the click-based profile alone is sufficient.
However, assignment features can add a different—and often valuable—signal about the learner’s mastery and engagement:

 - Average score ratio (avg_score_ratio) tells you how well they’re actually performing on graded work, not just clicking around.

 - % on-time submissions (pct_on_time) is a proxy for their self-regulation and commitment.

 - Number of submissions (n_submissions) can indicate how many assessments they attempted—another engagement metric.


Pure content-based filtering: if we only want to match students to resources by their click profiles (e.g. cosine‐similarity on click_forum, click_video, …), then we can omit build_assignment_features.

Cold-start or limited data: if many students haven’t done any graded work yet, those features will be mostly zeros and add noise.

When we should include it: using hybrid recomender we can combine “what they click” (interest) with “how well they score” (ability). For instance, only recommend advanced material to students whose avg_score_ratio > 0.8.
Difficulty adaptation: use avg_score_ratio to infer their proficiency in a topic and personalize the difficulty of recommended items.
Engagement alerts: low pct_on_time might trigger nudges (“Hey, you’re falling behind!”) or surface easier “catch-up” resources.



In [21]:
def build_assignment_features(student_assess, assessments):
    """
    Merge studentAssessment with assessments to get weights and deadlines, then compute per-student average score ratio and on-time rate.
    """
    sa = student_assess.merge(
        assessments[['id_assessment', 'weight', 'date', 'assessment_type']],
        on='id_assessment',
        how='left'
    )
    # ratio of score to weight
    sa['score_ratio'] = sa['score'] / sa['weight']
    # on-time submission: date_submitted <= deadline
    sa['on_time'] = (sa['date_submitted'] <= sa['date']).astype(int)
    # aggregate per student
    agg = sa.groupby('id_student').agg({
        'score_ratio': 'mean',
        'on_time': 'mean',
        'id_assessment': 'count'
    }).rename(columns={
        'score_ratio': 'avg_score_ratio',
        'on_time': 'pct_on_time',
        'id_assessment': 'n_submissions'
    }).reset_index()
    return agg

The assemble_features function is the method for turning all of the raw OULAD tables into a single, per-student feature matrix that we can feed straight into any recommender or modeling pipeline. 

1. Load all the raw tables (student_info, vle, student_vle, student_assess, assessments = load_oulad(path)); 

student_info — demographic and enrollment data, one row per student

vle — metadata about each Virtual Learning Environment (VLE) resource (type, module, etc.)

student_vle — every click/session record: student × resource × date × click count

student_assess — every assessment submission: student × assessment × score × submission date

assessments — metadata about each assessment (weight, deadline, type)

2. Build three “feature blocks”: 

content_feats    = build_content_features(vle, student_vle)

engage_feats     = build_engagement_features(student_vle)

assignment_feats = build_assignment_features(student_assess, assessments)


3. Merge into one DataFrame

df = student_info[['id_student']].drop_duplicates()

df = df.merge(content_feats,    on='id_student', how='left')

df = df.merge(engage_feats,     on='id_student', how='left')

df = df.merge(assignment_feats, on='id_student', how='left')

Starts with a master list of every id_student, then left-merges each feature block so that students with no clicks or no submissions still appear

4. Clean up missing values

df.fillna(0, inplace=True) - Any student who never clicked a given activity type or never submitted an assessment now has a 0 in those columns

5. Return value
The function returns a pandas DataFrame df with one row per student (id_student) and columns:

- Content features: click_forum, click_resource, click_quiz, … (one per activity_type)
- Engagement features: n_active_days; n_sessions; total_clicks; mean_clicks_per_session
- Assignment features: avg_score_ratio; pct_on_time; n_submissions

In [25]:

def assemble_features(path):
    # load raw tables
    student_info, vle, student_vle, student_assess, assessments = load_oulad(path)

    # engineer feature blocks
    content_feats    = build_content_features(vle, student_vle)
    engage_feats     = build_engagement_features(student_vle)
    assignment_feats = build_assignment_features(student_assess, assessments)

    # join them all
    df = student_info[['id_student']].drop_duplicates()
    df = df.merge(content_feats,    on='id_student', how='left')
    df = df.merge(engage_feats,     on='id_student', how='left')
    df = df.merge(assignment_feats, on='id_student', how='left')

    # fill missing (students with no interactions or submissions)
    df.fillna(0, inplace=True)

    return df

In [26]:
features = assemble_features(path="../data/raw/OULAD/")
print(features.shape)          # e.g. (32_593 students, ~20–30 columns)
print(features.columns.tolist())

(28785, 28)
['id_student', 'click_dataplus', 'click_dualpane', 'click_externalquiz', 'click_folder', 'click_forumng', 'click_glossary', 'click_homepage', 'click_htmlactivity', 'click_oucollaborate', 'click_oucontent', 'click_ouelluminate', 'click_ouwiki', 'click_page', 'click_questionnaire', 'click_quiz', 'click_repeatactivity', 'click_resource', 'click_sharedsubpage', 'click_subpage', 'click_url', 'n_active_days', 'n_sessions', 'total_clicks', 'mean_clicks_per_session', 'avg_score_ratio', 'pct_on_time', 'n_submissions']


In [28]:

os.makedirs("data/processed", exist_ok=True)

#  Write out to CSV
out_path = "./data/processed/oulad_user_features.csv"
features.to_csv(out_path, index=False)
print(f"Saved {features.shape[0]} rows × {features.shape[1]} cols to {out_path}")


Saved 28785 rows × 28 cols to ./data/processed/oulad_user_features.csv
