**OULAD activity types**


Here I collapse collapse OULAD’s 19  VLE activity types into  three high-level media buckets. In vle.csv, each id_site maps to an activity_type—these are typically HTML pages, PDFs, external links, quizzes or collaborative tools. Here’s a sensible mapping:

High-level bucket	OULAD activity types	Notes
Courses will come from 	resource, oucontent, page, subpage, htmlactivity, folder, dualpane, glossary.	These are internal learning pages—text, multimedia embeds, structured content that you’d treat as “course” interactions.

Reading:	url (when the target URL ends in .pdf or points to an online book), ouwiki.	Any external link to a PDF or book-style content, plus wiki/glossary pages, can be bucketed as “reading.” You’ll need to fetch/inspect the actual URL string.

Videos:	url (when the target domain is a video host like youtube.com/vimeo.com), ouelluminate, oucollaborate.	Live-session tools (ouelluminate, oucollaborate) and any external links whose domains resolve to known video platforms get labeled “video.”

To implement this algorythm i will join studentVle.csv with vle.csv on id_site to pull in activity_type.
For every url-type row, retrieve its actual URL from  scraped VLE metadata or logs. If it ends in .pdf or matches an online book domain we will count as reading. If the domain is YouTube, Vimeo, etc. will count as video.

Else we can either drop it or assign to a “misc” bucket.

Aggregate per user to get  material_counts and material_prefs fields, now aligned to exactly three modalities: course, reading, video.

This lets us seed the tri-modal profiles from OULAD even though it doesn’t natively record “books” or “YouTube” events—by pattern-matching external links and grouping internal content pages.

In [14]:
import pandas as pd

In [None]:
# 1. Load OULAD data

path="../data/raw/OULAD"
vle_df= pd.read_csv(f"{path}/vle.csv")
student_vle_df = pd.read_csv(f"{path}/studentVle.csv")
student_info_df = pd.read_csv(f"{path}/studentInfo.csv")

In [15]:
# 2. Define a mapping from detailed activity_type → media bucket

def map_media_bucket(activity_type):
    """
    Collapse detailed OULAD activity_type into three media buckets.
    """
    video_types = {'ouelluminate', 'oucollaborate'}
    course_types = {
        'resource', 'oucontent', 'quiz', 'forumng',
        'htmlactivity', 'dataplus', 'sharedsubpage',
        'questionnaire'
    }
    reading_types = {'url', 'ouwiki', 'glossary',
                     'folder', 'page', 'subpage'}
    if activity_type in video_types:
        return 'video'
    elif activity_type in reading_types:
        return 'reading'
    elif activity_type in course_types:
        return 'course'
    else:
        return 'other'

In [16]:
# 3. Merge click logs with activity types and map to media buckets
df = student_vle_df.merge(
    vle_df[['id_site', 'activity_type']],
    on='id_site', how='left'
)
df['media_type'] = df['activity_type'].apply(map_media_bucket)

In [17]:
# 4. Aggregate raw counts per user per media_type
counts = df.groupby(['id_student', 'media_type'])['sum_click'] \
    .sum().unstack(fill_value=0)

In [18]:
# 5. Compute normalized proportions
props = counts.div(counts.sum(axis=1), axis=0)

In [19]:
# 6. Build user profile DataFrame with counts and proportions
demo_profiles = (
    counts.add_suffix('_count')
    .merge(props.add_suffix('_prop'),
           left_index=True, right_index=True)
    .reset_index()
)

In [20]:
# 7. Merge demographics into profiles
#    Include gender, region, highest_education, imd_band, age_band
merged = demo_profiles.merge(
    student_info_df[['id_student', 'gender', 'region',
                     'highest_education', 'imd_band', 'age_band']],
    on='id_student', how='left'
)


In [24]:
# 8. Save final profiles to CSV
out_path = "../data/processed/oulad_media_profiles_full.csv"
merged.to_csv(out_path, index=False)

In [23]:
# 9. Preview the result
merged.head()


Unnamed: 0,id_student,course_count,other_count,reading_count,video_count,course_prop,other_prop,reading_prop,video_prop,gender,region,highest_education,imd_band,age_band
0,6516,2008,497,286,0,0.719455,0.178072,0.102472,0.0,M,Scotland,HE Qualification,80-90%,55<=
1,8462,172,203,268,13,0.262195,0.309451,0.408537,0.019817,M,London Region,HE Qualification,30-40%,55<=
2,8462,172,203,268,13,0.262195,0.309451,0.408537,0.019817,M,London Region,HE Qualification,30-40%,55<=
3,11391,759,138,37,0,0.812634,0.147752,0.039615,0.0,M,East Anglian Region,HE Qualification,90-100%,55<=
4,23629,120,36,5,0,0.745342,0.223602,0.031056,0.0,F,East Anglian Region,Lower Than A Level,20-30%,0-35


This table is  “media‐type profile” for each learner, where for every id_student we have aggregated:

Raw counts:
course_count, reading_count, video_count, other_count
These are the total clicks (or interactions) the student made on each high-level bucket.

Normalized preferences:
course_prop, reading_prop, video_prop, other_prop
These are the fractions of the total clicks that each bucket represents (so each row sums to 1).

 We’ll load these counts and proportions into the Django UserProfile’s material_counts and material_prefs JSON fields, so every profile records both absolute and relative format preferences.

Cold-start & personalization
­– For a new user, we can seed their profile with demographic-based averages of these proportions.
­– As they interact, we update their counts/props in real time to reflect shifting tastes.

Recommender signals
­– In your ML model (e.g. logistic regression or bandit), these props become input features that help predict or rank which format—course, reading, or video—a learner is most likely to engage with next.
­– You can also use the raw counts to gauge “confidence” (heavy-click users vs. light-click users) or to weight their topic/embedding centroids when blending signals.

Together with their topic and embedding vectors, these media-type features let your system tailor not just what content to recommend, but in which format—ensuring that each learner sees more of the kinds of materials they actually prefer.
