# Data validation
This notebook is to check whether we need to remove students for being duplicates in CG and EE courses, and whether their student ID exists in the class list. However, no changes are actually made to the dataset, all changes should be made in the data_cleaning file. This is just to see. 

In [38]:
import pandas as pd
import os
import sys

project_root = os.path.abspath(os.path.join(os.getcwd(), ".."))
sys.path.append(project_root)

from scripts import utils

In [39]:
courses_cg = ["COMP202", "COMP250"]
courses_ee = ["COMP251", "COMP424", "COMP551"]
courses = courses_cg + courses_ee
semester = "F2025"

Open all relevant files

In [40]:
data_dir = os.path.join(project_root, 'data')
course_data_dir = os.path.join(data_dir, 'student_emails_and_ids')

ee_file = "F2025_ee_clean.csv"
cg_file = "F2025_cg_clean.csv"
courses_files = {}
for course in courses:
    courses_files[course] = f"{semester}_{course}.csv"

ee_data_path = os.path.join(data_dir, 'clean', ee_file)
cg_data_path = os.path.join(data_dir, 'clean', cg_file)
courses_data_path = {}
for course in courses:
    courses_data_path[course] = os.path.join(course_data_dir, courses_files[course])

Open the cleaned data files

In [41]:
ee_data = pd.read_csv(ee_data_path, header=[0,1], index_col=0)
cg_data = pd.read_csv(cg_data_path, header=[0,1], index_col=0)
data_all = [ee_data, cg_data]

for d in data_all:
    utils.rebuild_multiindex(d)

Open the course data file and extract IDs

In [42]:
student_ids = {}
for course in courses:
    course_data = pd.read_csv(courses_data_path[course])
    ids = course_data.OrgDefinedId.str.strip("#")
    student_ids[course] = list(pd.to_numeric(ids))

### Duplicates
1. Check if we find the same student within each data file

In [43]:
no_duplicates = True
for d in data_all:
    students = set(d["StudentID"])
    if len(students) != len(d):
        print("One or more students responded in the survey multiple times within the same group.")
        no_duplicates = False

if no_duplicates:
    print("No duplicate students found within the EE or CG groups.")

No duplicate students found within the EE or CG groups.


2. Check if we find the same student across both files. 

In [44]:
no_duplicates = True
all_students = set(pd.concat([ee_data["StudentID"], cg_data["StudentID"]]))
if len(all_students) != len(ee_data) + len(cg_data):
    print("One or more students responded in both EE and CG surveys")
    no_duplicates = False

if no_duplicates:
    print("No duplicate students found across the EE and CG groups.")

No duplicate students found across the EE and CG groups.


TODO: handle the case where there are duplicate students 

### Student validation
Verify that students who answered survey are enrolled in the courses they selected in the survey.

1. EE data

In [None]:
for i, d in enumerate(data_all):
    for j, row in d.iterrows():
        id = row["StudentID"][""]
        for course in courses:

            # based on the dataset we look at we check the EE course or the CG course
            course_col_label = None
            if (i == 0 and course in courses_ee):
                course_col_label = "EE course"
            elif (i == 1 and course in courses_cg):
                course_col_label = "CG course"
            
            if course_col_label != None and row[course_col_label][course] == 1:
                if id not in student_ids[course]:
                    print(f"Student doesn't exist in {course}") # modify to add student ID
                

Student doesn't exist in COMP202


TODO: if there are students who attended the EE lecture but also filled the CG survey, we'll also need to check that.

### Check how many students attended more than one lecture

In [57]:
sum_of_courses_ee = ee_data["EE course"].sum(axis=1)
print(f"Number of students who attended more than 1 EE lecture: {len(sum_of_courses_ee[sum_of_courses_ee>1])}")

Number of students who attended more than 1 EE lecture: 1
