# Data Wrangling using Pandas

## Dataset Description: Student Assessment Questionnaires

The dataset `assessment_generated.csv` contains information derived from student assessment questionnaires.

Each record represents an individual student's response and includes demographic, academic, and self-assessment information. The dataset comprises the following attributes:

- **`timestamp`**  
  The date and time when the assessment was submitted, formatted as `yyyy-mm-dd hh:mm:ss timezone`.

- **`netid`**  
  The encoded NetID of the student. Valid NetIDs must have a string length between 8 and 14 characters (inclusive). Entries falling outside this range are considered invalid.

- **`ruid`**  
  The encoded RUID of the student. A valid RUID is expected to contain exactly 18 characters. Any deviation from this length is considered invalid.

- **`section`**  
  The course section number as reported by the student. This field may contain inaccuracies, as some students provided incorrect section information.

- **`role`**  
  The academic standing of the student. Possible values include:
  - `Freshman`
  - `Sophomore`
  - `Junior`
  - `Senior`
  - `Graduate`
  - `Other`

- **`major`**  
  The declared major of the student. Accepted categories are:
  - `Computer Science`
  - `Electrical and Computer Engineering`
  - `Mathematics`
  - `Other`

- **Skill Proficiency Columns**  
  The following columns record students’ self-assessed proficiency levels in specific skills, rated on scales ranging from 0 up to a multiple of 5 (depending on the number of questions per topic). Missing values are present in some entries.

  - `data_structures`  
  - `calculus_and_linear_algebra`  
  - `probability_and_statistics`  
  - `data_visualization`  
  - `python_libraries`  
  - `shell_scripting`  
  - `sql`  
  - `python_scripting`  
  - `jupyter_notebook`  
  - `regression`  
  - `programming_languages`  
  - `algorithms`  
  - `complexity_measures`  
  - `visualization_tools`  
  - `massive_data_processing`


## Tasks

- Import Data
  - Load assessment_generated.xlsx as a DataFrame named student_assessment_xlsx.
- Verify NetIDs
  - Ensure that student_assessment_xlsx contains the same set of netids as student_assessment.
- Analyze RUID Lengths
  - Display the frequency of each ruid length.
  - Display records where ruid length exceeds 20 characters.
- Compute Total Score
  - Create a new column total_score as the sum of all skill proficiency columns.
  - Sort records by total_score in descending order.
- Group Statistics by Section
  - Group students by section and compute the mean, median, and standard deviation for each skill proficiency column.
- Pivot Table by Role and Section
  - Create a pivot table where:
    - Rows correspond to role
    - Columns correspond to section numbers
    - Entries contain the average total_score
- Format Timestamp
  - Convert timestamp values to the EST timezone instead of UTC.
- Normalize Skill Proficiency
  - For each proficiency column, apply z-score normalization rather than min-max scaling.
- Handle Missing Values
  - Fill missing values in each skill proficiency column with the column mean.
- Remove Duplicate Records
  - Drop duplicates while keeping only the record with the highest total_score.
- Resolve Swapped IDs
  - Identify records where students may have swapped netid and ruid. (Hint: netid should be shorter than ruid.)
- For records swapped netid and ruid, attempt a join using:
  - student_assessment.ruid = student_list.netid


### Setup Code (Please run this first to set up the environment)

In [14]:
import numpy as np
import pandas as pd
from sklearn import preprocessing

#%pip install openpyxl


In [15]:
if __name__ == "__main__":
    skill_columns = [
        'data_structures', 'calculus_and_linear_algebra', 'probability_and_statistics',
        'data_visualization', 'python_libraries', 'shell_scripting', 'sql',
        'python_scripting', 'jupyter_notebook', 'regression', 'programming_languages',
        'algorithms', 'complexity_measures', 'visualization_tools', 'massive_data_processing'
    ]

    CSV_PATH = 'assessment_generated.csv'
    # load csv file to a Pandas dataframe named student_assessments
    student_assessments_csv = pd.read_csv(CSV_PATH)
    display(student_assessments_csv)

    STUDENT_LIST_PATH = 'student_list_generated.csv'
    student_list_df = pd.read_csv(STUDENT_LIST_PATH)

Unnamed: 0,timestamp,netid,ruid,section,role,major,data_structures,calculus_and_linear_algebra,probability_and_statistics,data_visualization,...,shell_scripting,sql,python_scripting,jupyter_notebook,regression,programming_languages,algorithms,complexity_measures,visualization_tools,massive_data_processing
0,2025-09-04 01:21:03 +0300,d2dbd3d0d5,786a2021217c6e2022,1,Junior,Computer Science,28.0,20.0,41.0,28.0,...,5.0,5.0,0.0,5.0,12.0,22.0,20.0,4.0,14.0,9.0
1,2025-09-04 00:28:39 +0200,c7dd9ac7c494,60703e393965793e3d,1,Junior,Computer Science,7.0,11.0,15.0,11.0,...,0.0,5.0,1.0,,0.0,15.0,9.0,4.0,1.0,2.0
2,2025-09-03 18:22:47 -0400,5d504543461b,0f1d55565609195250,1,Senior,Mathematics,22.0,15.0,22.0,22.0,...,,5.0,0.0,0.0,4.0,15.0,,,9.0,1.0
3,2025-09-04 06:29:53 +0800,021b4e0503,5145080b0b52450c0a,3,Senior,Computer Science,29.0,19.0,55.0,28.0,...,1.0,5.0,5.0,1.0,,27.0,32.0,10.0,,8.0
4,2025-09-03 16:31:34 -0600,8b8cc28089de,5d4c0104045b490005,1,Junior,Computer Science,25.0,14.0,43.0,23.0,...,1.0,5.0,1.0,1.0,12.0,13.0,7.0,0.0,6.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
152,2025-09-04 00:21:28 +0200,a89a979d95,8b9bdad2d28e92d2da,2,Sophomore,Mathematics,18.0,,26.0,16.0,...,1.0,5.0,0.0,0.0,4.0,17.0,10.0,3.0,19.0,4.0
153,2025-09-03 22:22:08 +0000,eaecb4edea,8d9dd4d4d48f99d0d0,2,Senior,Computer Science,14.0,5.0,5.0,13.0,...,1.0,5.0,1.0,1.0,0.0,8.0,7.0,0.0,0.0,0.0
154,2025-09-03 18:29:26 -0400,e7d58dcecf9e,bbaae7e2e2b1aceae7,1,Senior,Computer Science,11.0,9.0,14.0,10.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
155,2025-09-03 17:26:09 -0500,eef7a3e5e6,03125d5a5a06105b5d,3,Senior,Electrical and Computer Engineering,35.0,19.0,39.0,34.0,...,1.0,0.0,0.0,5.0,6.0,25.0,25.0,6.0,12.0,0.0


### Import Data

In [16]:
def import_data(path):
    """
    Load assessment_generated.xlsx as a DataFrame 
    IN: path, str, path to the file
    OUT: student_assessment_xlsx, pd.DataFrame
    """
    # Your_Code_Here
    student_assessment_xlsx = pd.read_excel(path)
    return student_assessment_xlsx

In [17]:
if __name__ == "__main__":
    XLSX_PATH = 'assessment_generated.xlsx'
    # load xlsx file to a Pandas dataframe named student_assessments
    student_assessments_xlsx = import_data(XLSX_PATH)

### Verify NetIDs

In [18]:
def verify_netids(df_csv, df_xlsx):
    """
    Verify that the NetIDs in the CSV and XLSX files match
    IN: df_csv, pd.DataFrame, dataframe from CSV file
        df_xlsx, pd.DataFrame, dataframe from XLSX file
    OUT: flag, bool, True if NetIDs match, False otherwise
    """
    # Your_Code_Here
    for name, df in [("CSV", df_csv), ("XLSX", df_xlsx)]:
        if "netid" not in df.columns:
            raise KeyError(f"'netid' column missing in {name} dataframe")

    csv_netids  = set(df_csv["netid"].astype(str).str.strip().str.lower())
    xlsx_netids = set(df_xlsx["netid"].astype(str).str.strip().str.lower())

    flag = (csv_netids == xlsx_netids)
    return True if flag else False

In [19]:
if __name__ == "__main__":
    netid_same_flag = verify_netids(student_assessments_csv, student_assessments_xlsx)
    print(f"NetIDs match: {netid_same_flag}")

NetIDs match: True


### Analyze RUID Lengths

In [20]:
def get_ruid_length_freq(df):
    """
    Get the frequency of RUID lengths in the dataframe
    IN: df, pd.DataFrame, dataframe containing RUIDs
    OUT: ruid_length_freq, dict, dictionary with RUID lengths as keys and their frequencies as values
    """
    # Your_Code_Here
    if "ruid" not in df.columns:
        raise KeyError("Expected RUID not found")
    
    ruid_length_freq = (
        df["ruid"]
        .dropna()
        .astype(str)
        .map(len)
        .value_counts()
        .sort_index()
        .to_dict()
    )
    return ruid_length_freq

In [21]:
if __name__ == "__main__":
    ruid_length_freq = get_ruid_length_freq(student_assessments_xlsx)
    for length, freq in sorted(ruid_length_freq.items()):
        print(f"RUID Length: {length:<2}, Frequency: {freq:<3}")

RUID Length: 4 , Frequency: 1  
RUID Length: 6 , Frequency: 1  
RUID Length: 10, Frequency: 2  
RUID Length: 16, Frequency: 2  
RUID Length: 18, Frequency: 147
RUID Length: 20, Frequency: 3  
RUID Length: 28, Frequency: 1  


In [22]:
def get_ruid_length_outliers(df):
    """
    Identify RUIDs with lengths that are > 20 characters
    IN: df, pd.DataFrame, dataframe containing RUIDs
    OUT: outliers, pd.DataFrame, dataframe containing outlier RUIDs
    """
    # Your_Code_Here
    if "ruid" not in df.columns:
        raise KeyError("Expected RUID not found")
    
    threshold = 20
    ruid_len = df["ruid"].astype(str).str.len()
    outliers = df.loc[ruid_len > threshold].copy()
    outliers["ruid_length"] = ruid_len[ruid_len > threshold]

    return outliers


In [23]:
if __name__ == "__main__":
    outliers = get_ruid_length_outliers(student_assessments_xlsx)
    display(outliers)

Unnamed: 0,timestamp,netid,ruid,section,role,major,data_structures,calculus_and_linear_algebra,probability_and_statistics,data_visualization,...,sql,python_scripting,jupyter_notebook,regression,programming_languages,algorithms,complexity_measures,visualization_tools,massive_data_processing,ruid_length
33,2025-09-04 06:38:55 +0800,534b5f535e,e3fabab9bae1f7b2bcbee8f1beb8,4,Junior,Computer Science,13.0,13.0,44.0,28.0,...,,1.0,1.0,0.0,,1.0,0.0,4.0,0.0,28


### Compute Total Score

Note: Skip `nan` entries when computing sum.

In [24]:
def compute_total_score(df, skill_columns):
    """
    Compute the total score for each student
    IN: df, pd.DataFrame, dataframe containing student assessments
    IN: skill_columns, list of str, list of columns to sum for total score
    OUT: df, pd.DataFrame, dataframe with total_score column
    """
    # Your_Code_Here
    missing = [c for c in skill_columns if c not in df.columns]
    if missing:
        raise KeyError(f"Missing columns: {missing}")

    nums = df[skill_columns].apply(pd.to_numeric, errors="coerce").fillna(0)
    df = df.copy()
    df["total_score"] = nums.sum(axis=1)
    return df

In [25]:
def sort_by_total_score(df):
    """
    Sort the dataframe by total score in descending order
    IN: df, pd.DataFrame, dataframe containing total_score column
    OUT: df_sorted, pd.DataFrame, sorted dataframe
    """
    # Your_Code_Here
    if "total_score" not in df.columns:
        raise KeyError("Expected a 'total_score' column")
    df_sorted = df.sort_values("total_score", ascending=False).reset_index(drop=True)
    return df_sorted

In [26]:
if __name__ == "__main__":
    student_assessments_with_total_score = compute_total_score(student_assessments_xlsx.copy(), skill_columns)
    student_assessments_with_total_score = sort_by_total_score(student_assessments_with_total_score)
    display(student_assessments_with_total_score)

Unnamed: 0,timestamp,netid,ruid,section,role,major,data_structures,calculus_and_linear_algebra,probability_and_statistics,data_visualization,...,sql,python_scripting,jupyter_notebook,regression,programming_languages,algorithms,complexity_measures,visualization_tools,massive_data_processing,total_score
0,2025-09-03 12:39:05 -1000,54490a53500a,fceeaca5a5fbeea3ac,3,Senior,Computer Science,35.0,15.0,44.0,35.0,...,5.0,5.0,5.0,15.0,22.0,26.0,10.0,14.0,6.0,258.0
1,2025-09-03 13:29:53 -0900,f3edfaf9faa5,b8a9e2e1e1baaee7e4,1,Sophomore,Computer Science,24.0,20.0,43.0,35.0,...,5.0,5.0,5.0,9.0,25.0,22.0,9.0,22.0,8.0,252.0
2,2025-09-04 04:20:41 +0600,f9e1fefbff,7766292e2e7161292d,1,Junior,Computer Science,35.0,15.0,49.0,35.0,...,0.0,5.0,5.0,13.0,22.0,22.0,5.0,15.0,5.0,247.0
3,2025-09-03 22:09:39 +0000,6668646765,05175f5c5c03145d5c,3,Junior,Computer Science,27.0,17.0,54.0,29.0,...,5.0,5.0,5.0,,28.0,25.0,8.0,23.0,,246.0
4,2025-09-04 01:13:32 +0300,1b0f13121349,2d3f7274742f387074,1,Senior,Computer Science,34.0,25.0,36.0,35.0,...,0.0,5.0,5.0,12.0,27.0,28.0,5.0,,12.0,243.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
152,2025-09-03 18:29:26 -0400,e7d58dcecf9e,bbaae7e2e2b1aceae7,1,Senior,Computer Science,11.0,9.0,14.0,10.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,45.0
153,2025-09-03 12:14:09 -1000,9491999b9dc3,ccde919595cbdb9095,1,Junior,Computer Science,0.0,,0.0,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
154,2025-09-03 11:36:30 -1100,203e2a2928,2e3a7e777729377070,1,Senior,Computer Science,0.0,0.0,0.0,,...,0.0,,0.0,0.0,,0.0,0.0,0.0,0.0,0.0
155,2025-09-04 01:33:27 +0300,87,6071,1,Junior,Computer Science,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.0


### Group Statistics by Section

In [27]:
def get_section_statistics(df, skill_columns):
    """
    Group students by section and compute the mean, median, and standard deviation for each skill proficiency column
    IN: df, pd.DataFrame, dataframe containing student assessments
    IN: skill_columns, list of str, list of skill proficiency columns
    OUT: section_stats, pd.DataFrame, dataframe with section statistics
    """
    # Your_Code_Here
    if "section" not in df.columns:
        raise KeyError("Expected a 'section' column in the dataframe.")
    missing = [c for c in skill_columns if c not in df.columns]
    if missing:
        raise KeyError(f"Missing skill columns: {missing}")

    # Coerce skills to numeric for robust aggregation
    df_num = df.copy()
    df_num[skill_columns] = df_num[skill_columns].apply(pd.to_numeric, errors="coerce")

    section_stats = (
        df_num
        .groupby("section")[skill_columns]
        .agg(["mean", "median", "std"])
        .sort_index()
        #.round(round_digits)
    )
    return section_stats

In [28]:
if __name__ == "__main__":
    section_statistics = get_section_statistics(student_assessments_with_total_score.copy(), skill_columns)
    display(section_statistics)

Unnamed: 0_level_0,data_structures,data_structures,data_structures,calculus_and_linear_algebra,calculus_and_linear_algebra,calculus_and_linear_algebra,probability_and_statistics,probability_and_statistics,probability_and_statistics,data_visualization,...,algorithms,complexity_measures,complexity_measures,complexity_measures,visualization_tools,visualization_tools,visualization_tools,massive_data_processing,massive_data_processing,massive_data_processing
Unnamed: 0_level_1,mean,median,std,mean,median,std,mean,median,std,mean,...,std,mean,median,std,mean,median,std,mean,median,std
section,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
1,19.608696,22.0,8.740066,13.521127,14.0,5.492486,27.875,26.0,13.673185,20.557143,...,7.705543,3.5,3.0,3.133503,3.787879,1.0,5.636461,1.549296,0.0,2.708657
2,19.0,18.0,5.244044,12.75,13.5,4.290442,30.15,26.5,15.971274,20.7,...,7.778738,3.47619,3.0,2.926073,3.619048,2.0,4.565919,1.0,0.0,1.449138
3,22.885714,22.0,6.807312,16.361111,15.0,4.691346,34.810811,35.0,12.939685,22.157895,...,8.343165,4.315789,4.0,3.5191,4.111111,1.0,5.873805,1.514286,1.0,2.105615
4,23.565217,25.0,7.650467,15.5,15.0,4.491538,36.478261,39.0,13.044921,25.608696,...,8.407836,3.875,4.5,3.221025,4.173913,3.0,3.459535,1.541667,0.0,2.781604


### Pivot Table by Role and Section

In [29]:
def create_pivot_table(df):
    """
    Create a pivot table where rows correspond to role, columns correspond to section numbers, and entries contain the average total_score
    IN: df, pd.DataFrame, dataframe containing student assessments
    OUT: pivot_table, pd.DataFrame, pivot table
    """
    # set role as categorical with specified order
    # Your_Code_Here

    # generate pivot table
    # Your_Code_Here
    # validate required columns
    for c in ("role", "section", "total_score"):
        if c not in df.columns:
            raise KeyError(f"Missing required column: {c}")

    # work on a copy of df only
    df = df.copy()
    df["role"] = df["role"].astype(str).str.strip()
    df["total_score"] = pd.to_numeric(df["total_score"], errors="coerce")

    # make 'section' numeric if possible (keeps original if truly non-numeric)
    try:
        df["section"] = pd.to_numeric(df["section"], errors="raise")
    except Exception:
        pass

    pivot_table = (
        pd.pivot_table(df, index="role", columns="section",
                       values="total_score", aggfunc="mean")
        .sort_index()           # roles
        .sort_index(axis=1)     # sections
        
    )
    return pivot_table

In [30]:
if __name__ == "__main__":
    pivot_table = create_pivot_table(student_assessments_with_total_score.copy())
    display(pivot_table)

section,1,2,3,4
role,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Junior,128.944444,145.888889,147.75,171.777778
Senior,127.0,109.555556,161.521739,143.571429
Sophomore,156.25,134.666667,99.333333,143.0


### Format Timestamp

In [33]:
def format_timestamp(df):
    """
    Convert timestamp values to the EST timezone instead of UTC
    IN: df, pd.DataFrame, dataframe containing student assessments
    OUT: df, pd.DataFrame, dataframe with formatted timestamp
    """
    # Your_Code_Here
    if "timestamp" not in df.columns:
        raise KeyError("Expected a 'timestamp' column in the dataframe.")

    out = df.copy()
    s = pd.to_datetime(out["timestamp"], errors="coerce")  # may be naive or tz-aware

    if not pd.api.types.is_datetime64_any_dtype(s):
        s = pd.to_datetime(out["timestamp"], errors="coerce", utc=True, unit="s")
        # if that didn't work well, try milliseconds
        if not pd.api.types.is_datetime64_any_dtype(s):
            s = pd.to_datetime(out["timestamp"], errors="coerce", utc=True, unit="ms")

    out["timestamp_est"] = s.dt.tz_convert("America/New_York")
    return out

In [34]:
if __name__ == "__main__":
    student_assessments_formatted_timestamp = format_timestamp(student_assessments_with_total_score.copy())
    display(student_assessments_formatted_timestamp)

  s = pd.to_datetime(out["timestamp"], errors="coerce")  # may be naive or tz-aware


Unnamed: 0,timestamp,netid,ruid,section,role,major,data_structures,calculus_and_linear_algebra,probability_and_statistics,data_visualization,...,python_scripting,jupyter_notebook,regression,programming_languages,algorithms,complexity_measures,visualization_tools,massive_data_processing,total_score,timestamp_est
0,2025-09-03 12:39:05 -1000,54490a53500a,fceeaca5a5fbeea3ac,3,Senior,Computer Science,35.0,15.0,44.0,35.0,...,5.0,5.0,15.0,22.0,26.0,10.0,14.0,6.0,258.0,NaT
1,2025-09-03 13:29:53 -0900,f3edfaf9faa5,b8a9e2e1e1baaee7e4,1,Sophomore,Computer Science,24.0,20.0,43.0,35.0,...,5.0,5.0,9.0,25.0,22.0,9.0,22.0,8.0,252.0,NaT
2,2025-09-04 04:20:41 +0600,f9e1fefbff,7766292e2e7161292d,1,Junior,Computer Science,35.0,15.0,49.0,35.0,...,5.0,5.0,13.0,22.0,22.0,5.0,15.0,5.0,247.0,NaT
3,2025-09-03 22:09:39 +0000,6668646765,05175f5c5c03145d5c,3,Junior,Computer Science,27.0,17.0,54.0,29.0,...,5.0,5.0,,28.0,25.0,8.0,23.0,,246.0,NaT
4,2025-09-04 01:13:32 +0300,1b0f13121349,2d3f7274742f387074,1,Senior,Computer Science,34.0,25.0,36.0,35.0,...,5.0,5.0,12.0,27.0,28.0,5.0,,12.0,243.0,NaT
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
152,2025-09-03 18:29:26 -0400,e7d58dcecf9e,bbaae7e2e2b1aceae7,1,Senior,Computer Science,11.0,9.0,14.0,10.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,45.0,NaT
153,2025-09-03 12:14:09 -1000,9491999b9dc3,ccde919595cbdb9095,1,Junior,Computer Science,0.0,,0.0,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,NaT
154,2025-09-03 11:36:30 -1100,203e2a2928,2e3a7e777729377070,1,Senior,Computer Science,0.0,0.0,0.0,,...,,0.0,0.0,,0.0,0.0,0.0,0.0,0.0,NaT
155,2025-09-04 01:33:27 +0300,87,6071,1,Junior,Computer Science,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,NaT


### Normalize Skill Proficiency

In [None]:
def normalize_skills(df, skill_columns):
    """
    For each proficiency column, apply z-score normalization
    IN: df, pd.DataFrame, dataframe containing student assessments
    IN: skill_columns, list of str, list of skill proficiency columns
    OUT: df, pd.DataFrame, dataframe with normalized skills
    """
    # Your_Code_Here
    # validate inputs
    missing = [c for c in skill_columns if c not in df.columns]
    if missing:
        raise KeyError(f"Missing skill columns: {missing}")

    out = df.copy()
    num = out[skill_columns].apply(pd.to_numeric, errors="coerce")
    for c in skill_columns:
        mu = num[c].mean()
        sd = num[c].std(ddof=0)
        out[f"{c}_z"] = 0.0 if (pd.isna(sd) or sd == 0) else (num[c] - mu) / sd
    return out

    


In [36]:
if __name__ == "__main__":
    student_assessments_normalized = normalize_skills(student_assessments_formatted_timestamp.copy(), skill_columns)
    display(student_assessments_normalized)

Unnamed: 0,timestamp,netid,ruid,section,role,major,data_structures,calculus_and_linear_algebra,probability_and_statistics,data_visualization,...,shell_scripting_z,sql_z,python_scripting_z,jupyter_notebook_z,regression_z,programming_languages_z,algorithms_z,complexity_measures_z,visualization_tools_z,massive_data_processing_z
0,2025-09-03 12:39:05 -1000,54490a53500a,fceeaca5a5fbeea3ac,3,Senior,Computer Science,35.0,15.0,44.0,35.0,...,3.108471,1.111570,1.597242,1.183173,2.875648,1.132474,1.656349,1.952235,1.938863,1.869033
1,2025-09-03 13:29:53 -0900,f3edfaf9faa5,b8a9e2e1e1baaee7e4,1,Sophomore,Computer Science,24.0,20.0,43.0,35.0,...,-0.887393,1.111570,1.597242,1.183173,1.298273,1.627725,1.151396,1.639468,3.475221,2.693045
2,2025-09-04 04:20:41 +0600,f9e1fefbff,7766292e2e7161292d,1,Junior,Computer Science,35.0,15.0,49.0,35.0,...,-0.088220,-1.065732,1.597242,1.183173,2.349857,1.132474,1.151396,0.388403,2.130908,1.457027
3,2025-09-03 22:09:39 +0000,6668646765,05175f5c5c03145d5c,3,Junior,Computer Science,27.0,17.0,54.0,29.0,...,3.108471,1.111570,1.597242,1.183173,,2.122976,1.530111,1.326702,3.667266,
4,2025-09-04 01:13:32 +0300,1b0f13121349,2d3f7274742f387074,1,Senior,Computer Science,34.0,25.0,36.0,35.0,...,-0.887393,-1.065732,1.597242,1.183173,2.086961,1.957892,1.908826,0.388403,,4.341068
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
152,2025-09-03 18:29:26 -0400,e7d58dcecf9e,bbaae7e2e2b1aceae7,1,Senior,Computer Science,11.0,9.0,14.0,10.0,...,-0.887393,-1.065732,-0.842989,-1.065459,-1.067789,-2.499366,-1.625849,-1.175429,-0.749764,-0.603002
153,2025-09-03 12:14:09 -1000,9491999b9dc3,ccde919595cbdb9095,1,Junior,Computer Science,0.0,,0.0,,...,-0.887393,-1.065732,-0.842989,-1.065459,-1.067789,-2.499366,-1.625849,-1.175429,-0.749764,-0.603002
154,2025-09-03 11:36:30 -1100,203e2a2928,2e3a7e777729377070,1,Senior,Computer Science,0.0,0.0,0.0,,...,-0.887393,-1.065732,,-1.065459,-1.067789,,-1.625849,-1.175429,-0.749764,-0.603002
155,2025-09-04 01:33:27 +0300,87,6071,1,Junior,Computer Science,0.0,0.0,0.0,0.0,...,-0.887393,-1.065732,-0.842989,-1.065459,-1.067789,-2.499366,-1.625849,-1.175429,-0.749764,


### Handle Missing Values

In [37]:
def handle_missing_values(df, skill_columns):
    """
    Fill missing values in each skill proficiency column with the column mean
    IN: df, pd.DataFrame, dataframe containing student assessments
    OUT: df, pd.DataFrame, dataframe with missing values handled
    """
    # Your_Code_Here
     # validate inputs
    missing = [c for c in skill_columns if c not in df.columns]
    if missing:
        raise KeyError(f"Missing skill columns: {missing}")

    out = df.copy()

    # coerce to numeric for the target columns (non-numeric -> NaN), then fill with column mean
    for c in skill_columns:
        col = pd.to_numeric(out[c], errors="coerce")
        mu = col.mean()
        if pd.isna(mu):
            # if the entire column is NaN, fall back to 0
            mu = 0.0
        out[c] = col.fillna(mu)
    return out

In [38]:
if __name__ == "__main__":
    student_assessments_no_missing = handle_missing_values(student_assessments_normalized.copy(), skill_columns)
    display(student_assessments_no_missing)

Unnamed: 0,timestamp,netid,ruid,section,role,major,data_structures,calculus_and_linear_algebra,probability_and_statistics,data_visualization,...,shell_scripting_z,sql_z,python_scripting_z,jupyter_notebook_z,regression_z,programming_languages_z,algorithms_z,complexity_measures_z,visualization_tools_z,massive_data_processing_z
0,2025-09-03 12:39:05 -1000,54490a53500a,fceeaca5a5fbeea3ac,3,Senior,Computer Science,35.0,15.000000,44.0,35.000000,...,3.108471,1.111570,1.597242,1.183173,2.875648,1.132474,1.656349,1.952235,1.938863,1.869033
1,2025-09-03 13:29:53 -0900,f3edfaf9faa5,b8a9e2e1e1baaee7e4,1,Sophomore,Computer Science,24.0,20.000000,43.0,35.000000,...,-0.887393,1.111570,1.597242,1.183173,1.298273,1.627725,1.151396,1.639468,3.475221,2.693045
2,2025-09-04 04:20:41 +0600,f9e1fefbff,7766292e2e7161292d,1,Junior,Computer Science,35.0,15.000000,49.0,35.000000,...,-0.088220,-1.065732,1.597242,1.183173,2.349857,1.132474,1.151396,0.388403,2.130908,1.457027
3,2025-09-03 22:09:39 +0000,6668646765,05175f5c5c03145d5c,3,Junior,Computer Science,27.0,17.000000,54.0,29.000000,...,3.108471,1.111570,1.597242,1.183173,,2.122976,1.530111,1.326702,3.667266,
4,2025-09-04 01:13:32 +0300,1b0f13121349,2d3f7274742f387074,1,Senior,Computer Science,34.0,25.000000,36.0,35.000000,...,-0.887393,-1.065732,1.597242,1.183173,2.086961,1.957892,1.908826,0.388403,,4.341068
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
152,2025-09-03 18:29:26 -0400,e7d58dcecf9e,bbaae7e2e2b1aceae7,1,Senior,Computer Science,11.0,9.000000,14.0,10.000000,...,-0.887393,-1.065732,-0.842989,-1.065459,-1.067789,-2.499366,-1.625849,-1.175429,-0.749764,-0.603002
153,2025-09-03 12:14:09 -1000,9491999b9dc3,ccde919595cbdb9095,1,Junior,Computer Science,0.0,14.410596,0.0,21.748344,...,-0.887393,-1.065732,-0.842989,-1.065459,-1.067789,-2.499366,-1.625849,-1.175429,-0.749764,-0.603002
154,2025-09-03 11:36:30 -1100,203e2a2928,2e3a7e777729377070,1,Senior,Computer Science,0.0,0.000000,0.0,21.748344,...,-0.887393,-1.065732,,-1.065459,-1.067789,,-1.625849,-1.175429,-0.749764,-0.603002
155,2025-09-04 01:33:27 +0300,87,6071,1,Junior,Computer Science,0.0,0.000000,0.0,0.000000,...,-0.887393,-1.065732,-0.842989,-1.065459,-1.067789,-2.499366,-1.625849,-1.175429,-0.749764,


### Remove Duplicate Records

Note: Drop duplicates based on `ruid`. Keep the record with the highest `total_score`. If there is a tie, keep the one whose `timestamp` is latest.

In [39]:
def remove_duplicates(df):
    """
    Drop duplicates while keeping only the record with the highest total_score
    IN: df, pd.DataFrame, dataframe containing student assessments
    OUT: df, pd.DataFrame, dataframe with duplicates removed
    """
    if "total_score" not in df.columns:
        raise KeyError("Expected a 'total_score' column.")
    key = "netid" if "netid" in df.columns else ("ruid" if "ruid" in df.columns else None)
    if key is None:
        raise KeyError("Expected an identifier column 'netid' or 'ruid'.")

    out = df.copy()

    # sort by total_score desc, then by timestamp desc (if present)
    if "timestamp" in out.columns:
        ts = pd.to_datetime(out["timestamp"], errors="coerce")
        out = out.assign(_ts=ts).sort_values(
            ["total_score", "_ts"], ascending=[False, False]
        ).drop(columns="_ts")
    else:
        out = out.sort_values("total_score", ascending=False)

    # drop duplicates by identifier, keep the first (highest score / newest)
    out = out.drop_duplicates(subset=[key], keep="first")

    # sort back by total_score desc for readability
    out = out.sort_values("total_score", ascending=False).reset_index(drop=True)
    return out

In [40]:
if __name__ == "__main__":
    student_assessments_no_duplicates = remove_duplicates(student_assessments_no_missing.copy())
    display(student_assessments_no_duplicates)

  ts = pd.to_datetime(out["timestamp"], errors="coerce")


Unnamed: 0,timestamp,netid,ruid,section,role,major,data_structures,calculus_and_linear_algebra,probability_and_statistics,data_visualization,...,shell_scripting_z,sql_z,python_scripting_z,jupyter_notebook_z,regression_z,programming_languages_z,algorithms_z,complexity_measures_z,visualization_tools_z,massive_data_processing_z
0,2025-09-03 12:39:05 -1000,54490a53500a,fceeaca5a5fbeea3ac,3,Senior,Computer Science,35.0,15.0,44.0,35.0,...,3.108471,1.111570,1.597242,1.183173,2.875648,1.132474,1.656349,1.952235,1.938863,1.869033
1,2025-09-03 13:29:53 -0900,f3edfaf9faa5,b8a9e2e1e1baaee7e4,1,Sophomore,Computer Science,24.0,20.0,43.0,35.0,...,-0.887393,1.111570,1.597242,1.183173,1.298273,1.627725,1.151396,1.639468,3.475221,2.693045
2,2025-09-04 04:20:41 +0600,f9e1fefbff,7766292e2e7161292d,1,Junior,Computer Science,35.0,15.0,49.0,35.0,...,-0.088220,-1.065732,1.597242,1.183173,2.349857,1.132474,1.151396,0.388403,2.130908,1.457027
3,2025-09-03 22:09:39 +0000,6668646765,05175f5c5c03145d5c,3,Junior,Computer Science,27.0,17.0,54.0,29.0,...,3.108471,1.111570,1.597242,1.183173,,2.122976,1.530111,1.326702,3.667266,
4,2025-09-04 01:13:32 +0300,1b0f13121349,2d3f7274742f387074,1,Senior,Computer Science,34.0,25.0,36.0,35.0,...,-0.887393,-1.065732,1.597242,1.183173,2.086961,1.957892,1.908826,0.388403,,4.341068
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
147,2025-09-04 04:35:08 +0600,f0c08cd0d58b,fceca6a5a5f8e4a3a6,1,Senior,Computer Science,6.0,7.0,12.0,11.0,...,-0.088220,-0.630272,-0.354943,-0.615733,-1.067789,-1.343781,-1.120895,-0.549896,-0.749764,-0.603002
148,2025-09-03 19:17:21 -0300,464a4347411c,3e2c62676738296460,1,Senior,Electrical and Computer Engineering,9.0,7.0,11.0,13.0,...,-0.088220,-0.630272,-0.354943,-0.615733,-1.067789,-1.673948,-1.120895,-1.175429,-0.749764,-0.603002
149,2025-09-03 18:29:26 -0400,e7d58dcecf9e,bbaae7e2e2b1aceae7,1,Senior,Computer Science,11.0,9.0,14.0,10.0,...,-0.887393,-1.065732,-0.842989,-1.065459,-1.067789,-2.499366,-1.625849,-1.175429,-0.749764,-0.603002
150,2025-09-04 01:33:27 +0300,87,6071,1,Junior,Computer Science,0.0,0.0,0.0,0.0,...,-0.887393,-1.065732,-0.842989,-1.065459,-1.067789,-2.499366,-1.625849,-1.175429,-0.749764,


### Resolve Swapped IDs

In [41]:
def get_swapped_records(df):
    """
    Identify records where students may have swapped netid and ruid
    IN: df, pd.DataFrame, dataframe containing student assessments
    OUT: swapped_df, pd.DataFrame, dataframe containing swapped records
    """
    # calculate lengths
    # Your_Code_Here

    # find netid with exactly 18 characters and ruid length between 8 and 14
    # Your_Code_Here
    
    for col in ("netid", "ruid"):
        if col not in df.columns:
            raise KeyError(f"Missing required column: '{col}'")

    # calculate lengths
    netid_len = df["netid"].astype(str).str.len()
    ruid_len  = df["ruid"].astype(str).str.len()

    # find netid with exactly 18 characters and ruid length between 8 and 14
    mask = (netid_len == 18) & (ruid_len.between(8, 14, inclusive="both"))

    swapped_df = df.loc[mask].copy()
    swapped_df["netid_length"] = netid_len[mask]
    swapped_df["ruid_length"]  = ruid_len[mask]
    return swapped_df

In [42]:
if __name__ == "__main__":
    swapped_records = get_swapped_records(student_assessments_no_duplicates.copy())
    display(swapped_records)

Unnamed: 0,timestamp,netid,ruid,section,role,major,data_structures,calculus_and_linear_algebra,probability_and_statistics,data_visualization,...,python_scripting_z,jupyter_notebook_z,regression_z,programming_languages_z,algorithms_z,complexity_measures_z,visualization_tools_z,massive_data_processing_z,netid_length,ruid_length
96,2025-09-04 00:29:54 +0200,8392dadada8891d3dd,7163257279,2,Senior,Computer Science,15.0,14.0,32.0,21.0,...,-0.354943,,-1.067789,-0.188195,-0.615942,1.013935,0.21046,-0.603002,18,10


In [43]:
def join_swapped_records(df_swapped, student_list_df):
    """
    Join the swapped records with the student list to correct netid and ruid
    IN: df_swapped, pd.DataFrame, dataframe containing swapped records
    IN: student_list_df, pd.DataFrame, dataframe containing student list
    OUT: joined_df, pd.DataFrame, dataframe with corrected netid and ruid
    """
    # Your_Code_Here
    for c in ("netid", "ruid"):
        if c not in df_swapped.columns:
            raise KeyError(f"df_swapped is missing required column: {c}")
    if "netid" not in student_list_df.columns:
        raise KeyError("student_list_df must contain a 'netid' column")
    has_ruid_in_student_list = "ruid" in student_list_df.columns

    # Merge on: left 'ruid' (which we suspect is actually a NetID) to student_list 'netid'
    sl = student_list_df.rename(columns={"netid": "student_list_netid"})
    joined_df = df_swapped.merge(
        sl,
        left_on="ruid",
        right_on="student_list_netid",
        how="left",
        suffixes=("", "_sl"),
    )

    # Compute corrected fields
    # Correct NetID should be whatever was in the swapped 'ruid' column (if it looks like a netid)
    joined_df["netid_corrected"] = joined_df["ruid"]

    # Correct RUID: prefer the authoritative one from the student list if available
    if has_ruid_in_student_list:
        # Use student list ruid when present; otherwise, fall back to original 'netid'
        joined_df["ruid_corrected"] = joined_df["ruid_sl"].where(
            joined_df["ruid_sl"].notna(),
            joined_df["netid"]  # fallback: simple swap if no match in student list
        )
    else:
        # No ruid in student list -> simple swap fallback
        joined_df["ruid_corrected"] = joined_df["netid"]

    # Keep useful columns only (and preserve originals for auditing)
    keep_cols = [c for c in joined_df.columns if c in df_swapped.columns] + [
        "netid_corrected",
        "ruid_corrected",
    ]
    return joined_df[keep_cols]

In [44]:
if __name__ == "__main__":
    joined_swapped = join_swapped_records(swapped_records.copy(), student_list_df)
    display(joined_swapped)

Unnamed: 0,timestamp,netid,ruid,section,role,major,data_structures,calculus_and_linear_algebra,probability_and_statistics,data_visualization,...,regression_z,programming_languages_z,algorithms_z,complexity_measures_z,visualization_tools_z,massive_data_processing_z,netid_length,ruid_length,netid_corrected,ruid_corrected
0,2025-09-04 00:29:54 +0200,8392dadada8891d3dd,7163257279,2,Senior,Computer Science,15.0,14.0,32.0,21.0,...,-1.067789,-0.188195,-0.615942,1.013935,0.21046,-0.603002,18,10,7163257279,8392dadada8891d3dd


### DataFrame Schema

In [45]:
if __name__ == "__main__":
    section_statistics.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, 1 to 4
Data columns (total 45 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   (data_structures, mean)                4 non-null      float64
 1   (data_structures, median)              4 non-null      float64
 2   (data_structures, std)                 4 non-null      float64
 3   (calculus_and_linear_algebra, mean)    4 non-null      float64
 4   (calculus_and_linear_algebra, median)  4 non-null      float64
 5   (calculus_and_linear_algebra, std)     4 non-null      float64
 6   (probability_and_statistics, mean)     4 non-null      float64
 7   (probability_and_statistics, median)   4 non-null      float64
 8   (probability_and_statistics, std)      4 non-null      float64
 9   (data_visualization, mean)             4 non-null      float64
 10  (data_visualization, median)           4 non-null      float64
 11  (data_visualiza

In [46]:
if __name__ == "__main__":
    pivot_table.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, Junior to Sophomore
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   1       3 non-null      float64
 1   2       3 non-null      float64
 2   3       3 non-null      float64
 3   4       3 non-null      float64
dtypes: float64(4)
memory usage: 120.0+ bytes


In [47]:
if __name__ == "__main__":
    student_assessments_no_duplicates.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 152 entries, 0 to 151
Data columns (total 38 columns):
 #   Column                         Non-Null Count  Dtype                           
---  ------                         --------------  -----                           
 0   timestamp                      152 non-null    object                          
 1   netid                          152 non-null    object                          
 2   ruid                           152 non-null    object                          
 3   section                        152 non-null    int64                           
 4   role                           152 non-null    object                          
 5   major                          152 non-null    object                          
 6   data_structures                152 non-null    float64                         
 7   calculus_and_linear_algebra    152 non-null    float64                         
 8   probability_and_statistics     152 non-n

In [48]:
if __name__ == "__main__":
    joined_swapped.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 42 columns):
 #   Column                         Non-Null Count  Dtype                           
---  ------                         --------------  -----                           
 0   timestamp                      1 non-null      object                          
 1   netid                          1 non-null      object                          
 2   ruid                           1 non-null      object                          
 3   section                        1 non-null      int64                           
 4   role                           1 non-null      object                          
 5   major                          1 non-null      object                          
 6   data_structures                1 non-null      float64                         
 7   calculus_and_linear_algebra    1 non-null      float64                         
 8   probability_and_statistics     1 non-null   