# Data Wrangling using Pandas

## Tasks <a id="tasks"></a>

The data wrangling process for this dataset involves several key stages, including data manipulation, structuring, cleaning, and enrichment. Each stage is described below with the specific tasks to be performed.

---

#### 1. [Manipulating Data](#manipulating)
*(Handling missing values, examining distributions, grouping, frequency analysis, and correlation)*

- **Inspect missing values** in each column to determine appropriate handling strategies, such as imputation or removal.  
- **Examine the distribution** of values across numerical and categorical columns to identify potential anomalies or outliers.  
- **Analyze frequency counts** for selected categorical columns (e.g., `role`, `major`) to obtain an overview of the dataset composition.  
- **Group the data** by a categorical column, e.g., `role`, and compute summary statistics — mean, median, and standard deviation — for each skill proficiency column.  
- **Use a pivot table** to explore potential relationships or correlations between categorical variables, such as `role` and `major`.

---

#### 2. [Structuring Data](#structuring)
*(Formatting, type conversion, and scaling)*

- **Assign appropriate data types** to each column based on the dataset description. For example, convert string entries in the `timestamp` column into `datetime` objects while retaining timezone information.  
- **Normalize skill proficiency columns** by scaling their values to a 0–1 range to facilitate comparison across skills.

---

#### 3. [Cleaning Data](#cleaning)
*(Managing missing values, invalid entries, and duplicates)*

- **Impute missing proficiency scores** by replacing them with 0, assuming no reported proficiency.  
- **Remove invalid entries** for `netid` and `ruid` that do not conform to the specified length criteria.  
- **Handle duplicate submissions** by retaining only the most recent record for each student, determined by the `timestamp`.

---

#### 4. [Enriching Data](#enriching)
*(Merging, correcting, and deriving additional information)*

- **Validate and correct `section` data** by cross-referencing with an external dataset containing verified section information.  
- **Derive new categorical features** by grouping related skills into broader categories to enable higher-level analysis and visualization.

---


## Dataset Description: Student Assessment Questionnaires

The dataset `assessment_generated.csv` contains information derived from student assessment questionnaires.

Each record represents an individual student's response and includes demographic, academic, and self-assessment information. The dataset comprises the following attributes:

- **`timestamp`**  
  The date and time when the assessment was submitted, formatted as `yyyy-mm-dd hh:mm:ss timezone`.

- **`netid`**  
  The encoded NetID of the student. Valid NetIDs must have a string length between 8 and 14 characters (inclusive). Entries falling outside this range are considered invalid.

- **`ruid`**  
  The encoded RUID of the student. A valid RUID is expected to contain exactly 18 characters. Any deviation from this length is considered invalid.

- **`section`**  
  The course section number as reported by the student. This field may contain inaccuracies, as some students provided incorrect section information.

- **`role`**  
  The academic standing of the student. Possible values include:
  - `Freshman`
  - `Sophomore`
  - `Junior`
  - `Senior`
  - `Graduate`
  - `Other`

- **`major`**  
  The declared major of the student. Accepted categories are:
  - `Computer Science`
  - `Electrical and Computer Engineering`
  - `Mathematics`
  - `Other`

- **Skill Proficiency Columns**  
  The following columns record students’ self-assessed proficiency levels in specific skills, rated on scales ranging from 0 up to a multiple of 5 (depending on the number of questions per topic). Missing values are present in some entries.

  - `data_structures`  
  - `calculus_and_linear_algebra`  
  - `probability_and_statistics`  
  - `data_visualization`  
  - `python_libraries`  
  - `shell_scripting`  
  - `sql`  
  - `python_scripting`  
  - `jupyter_notebook`  
  - `regression`  
  - `programming_languages`  
  - `algorithms`  
  - `complexity_measures`  
  - `visualization_tools`  
  - `massive_data_processing`


In [21]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing

In [22]:
# load csv file to a Pandas dataframe named student_assessments
student_assessments = pd.read_csv('assessment_generated.csv')
student_assessments

Unnamed: 0,timestamp,netid,ruid,section,role,major,data_structures,calculus_and_linear_algebra,probability_and_statistics,data_visualization,...,shell_scripting,sql,python_scripting,jupyter_notebook,regression,programming_languages,algorithms,complexity_measures,visualization_tools,massive_data_processing
0,2025-09-04 01:21:03 +0300,d2dbd3d0d5,786a2021217c6e2022,1,Junior,Computer Science,28.0,20.0,41.0,28.0,...,5.0,5.0,0.0,5.0,12.0,22.0,20.0,4.0,14.0,9.0
1,2025-09-04 00:28:39 +0200,c7dd9ac7c494,60703e393965793e3d,1,Junior,Computer Science,7.0,11.0,15.0,11.0,...,0.0,5.0,1.0,,0.0,15.0,9.0,4.0,1.0,2.0
2,2025-09-03 18:22:47 -0400,5d504543461b,0f1d55565609195250,1,Senior,Mathematics,22.0,15.0,22.0,22.0,...,,5.0,0.0,0.0,4.0,15.0,,,9.0,1.0
3,2025-09-04 06:29:53 +0800,021b4e0503,5145080b0b52450c0a,3,Senior,Computer Science,29.0,19.0,55.0,28.0,...,1.0,5.0,5.0,1.0,,27.0,32.0,10.0,,8.0
4,2025-09-03 16:31:34 -0600,8b8cc28089de,5d4c0104045b490005,1,Junior,Computer Science,25.0,14.0,43.0,23.0,...,1.0,5.0,1.0,1.0,12.0,13.0,7.0,0.0,6.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
152,2025-09-04 00:21:28 +0200,a89a979d95,8b9bdad2d28e92d2da,2,Sophomore,Mathematics,18.0,,26.0,16.0,...,1.0,5.0,0.0,0.0,4.0,17.0,10.0,3.0,19.0,4.0
153,2025-09-03 22:22:08 +0000,eaecb4edea,8d9dd4d4d48f99d0d0,2,Senior,Computer Science,14.0,5.0,5.0,13.0,...,1.0,5.0,1.0,1.0,0.0,8.0,7.0,0.0,0.0,0.0
154,2025-09-03 18:29:26 -0400,e7d58dcecf9e,bbaae7e2e2b1aceae7,1,Senior,Computer Science,11.0,9.0,14.0,10.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
155,2025-09-03 17:26:09 -0500,eef7a3e5e6,03125d5a5a06105b5d,3,Senior,Electrical and Computer Engineering,35.0,19.0,39.0,34.0,...,1.0,0.0,0.0,5.0,6.0,25.0,25.0,6.0,12.0,0.0


[Back to top](#tasks)

### Manipulating Data<a id="manipulating"></a>

##### Inspect Missing Values

`df.info()` provides a concise summary of the DataFrame, including the number of non-null entries, data types of each column, and memory usage. This function is particularly useful for obtaining an overview of the dataset’s completeness and structure, allowing for quick identification of missing values and incorrect data types.

In [23]:
student_assessments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157 entries, 0 to 156
Data columns (total 21 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   timestamp                    157 non-null    object 
 1   netid                        157 non-null    object 
 2   ruid                         157 non-null    object 
 3   section                      157 non-null    int64  
 4   role                         157 non-null    object 
 5   major                        157 non-null    object 
 6   data_structures              148 non-null    float64
 7   calculus_and_linear_algebra  151 non-null    float64
 8   probability_and_statistics   152 non-null    float64
 9   data_visualization           151 non-null    float64
 10  python_libraries             147 non-null    float64
 11  shell_scripting              154 non-null    float64
 12  sql                          152 non-null    float64
 13  python_scripting    

According to the `Non-Null Count` information obtained from `df.info()`, the columns `timestamp`, `netid`, `ruid`, `section`, `role`, and `major` contain no missing values. In contrast, the skill proficiency columns exhibit missing entries, which may require additional handling through imputation or other appropriate data cleaning techniques.

##### Distribution of Values

`df.describe(include='all')` can be used to generate a comprehensive overview of the dataset. It provides summary statistics for both numerical and categorical columns — including count, unique values, most frequent (top) values, and their frequencies for categorical data, as well as measures such as mean, standard deviation, minimum, maximum, and quartiles for numerical data. This function is particularly useful for gaining an initial understanding of data distributions and detecting potential anomalies.


In [24]:
# describe the dataset (ignore na)
student_assessments.describe(include="all")

Unnamed: 0,timestamp,netid,ruid,section,role,major,data_structures,calculus_and_linear_algebra,probability_and_statistics,data_visualization,...,shell_scripting,sql,python_scripting,jupyter_notebook,regression,programming_languages,algorithms,complexity_measures,visualization_tools,massive_data_processing
count,157,157,157,157.0,157,157,148.0,151.0,152.0,151.0,...,154.0,152.0,154.0,149.0,146.0,150.0,149.0,153.0,146.0,151.0
unique,157,152,151,,3,4,,,,,...,,,,,,,,,,
top,2025-09-04 01:21:03 +0300,8b8cc28089de,bbaae7e2e2b1aceae7,,Senior,Computer Science,,,,,...,,,,,,,,,,
freq,1,2,2,,80,135,,,,,...,,,,,,,,,,
mean,,,,2.076433,,,20.912162,14.410596,31.164474,21.748344,...,1.11039,2.447368,1.727273,2.369128,4.061644,15.14,12.879195,3.75817,3.90411,1.463576
std,,,,1.152152,,,7.867897,5.146224,14.055917,8.19123,...,1.255376,2.304012,2.055672,2.231074,3.816882,6.077829,7.948239,3.207775,5.225044,2.435227
min,,,,1.0,,,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,,,,1.0,,,16.0,11.0,22.0,16.0,...,1.0,0.0,0.0,0.0,0.25,12.0,7.0,1.0,0.0,0.0
50%,,,,2.0,,,22.0,15.0,31.0,22.0,...,1.0,1.0,1.0,1.0,3.0,16.0,13.0,4.0,2.0,0.0
75%,,,,3.0,,,27.0,18.0,42.0,28.0,...,1.0,5.0,5.0,5.0,6.0,19.0,18.0,6.0,6.0,2.0


According to the `freq` row in the output of `df.describe(include='all')`, some students appear to have submitted the assessment questionnaire more than once, as indicated by duplicate occurrences of their `netid` and `ruid`. These duplicate records should be addressed during the data cleaning process to ensure that only the most recent submission from each student is retained.


##### Frequency of Values

To examine the distribution of values within a specific column, the function `df['column_name'].value_counts()` can be used. This function returns a Series containing the counts of all unique values in the specified column, sorted in descending order by default. It is particularly useful for analyzing categorical variables and identifying dominant or infrequent categories.


In [25]:
# The frequency of roles in the 'role' column
print(student_assessments["role"].value_counts())

print()

# The frequency of majors in the 'major' column
print(student_assessments["major"].value_counts())

role
Senior       80
Junior       66
Sophomore    11
Name: count, dtype: int64

major
Computer Science                       135
Mathematics                             13
Electrical and Computer Engineering      8
Others                                   1
Name: count, dtype: int64


##### Grouping Data

The function `df.groupby()` is used to group a DataFrame by one or more columns and to perform aggregate operations on the grouped data. This method is particularly useful for summarizing and analyzing patterns across categorical variables, such as calculating mean, median, or standard deviation values for different groups within the dataset.

In [26]:
score_columns = [
    "data_structures",
    "calculus_and_linear_algebra",
    "probability_and_statistics",
    "data_visualization",
    "python_libraries",
    "shell_scripting",
    "sql",
    "python_scripting",
    "jupyter_notebook",
    "regression",
    "programming_languages",
    "algorithms",
    "complexity_measures",
    "visualization_tools",
    "massive_data_processing",
]

student_assessments[['role', *score_columns]].groupby('role', observed=True).agg(['mean', 'median', 'std'])

Unnamed: 0_level_0,data_structures,data_structures,data_structures,calculus_and_linear_algebra,calculus_and_linear_algebra,calculus_and_linear_algebra,probability_and_statistics,probability_and_statistics,probability_and_statistics,data_visualization,...,algorithms,complexity_measures,complexity_measures,complexity_measures,visualization_tools,visualization_tools,visualization_tools,massive_data_processing,massive_data_processing,massive_data_processing
Unnamed: 0_level_1,mean,median,std,mean,median,std,mean,median,std,mean,...,std,mean,median,std,mean,median,std,mean,median,std
role,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
Junior,20.555556,21.0,7.908301,14.0,14.0,4.90556,31.954545,31.5,14.163259,21.569231,...,8.558332,3.681818,3.5,3.220889,3.967213,2.0,5.486247,1.451613,0.0,2.708078
Senior,21.391892,22.5,8.275757,14.487179,15.0,5.33974,30.266667,31.0,14.358662,22.381579,...,7.626932,3.881579,4.0,3.195693,3.743243,2.0,4.536189,1.487179,0.5,2.214143
Sophomore,19.727273,20.0,4.268276,16.4,16.5,5.103376,32.545455,37.0,11.894231,18.1,...,5.985167,3.363636,3.0,3.471966,4.636364,1.0,8.015893,1.363636,0.0,2.54058


##### Correlation of Categorical Columns

The function `df.pivot_table()` can be used to create a pivot table that summarizes and aggregates data based on specified index and column variables. It allows for flexible computation of summary statistics, such as mean or count, across combinations of categorical variables. This function is particularly useful for exploring relationships and patterns between different categories within the dataset.


In [27]:
# count # of students by role and major
student_assessments["major"] = student_assessments["major"].astype("category")
student_assessments["role"] = (
    student_assessments["role"]
    .astype("category")
    .cat.reorder_categories(["Sophomore", "Junior", "Senior"], ordered=True)
)
student_assessments.pivot_table(
    index="major", columns="role", aggfunc=np.size, observed=True, values="netid"
)

role,Sophomore,Junior,Senior
major,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Computer Science,10.0,55.0,70.0
Electrical and Computer Engineering,,4.0,4.0
Mathematics,1.0,6.0,6.0
Others,,1.0,


[Back to top](#tasks)

### Structuring Data<a id="structuring"></a>

##### Formatting Timestamp Column

The `timestamp` column should be parsed and converted to a standardized `datetime` format, ensuring that all entries are represented in the same timezone for consistency. This step facilitates accurate temporal analysis and comparison across records.

In [28]:
# parse timestamp and convert to utc
student_assessments["timestamp"] = pd.to_datetime(
    student_assessments["timestamp"],
    format="%Y-%m-%d %H:%M:%S %z",
    utc=True
)
student_assessments

Unnamed: 0,timestamp,netid,ruid,section,role,major,data_structures,calculus_and_linear_algebra,probability_and_statistics,data_visualization,...,shell_scripting,sql,python_scripting,jupyter_notebook,regression,programming_languages,algorithms,complexity_measures,visualization_tools,massive_data_processing
0,2025-09-03 22:21:03+00:00,d2dbd3d0d5,786a2021217c6e2022,1,Junior,Computer Science,28.0,20.0,41.0,28.0,...,5.0,5.0,0.0,5.0,12.0,22.0,20.0,4.0,14.0,9.0
1,2025-09-03 22:28:39+00:00,c7dd9ac7c494,60703e393965793e3d,1,Junior,Computer Science,7.0,11.0,15.0,11.0,...,0.0,5.0,1.0,,0.0,15.0,9.0,4.0,1.0,2.0
2,2025-09-03 22:22:47+00:00,5d504543461b,0f1d55565609195250,1,Senior,Mathematics,22.0,15.0,22.0,22.0,...,,5.0,0.0,0.0,4.0,15.0,,,9.0,1.0
3,2025-09-03 22:29:53+00:00,021b4e0503,5145080b0b52450c0a,3,Senior,Computer Science,29.0,19.0,55.0,28.0,...,1.0,5.0,5.0,1.0,,27.0,32.0,10.0,,8.0
4,2025-09-03 22:31:34+00:00,8b8cc28089de,5d4c0104045b490005,1,Junior,Computer Science,25.0,14.0,43.0,23.0,...,1.0,5.0,1.0,1.0,12.0,13.0,7.0,0.0,6.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
152,2025-09-03 22:21:28+00:00,a89a979d95,8b9bdad2d28e92d2da,2,Sophomore,Mathematics,18.0,,26.0,16.0,...,1.0,5.0,0.0,0.0,4.0,17.0,10.0,3.0,19.0,4.0
153,2025-09-03 22:22:08+00:00,eaecb4edea,8d9dd4d4d48f99d0d0,2,Senior,Computer Science,14.0,5.0,5.0,13.0,...,1.0,5.0,1.0,1.0,0.0,8.0,7.0,0.0,0.0,0.0
154,2025-09-03 22:29:26+00:00,e7d58dcecf9e,bbaae7e2e2b1aceae7,1,Senior,Computer Science,11.0,9.0,14.0,10.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
155,2025-09-03 22:26:09+00:00,eef7a3e5e6,03125d5a5a06105b5d,3,Senior,Electrical and Computer Engineering,35.0,19.0,39.0,34.0,...,1.0,0.0,0.0,5.0,6.0,25.0,25.0,6.0,12.0,0.0


##### Scaling Skill Proficiency Columns

For the skill proficiency columns, scaling may be necessary to ensure that all skills are evaluated on a comparable basis, particularly if the original assessments have different maximum scores. Normalizing these values to a common range (e.g., 0 to 1) facilitates fair comparison and improves the interpretability of summary statistics and visualizations.

In [29]:
# min-max normalization for score columns
scaler = preprocessing.MinMaxScaler()
student_assessments[score_columns] = scaler.fit_transform(student_assessments[score_columns])
student_assessments

Unnamed: 0,timestamp,netid,ruid,section,role,major,data_structures,calculus_and_linear_algebra,probability_and_statistics,data_visualization,...,shell_scripting,sql,python_scripting,jupyter_notebook,regression,programming_languages,algorithms,complexity_measures,visualization_tools,massive_data_processing
0,2025-09-03 22:21:03+00:00,d2dbd3d0d5,786a2021217c6e2022,1,Junior,Computer Science,0.800000,0.80,0.672131,0.800000,...,1.0,1.0,0.0,1.0,0.800000,0.785714,0.571429,0.4,0.608696,0.750000
1,2025-09-03 22:28:39+00:00,c7dd9ac7c494,60703e393965793e3d,1,Junior,Computer Science,0.200000,0.44,0.245902,0.314286,...,0.0,1.0,0.2,,0.000000,0.535714,0.257143,0.4,0.043478,0.166667
2,2025-09-03 22:22:47+00:00,5d504543461b,0f1d55565609195250,1,Senior,Mathematics,0.628571,0.60,0.360656,0.628571,...,,1.0,0.0,0.0,0.266667,0.535714,,,0.391304,0.083333
3,2025-09-03 22:29:53+00:00,021b4e0503,5145080b0b52450c0a,3,Senior,Computer Science,0.828571,0.76,0.901639,0.800000,...,0.2,1.0,1.0,0.2,,0.964286,0.914286,1.0,,0.666667
4,2025-09-03 22:31:34+00:00,8b8cc28089de,5d4c0104045b490005,1,Junior,Computer Science,0.714286,0.56,0.704918,0.657143,...,0.2,1.0,0.2,0.2,0.800000,0.464286,0.200000,0.0,0.260870,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
152,2025-09-03 22:21:28+00:00,a89a979d95,8b9bdad2d28e92d2da,2,Sophomore,Mathematics,0.514286,,0.426230,0.457143,...,0.2,1.0,0.0,0.0,0.266667,0.607143,0.285714,0.3,0.826087,0.333333
153,2025-09-03 22:22:08+00:00,eaecb4edea,8d9dd4d4d48f99d0d0,2,Senior,Computer Science,0.400000,0.20,0.081967,0.371429,...,0.2,1.0,0.2,0.2,0.000000,0.285714,0.200000,0.0,0.000000,0.000000
154,2025-09-03 22:29:26+00:00,e7d58dcecf9e,bbaae7e2e2b1aceae7,1,Senior,Computer Science,0.314286,0.36,0.229508,0.285714,...,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000
155,2025-09-03 22:26:09+00:00,eef7a3e5e6,03125d5a5a06105b5d,3,Senior,Electrical and Computer Engineering,1.000000,0.76,0.639344,0.971429,...,0.2,0.0,0.0,1.0,0.400000,0.892857,0.714286,0.6,0.521739,0.000000


[Back to top](#tasks)

### Cleaning Data<a id="cleaning"></a>

##### Missing Values

Some of the skill proficiency columns contain missing values. For the purposes of this analysis, these missing values will be imputed with `0`, under the assumption that a missing entry indicates no proficiency in the corresponding skill.

In [30]:
# fill missing skill proficiency with 0
skill_cols = [
    "data_structures",
    "calculus_and_linear_algebra",
    "probability_and_statistics",
    "data_visualization",
    "python_libraries",
    "shell_scripting",
    "sql",
    "python_scripting",
    "jupyter_notebook",
    "regression",
    "programming_languages",
    "algorithms",
    "complexity_measures",
    "visualization_tools",
    "massive_data_processing",
]

student_assessments[skill_cols] = student_assessments[skill_cols].fillna(0)
student_assessments

Unnamed: 0,timestamp,netid,ruid,section,role,major,data_structures,calculus_and_linear_algebra,probability_and_statistics,data_visualization,...,shell_scripting,sql,python_scripting,jupyter_notebook,regression,programming_languages,algorithms,complexity_measures,visualization_tools,massive_data_processing
0,2025-09-03 22:21:03+00:00,d2dbd3d0d5,786a2021217c6e2022,1,Junior,Computer Science,0.800000,0.80,0.672131,0.800000,...,1.0,1.0,0.0,1.0,0.800000,0.785714,0.571429,0.4,0.608696,0.750000
1,2025-09-03 22:28:39+00:00,c7dd9ac7c494,60703e393965793e3d,1,Junior,Computer Science,0.200000,0.44,0.245902,0.314286,...,0.0,1.0,0.2,0.0,0.000000,0.535714,0.257143,0.4,0.043478,0.166667
2,2025-09-03 22:22:47+00:00,5d504543461b,0f1d55565609195250,1,Senior,Mathematics,0.628571,0.60,0.360656,0.628571,...,0.0,1.0,0.0,0.0,0.266667,0.535714,0.000000,0.0,0.391304,0.083333
3,2025-09-03 22:29:53+00:00,021b4e0503,5145080b0b52450c0a,3,Senior,Computer Science,0.828571,0.76,0.901639,0.800000,...,0.2,1.0,1.0,0.2,0.000000,0.964286,0.914286,1.0,0.000000,0.666667
4,2025-09-03 22:31:34+00:00,8b8cc28089de,5d4c0104045b490005,1,Junior,Computer Science,0.714286,0.56,0.704918,0.657143,...,0.2,1.0,0.2,0.2,0.800000,0.464286,0.200000,0.0,0.260870,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
152,2025-09-03 22:21:28+00:00,a89a979d95,8b9bdad2d28e92d2da,2,Sophomore,Mathematics,0.514286,0.00,0.426230,0.457143,...,0.2,1.0,0.0,0.0,0.266667,0.607143,0.285714,0.3,0.826087,0.333333
153,2025-09-03 22:22:08+00:00,eaecb4edea,8d9dd4d4d48f99d0d0,2,Senior,Computer Science,0.400000,0.20,0.081967,0.371429,...,0.2,1.0,0.2,0.2,0.000000,0.285714,0.200000,0.0,0.000000,0.000000
154,2025-09-03 22:29:26+00:00,e7d58dcecf9e,bbaae7e2e2b1aceae7,1,Senior,Computer Science,0.314286,0.36,0.229508,0.285714,...,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000
155,2025-09-03 22:26:09+00:00,eef7a3e5e6,03125d5a5a06105b5d,3,Senior,Electrical and Computer Engineering,1.000000,0.76,0.639344,0.971429,...,0.2,0.0,0.0,1.0,0.400000,0.892857,0.714286,0.6,0.521739,0.000000


##### Invalid or Outlier Values

Several invalid entries were identified in the `netid` and `ruid` columns. These entries will be removed to maintain data integrity and ensure the accuracy of subsequent analyses.

In [31]:
# drop rows with len(netid) < 8 or len(netid) > 14
student_assessments = student_assessments[student_assessments['netid'].apply(lambda x: 8 <= len(x) <= 14)]
# drop rows with len(ruid) != 18
student_assessments = student_assessments[student_assessments['ruid'].apply(lambda x: len(x) == 18)]

student_assessments

Unnamed: 0,timestamp,netid,ruid,section,role,major,data_structures,calculus_and_linear_algebra,probability_and_statistics,data_visualization,...,shell_scripting,sql,python_scripting,jupyter_notebook,regression,programming_languages,algorithms,complexity_measures,visualization_tools,massive_data_processing
0,2025-09-03 22:21:03+00:00,d2dbd3d0d5,786a2021217c6e2022,1,Junior,Computer Science,0.800000,0.80,0.672131,0.800000,...,1.0,1.0,0.0,1.0,0.800000,0.785714,0.571429,0.4,0.608696,0.750000
1,2025-09-03 22:28:39+00:00,c7dd9ac7c494,60703e393965793e3d,1,Junior,Computer Science,0.200000,0.44,0.245902,0.314286,...,0.0,1.0,0.2,0.0,0.000000,0.535714,0.257143,0.4,0.043478,0.166667
2,2025-09-03 22:22:47+00:00,5d504543461b,0f1d55565609195250,1,Senior,Mathematics,0.628571,0.60,0.360656,0.628571,...,0.0,1.0,0.0,0.0,0.266667,0.535714,0.000000,0.0,0.391304,0.083333
3,2025-09-03 22:29:53+00:00,021b4e0503,5145080b0b52450c0a,3,Senior,Computer Science,0.828571,0.76,0.901639,0.800000,...,0.2,1.0,1.0,0.2,0.000000,0.964286,0.914286,1.0,0.000000,0.666667
4,2025-09-03 22:31:34+00:00,8b8cc28089de,5d4c0104045b490005,1,Junior,Computer Science,0.714286,0.56,0.704918,0.657143,...,0.2,1.0,0.2,0.2,0.800000,0.464286,0.200000,0.0,0.260870,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
152,2025-09-03 22:21:28+00:00,a89a979d95,8b9bdad2d28e92d2da,2,Sophomore,Mathematics,0.514286,0.00,0.426230,0.457143,...,0.2,1.0,0.0,0.0,0.266667,0.607143,0.285714,0.3,0.826087,0.333333
153,2025-09-03 22:22:08+00:00,eaecb4edea,8d9dd4d4d48f99d0d0,2,Senior,Computer Science,0.400000,0.20,0.081967,0.371429,...,0.2,1.0,0.2,0.2,0.000000,0.285714,0.200000,0.0,0.000000,0.000000
154,2025-09-03 22:29:26+00:00,e7d58dcecf9e,bbaae7e2e2b1aceae7,1,Senior,Computer Science,0.314286,0.36,0.229508,0.285714,...,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000
155,2025-09-03 22:26:09+00:00,eef7a3e5e6,03125d5a5a06105b5d,3,Senior,Electrical and Computer Engineering,1.000000,0.76,0.639344,0.971429,...,0.2,0.0,0.0,1.0,0.400000,0.892857,0.714286,0.6,0.521739,0.000000


##### Duplicates

For students who submitted the assessment multiple times, only the most recent submission will be retained, as determined by the `timestamp` column. This approach ensures that each student is represented by a single, up-to-date record in the dataset.


In [32]:
# remove duplicate student info (keep the latest submission)
# sort by timestamp first
student_assessments = student_assessments.sort_values("timestamp")
# then drop duplicates
student_assessments = student_assessments.drop_duplicates(subset=["netid"], keep="last")
student_assessments = student_assessments.drop_duplicates(subset=["ruid"], keep="last")
student_assessments

Unnamed: 0,timestamp,netid,ruid,section,role,major,data_structures,calculus_and_linear_algebra,probability_and_statistics,data_visualization,...,shell_scripting,sql,python_scripting,jupyter_notebook,regression,programming_languages,algorithms,complexity_measures,visualization_tools,massive_data_processing
90,2025-09-03 22:05:22+00:00,c1d4dedb,8591dadfdf8c9fded6,3,Senior,Computer Science,0.371429,0.00,0.229508,1.000000,...,0.2,0.0,0.0,0.0,0.200000,0.678571,0.857143,0.2,0.304348,0.166667
31,2025-09-03 22:05:49+00:00,fce8fefdf5a0,33226d6a6a3425686a,1,Junior,Computer Science,0.114286,0.52,0.213115,0.400000,...,0.2,0.2,0.2,0.2,0.000000,0.214286,0.114286,0.1,0.000000,0.000000
102,2025-09-03 22:06:05+00:00,8a90d1898fdf,acbef4f5f5acbcf2fd,3,Senior,Mathematics,0.571429,0.96,0.852459,0.828571,...,0.2,0.2,0.2,0.2,0.600000,0.642857,0.571429,0.5,0.000000,0.000000
156,2025-09-03 22:06:40+00:00,d6da8cd2d28d,11024f48481a064940,3,Junior,Computer Science,0.257143,0.24,0.147541,0.342857,...,0.2,0.2,0.2,0.2,0.000000,0.250000,0.142857,0.2,0.000000,0.000000
22,2025-09-03 22:06:41+00:00,d3ca9bdadc85,f5e7a8acacf3ecaeaa,3,Senior,Computer Science,0.828571,0.64,0.557377,0.371429,...,0.2,0.2,0.0,0.2,0.066667,0.678571,0.314286,0.6,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13,2025-09-03 22:38:31+00:00,4d4c05454418,a8baf3f1f1abb1f5f9,1,Junior,Computer Science,0.800000,0.72,0.213115,0.685714,...,0.2,0.2,0.0,1.0,0.000000,0.428571,0.228571,0.0,0.000000,0.000000
128,2025-09-03 22:38:35+00:00,6e7832636338,cad8919393c0d89a93,1,Junior,Computer Science,0.314286,0.32,0.508197,0.514286,...,0.2,0.2,0.2,1.0,0.200000,0.178571,0.342857,0.6,0.000000,0.000000
15,2025-09-03 22:38:50+00:00,697760696d,7f6d252626786c2f25,4,Junior,Computer Science,0.857143,0.72,0.770492,0.885714,...,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000
17,2025-09-03 22:38:56+00:00,c1c7c4cbc8,a7b6f8fefea1b2fcf7,4,Senior,Computer Science,0.285714,0.28,0.213115,0.285714,...,0.2,1.0,0.2,0.2,0.000000,0.214286,0.142857,0.4,0.000000,0.000000


[Back to top](#tasks)

### Enriching Data<a id="enriching"></a>

##### Merging Data

Since self-reported section numbers may be inaccurate, we can cross-reference the `netid` with an external dataset containing accurate section information to correct any discrepancies.

`student_list_generated.csv` contains the following columns:
- `netid`: The encoded NetID of the student.
- `section`: The accurate course section number for the student.

In [33]:
student_list = pd.read_csv('student_list_generated.csv')
student_list

Unnamed: 0,netid,section
0,d6cf94d6d68e9b,4
1,223c672524786d,4
2,121f0a0d0854,4
3,8d9085848c,4
4,5f42075a58,4
...,...,...
117,4d4c05454418,3
118,7c6022727a21,3
119,58480d5351,1
120,6370237a7325,3


To merge the two datasets, we can use the `pd.merge()` function in pandas, which emulates a SQL-like join operation. We will perform a inner join on the `netid` column to combine the datasets, ensuring that we only retain records with matching `netid` values in both datasets. This will help us update the `section` information in our main dataset with the accurate data from the external file.

Why inner join? Some students in `assessment_generated.csv` may not be present in `student_list_generated.csv` as they dropped the course. Also some students in `student_list_generated.csv` may not have submitted the assessment. An inner join ensures that we only keep records for students who are present in both datasets, which is essential for maintaining data integrity and relevance for our analysis.

In [34]:
# inner join to enrich data
student_assessment_section = pd.merge(
    student_assessments,
    student_list,
    on='netid',
    how='inner',
    suffixes=('_assessment', '_list')
)

# drop the old section column
student_assessment_section = student_assessment_section.drop(columns=['section_assessment'])
# rename the new section column
student_assessment_section = student_assessment_section.rename(columns={'section_list': 'section'})

student_assessment_section

Unnamed: 0,timestamp,netid,ruid,role,major,data_structures,calculus_and_linear_algebra,probability_and_statistics,data_visualization,python_libraries,...,sql,python_scripting,jupyter_notebook,regression,programming_languages,algorithms,complexity_measures,visualization_tools,massive_data_processing,section
0,2025-09-03 22:05:22+00:00,c1d4dedb,8591dadfdf8c9fded6,Senior,Computer Science,0.371429,0.00,0.229508,1.000000,1.00,...,0.0,0.0,0.0,0.200000,0.678571,0.857143,0.2,0.304348,0.166667,1
1,2025-09-03 22:05:49+00:00,fce8fefdf5a0,33226d6a6a3425686a,Junior,Computer Science,0.114286,0.52,0.213115,0.400000,0.20,...,0.2,0.2,0.2,0.000000,0.214286,0.114286,0.1,0.000000,0.000000,2
2,2025-09-03 22:06:41+00:00,d3ca9bdadc85,f5e7a8acacf3ecaeaa,Senior,Computer Science,0.828571,0.64,0.557377,0.371429,0.10,...,0.2,0.0,0.2,0.066667,0.678571,0.314286,0.6,0.000000,0.000000,2
3,2025-09-03 22:06:50+00:00,e8fee3ece3,8193dbd8d88390dfd9,Senior,Computer Science,0.828571,0.60,0.426230,0.742857,0.65,...,0.0,0.0,0.0,0.266667,0.750000,0.228571,0.5,0.260870,0.250000,3
4,2025-09-03 22:06:53+00:00,a5aae0bcbbef,d0c2888989d0c78b8d,Senior,Computer Science,0.457143,0.52,0.229508,0.314286,0.10,...,0.0,0.2,1.0,0.133333,0.607143,0.000000,0.0,0.000000,0.083333,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99,2025-09-03 22:37:44+00:00,2f2363343e6e,dccd808585daca8080,Senior,Computer Science,0.571429,0.60,0.344262,0.200000,0.00,...,0.0,0.2,0.2,0.200000,0.321429,0.200000,0.3,0.000000,0.000000,3
100,2025-09-03 22:38:18+00:00,636c6760613e,7c6e2c25257b652227,Senior,Computer Science,0.657143,0.60,0.524590,0.742857,0.15,...,0.2,1.0,1.0,0.466667,0.535714,0.314286,0.5,0.086957,0.083333,4
101,2025-09-03 22:38:31+00:00,4d4c05454418,a8baf3f1f1abb1f5f9,Junior,Computer Science,0.800000,0.72,0.213115,0.685714,0.00,...,0.2,0.0,1.0,0.000000,0.428571,0.228571,0.0,0.000000,0.000000,3
102,2025-09-03 22:38:35+00:00,6e7832636338,cad8919393c0d89a93,Junior,Computer Science,0.314286,0.32,0.508197,0.514286,0.20,...,0.2,0.2,1.0,0.200000,0.178571,0.342857,0.6,0.000000,0.000000,4


##### Creating New Columns

Skill proficiency columns can be grouped into two main categories:
- **Background Requirements**: `data_structures`, `calculus_and_linear_algebra`, `probability_and_statistics`, `data_visualization`, `python_libraries`, `shell_scripting`, `sql`, `python_scripting`, `jupyter_notebook`, `visualization_tools`, `programming_languages`
- **Core Competencies**: `regression`, `algorithms`, `complexity_measures`, `massive_data_processing`

In [35]:
background_columns = [
    "data_structures",
    "calculus_and_linear_algebra",
    "probability_and_statistics",
    "data_visualization",
    "python_libraries",
    "shell_scripting",
    "sql",
    "python_scripting",
    "jupyter_notebook",
    "visualization_tools",
    "programming_languages",
]
core_columns = [
    "regression",
    "algorithms",
    "complexity_measures",
    "massive_data_processing",
]


student_assessment_section["background"] = student_assessment_section[background_columns].sum(axis=1)
student_assessment_section["core"] = student_assessment_section[core_columns].sum(axis=1)

student_assessment_section

Unnamed: 0,timestamp,netid,ruid,role,major,data_structures,calculus_and_linear_algebra,probability_and_statistics,data_visualization,python_libraries,...,jupyter_notebook,regression,programming_languages,algorithms,complexity_measures,visualization_tools,massive_data_processing,section,background,core
0,2025-09-03 22:05:22+00:00,c1d4dedb,8591dadfdf8c9fded6,Senior,Computer Science,0.371429,0.00,0.229508,1.000000,1.00,...,0.0,0.200000,0.678571,0.857143,0.2,0.304348,0.166667,1,3.783856,1.423810
1,2025-09-03 22:05:49+00:00,fce8fefdf5a0,33226d6a6a3425686a,Junior,Computer Science,0.114286,0.52,0.213115,0.400000,0.20,...,0.2,0.000000,0.214286,0.114286,0.1,0.000000,0.000000,2,2.461686,0.214286
2,2025-09-03 22:06:41+00:00,d3ca9bdadc85,f5e7a8acacf3ecaeaa,Senior,Computer Science,0.828571,0.64,0.557377,0.371429,0.10,...,0.2,0.066667,0.678571,0.314286,0.6,0.000000,0.000000,2,3.775948,0.980952
3,2025-09-03 22:06:50+00:00,e8fee3ece3,8193dbd8d88390dfd9,Senior,Computer Science,0.828571,0.60,0.426230,0.742857,0.65,...,0.0,0.266667,0.750000,0.228571,0.5,0.260870,0.250000,3,4.458528,1.245238
4,2025-09-03 22:06:53+00:00,a5aae0bcbbef,d0c2888989d0c78b8d,Senior,Computer Science,0.457143,0.52,0.229508,0.314286,0.10,...,1.0,0.133333,0.607143,0.000000,0.0,0.000000,0.083333,4,3.628080,0.216667
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99,2025-09-03 22:37:44+00:00,2f2363343e6e,dccd808585daca8080,Senior,Computer Science,0.571429,0.60,0.344262,0.200000,0.00,...,0.2,0.200000,0.321429,0.200000,0.3,0.000000,0.000000,3,2.637119,0.700000
100,2025-09-03 22:38:18+00:00,636c6760613e,7c6e2c25257b652227,Senior,Computer Science,0.657143,0.60,0.524590,0.742857,0.15,...,1.0,0.466667,0.535714,0.314286,0.5,0.086957,0.083333,4,5.697261,1.364286
101,2025-09-03 22:38:31+00:00,4d4c05454418,a8baf3f1f1abb1f5f9,Junior,Computer Science,0.800000,0.72,0.213115,0.685714,0.00,...,1.0,0.000000,0.428571,0.228571,0.0,0.000000,0.000000,3,4.247400,0.228571
102,2025-09-03 22:38:35+00:00,6e7832636338,cad8919393c0d89a93,Junior,Computer Science,0.314286,0.32,0.508197,0.514286,0.20,...,1.0,0.200000,0.178571,0.342857,0.6,0.000000,0.000000,4,3.635340,1.142857


[Back to top](#tasks)