# UC San Diego: Data Science in Practice - Data Checkpoint
### Summer Session I 2023 | Instructor : C. Alex Simpkins Ph.D.

## UCSD Grade Inflation

# Names

- Emi Lee
- Cindy Luu
- Erwin Miguel Olimpo
- Neda Emdad
- Calvin Nguyen
- Diya Lakhani

<a id='research_question'></a>
# Research Question

How have UCSD students' grade point averages (GPAs) changed between the years 2000-2022? What factors have influenced these changes, considering the effects of:

* (1) the recent COVID-19 pandemic on student GPAs
* (2) the 2022 UCSD Academic Worker Strike on student GPAs
* (3) the emergence of ChatGPT


# Dataset(s)


- Dataset Name: UCSD CAPEs Data
- Link to the dataset: https://www.kaggle.com/datasets/sanbornpnguyen/ucsdcapes?utm_medium=social&utm_campaign=kaggle-dataset-share
- Number of observations: 63363 rows

The UCSD CAPEs Data is our primary dataset and it contains all the CAPE evaluations from 2007 to 2023 scraped from the CAPEs website  https://capes.ucsd.edu.

Note: We may add more datasets later as we are doing the analysis depending on what is required.

# Data Wrangling

The first step is to import the appropriate modules and convert the data from the CSV file to a dataframe.

In [70]:
import pandas as pd
import numpy as np
capes = pd.read_csv('raw_data/capes_data.csv')

In [71]:
capes.head()

Unnamed: 0,Instructor,Course,Quarter,Total Enrolled in Course,Total CAPEs Given,Percentage Recommended Class,Percentage Recommended Professor,Study Hours per Week,Average Grade Expected,Average Grade Received,Evalulation URL
0,Butler Elizabeth Annette,AAS 10 - Intro/African-American Studies (A),SP23,66,48,93.5%,100.0%,2.8,A- (3.84),B+ (3.67),https://cape.ucsd.edu/CAPEReport.aspx?sectioni...
1,Butler Elizabeth Annette,AAS 170 - Legacies of Research (A),SP23,20,7,100.0%,100.0%,2.5,A- (3.86),A- (3.92),https://cape.ucsd.edu/CAPEReport.aspx?sectioni...
2,Jones Ian William Nasser,ANAR 111 - Foundations of Archaeology (A),SP23,16,3,100.0%,100.0%,3.83,B+ (3.67),,https://cape.ucsd.edu/CAPEReport.aspx?sectioni...
3,Shtienberg Gilad,ANAR 115 - Coastal Geomorphology/Environ (A),SP23,26,6,100.0%,83.3%,3.83,B+ (3.50),B (3.07),https://cape.ucsd.edu/CAPEReport.aspx?sectioni...
4,Braswell Geoffrey E.,ANAR 155 - Stdy Abrd: Ancient Mesoamerica (A),SP23,22,9,100.0%,100.0%,5.17,A (4.00),A (4.00),https://cape.ucsd.edu/CAPEReport.aspx?sectioni...


Looking at the first few rows of our capes dataframe, there does not seem to be further data wrangling work required. We can shift to data clening. 

# Data Cleaning

Let's begin with checking the dimensions of our data set and the column names.

In [72]:
capes.shape

(63363, 11)

In [73]:
capes.columns

Index(['Instructor', 'Course', 'Quarter', 'Total Enrolled in Course',
       'Total CAPEs Given', 'Percentage Recommended Class',
       'Percentage Recommended Professor', 'Study Hours per Week',
       'Average Grade Expected', 'Average Grade Received', 'Evalulation URL'],
      dtype='object')

Since our research question is focused on the course and course grades we can drop all unecessary columns.

In [74]:
capes = capes.drop(columns = ['Instructor', 'Percentage Recommended Class',
       'Percentage Recommended Professor', 'Study Hours per Week',
       'Average Grade Expected', 'Evalulation URL', 'Total Enrolled in Course',
       'Total CAPEs Given'])
capes.head()

Unnamed: 0,Course,Quarter,Average Grade Received
0,AAS 10 - Intro/African-American Studies (A),SP23,B+ (3.67)
1,AAS 170 - Legacies of Research (A),SP23,A- (3.92)
2,ANAR 111 - Foundations of Archaeology (A),SP23,
3,ANAR 115 - Coastal Geomorphology/Environ (A),SP23,B (3.07)
4,ANAR 155 - Stdy Abrd: Ancient Mesoamerica (A),SP23,A (4.00)


In [75]:
#checking for NaN values
capes.isnull().sum().any()

True

Since our dataframe contains NaN values, we need to drop all the rows that have an NaN value.

In [76]:
#dropping NaN rows
capes = capes.dropna()
capes.isnull().sum().any()

False

Grades in the 'Average Grade Recieved' column are of the form <letter_grade (GPA)>. In order to take a mean of all the GPAs, we need to split the column.

In [77]:
#defining function to split the grade
def split_grade_gpa(string):
    lst1 = string.split("(")
    str1 = lst1[1].strip(")")
    return float(str1)

def split_grade(string):
    lst1 = string.split("(")
    return lst1[0].strip(" ")

cape_GPAs = capes['Average Grade Received'].apply(split_grade_gpa)
cape_grades = capes['Average Grade Received'].apply(split_grade)
capes['GPA'] = cape_GPAs
capes['Grade'] = cape_grades
capes = capes.drop(columns=['Average Grade Received'])
capes.head()

Unnamed: 0,Course,Quarter,GPA,Grade
0,AAS 10 - Intro/African-American Studies (A),SP23,3.67,B+
1,AAS 170 - Legacies of Research (A),SP23,3.92,A-
3,ANAR 115 - Coastal Geomorphology/Environ (A),SP23,3.07,B
4,ANAR 155 - Stdy Abrd: Ancient Mesoamerica (A),SP23,4.0,A
5,ANBI 111 - Human Evolution (A),SP23,2.95,B-


Similarly, we require a column for department, whether a course is a lower division course or upper division course, and the year.

In [78]:
#function to obtain department code
def dept_strip(string):
    lst1 = string.split(" ")
    return lst1[0]

#function to check for lower division / upper division
def ldud(string):
    lst1 = string.split(" ")
    if lst1[1].isalpha() == False:
        str_num = "".join(filter(lambda x: x.isalpha() == False, lst1[1]))
        if int(str_num) <= 99:
            return "LD"
        else:
            return "UD"
    else:
        if int(lst1[1]) <= 99:
            return "LD"
        else:
            return "UD"
    
dept_names = capes['Course'].apply(dept_strip)
upper_lower = capes['Course'].apply(ldud)

capes['Division'] = upper_lower
capes['Dept'] = dept_names

In [79]:
def extract_year(string):
    ret_str = "20" + string[2:4]
    return int(ret_str)

years = capes['Quarter'].apply(extract_year)
capes['Year'] = years
capes.head()

Unnamed: 0,Course,Quarter,GPA,Grade,Division,Dept,Year
0,AAS 10 - Intro/African-American Studies (A),SP23,3.67,B+,LD,AAS,2023
1,AAS 170 - Legacies of Research (A),SP23,3.92,A-,UD,AAS,2023
3,ANAR 115 - Coastal Geomorphology/Environ (A),SP23,3.07,B,UD,ANAR,2023
4,ANAR 155 - Stdy Abrd: Ancient Mesoamerica (A),SP23,4.0,A,UD,ANAR,2023
5,ANBI 111 - Human Evolution (A),SP23,2.95,B-,UD,ANBI,2023


Let's rearrange the columns.

In [80]:
capes = capes.get(['Quarter', 'Dept', 'Course', 'Division', 'Year', 'GPA', 'Grade'])
capes.head()

Unnamed: 0,Quarter,Dept,Course,Division,Year,GPA,Grade
0,SP23,AAS,AAS 10 - Intro/African-American Studies (A),LD,2023,3.67,B+
1,SP23,AAS,AAS 170 - Legacies of Research (A),UD,2023,3.92,A-
3,SP23,ANAR,ANAR 115 - Coastal Geomorphology/Environ (A),UD,2023,3.07,B
4,SP23,ANAR,ANAR 155 - Stdy Abrd: Ancient Mesoamerica (A),UD,2023,4.0,A
5,SP23,ANBI,ANBI 111 - Human Evolution (A),UD,2023,2.95,B-


In [81]:
#Computing the means per course per quater divided by division
capes_sub = capes.get(['Quarter', 'Dept', 'Division', 'Year', 'GPA'])
capes_sub = capes_sub.groupby(['Quarter', 'Dept', 'Division','GPA']).mean().reset_index()
capes_sub

Unnamed: 0,Quarter,Dept,Division,GPA,Year
0,FA07,ANAR,UD,2.61,2007.0
1,FA07,ANBI,UD,2.64,2007.0
2,FA07,ANBI,UD,3.44,2007.0
3,FA07,ANSC,UD,2.71,2007.0
4,FA07,ANSC,UD,3.47,2007.0
...,...,...,...,...,...
39651,WI23,WCWP,LD,3.93,2023.0
39652,WI23,WCWP,UD,3.47,2023.0
39653,WI23,WCWP,UD,3.57,2023.0
39654,WI23,WCWP,UD,3.82,2023.0
