# UC San Diego: Data Science in Practice - Data Checkpoint
### Summer Session I 2023 | Instructor : C. Alex Simpkins Ph.D.

## UCSD Grade Inflation

# Names

- Emi Lee
- Cindy Luu
- Erwin Miguel Olimpo
- Neda Emdad
- Calvin Nguyen
- Diya Lakhani

<a id='research_question'></a>
# Research Question

How have UCSD students' grade point averages (GPAs) change across various upper division and lower division courses between the years 2007-2022? What factors have influenced these changes, considering the effects of:

* (1) the recent COVID-19 pandemic on student GPAs
* (2) the 2022 UCSD Academic Worker Strike on student GPAs
* (3) the emergence of ChatGPT


# Dataset(s)


- Dataset Name: UCSD CAPEs Data
- Link to the dataset: https://www.kaggle.com/datasets/sanbornpnguyen/ucsdcapes?utm_medium=social&utm_campaign=kaggle-dataset-share
- Number of observations: 63363 rows

The UCSD CAPEs Data is our primary dataset and it contains all the CAPE evaluations from 2007 to 2023 scraped from the CAPEs website  https://capes.ucsd.edu. The dataset contains information about instructors, recommendation rates, average grade recieved, etc. We plan on extracting columns that detail the course name, average grade recieved, and the quarter.

Note: As of this first checkpoint, this one dataset is sufficient to answer the question. However, we may add more datasets later as we are doing the analysis depending on what is required.

# Data Wrangling

The first step is to import the appropriate modules and convert the data from the CSV file to a dataframe.

In [7]:
#importing relevant data libraries
import pandas as pd
import numpy as np

#converting data from csv file to pandas dataframe
capes = pd.read_csv('raw_data/capes_data.csv')

In [8]:
capes.head()

Unnamed: 0,Instructor,Course,Quarter,Total Enrolled in Course,Total CAPEs Given,Percentage Recommended Class,Percentage Recommended Professor,Study Hours per Week,Average Grade Expected,Average Grade Received,Evalulation URL
0,Butler Elizabeth Annette,AAS 10 - Intro/African-American Studies (A),SP23,66,48,93.5%,100.0%,2.8,A- (3.84),B+ (3.67),https://cape.ucsd.edu/CAPEReport.aspx?sectioni...
1,Butler Elizabeth Annette,AAS 170 - Legacies of Research (A),SP23,20,7,100.0%,100.0%,2.5,A- (3.86),A- (3.92),https://cape.ucsd.edu/CAPEReport.aspx?sectioni...
2,Jones Ian William Nasser,ANAR 111 - Foundations of Archaeology (A),SP23,16,3,100.0%,100.0%,3.83,B+ (3.67),,https://cape.ucsd.edu/CAPEReport.aspx?sectioni...
3,Shtienberg Gilad,ANAR 115 - Coastal Geomorphology/Environ (A),SP23,26,6,100.0%,83.3%,3.83,B+ (3.50),B (3.07),https://cape.ucsd.edu/CAPEReport.aspx?sectioni...
4,Braswell Geoffrey E.,ANAR 155 - Stdy Abrd: Ancient Mesoamerica (A),SP23,22,9,100.0%,100.0%,5.17,A (4.00),A (4.00),https://cape.ucsd.edu/CAPEReport.aspx?sectioni...


Looking at the first few rows of our capes dataframe, there does not seem to be further data wrangling work required. We can shift to data clening. 

# Data Cleaning

Let's begin with checking the dimensions of our data set and the column names.

In [9]:
#returns dimensions of the dataframe
capes.shape

(63363, 11)

In [10]:
#returns column names
capes.columns

Index(['Instructor', 'Course', 'Quarter', 'Total Enrolled in Course',
       'Total CAPEs Given', 'Percentage Recommended Class',
       'Percentage Recommended Professor', 'Study Hours per Week',
       'Average Grade Expected', 'Average Grade Received', 'Evalulation URL'],
      dtype='object')

Our research question is focused on studying the changes in the GPA over various courses between the years 2007-2023. Columns detailing information about the instructor, recommendation rate, study hours, expected grades, evaluation URL, and total CAPEs given can be dropped from our dataset.

In [11]:
#dropping unecessary columns
capes = capes.drop(columns = ['Instructor', 'Percentage Recommended Class',
       'Percentage Recommended Professor', 'Study Hours per Week',
       'Average Grade Expected', 'Evalulation URL', 'Total Enrolled in Course',
       'Total CAPEs Given'])
capes.head()

Unnamed: 0,Course,Quarter,Average Grade Received
0,AAS 10 - Intro/African-American Studies (A),SP23,B+ (3.67)
1,AAS 170 - Legacies of Research (A),SP23,A- (3.92)
2,ANAR 111 - Foundations of Archaeology (A),SP23,
3,ANAR 115 - Coastal Geomorphology/Environ (A),SP23,B (3.07)
4,ANAR 155 - Stdy Abrd: Ancient Mesoamerica (A),SP23,A (4.00)


There are some cases in which the professor for a course does not report the average grade received by the class. These entries appear as NaN values and those observations need to be dropped. Although, removing NaN values is a part of data wrangling, since we are only concerned about missing values in three columns it was necessary to strip unrequired columns first.

In [12]:
#checking for NaN values
capes.isnull().sum().any()

True

In [13]:
#dropping NaN rows
capes = capes.dropna()

#verifying that our dataframe is free of any missing values
capes.isnull().sum().any()

False

Grades in the 'Average Grade Recieved' column are of the form <letter_grade (GPA)>. In order to analyze how the GPA is changing overtime and take a mean of all the GPAs, we need to split the column into grade (as a string) and GPA (as a float).

In [14]:
#defining function to split the grade
def split_grade_gpa(string):
    lst1 = string.split("(")
    str1 = lst1[1].strip(")")
    return float(str1)

#helper function to extract grade
def split_grade(string):
    lst1 = string.split("(")
    return lst1[0].strip(" ")

cape_GPAs = capes['Average Grade Received'].apply(split_grade_gpa)
cape_grades = capes['Average Grade Received'].apply(split_grade)

#adding GPA column
capes['GPA'] = cape_GPAs
#adding grade column
capes['Grade'] = cape_grades

capes = capes.drop(columns=['Average Grade Received'])
capes.head()

Unnamed: 0,Course,Quarter,GPA,Grade
0,AAS 10 - Intro/African-American Studies (A),SP23,3.67,B+
1,AAS 170 - Legacies of Research (A),SP23,3.92,A-
3,ANAR 115 - Coastal Geomorphology/Environ (A),SP23,3.07,B
4,ANAR 155 - Stdy Abrd: Ancient Mesoamerica (A),SP23,4.0,A
5,ANBI 111 - Human Evolution (A),SP23,2.95,B-


Similarly, we require a column for department, whether a course is a lower division course or upper division course, and the year. This is so we can groupby quarter, department, and division taking a mean of the GPAs. Then in analysis we will compare the means of every department lower and upper division across every quarter since FA07.

In [15]:
#function to obtain department code
def dept_strip(string):
    lst1 = string.split(" ")
    return lst1[0]

#function to check for lower division / upper division
def ldud(string):
    lst1 = string.split(" ")
    if lst1[1].isalpha() == False:
        str_num = "".join(filter(lambda x: x.isalpha() == False, lst1[1]))
        if int(str_num) <= 99:
            return "LD"
        else:
            return "UD"
    else:
        if int(lst1[1]) <= 99:
            return "LD"
        else:
            return "UD"
    
dept_names = capes['Course'].apply(dept_strip)
upper_lower = capes['Course'].apply(ldud)

#creating a column for upper / lower division
capes['Division'] = upper_lower
#creating a column for department
capes['Dept'] = dept_names

In [16]:
# helper function to extract year from quarter column
def extract_year(string):
    ret_str = "20" + string[2:4]
    return int(ret_str)

years = capes['Quarter'].apply(extract_year)

#creating year column
capes['Year'] = years
capes.head()

Unnamed: 0,Course,Quarter,GPA,Grade,Division,Dept,Year
0,AAS 10 - Intro/African-American Studies (A),SP23,3.67,B+,LD,AAS,2023
1,AAS 170 - Legacies of Research (A),SP23,3.92,A-,UD,AAS,2023
3,ANAR 115 - Coastal Geomorphology/Environ (A),SP23,3.07,B,UD,ANAR,2023
4,ANAR 155 - Stdy Abrd: Ancient Mesoamerica (A),SP23,4.0,A,UD,ANAR,2023
5,ANBI 111 - Human Evolution (A),SP23,2.95,B-,UD,ANBI,2023


Let's rearrange the columns in an orderly manner so the quarter is the first column.

In [17]:
#rearranging the columns
capes = capes.get(['Quarter', 'Dept', 'Course', 'Division', 'Year', 'GPA', 'Grade'])
capes.head()

Unnamed: 0,Quarter,Dept,Course,Division,Year,GPA,Grade
0,SP23,AAS,AAS 10 - Intro/African-American Studies (A),LD,2023,3.67,B+
1,SP23,AAS,AAS 170 - Legacies of Research (A),UD,2023,3.92,A-
3,SP23,ANAR,ANAR 115 - Coastal Geomorphology/Environ (A),UD,2023,3.07,B
4,SP23,ANAR,ANAR 155 - Stdy Abrd: Ancient Mesoamerica (A),UD,2023,4.0,A
5,SP23,ANBI,ANBI 111 - Human Evolution (A),UD,2023,2.95,B-


To compare the means for every department across lower division and upper division courses we need to groupby quarter, department and division. Before that, we are creating a capes_sub df that will not contain the letter grade columns since we cannot obtain a mean of that column.

In [20]:
#Computing the means per course per quater divided by division
capes_sub = capes.get(['Quarter', 'Dept', 'Division', 'Year', 'GPA'])
capes_sub = capes_sub.groupby(['Quarter', 'Dept', 'Division']).mean().reset_index()
capes_sub.head()


Unnamed: 0,Quarter,Dept,Division,Year,GPA
0,FA07,ANAR,UD,2007.0,2.61
1,FA07,ANBI,UD,2007.0,3.04
2,FA07,ANSC,UD,2007.0,3.09
3,FA07,ANTH,LD,2007.0,2.725
4,FA07,ANTH,UD,2007.0,3.27


For analysis we will begin with line plots to visualize the increase in GPA across the quarters and proceed with an ANOVA test to compare group means for our three subquestions.

# Data Analysis & Results (EDA)