<center>
    <h1 style="width: 90%">Understanding Relationships Between Professor Info and Course Enrollment Rates in the Computer Science Department at the University of Maryland, College Park</h1>
    <h2 style="width: 90%">By Wesley Smith, Franco Edah, and Ohsun Kwon</h2>
</center>

When given the ability of choice, people nearly always seeks out the best possible selection to ensure their own success. So, it is not surprising that students at the University of Maryland, College Park commonly compare course ratings and grade data, amongst many other factors, in their choice of which courses and course sections to select each semester. In this project, our group seeks to better understand the relationships between professor rating and professor average GPA with how students seek to enroll in courses within the Department of Computer Science at the UMCP.

# Scraping Code

Before we are able to begin our analysis, we first need to retrieve the publicly available data about professors and courses within the Computer Science Department. Using APIs developed by our peer students at the University hosted on `umd.io` and `planetterp.com`, we are able to secure the necessary data to conduct our analyses.

Fetching large amounts of data from the `umd.io` and `planetterp.com` APIs consistently as we were working on our project turned out to be costly both in terms of the amount of time we were waiting for data to process, and costly for the APIs themselves, with us sometimes being rate-limited when fixing small mistakes. For reference, being rate-limited is when too many requests are sent to a public API, and the API limits requests from your address in an effort to conserve resources for others. 

In order to solve the costliness of fetching and re-fetching the data from these publicly available APIs, we instead fetch all of the data about CMSC (Dept. of Computer Science) courses at UMCP at the beginning of our project, store them in an SQLite database, and use this database to do necessary queries throughout our data analysis.

We chose the SQLite database over other initial solutions like MongoDB due to SQLite having native support in the Pandas library, along with the fact that storing our data in a table-like fashion made the most sense considering Pandas needs to represent it as a table as well. If we were to use MongoDB for example, we would need to query the database and then flatten our data to be placed into a Pandas dataframe for analysis anyways. By representing the data as a table from the beginning, we remove this rather redundant step when importing our data. Furthermore, keeping the data local in our own repository was also another plus, as needing to connect to a remote service for hosting our data could involve extra costs and complexity in our analysis.

Below is the code that scrapes the `umd.io` and `planetterp.com` APIs and stores the necessary data that we recieve in our local SQLite database. Comments are present throughout the code for easy understanding.

In [6]:
import requests
import sqlite3
import os
from operator import itemgetter

# a map of grade value to the grade point value
GRADE_POINTS = {
    "A+": 4.0, "A": 4.0, "A-": 3.7, 
    "B+": 3.3, "B": 3.0, "B-": 2.7,
    "C+": 2.3, "C": 2.0, "C-": 1.7,
    "D+": 1.3, "D": 1.0, "D-": 0.5,
    "F": 0.0
}

# using the provided connection, ensure that all of our tables exist
def db_ensure_tables_exist(conn):
    # courses stores the overall info for a course
    conn.cursor().execute("""
    CREATE TABLE IF NOT EXISTS courses (
        id CHAR(8) NOT NULL PRIMARY KEY,
        dept CHAR(4) NOT NULL,
        number INT(4) NOT NULL,
        credits INT(1) NOT NULL
    )
    """)
    # course_professors stores the professors for a course
    conn.cursor().execute("""
    CREATE TABLE IF NOT EXISTS course_professors (
        course CHAR(8) NOT NULL,
        name VARCHAR(100) NOT NULL,
        CONSTRAINT primary_key PRIMARY KEY (course, name),
        FOREIGN KEY (course) REFERENCES courses(id)
    )
    """)
    # You cannot create a primary key containing a TEXT field, 
    # so we don't have one for this table.
    conn.cursor().execute("""
    CREATE TABLE IF NOT EXISTS professor_reviews (
        course CHAR(8) NOT NULL,
        prof_name VARCHAR(100) NOT NULL,
        rating INT(1) NOT NULL,
        review TEXT NOT NULL,
        FOREIGN KEY (course) REFERENCES courses(id),
        FOREIGN KEY (prof_name) REFERENCES course_professors(name)
    )
    """)
    # course_grades stores the gpa and number of drops for a course
    # in a given year and semester. the semester column is either
    # "FALL" or "SPRING".
    conn.cursor().execute("""
    CREATE TABLE IF NOT EXISTS course_grades (
        course CHAR(8) NOT NULL,
        prof_name VARCHAR(100),
        year INT(4) NOT NULL,
        semester VARCHAR(10) NOT NULL,
        gpa DOUBLE(2,2) NOT NULL,
        num_drops INT(5) NOT NULL,
        CONSTRAINT primary_key PRIMARY KEY (course, prof_name, year, semester),
        FOREIGN KEY (course) REFERENCES courses(id),
        FOREIGN KEY (prof_name) REFERENCES course_professors(name)
    )
    """)

# create the database connection and ensure all of the tables exist
# setting wipe to true will wipe the entire database and recreate it
def db_setup(wipe=False):
    file_name = "project.db"
    if wipe:
        os.remove(file_name)
    conn = sqlite3.connect(file_name)
    db_ensure_tables_exist(conn)
    return conn

# write the information about a course to the provided connection
def db_write_course_info(conn, plt_terp_info):
    dept, number, credits, prof_names, reviews = itemgetter(
        "department",
        "course_number", 
        "credits",
        "professors",
        "reviews"
    )(plt_terp_info)
    cur = conn.cursor()

    id = f"{dept}{number}"
    cur.execute(
        "INSERT INTO courses (id, dept, number, credits) VALUES (?, ?, ?, ?)", 
        (id, dept, number, credits)
    )
    
    for prof_name in prof_names:
        cur.execute(
            "INSERT OR IGNORE INTO course_professors (course, name) VALUES (?, ?)",
            (id, prof_name)
        )
    
    for review in reviews:
        # perhaps do sentiment analysis here on the review text?
        cur.execute(
            "INSERT INTO professor_reviews (course, prof_name, rating, review) VALUES (?, ?, ?, ?)",
            (id, review["professor"], review["rating"], review["review"])
        )

# write the information about a courses grades to the provided connection
def db_write_course_grades(conn, course_id, plt_terp_grades):
    cur = conn.cursor()
    for entry in plt_terp_grades:
        num_grade_w = int(entry["W"])
        prof_name = entry["professor"]
        semester_raw = entry["semester"]
        year = semester_raw[0:4]
        semester_id = int(semester_raw[4:])
        semester_enum = "FALL" if semester_id == 1 else "SPRING" if semester_id == 8 else None
            
        # if we can't identify fall or spring semester,
        # then the course isnt relevant to our data
        if semester_enum == None:
            continue
            
        # loop over keys and values and check if key is a grade
        # name. if so, add it to our grade point sum and total
        # amount of grades.
        grade_point_sum = 0
        total_grades = 0
        for key, value in entry.items():
            if key not in GRADE_POINTS:
                continue
            
            grade_points = GRADE_POINTS[key]
            amt = int(value)
            total_grades += amt
            grade_point_sum += amt * grade_points
        
        gpa = grade_point_sum / total_grades
        cur.execute(
            """
            INSERT INTO course_grades (course, prof_name, year, semester, gpa, num_drops)
            VALUES (?, ?, ?, ?, ?, ?)
            ON CONFLICT(course, prof_name, year, semester) 
                DO UPDATE SET gpa = (gpa + excluded.gpa) / 2 AND num_drops = num_drops + excluded.num_drops
            """,
            (course_id, prof_name, year, semester_enum, gpa, num_grade_w)
        )

# call the umd.io api to get a list of courses in the CMSC department
def get_cmsc_courses():
    result = requests.get("https://api.umd.io/v1/courses", params={"dept_id": "CMSC"})
    return list(map(lambda x: x["course_id"], result.json()))

# call the planet terp api to get the information about a course, including its reviews
def get_course_info(course_id):
    return requests.get("https://api.planetterp.com/v1/course", params={"name": course_id, "reviews": "true"}).json()

# call the planet terp api to get information about a course's grades
def get_course_grades(course_id):
    return requests.get("https://api.planetterp.com/v1/grades", params={"course": course_id}).json()
    
# setup the database. we wipe the database since we want to regenerate all of the information for now.
with db_setup(wipe=True) as conn:
    for course_id in get_cmsc_courses():
        db_write_course_info(conn, get_course_info(course_id))
        db_write_course_grades(conn, course_id, get_course_grades(course_id))
        print(f"processed {course_id}")

processed CMSC100
processed CMSC106
processed CMSC122
processed CMSC125
processed CMSC131
processed CMSC132
processed CMSC133
processed CMSC216
processed CMSC250
processed CMSC298A
processed CMSC320
processed CMSC330
processed CMSC335
processed CMSC351
processed CMSC351H
processed CMSC396H
processed CMSC411
processed CMSC412
processed CMSC414
processed CMSC416
processed CMSC417
processed CMSC420
processed CMSC421
processed CMSC422
processed CMSC423
processed CMSC424
processed CMSC425
processed CMSC426
processed CMSC427
processed CMSC430


# Basic Analysis

In [5]:
import sqlite3
import pandas as pd

conn = sqlite3.connect("project.db")
query = """
SELECT
    courses.id AS course_id,
    courses.dept,
    courses.number,
    courses.credits,
    course_grades.prof_name,
    course_grades.year,
    course_grades.semester,
    course_grades.gpa,
    course_grades.num_drops,
    professor_reviews.rating AS prof_overall_rating
FROM courses
LEFT JOIN course_grades 
    ON course_grades.course = courses.id
LEFT JOIN professor_reviews 
    ON professor_reviews.course = courses.id
    AND professor_reviews.prof_name = course_grades.prof_name
GROUP BY 
    courses.id, 
    course_grades.prof_name, 
    course_grades.semester
"""
df = pd.read_sql(sql=query, con=conn, index_col="course_id")
display(df)

Unnamed: 0_level_0,dept,number,credits,prof_name,year,semester,gpa,num_drops,prof_overall_rating
course_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
CMSC100,CMSC,100,1,Alyssa Neuner,2017.0,SPRING,3.758824,1.0,
CMSC100,CMSC,100,1,Amy Vaillancourt,2015.0,SPRING,3.382609,1.0,
CMSC100,CMSC,100,1,Andrew Nolan,2012.0,SPRING,3.826087,0.0,
CMSC100,CMSC,100,1,Charles Kassir,2014.0,SPRING,3.669444,1.0,
CMSC100,CMSC,100,1,Corie Brown,2021.0,SPRING,3.805882,0.0,5.0
...,...,...,...,...,...,...,...,...,...
CMSC430,CMSC,430,3,Jose Calderon,2020.0,FALL,3.671642,2.0,5.0
CMSC430,CMSC,430,3,Jose Calderon,2020.0,SPRING,3.155814,4.0,5.0
CMSC430,CMSC,430,3,Leonidas Lampropoulos,2021.0,FALL,3.604854,6.0,
CMSC430,CMSC,430,3,Nick Petroni,2018.0,SPRING,3.105769,3.0,5.0
