<center>
    <h1 style="width: 90%">Understanding Relationships Between Professor Info and Course Enrollment Rates in the Computer Science Department at the University of Maryland, College Park</h1>
    <h2 style="width: 90%">By Wesley Smith, Franco Edah, and Ohsun Kwon</h2>
</center>

When given the ability of choice, people nearly always seek out the best possible selection to ensure their own success. So, it is not surprising that students at the University of Maryland, College Park commonly compare course ratings and grade data, amongst many other factors, in their choice of which courses and course sections to select each semester. In this project, our group seeks to better understand the relationships between professor rating and professor average GPA with how students seek to enroll in courses within the Department of Computer Science at the UMCP.

# Fetching Our Data

Before we are able to begin our analysis, we first need to retrieve the publicly available data about professors and courses within the Computer Science Department. Using APIs developed by our peer students at the University hosted on `umd.io` and `planetterp.com`, we are able to secure the necessary data to conduct our analyses.

Fetching large amounts of data from the `umd.io` and `planetterp.com` APIs consistently as we were working on our project turned out to be costly both in terms of the amount of time we were waiting for data to process, and costly for the APIs themselves, with us sometimes being rate-limited when fixing small mistakes. For reference, being rate-limited is when too many requests are sent to a public API, and the API limits requests from your address in an effort to conserve resources for others. 

In order to solve the costliness of fetching and re-fetching the data from these publicly available APIs, we instead fetch all of the data about CMSC (Dept. of Computer Science) courses at UMCP at the beginning of our project, store them in an SQLite database, and use this database to do necessary queries throughout our data analysis.

We chose the SQLite database over other initial solutions like MongoDB due to SQLite having native support in the Pandas library, along with the fact that storing our data in a table-like fashion made the most sense considering Pandas needs to represent it as a table as well. If we were to use MongoDB for example, we would need to query the database and then flatten our data to be placed into a Pandas dataframe for analysis anyways. By representing the data as a table from the beginning, we remove this rather redundant step when importing our data. Furthermore, keeping the data local in our own repository was also another plus, as needing to connect to a remote service for hosting our data could involve extra costs and complexity in our analysis.

Below is the code that scrapes the `umd.io` and `planetterp.com` APIs and stores the necessary data that we recieve in our local SQLite database.

## Connect to the Database

In [276]:
import sqlite3
def open_conn():
    return sqlite3.connect("project.db")

with open_conn() as conn:
    print(conn.execute("SELECT name FROM sqlite_master WHERE type='table'").fetchall())

[('instructor_reviews',), ('course_grades',), ('courses',), ('course_sections',), ('course_section_instructors',)]


## Fetch and Write Basic Course Info

First, we need to get a list of all of the courses that are available within the `umd.io` API for us to fetch. However, we cannot get all of our data in one API request because:
1. The API returns results in a paginated form, and we can only request up to 100 entries on each page. So, if there are more courses than 100 for a single set of request parameters, we need to keep fetching the next page until there are no more results to return.
2. The API is supposed to support being able to request all courses less than or equal to (`leq`) than a semester, but the API seems to be broken and does not accept strings longer than six characters. To provide the `leq` option, we would need to provide `202008|leq` for the semester parameter, which is 10 characters long. To work around this issue, we start at the current year (`2022`) and iteratively make a call for both the spring `01` semester and `08` fall semesters for each year, decrementing the year each iteration. 

In [277]:
# YOU SHOULD NOT RUN THIS CODE BLOCK UNLESS YOU ARE WANTING TO LOAD NEW DATA
import requests

def get_courses(page, semester):
    params = {"dept_id": "CMSC", "per_page": 100, "page": page, "semester": semester}
    return requests.get("https://api.umd.io/v1/courses", params=params).json()

def fetch_for_semester(semester):
    result = []
    page = 1
    while True:
        response = get_courses(page, semester)
        if "error_code" in response:
            if response["message"] == "We don't have data for this semester!":
                # theres no data to fetch! stop now
                break
            raise Exception(f"unknown error response: {response}")
                
        if len(response) == 0:
            # if we got no result, then we reached the last page
            break
            
        result += response
        page += 1
    return result

def fetch_for_year(year):
    result = []
    for semester_id in ["01", "08"]:
        result += fetch_for_semester(f"{year}{semester_id}")
    return result

courses = []
for year in range(2017, 2023):
    courses += fetch_for_year(year)

len_courses = len(courses)
print(f"amount of courses fetched: {len_courses}")

amount of courses fetched: 880


Now, we will write this information to our database to be able to use later without re-fetching from the `umd.io` API. When we see duplicate courses, that means we have already placed these in our database and will ignore the error result.

As we are writing data to the database, we are parsing out the semester field of the API result to split it into the year and semester category that it represents. In the database, we represent the fall and spring semesters as the enum values `FALL` and `SPRING` respectively.

In [278]:
# helper function for parsing a semester in the format string "{year}{01|08}", where 01 is spring and 08 is fall
def parse_semester(semester_raw):
    semester_year = semester_raw[0:4]
    semester_id = int(semester_raw[4:])
    semester_enum = "SPRING" if semester_id == 1 else "FALL" if semester_id == 8 else None
    return semester_year, semester_enum

In [279]:
# YOU SHOULD NOT RUN THIS CODE BLOCK UNLESS YOU ARE WANTING TO LOAD NEW DATA
with open_conn() as conn:
    cur = conn.cursor()
    cur.execute("DROP TABLE IF EXISTS courses") # uncomment if you'd like to start over
    cur.execute("""
    CREATE TABLE IF NOT EXISTS courses (
        id CHAR(7) NOT NULL,
        dept CHAR(4) NOT NULL,
        number VARCHAR(4) NOT NULL,
        year INT(4) NOT NULL,
        semester VARCHAR(10) NOT NULL,
        credits INT(1) NOT NULL,
        CONSTRAINT primary_key PRIMARY KEY (id, year, semester)
    )
    """)

    for course in courses:
        course_id = course["course_id"]
        dept = course["dept_id"]
        number = course_id[len(dept):]
        # ignore classes higher than the 400 level
        if int(number[0]) > 4:
            continue 

        year, semester = parse_semester(course["semester"])
        credits = int(course["credits"])
        cur.execute(
            "INSERT OR IGNORE INTO courses (id, dept, number, year, semester, credits) VALUES (?, ?, ?, ?, ?, ?)",
            (course_id, dept, number, year, semester, credits)
        )

    print(list(cur.execute("SELECT * FROM courses LIMIT 5")))

[('CMSC100', 'CMSC', '100', 2017, 'FALL', 1), ('CMSC106', 'CMSC', '106', 2017, 'FALL', 4), ('CMSC122', 'CMSC', '122', 2017, 'FALL', 3), ('CMSC131', 'CMSC', '131', 2017, 'FALL', 4), ('CMSC131A', 'CMSC', '131A', 2017, 'FALL', 4)]


## Fetching and Writing Course Section Info

Now that we have our course data, we need to get more fine-grained data about who teaches the course and the seat data for each of their sections, permitting us to consider course registration rates in our data analysis. 

In [280]:
# YOU SHOULD NOT RUN THIS CODE BLOCK UNLESS YOU ARE WANTING TO LOAD NEW DATA
import math

SECTIONS_PER_PAGE = 100

def get_sections(course_id, semester, page):
    params = params={
        "course_id": course_id, 
        "semester": semester,
        "page": page, 
        "per_page": SECTIONS_PER_PAGE
    }
    return requests.get("https://api.umd.io/v1/courses/sections", params=params).json()

def get_all_sections(course_id, semester, num_sections):
    sections = []
    num_pages = int(math.ceil(num_sections / SECTIONS_PER_PAGE))
    for page_idx in range(num_pages):
        result = get_sections(course_id, semester, page_idx + 1)
        sections += result
    return sections

def insert_section(cur, section):
    course_id = section["course"]
    year, semester = parse_semester(str(section["semester"]))
    number = section["number"]
    seats_open = int(section["open_seats"])
    seats = int(section["seats"])
    seats_taken = seats - seats_open
    waitlist = int(section["waitlist"])
    instructors = section["instructors"]
    print(f"processing section {number} of {course_id} for year {year}, semester {semester} with instructors {instructors}")
    cur.execute(
        """
        INSERT INTO course_sections (course, year, semester, number, seats_open, seats_taken, waitlist_size)
        VALUES (?, ?, ?, ?, ?, ?, ?)
        """,
        (course_id, year, semester, number, seats_open, seats_taken, waitlist)
    )
    
    for instructor_name in section["instructors"]:
        cur.execute(
            """
            INSERT INTO course_section_instructors (course, year, semester, number, name)
            VALUES (?, ?, ?, ?, ?)
            """,
            (course_id, year, semester, number, instructor_name)
        )

with open_conn() as conn:
    cur = conn.cursor()
    cur.execute("DROP TABLE IF EXISTS course_sections") # uncomment if you'd like to start over
    cur.execute("""
    CREATE TABLE IF NOT EXISTS course_sections (
        course CHAR(7) NOT NULL,
        year INT(4) NOT NULL,
        semester VARCHAR(10) NOT NULL,
        number VARCHAR(4) NOT NULL,
        seats_open INT(4) NOT NULL,
        seats_taken INT(4) NOT NULL,
        waitlist_size INT(4) NOT NULL,
        CONSTRAINT primary_key PRIMARY KEY (course, year, semester, number),
        FOREIGN KEY (course, year, semester) REFERENCES courses(id, year, semester)
    )
    """)
    cur.execute("DROP TABLE IF EXISTS course_section_instructors") # uncomment if you'd like to start over
    cur.execute("""
    CREATE TABLE IF NOT EXISTS course_section_instructors (
        course CHAR(7) NOT NULL,
        year INT(4) NOT NULL,
        semester VARCHAR(10) NOT NULL,
        number VARCHAR(4) NOT NULL,
        name VARCHAR(100) NOT NULL,
        CONSTRAINT primary_key PRIMARY KEY (course, year, semester, number, name),
        FOREIGN KEY (course, year, semester, number) REFERENCES course_sections(id, year, semester, number)
    ) 
    """)

    for course in courses:
        course_id = course["course_id"]
        semester = course["semester"]
        num_sections = len(course["sections"])
        sections = get_all_sections(course_id, semester, num_sections)
        for section in sections:
            insert_section(cur, section)

    print(list(cur.execute("SELECT * FROM course_sections LIMIT 5")))
    print(list(cur.execute("SELECT * FROM course_section_instructors LIMIT 5")))

processing section 0101 of CMSC122 for year 2019, semester SPRING with instructors ['Brian Brubach']
processing section 0201 of CMSC122 for year 2019, semester SPRING with instructors ['Pedram Sadeghian']
processing section 0101 of CMSC131 for year 2019, semester SPRING with instructors ['Ilchul Yoon']
processing section 0102 of CMSC131 for year 2019, semester SPRING with instructors ['Ilchul Yoon']
processing section 0103 of CMSC131 for year 2019, semester SPRING with instructors ['Ilchul Yoon']
processing section 0104 of CMSC131 for year 2019, semester SPRING with instructors ['Ilchul Yoon']
processing section 0201 of CMSC131 for year 2019, semester SPRING with instructors ['Pedram Sadeghian']
processing section 0202 of CMSC131 for year 2019, semester SPRING with instructors ['Pedram Sadeghian']
processing section 0203 of CMSC131 for year 2019, semester SPRING with instructors ['Pedram Sadeghian']
processing section 0204 of CMSC131 for year 2019, semester SPRING with instructors ['Pe

## Fetching Professor Ratings

In our analysis, we also seek to consider how course ratings affect enrollment rates. Using the PlanetTerp API, we are able to fetch the course ratings for given professors.

In [281]:
def get_prof_avgRating(prof):
    r = requests.get('https://api.planetterp.com/v1/professor', params={'name': prof})
    r = r.json()

    if "average_rating" in r:
        return r['average_rating']
    else:
        return None

In [282]:
with open_conn() as conn:
    cur = conn.cursor()
    cur.execute("DROP TABLE IF EXISTS instructor_reviews") # uncomment if you'd like to start over
    cur.execute("""
    CREATE TABLE IF NOT EXISTS instructor_reviews (
        course CHAR(7) NOT NULL,
        instructor_name VARCHAR(100) NOT NULL,
        rating INT(1) NOT NULL,
        avg_class_gpa DOUBLE(2,2),
        CONSTRAINT primary_key PRIMARY KEY (course, instructor_name),
        FOREIGN KEY (course) REFERENCES courses(id)
    )
    """)

    unique_course_ids = set(map(lambda x: x["course_id"], courses))
    for course_id in unique_course_ids:
        response = requests.get("https://api.planetterp.com/v1/course", params={"name": course_id}).json()
        if "error" in response:
            print(f"failed resp {response}")
            continue

        avg_class_gpa = response["average_gpa"]


        for professor in response["professors"]:
            rating = get_prof_avgRating(professor)

            cur.execute(
                "INSERT OR IGNORE INTO instructor_reviews (course, instructor_name, rating, avg_class_gpa) VALUES (?, ?, ?, ?)",
                (course_id, professor, rating, avg_class_gpa)
            )

        # for review in response["reviews"]:
        #     cur.execute(
        #         "INSERT OR IGNORE INTO instructor_reviews (course, instructor_name, rating) VALUES (?, ?, ?)",
        #         (course_id, review["professor"], review["rating"])
        #     )

    print(list(cur.execute("SELECT * FROM instructor_reviews LIMIT 5")))

failed resp {'error': 'course not found'}
failed resp {'error': 'course not found'}
failed resp {'error': 'course not found'}
[('CMSC858E', 'Samir Khuller', 4, 3.44348), ('CMSC838G', 'Michael Hicks', 4.0909, 3.60833), ('CMSC838G', 'Leonidas Lampropoulos', 4, 3.60833), ('CMSC389V', 'John Dickerson', 3.5714, 3.80482), ('CMSC818N', 'Dinesh Manocha', 1, 3.79516)]


## Fetching Grade Data

Some text explaining the grade data fetch.

In [283]:
GRADE_POINTS = {
    "A+": 4.0, "A": 4.0, "A-": 3.7, 
    "B+": 3.3, "B": 3.0, "B-": 2.7,
    "C+": 2.3, "C": 2.0, "C-": 1.7,
    "D+": 1.3, "D": 1.0, "D-": 0.5,
    "F": 0.0
}

def db_write_course_grades(conn, course_id, plt_terp_grades):
    cur = conn.cursor()
    for entry in plt_terp_grades:
        num_grade_w = int(entry["W"])
        prof_name = entry["professor"]
        semester_raw = entry["semester"]
        year, semester = parse_semester(semester_raw)
            
        # if we can't identify fall or spring semester,
        # then the course isnt relevant to our data
        if semester == None:
            continue
            
        # loop over keys and values and check if key is a grade
        # name. if so, add it to our grade point sum and total
        # amount of grades.
        grade_point_sum = 0
        total_grades = 0
        for key, value in entry.items():
            if key not in GRADE_POINTS:
                continue
            
            grade_points = GRADE_POINTS[key]
            amt = int(value)
            total_grades += amt
            grade_point_sum += amt * grade_points
        
        if total_grades == 0:
            gpa = None
        else:
            gpa = grade_point_sum / total_grades

        cur.execute(
            """
            INSERT INTO course_grades (course, instructor_name, year, semester, gpa, num_drops)
            VALUES (?, ?, ?, ?, ?, ?)
            ON CONFLICT(course, instructor_name, year, semester) 
                DO UPDATE SET gpa = (gpa + excluded.gpa) / 2, num_drops = num_drops + excluded.num_drops
            """,
            (course_id, prof_name, year, semester, gpa, num_grade_w)
        )

with open_conn() as conn:
    cur = conn.cursor()
    cur.execute("DROP TABLE IF EXISTS course_grades") # uncomment if you'd like to start over
    cur.execute("""
    CREATE TABLE IF NOT EXISTS course_grades (
        course CHAR(7) NOT NULL,
        instructor_name VARCHAR(100),
        year INT(4) NOT NULL,
        semester VARCHAR(10) NOT NULL,
        gpa DOUBLE(2,2),
        num_drops INT(5) NOT NULL,
        CONSTRAINT primary_key PRIMARY KEY (course, instructor_name, year, semester),
        FOREIGN KEY (course) REFERENCES courses(id)
    )
    """)

    unique_course_ids = set(map(lambda x: x["course_id"], courses))
    for course_id in unique_course_ids:
        response = requests.get("https://api.planetterp.com/v1/grades", params={"course": course_id}).json()
        if "error" in response:
            error_msg = response["error"]
            print(f"course {course_id} got error {error_msg}")
            continue

        db_write_course_grades(conn, course_id, response)

    print(list(cur.execute("SELECT * FROM course_grades LIMIT 5")))

course CMSC838C got error course not found
course CMSC388X got error course not found
course CMSC848C got error course not found
[('CMSC858E', 'Samir Khuller', 2018, 'FALL', 3.6, 1), ('CMSC838G', 'Leonidas Lampropoulos', 2021, 'FALL', 4.0, 0), ('CMSC838G', 'Michael Hicks', 2016, 'SPRING', 3.5999999999999996, 1), ('CMSC838G', 'Michael Hicks', 2014, 'SPRING', 3.5, 0), ('CMSC389V', 'John Dickerson', 2021, 'SPRING', 3.9037037037037035, 1)]


# Basic Analysis

In [284]:
import pandas as pd

query = """
SELECT
    courses.id AS course_id,
    courses.year,
    courses.semester,
    courses.credits,
    course_grades.instructor_name,
    instructor_reviews.avg_class_gpa,
    AVG(instructor_reviews.rating) AS avg_rating,
    AVG(course_grades.gpa) AS avg_gpa,
    SUM(course_grades.num_drops) AS total_num_drops
FROM courses
LEFT JOIN course_grades
    ON course_grades.course = courses.id
    AND course_grades.year = courses.year
    AND course_grades.semester = courses.semester
LEFT JOIN instructor_reviews
    ON instructor_reviews.course = courses.id
    AND instructor_reviews.instructor_name = course_grades.instructor_name
/*WHERE course_grades.course = 'CMSC132'*/
GROUP BY
    course_grades.course,
    course_grades.instructor_name,
    course_grades.year,
    course_grades.semester
"""
with open_conn() as conn:
    df = pd.read_sql(sql=query, con=conn, index_col="course_id")
    df = df.reset_index(0)
    display(df)

Unnamed: 0,course_id,year,semester,credits,instructor_name,avg_class_gpa,avg_rating,avg_gpa,total_num_drops
0,CMSC298A,2017,FALL,1,,,,,
1,CMSC100,2017,FALL,1,Alyssa Neuner,,,3.758824,1.0
2,CMSC100,2018,FALL,1,Alyssa Neuner,,,3.900000,0.0
3,CMSC100,2017,FALL,1,Amy Vaillancourt,,,3.960000,2.0
4,CMSC100,2018,FALL,1,Amy Vaillancourt,,,3.937500,1.0
...,...,...,...,...,...,...,...,...,...
502,CMSC498X,2019,FALL,3,Marc Lichtman,,,3.577419,0.0
503,CMSC499A,2019,FALL,1,,,,4.000000,0.0
504,CMSC499A,2019,SPRING,1,,,,4.000000,0.0
505,CMSC499A,2020,FALL,1,,,,4.000000,0.0


In [285]:
pd.reset_option('^display.', silent=True)
nan_prof_rating = df[df.avg_rating.isnull() ]
nan_class_gpa = df[df.avg_class_gpa.isnull()]
nan_prof_gpa = df[df.avg_gpa.isnull()]

len(nan_prof_rating)

tidyDf = df.drop(nan_prof_rating.index, inplace=False)
tidyDf["instructor_last_name"] =  tidyDf.instructor_name.apply(lambda x: x.split()[-1])
tidyDf

# profTidyDf = tidyDf.groupby('instructor_last_name').mean()
# profTidyDf = profTidyDf.reset_index(0)
# profTidyDf

# courseTidyDf = tidyDf.groupby('course_id').mean()
# courseTidyDf = courseTidyDf.reset_index(0)
# # courseTidyDf = courseTidyDf[["a"]]
# courseTidyDf




Unnamed: 0,course_id,year,semester,credits,instructor_name,avg_class_gpa,avg_rating,avg_gpa,total_num_drops,instructor_last_name
5,CMSC101,2020,FALL,3,Mollye Bendell,3.80000,5.0000,3.800000,0.0,Bendell
6,CMSC106,2019,FALL,4,Anthony Banes,2.34232,2.7500,3.171053,5.0,Banes
7,CMSC106,2017,FALL,4,Ilchul Yoon,2.34232,2.8636,2.588235,7.0,Yoon
8,CMSC106,2018,FALL,4,Ilchul Yoon,2.34232,2.8636,2.805882,6.0,Yoon
9,CMSC106,2020,FALL,4,Ilchul Yoon,2.34232,2.8636,2.736667,12.0,Yoon
...,...,...,...,...,...,...,...,...,...,...
491,CMSC474,2019,SPRING,3,Mohammad Hajiaghayi,2.68902,4.7500,2.990625,1.0,Hajiaghayi
497,CMSC498P,2020,FALL,3,Thomas Goldstein,2.94615,5.0000,3.481818,2.0,Goldstein
498,CMSC498V,2018,FALL,3,Furong Huang,3.51864,3.0000,3.952632,0.0,Huang
499,CMSC498V,2017,FALL,3,Niki Vazou,3.51864,3.0000,3.680556,4.0,Vazou


# Professor average GPA to average enrollment rate

In [291]:
import pandas as pd
import numpy as np
query = """
SELECT 
    course_section_instructors.name AS instructor_name,
    SUM(course_sections.seats_taken) AS seats_taken,
    SUM(course_sections.seats_open) AS seats_open
FROM course_section_instructors
LEFT JOIN course_sections
    ON course_sections.course = course_section_instructors.course
    AND course_sections.year = course_section_instructors.year
    AND course_sections.semester = course_section_instructors.semester
    AND course_sections.number = course_section_instructors.number
LEFT JOIN course_grades
    ON course_grades.course = course_section_instructors.course
    AND course_grades.year = course_section_instructors.year
    AND course_grades.semester = course_section_instructors.semester
    AND course_grades.instructor_name = course_section_instructors.name
GROUP BY course_section_instructors.name
"""
with open_conn() as conn:
    enroll_rateDf = pd.read_sql(sql=query, con=conn, index_col="instructor_name")
    enroll_rateDf = enroll_rateDf.reset_index(0)
    display(enroll_rateDf)

Unnamed: 0,instructor_name,seats_taken,seats_open
0,A Shankar,635,386
1,Abhinav Bhatele,191,144
2,Abhinav Shrivastava,400,65
3,Adam Porter,542,83
4,Akwum Onwunta,8,51
...,...,...,...
153,William Goldman,11,24
154,William Regli,235,53
155,Wiseley Wong,153,26
156,Xiaodi Wu,138,132


In [307]:
enroll_rateDf["total_seats"] = enroll_rateDf.seats_open + enroll_rateDf.seats_taken
enroll_rateDf

temp = {}

for prof in enroll_rateDf['instructor_name']:
    lastname = prof.split()[-1]
    if lastname in temp:
        temp[lastname].append(prof)
    else:
        temp[lastname] = [prof]

duplicates = list(filter(lambda x: len(temp[x]) > 1, temp))
duplicates = list(map(lambda x: temp[x], temp))
duplicates


[['A Shankar'],
 ['Abhinav Bhatele'],
 ['Abhinav Shrivastava'],
 ['Adam Porter'],
 ['Akwum Onwunta'],
 ['Alan Sussman'],
 ['Alexander Barg'],
 ['Alexander Brassel'],
 ['Amol Deshpande'],
 ['Amy Vaillancourt'],
 ['Andrew Childs'],
 ['Anthony Banes'],
 ['Anthony Ostuni'],
 ['Anwar Mamat'],
 ['Aravind Srinivasan'],
 ['Ashok Agrawala'],
 ['Bahar Asgari'],
 ['Behtash Babadi'],
 ['Brian Brubach'],
 ['C Rytting', 'Christopher Rytting'],
 ['Carlos Castillo'],
 ['Charalampos Papamanthou'],
 ['Charles Clark'],
 ['Charlotte Avery'],
 ['Christopher Metzler'],
 ['Christopher Moakler'],
 ['Cliff Bakalian', 'Clifford Bakalian'],
 ['Clyde Kruskal'],
 ['Cornelia Fermuller'],
 ['Dana Dachman-Soled'],
 ['Dana Nau'],
 ['Daniel Abadi'],
 ['Daniel Gottesman'],
 ['Dave Levin', 'David Levin'],
 ['Dave Mount', 'David Mount'],
 ['David Harris'],
 ['David Jacobs'],
 ['David Sekora'],
 ['David Van Horn'],
 ['Dinesh Manocha'],
 ['Dionisios Margetis'],
 ['Donald Perlis'],
 ['Eitan Tadmor'],
 ['Elias Gonzalez'],
 ['

In [293]:
enroll_rateDf["instructor_last_name"] = enroll_rateDf.instructor_name.apply(lambda x: x.split()[-1])
bad_enrollDf = enroll_rateDf.groupby("instructor_last_name").sum()
bad_enrollDf = bad_enrollDf.reset_index(0)
bad_enrollDf

Unnamed: 0,instructor_last_name,seats_taken,seats_open,total_seats
0,Abadi,173,62,235
1,Adams,158,6,164
2,Agrawala,394,188,582
3,Alagic,88,2,90
4,Albert,31,5,36
...,...,...,...,...
145,Yang,113,7,120
146,Yoon,1702,1221,2923
147,Yushutin,236,15,251
148,Zhou,145,0,145


In [None]:
# pd.reset_option()
tidyDf.merge(enroll_rateDf, left_on="instructor_last_name", right_on="instructor_last_name", how="inner")

# newDf = tidyDf[['course_id', 'instructor_name_x']].copy()

In [None]:
dfNames = list(df.instructor_name.unique())
dfNames.remove(None)
dfNames = sorted(dfNames)
dfNames

enrollRateNames = sorted(list(enroll_rateDf.instructor_name.unique()))
tiDyNames = list(tidyDf.instructor_name.unique())

# enrollRateNames'

In [None]:
lastTidy = sorted(list(map(lambda x: x.split()[1], tiDyNames)))
lastEnroll = sorted(list(map(lambda x: x.split()[1], enrollRateNames)))

diff =  set(tiDyNames) - set(enrollRateNames)
# diff0 = set(lastTidy) - set(lastEnroll)

# print(diff)
# print()
# print(diff0)

diff

In [None]:
fooBar = 