# Title
Collaborators: Albert Chen, Alex Chen, Aaron Lin, Srujan Penikelapati

# Introduction

Faculty are the lifeblood of any university. They conduct cutting-edge research to discover new insights that push the boundary of the field they are studying. They also teach classes, helping students learn new skills and knowledge that will benefit them in their professional lives. As in any occupation, professors vary in the quality of their research and their teaching ability, which directly impacts the quality of education that students who take their classes receive.

Students understand the fact that their choice of professor for a course can be the difference between an engaging, informative semester or a less than pleasing experience. Prior to registration, students often check professor reviews, online posts regarding the class, grade distributions, and other sources of information to determine whose class to choose. 

The goal of this project is to see the connection between how effective a professor is and possible determining factors of their teaching ability. Using data science, we will review salary data, years of experience, and average student rating of professors here at the University of Maryland to see how they correlate with their average Grade Point Average (GPA) for students across their classes. For example, does a professor with a high salary and many years of experience at UMD reflect in higher student GPAs? Once we build a correlative model, we can also pose performative questions in reverse, like whether the University is getting the best 'bang-for-their-buck' given a professor's salary and their average GPA as compared to other professors in the same field. 








# Data Collection

We first need to collect data that is relevant to our study. To get grade distributions and student reviews, we made GET requests to the [PlanetTerp API](https://planetterp.com/api/) and stored the data in dataframes. To get information on a professor's salaries and how long the've been teaching, we made GET requests to the [Diamondback API](https://api.dbknews.com/docs/#/salary) and also stored that data in a dataframe. However, performing all of these requests can take a long time, making it impractical to work efficiently. As such, we stored our dataframes into various .csv files. This also means that our data is accurate up to April 28th, 2023, as from there on we read from our csv files instead of requesting from the websites themselves. The code used for that is below.

In [30]:
import pandas as pd
import requests
import os
from pathlib import Path
import os.path
from statsmodels.formula.api import *
import matplotlib.pyplot as plt
from sklearn import linear_model
from ggplot import *

In [31]:
grades_exist = os.path.exists("src/1_collect_data/planet_terp_data/PT_grade_data.csv")
reviews_exist = os.path.exists("src/1_collect_data/planet_terp_data/PT_review_data.csv")
salaries_exist = os.path.exists("src/1_collect_data/salary_data/DB_combined_data.csv")

In [32]:
if reviews_exist == False:
    reviews = []
    done = False
    offset = 0
    while done == False:
        r = requests.get("https://planetterp.com/api/v1/professors", params = {"offset":offset, "reviews": "true", "limit":100},)
        if r.json() == []:
            done = True
        else:
            reviews.append(r.json())
            offset = offset+100
    df = pd.DataFrame()
    count = 0
    for i in reviews:
        for j in i:
            if j.get("reviews") != []:
                for k in j.get("reviews"):
                    if k.get("course") != None and j.get("type") == "professor":
                        df.at[count, "name"] = j.get("name")
                        df.at[count, "slug"] = j.get("slug")
                        df.at[count, "type"] = j.get("type")
                        df.at[count, "course"] = k.get("course")
                        df.at[count, "rating"] = k.get("rating")
                        df.at[count, "review"] = k.get("review")
                        df.at[count, "date"] = k.get("created")[:10]
                        count = count + 1

    df = df.sort_values(by=["name","course"])
    df.to_csv("src/1_collect_data/PT_review_data.csv", encoding='utf-8', index=False)

In [33]:
if grades_exist == False:
    grades = []
    professors = df["name"].drop_duplicates()
    for prof in professors:
        r = requests.get("https://planetterp.com/api/v1/grades", params = {"offset":offset, "reviews": "true", "limit":100, "professor": prof})
        grades.append(r.json())
    grade_df = pd.DataFrame()
    count = 0
    for i in grades:
        if i != []:
            for j in i:
                grade_df.at[count, "professor"] = j.get("professor")
                grade_df.at[count, "course"] = j.get("course")
                grade_df.at[count, "semester"] = j.get("semester")
                grade_df.at[count, "section"] = j.get("section")
                grade_df.at[count, "A+"] = j.get("A+")
                grade_df.at[count, "A"] = j.get("A")
                grade_df.at[count, "A-"] = j.get("A-")
                grade_df.at[count, "B+"] = j.get("B+")
                grade_df.at[count, "B"] = j.get("B")
                grade_df.at[count, "B-"] = j.get("B-")
                grade_df.at[count, "C+"] = j.get("C+")
                grade_df.at[count, "C"] = j.get("C")
                grade_df.at[count, "C-"] = j.get("C-")
                grade_df.at[count, "D+"] = j.get("D+")
                grade_df.at[count, "D"] = j.get("D")
                grade_df.at[count, "D-"] = j.get("D-")
                grade_df.at[count, "F"] = j.get("F")
                grade_df.at[count, "W"] = j.get("W")
                grade_df.at[count, "Other"] = j.get("Other")
                count = count + 1
                print(j.get("professor"))
    
    grade_df = grade_df.sort_values(by=["professor","course"])
    grade_df.to_csv("src/1_collect_data/PT_grade_data.csv", encoding = "utf-8", index = False)

In [34]:
def combine(group):
    years = group['year'].tolist()
    salaries = group['Salary'].tolist()
    salaries = [float(s.replace(",","")[1:]) for s in salaries]
    departments = group['Department'].tolist()
    
    i = 0
    while i < len(years) - 1:
        if (years[i] == years[i+1]):
            if salaries[i] != salaries[i + 1]:
                salaries[i] = salaries[i] + salaries[i + 1]
            years.pop(i+1)
            salaries.pop(i+1)
            departments.pop(i+1)
        else:
            i += 1

    return pd.Series({
        'years_taught': years,
        'salaries': salaries,
        'departments': departments,
    })

if salaries_exist == False:
    # get years that api is valid for
    r_years = requests.get("https://api.dbknews.com/salary/years")
    years = r_years.json()["data"]

    df = pd.DataFrame()

    # for each year, get salary data
    for year in years:
        r = requests.get(f"https://api.dbknews.com/salary/year/{year}")

        # number of faculty
        count = r.json()["count"]
        year_df = pd.DataFrame()
        page = 0

        while page * 10 < count:
            page += 1

            # get salary data for 1 page
            r = requests.get(f"https://api.dbknews.com/salary/year/{year}?page={page}")
            page_df = pd.DataFrame.from_dict(r.json()["data"])

            # remove division and title columns, modify department col, and add year
            page_df = page_df.drop(['Division', "Title"], axis=1)
            page_df["Department"] = page_df["Department"].str.slice(stop=4)
            page_df["year"] = [f"{year}"] * len(page_df.index)
            year_df = pd.concat([year_df, page_df], axis=0)

        print(f"year {year} finished")
        year_df.to_csv(f'src/1_collect_data/salary_data/{year}data.csv', index=False)

    for year in years:
        year_df = pd.read_csv(f'src/1_collect_data/salary_data/{year}data.csv')
        year_df['Employee'] = year_df['Employee'].str.replace('\n', ' ')
        df = pd.concat([df, year_df], axis=0)

    df_grouped = df.groupby(['Employee']).apply(combine).reset_index()

    df_grouped['name'] = df_grouped['Employee'].apply(lambda x: (x.split(', ')[1].split(" ")[0]+ ' ' + x.split(', ')[0].split(" ")[-1]).upper())

    print(df_grouped.to_string())
    df_grouped.to_csv(f'src/1_collect_data/salary_data/DB_combined_data.csv', index=False)

In [35]:
reviews_df = pd.read_csv("src/1_collect_data/planet_terp_data/PT_review_data.csv")
grades_df = pd.read_csv("src/1_collect_data/planet_terp_data/PT_grade_data.csv")
salaries_df = pd.read_csv("src/1_collect_data/salary_data/DB_combined_data.csv")