# CMSC320 Final Tutorial: poopy professors paid plenty
### Amelia Hsu, Jason Liu, and Brian Xiang

## Table of contents
1. [Motivation](#introduction)
2. [Hypothesis](#hypothesis)
3. [Data Munging](#data-munging)
    1. [Data Collection](#data-collection)
    2. [Tidying the Data](#tidying-the-data)
    3. [Name Matching](#name-matching)
 4. [Data Representation](#data-representation)
    1. [Inital Graphing](#initial-graphing)

## Motivation

Let's be honest, professors all vary in teaching style (and quality). Students tend to be extremely vocal about their opinions of professors. Everyone is looking out for one other. Friends want each other to get the best professors that they can and avoid those that they may not learn the best from. As a result, online platforms have been created to house student reviews of professors, the most commonly known website being [Rate My Professors](https://ratemyprofessors.com), which has data on over 1.3 million professors, 7,000 schools and 15 million ratings. Three student at the University of Maryland, College Park even took the initiative to create their own platform to gather specifically UMD professor ratings, [PlanetTerp](https://planetterp.com/), which includes over 11,000 professors and 16,000 reviews. PlanetTerp has the additional feature of including course grades for each UMD course; as of right now there are nearly 300,000 course grades stored on the site.

Starting in 2013, The Diamondback, UMD's premier independent student newspaper, began publishing an annual salary guide: [The Diamondback Salary Guide](https://salaryguide.dbknews.com/). The Diamondback Salary Guide displays every university employee's yearly pay in an easily digestible format for all to view. This information is public data provided to The Diamondback by the university itself; The Diamondback simply made this data more accessible to all by posting it on a single website.

The Diamondback Salary Guide states, "[w]e won't tell you what conclusions to draw from these numbers, but it's our hope that they'll give you the information you need to reflect on your own." In this final tutorial, we plan to do just that: compare both the salaries and ratings of UMD professors and reflect on our findings. From our own past experiences, we have observed that our favorite professors are not always the ones being paid the highest salaries. We are interested in the possiblity of a potential correlation between these two attributes. If there is a correlation between professor salary and rating, what is it? If a correlation exists, can we use this information to predict professor salary based on student reviews and vice versa?

## Hypothesis

We hypothesize that there will be a correlation between a professor's salary and their rating, due to some departments who have tenured professors who have dropped their standard of teaching (to say the least).

## Data Munging

### Data Collection

In order to observe the relationship between professor salary and rating, we collected data from three sources: Diamondback Salary Guide (DBK), Rate My Professors (RMP), and PlanetTerp (PT). DBK was our source of professor salary data, and a combination of RMP and PT was used as our source of professor rating data.

Diamondback Salary Guide has an undocumented API. However, we were able to learn about the API by looking at the network requests as we modified parameters on the site, which meant we could programmatially go through all of the pages and pull full data for all of the years that the Salary Guide tracks (2013-2022).

<img src="img/dbk_request.png" width="1000"/>

Rate My Professors also has an undocumented API. This is discovered by inspecting the network requests as we loaded a page of professor reviews, noting that there was a GraphQL query, then inspecting and then copying over the GraphQL query, authentication, and the necessary variables we needed to emulate the request locally. 

Interestingly, although their API technically requires authentication, the request from the website leaks the necessary [Basic authentication](https://en.wikipedia.org/wiki/Basic_access_authentication) header, which is `test:test` encoded in base64.

<img src="img/rmp_graphql.png" width="1000"/>


PlanetTerp was created by UMD students and thus the creators were generous enough to document an [API](https://planetterp.com/api/) to help fellow students use the data available on their website. 

Using this API, we were able to get the list of all of the professors that PlanetTerp has data on that have taught at UMD (over 11,000!), and get a list of all of the courses they've taught, their average rating over all of their courses, and all of their reviews, each of which included the text content, rating, expected grade, and creation time.

### Tidying the data

Before we began doing anything with our data, we first needed to clean it up. PlanetTerp has many listings for professors that have zero ratings, which is not helpful in our data exploration. For this reason, we removed all professors from our PT dataset who had no reviews. We also noticed that it was possible for PT to have multiple listings for the same professor (see [Madeleine Goh](https://planetterp.com/professor/goh) and [Madeleine Goh](https://planetterp.com/professor/goh_madeleine)). For professors with multiple listings, we merged the listings and combined all of their reviews, recalculating their average rating.

While collecting data from RMP, we noticed something odd about each professor’s Overall Quality score. The results we calculated when averaging a professor’s individual quality ratings were not equal to the professor's Overall Quality score. We are not sure what factors are taken into account by RMP when calculating overall quality. When students create new reviews on RMP, they are asked to score the professor’s helpfulness and clarity. We can see each review’s helpfulRating and clarityRating in the API data which we collected. However, the RMP website only displays a “Quality” score. In the vast majority of cases, we have found that the Quality score is calculated by averaging the Helpful and Clarity scores ((helpfulRating + clarityRating) / 2). However, after performing a few calculations by hand, we found that a professor’s Overall Quality is not a result of averaging each review’s Quality score. Let’s take Clyde Kruskal as an example: at the time of our calculations, RMP gave Kruskal an Overall Quality score of 2.30. However, the average of each review’s Quality was 2.14, the average of each review’s helpfulRating was 2.11, and the average of each review’s clarityRating was 2.16, none of which are equal to 2.30. It is unclear what is causing this discrepancy. Is RMP factoring in the difficulty ratings of the professors? How recent each review is? Overall Quality score is a mystery black box number to us. Since we do not know how RMP is calculating this score, we chose to average each review’s quality rating and use this value for the average rating, since we know exactly how this is calculated.

### Name Matching

To connect a professor’s salary to their ratings, we needed to find a way to match the names from each dataset to each other. This proved to be a bit more difficult than we expected, because professor names were not standardized between the three platforms. Sometimes professor names included middle names, sometimes they included a middle initial, and sometimes no middle name was provided at all. Occasionally, professor nicknames were listed instead of their full names. With over three thousand different professors, we could not possibly match professor names by hand. Thus we needed a method to find the best professor matches between the three datasets. We used fuzzy name matching to accomplish this task. Fuzzy matching (also known as approximate string matching) is an artificial intelligence and machine learning technology that helps identify two elements that are approximately similar but not exactly the same. 

We explored two different options for matching professor names from PlanetTerp to Diamondback Salary Guide. One option that we considered was using [Hello My Name Is (HMNI)](https://github.com/Christopher-Thornton/hmni), a form of fuzzy name matching using machine learning. However, we decided against using HMNI because it was two years outdated and did not work with our current version of python3. The next method that we tried was using [fuzzywuzzy](https://github.com/seatgeek/thefuzz)  or [fuzzyset](https://github.com/Glench/fuzzyset.js/), which also performs fuzzy name matching but use the Levenshtein distance to calculate similarities between names. Levenshtein distance is the number of deletions, insertions, or substitutions required to transform one string to another. We ultimately decided to use fuzzyset to match professor names from PT to DBK because fuzzyset had faster performance than fuzzywuzzy, and we were receiving more successful, correct matches than with HMNI.

## Data Representation

### Initial Graphing

After matching DBK salaries to PlanetTerp ratings, we created a preliminary graph to visualize the data that we had tirelessly toiled to collect, tidy, and match.

In [None]:
# graph without removing very low review count

Looking at this preliminary graph, we noticed a large concentration of points on lines x = 1.0, 2.0, 3.0, 4.0, and 5.0.
These concentrations are from the large numbers of professors on PlanetTerp whose students generally don’t hold any strong positive/negative opinions, and only have 1 review.
After seeing this, we decided to filter out any professors with very few reviews, which does reduce the size of our dataset, but it reduces the number of one-off really high/low reviews that might otherwise skew our data.

In [None]:
# graph with removing very low review count