## Intro 

This is part one of a two part project in scraping and analyzing data from https://www.ratemyprofessors.com.

This part will be commited to scraping, cleaning, and converting the data into a more managable format.

Part two will be some analysis and visualization.

In [83]:
import requests
import json

js_request = "https://search-a.akamaihd.net/typeahead/suggest/?solrformat=true&rows=20&callback=noCB&q=*:*+AND+schoolid_s:1606&defType=edismax&qf=teacherfirstname_t^2000+teacherlastname_t^2000+teacherfullname_t^2000+autosuggest&bf=pow(total_number_of_ratings_i,2.1)&sort=total_number_of_ratings_i+desc&siteName=rmp&rows=20&start=0&fl=pk_id+teacherfirstname_t+teacherlastname_t+total_number_of_ratings_i+averageratingscore_rf+schoolid_s&fq=&prefix=schoolname_t:\"University+of+Utah\""       

js_response = requests.get(js_request).text
print(js_response)


noCB({
  "responseHeader":{
    "status":0,
    "QTime":10},
  "response":{"numFound":2584,"start":0,"docs":[
      {
        "averageratingscore_rf":3.5,
        "pk_id":261261,
        "total_number_of_ratings_i":174,
        "schoolid_s":"1606",
        "teacherfirstname_t":"David",
        "teacherlastname_t":"Temme"},
      {
        "averageratingscore_rf":4.2,
        "pk_id":335313,
        "total_number_of_ratings_i":169,
        "schoolid_s":"1606",
        "teacherfirstname_t":"Alan",
        "teacherlastname_t":"Sandomir"},
      {
        "averageratingscore_rf":3.8,
        "pk_id":566286,
        "total_number_of_ratings_i":150,
        "schoolid_s":"1606",
        "teacherfirstname_t":"Renee",
        "teacherlastname_t":"Dawson"},
      {
        "averageratingscore_rf":4.3,
        "pk_id":261621,
        "total_number_of_ratings_i":134,
        "schoolid_s":"1606",
        "teacherfirstname_t":"Matthew",
        "teacherlastname_t":"Linton"},
      {
        "average

After some digging around on the site, I discovered that the list items containing the profesors are actually generated by javascript, and are not in the html source code.  

The "pk_id" value is used to get the individual page of each professor.  So it will be usefull to grab that value and request the actual pages of the profesors to get more of the data I'm looking for.

Also it seems that the site only loads 20 profesors at a time, making the user press a 'load more' button to get the next batch of 20.

In [2]:
import re
import json
import time
import math

clean = re.sub('noCB\(', '', js_response)
clean = re.sub('\);', '', clean)

json_response = json.loads(clean)['response']

total_prof = json_response['numFound']
num_req_needed = math.ceil(total_prof/20)

pk_ids = list()

for i in range(0,num_req_needed):
    next_start = str(i*20)
    js_request = "https://search-a.akamaihd.net/typeahead/suggest/?solrformat=true&rows=20&callback=noCB&q=*:*+AND+schoolid_s:1606&defType=edismax&qf=teacherfirstname_t^2000+teacherlastname_t^2000+teacherfullname_t^2000+autosuggest&bf=pow(total_number_of_ratings_i,2.1)&sort=total_number_of_ratings_i+desc&siteName=rmp&rows=20&start=" + next_start + "&fl=pk_id+teacherfirstname_t+teacherlastname_t+total_number_of_ratings_i+averageratingscore_rf+schoolid_s&fq=&prefix=schoolname_t:\"University+of+Utah\""    
    js_response = requests.get(js_request).text
    
    clean = re.sub('noCB\(', '', js_response)
    clean = re.sub('\);', '', clean)
    
    json_response = json.loads(clean)['response']['docs']
    
    for item in json_response:
        pk_ids.append(item['pk_id'])
        
print(len(pk_ids))

2584


In [3]:
print(pk_ids[0:10])

[261261, 335313, 566286, 261621, 261242, 261467, 470792, 673272, 221339, 651891]


Got them all!  

Now to scrape some data off of one professor.

In [114]:
from bs4 import BeautifulSoup
import re

def get_prof_data(pk_id):
    url = "https://www.ratemyprofessors.com/ShowRatings.jsp?tid=" + str(pk_id)

    response = requests.get(url).content
    soup = BeautifulSoup(response, 'html.parser')

    rating_breakdown = soup(class_="rating-breakdown")
    
    overall_score = soup.find(class_="breakdown-container quality").div.div.text.strip()
    take_again = soup.find(class_=re.compile("breakdown-section takeAgain")).div.text.strip()
    difficulty = soup.find(class_="breakdown-section difficulty").div.text.strip()
    hot_string = soup.find('figure').img['src']
    department = soup.find(class_='result-title').text.strip().split(' ')[3]
    ratings = soup.find(class_="table-toggle rating-count active").text.strip().split(' ')[0]
    
    if 'hot' in hot_string:
        hot = 1
    else:
        hot = 0
            
    most_common_tag = ""
    max_tag_count = 0
    tags = soup.find(class_="tag-box")

    for child in tags.findChildren():
        split_child = child.text.split(" ")
        tag_count_str = split_child[-1]
        tag_count = re.sub('\(','', tag_count_str)
        tag_count = int(re.sub('\)','', tag_count))
    
        if tag_count > max_tag_count:
            max_tag_count = tag_count
            most_common_tag = re.sub('[^A-Za-z ]+', '', child.text).strip()
        

    return department, overall_score, take_again, difficulty, hot, most_common_tag, ratings

In [115]:
department, overall_score, take_again, difficulty, hot, most_common_tag, ratings = get_prof_data(261261)
print("Department ", department, " Score ", overall_score, " Take again ", take_again, " Difficulty ", difficulty, " Hot", hot, " Most Common Tag ", most_common_tag)
print("Number of Ratings: ", ratings)

Department  Biology  Score  3.5  Take again  63%  Difficulty  3.8  Hot 0  Most Common Tag  Tough Grader
Number of Ratings:  174


An addition to those basic scores, each commenter can specify what grade they got in the class.

The way its loaded into the web page is a bit strange, the first 20 are in the html source code, the rest are populated by javascript.

Also the values will take a bit of cleaning.

In [116]:
import numpy as np

def avg_grade_float(pk_id):
    avg_grade = 0.0
    grades_list = list()
    
    first_req = "https://www.ratemyprofessors.com/ShowRatings.jsp?tid=" + str(pk_id)
    first_res = requests.get(first_req).content
    soup = BeautifulSoup(first_res, 'html.parser')
    
    responses = soup.find_all('span', class_="grade")
    for item in responses:
        grades_list.append(item.span.text)
        
    first_js_req = "https://www.ratemyprofessors.com/paginate/professors/ratings?tid="+str(pk_id)+"&page=2"
    first_json = json.loads(requests.get(first_js_req).text)
    
    for obj in first_json['ratings']:
        grades_list.append(obj['teacherGrade'])
    
    num_req_needed = math.ceil(first_json['remaining']/20)
    for i in range(0,num_req_needed):
        next_page = i + 3
        next_req = "https://www.ratemyprofessors.com/paginate/professors/ratings?tid="+str(pk_id)+"&page=" + str(next_page)
        next_json = json.loads(requests.get(next_req).text)
        
        for obj in next_json['ratings']:
            grades_list.append(obj['teacherGrade'])
    
    clean_grades_list = [i for i in grades_list if i != 'Not sure yet' and i != 'INC' and i != 'N/A' ]
    
    grade_dict = {
        'A+' :4.0,
        'A' : 4.0,
        'A-' : 3.7,
        'B+' : 3.3,
        'B' : 3.0,
        'B-' : 2.7,
        'C+' : 2.3,
        'C' : 2.0,
        'C-' : 1.7,
        'D+' : 1.3,
        'D' : 1.0,
        'D-' : .7,
        'F' : 0,
        'P' : 3.0
    }
    
    number_grades = list(map(grade_dict.get, clean_grades_list))
    return np.mean(number_grades)
    

In [117]:
print(avg_grade_float(261261))

3.50571428571


Perfect.  Now time to get data from all the professors at the university.  

Converting to a pandas dataframe will make it much easier to plot, and compute statistics on the dataset

In [119]:
import pandas as pd

department, overall_score, take_again, difficulty, hot, most_common_tag

df_list = list()

for pk_id in pk_ids:
    try:
        department, overall_score, take_again, difficulty, hot, most_common_tag, ratings = get_prof_data(pk_id)
    except Exception:
        pass
    #grade = avg_grade_float(pk_id)
    d = {
        'pk_id' : pk_id,
        'Department' : department,
        'Score' : overall_score,
        'Take_Again' : take_again,
        'Difficulty' : difficulty,
        'Hot' : hot,
        'Tag' : most_common_tag,
        'Number_Ratings' : ratings
        #'Grade' : grade
    }
    df_list.append(d)
    if len(df_list)%500 == 0:
        print(len(df_list))

500
1000
1500
2000
2500


In [122]:
u_of_u_data = pd.DataFrame(df_list)
u_of_u_data.head(10)

Unnamed: 0,Department,Difficulty,Hot,Number_Ratings,Score,Tag,Take_Again,pk_id
0,Biology,3.8,0,174,3.5,Tough Grader,63%,261261
1,Business,3.8,1,169,4.2,Get ready to read,77%,335313
2,Biology,3.8,0,150,3.8,Tough Grader,65%,566286
3,Biology,3.8,1,134,4.3,Skip class You wont pass,100%,261621
4,Music,2.4,0,92,3.8,Hilarious,,261242
5,Chemistry,3.9,1,91,4.3,Caring,,261467
6,Accounting,3.7,0,86,3.4,Skip class You wont pass,71%,470792
7,Biology,4.3,0,85,4.5,Skip class You wont pass,90%,673272
8,Mathematics,3.1,1,78,4.2,Skip class You wont pass,67%,221339
9,Accounting,3.7,0,75,3.9,Respected,89%,651891


In [126]:
u_of_u_data.to_csv("Utah_Professor_Data.csv")

I decided to not calculate the average grade for each professor after all.  

Since it drastically increased the number of HTTP requests needed, and most students don't submit what grade they got anyway.

In the jupyter notebook for part 2, I'll get into the actual analysis and visualization.