<a href="https://colab.research.google.com/github/drewscottt/UArizona-RMP/blob/main/UArizona_RMP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is my final project for ISTA322. In it, I extract information about professors at the University of Arizona from two sources (a salary sheet and RateMyProfessor API). Then, I transform the data to filter interesting information about the data, use libraries to infer demographic information about the professors, and aggregate data to create datasets for each department and course at the UofA. Then, I load all of this information into a database. I have example queries for each of the tables loaded into the DB, showing what sorts of questions can be asked about the data.

[My original plan](https://docs.google.com/document/d/1RWUIrtQrgzBvJRRMYXLDIr0429hGngkLyFeovoPUHlE/edit?usp=sharing)

##Installations
This will automatically restart the runtime environment (otherwise errors will be caused later on in the project), so a crashed message will show, but that's okay.

In [None]:
!pip install -U -q jsonmerge
!pip install gender-guesser
!pip install ethnicolr
!pip3 install mysql-connector-python

# need to restart runtime for some reason
import os
os.kill(os.getpid(), 9)

  Building wheel for jsonmerge (setup.py) ... [?25l[?25hdone
Collecting gender-guesser
  Downloading gender_guesser-0.4.0-py2.py3-none-any.whl (379 kB)
[K     |████████████████████████████████| 379 kB 21.0 MB/s 
[?25hInstalling collected packages: gender-guesser
Successfully installed gender-guesser-0.4.0
Collecting ethnicolr
  Downloading ethnicolr-0.8.1-py2.py3-none-any.whl (36.1 MB)
[K     |████████████████████████████████| 36.1 MB 170 kB/s 
[?25hCollecting tensorflow==2.5.2
  Downloading tensorflow-2.5.2-cp37-cp37m-manylinux2010_x86_64.whl (454.4 MB)
[K     |████████████████████████████████| 454.4 MB 23 kB/s 
[?25hCollecting pandas>=1.3.0
  Downloading pandas-1.3.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.3 MB)
[K     |████████████████████████████████| 11.3 MB 43.8 MB/s 
Collecting typing-extensions~=3.7.4
  Downloading typing_extensions-3.7.4.3-py3-none-any.whl (22 kB)
Collecting tensorflow-estimator<2.6.0,>=2.5.0
  Downloading tensorflow_estimator-2.



##Import modules

In [None]:
# used to read from Google Drive 
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# used to request from APIs and manage their json responses
import urllib.request
import json
from jsonmerge import merge, Merger

import pandas as pd
import numpy as np
import math

# used to infer demographic information about professors
import gender_guesser.detector as gender
from ethnicolr import census_ln

  """)


##Google Drive Authentication


In [None]:
# Authenticate access to Google Drive
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

#Extract
Data sources:
* [UArizona Employee Salaries (2019-20)](https://docs.google.com/spreadsheets/d/e/2PACX-1vTaAWak0pN6Jnulm95eTM7kIubvNNMPgYh3d6sCHN5W1tekpIktoMBoDKJeZhmAyI7ZzH1BAytEp_bV/pubhtml?lsrp=1) (Google Sheet), [My Version](https://drive.google.com/file/d/1NrJKzMENRlp8Y-G4ATDq733imXs61b70/view?usp=sharing) (CSV)
* [RateMyProfessor UArizona Professors](https://www.ratemyprofessors.com/filter/professor/?&page=1&filter=teacherlastname_sort_s+asc&query=*%3A*&queryoption=TEACHER&queryBy=schoolId&sid=1402) (JSON), [All pages merged to one](https://drive.google.com/file/d/1f8A51h7dCGqdNe4sZKekCi4aZOht23-n/view?usp=sharing) (JSON)
* [RateMyProfessor Reviews](https://www.ratemyprofessors.com/paginate/professors/ratings?tid=180399) (JSON) [All merged to one](https://drive.google.com/file/d/13QDkDkIZDkmzEon0eZYJ3QLVWlrDPtoQ/view?usp=sharing) (JSON)

The goals of this section are:

1. Extract the information from UArizona Salaries Sheet and load it into a csv file on my Google Drive, then load the Google Drive csv into a DataFrame
2. Extract the Professors from the RMP Professors API and load it into a json file on my Google Drive, then load it into a DataFrame
3. Find the Professors who exist in both datasets, and join these sets together into one DataFrame
4. Extract the reviews for each Professor in the merged DF from the RMP API, and load it into a json file on my Google Drive, then load it into a DataFrame

End Result:
* Have 2 DataFrames: merged_df and reviews_df

In [None]:
# Extract Employees from UArizona Salaries Sheet

# get the salary file from Google Drive, name it salaries.csv
salaries_url = 'https://drive.google.com/file/d/1NrJKzMENRlp8Y-G4ATDq733imXs61b70/view?usp=sharing'
salaries_csv_drive = drive.CreateFile({'id':'1NrJKzMENRlp8Y-G4ATDq733imXs61b70'})
salaries_csv_drive.GetContentFile('salaries.csv')

# read salaries.csv into a df
salaries_df = pd.read_csv('salaries.csv')

# we need to transform the name column to FirstName and LastName so we can use it to get professor reviews from RMP

# split the Name column into FirstName and LastName
salaries_df[['LastName', 'FirstName']] = salaries_df['Name'].str.split(',', expand=True)

# remove middle names/initials from FirstName column
salaries_df['FirstName'] = salaries_df['FirstName'].str.split(expand=True)[0]

salaries_df

Unnamed: 0,Name,Primary Title,Annual at Actual FTE,FTE,Annual at Full FTE,State Fund Ratio,College Location,College Name,Department,LastName,FirstName
0,"Miller,Sean E","Head Coach, Men's Basketball",2400000,1.00,2400000,0.00,Main Campus,Intercollegiate Athletics Div,Administration and Athletics,Miller,Sean
1,"Nakaji,Peter","Chair, Neurosurgery - BUMCP",1550000,1.00,1550000,0.00,Arizona Health Sciences,College of Medicine - Phoenix,COM Phx Neurosurgery,Nakaji,Peter
2,"Sumlin,Kevin","Head Coach, Football",1100000,1.00,1100000,0.00,Main Campus,Intercollegiate Athletics Div,Administration and Athletics,Sumlin,Kevin
3,"Robbins,Robert",President of the University,898625,1.00,898625,0.75,Main Campus,Exec Ofc of the President Div,Executive Ofc of the President,Robbins,Robert
4,"Dake,Michael","Senior Vice President, Health Sciences",875000,1.00,875000,0.75,Arizona Health Sciences,Az Health Sciences Division,Senior VP Health Sciences,Dake,Michael
...,...,...,...,...,...,...,...,...,...,...,...
12885,"Mast,Ian MacArthur",Accompanist,1572,0.05,31440,0.00,Main Campus,College of Fine Arts,School of Dance,Mast,Ian
12886,"Andreacola,Darla Rachelle Kuhn",Sales Assistant,1383,0.06,23050,0.00,Main Campus,UA Assoc Students Bookstore,The UofA BookStores,Andreacola,Darla
12887,"Bee,Victoria G",Sales Assistant,1383,0.06,23050,0.00,Main Campus,UA Assoc Students Bookstore,The UofA BookStores,Bee,Victoria
12888,"Nett,Mary A",Sales Assistant,1153,0.05,23060,0.00,Main Campus,UA Assoc Students Bookstore,The UofA BookStores,Nett,Mary


In [None]:
# Extract the Professors from the RMP Professors API
# *** IMPORTANT: This cell shouldn't be run again because the results of this cell
#                 have been cached a Google Drive file. Rerunning will result in 
#                 unnecessary network traffic.
'''
rmp_professor_base = 'https://www.ratemyprofessors.com/filter/professor/?'
rmp_parameters = '&filter=teacherlastname_sort_s+asc&query=*%3A*&queryoption=TEACHER&queryBy=schoolId&sid=1402'

rmp_url = rmp_professor_base + 'page=1' + rmp_parameters
# get the total number of UArizona Professors on RMP
with urllib.request.urlopen(rmp_url) as url:
    data = json.loads(url.read().decode())
    count_profs = data['searchResultsTotal']

# there are 20 professors listed on each page
pages_needed = math.ceil(count_profs / 20)

# get all of the professors from each page, save them in rmp_professors list
rmp_professors = []
for page_num in range(1, pages_needed + 1):
  rmp_url = rmp_professor_base + f'page={page_num}' + rmp_parameters
  with urllib.request.urlopen(rmp_url) as url:
    data = json.loads(url.read().decode())
    rmp_professors.append(data)

# now, merge each of these json pages into one json file object using the jsonmerge module

# this schema specifies to append each professor list to the already merged file
# instead of overwriting the previous one (which is the default)
schema = {
          "properties": {
              "professors": {
                  "mergeStrategy": "append"
              }             
          }
         }

# merge all of the json objects together, via a chaining method
merger = Merger(schema)
merged_professors = merger.merge(rmp_professors[0], rmp_professors[1])
for i in range(2, len(rmp_professors)):
  merged_professors = merger.merge(merged_professors, rmp_professors[i])

# dump the merged_professors json object into rmp_professors.json file
with open("rmp_professors.json", "w") as rmp_prof_json:
     json.dump(merged_professors, rmp_prof_json)

# I manually saved this dumped file to my Google Drive for future use
'''

'\nrmp_professor_base = \'https://www.ratemyprofessors.com/filter/professor/?\'\nrmp_parameters = \'&filter=teacherlastname_sort_s+asc&query=*%3A*&queryoption=TEACHER&queryBy=schoolId&sid=1402\'\n\nrmp_url = rmp_professor_base + \'page=1\' + rmp_parameters\n# get the total number of UArizona Professors on RMP\nwith urllib.request.urlopen(rmp_url) as url:\n    data = json.loads(url.read().decode())\n    count_profs = data[\'searchResultsTotal\']\n\n# there are 20 professors listed on each page\npages_needed = math.ceil(count_profs / 20)\n\n# get all of the professors from each page, save them in rmp_professors list\nrmp_professors = []\nfor page_num in range(1, pages_needed + 1):\n  rmp_url = rmp_professor_base + f\'page={page_num}\' + rmp_parameters\n  with urllib.request.urlopen(rmp_url) as url:\n    data = json.loads(url.read().decode())\n    rmp_professors.append(data)\n\n# now, merge each of these json pages into one json file object using the jsonmerge module\n\n# this schema spec

In [None]:
# Continue extracting the RMP Professors, now that we have the json files all merged into one

# load the json file from Google Drive
professors_url = 'https://drive.google.com/file/d/1f8A51h7dCGqdNe4sZKekCi4aZOht23-n/view?usp=sharing'
professors_json_drive = drive.CreateFile({'id':'1f8A51h7dCGqdNe4sZKekCi4aZOht23-n'})
professors_json_drive.GetContentFile('rmp_professors.json')
with open('rmp_professors.json', 'r') as rmp_json:
  rmp_professors = json.load(rmp_json)['professors']

# convert json to df
rmp_professors_df = pd.json_normalize(rmp_professors)

# rename first and last name columns so we can easily merge it with salaries_df

rmp_professors_df.rename(columns={'tFname' : 'FirstName', 'tLname' : 'LastName'}, inplace=True)

rmp_professors_df

Unnamed: 0,tDept,tSid,institution_name,FirstName,tMiddlename,LastName,tid,tNumRatings,rating_class,contentType,categoryType,overall_rating
0,Law,1402,University of Arizona,Harry,,Aaron,1894503,5,average,TEACHER,PROFESSOR,3.2
1,Literature,1402,University of Arizona,Yuxuf,,Abana,523510,116,poor,TEACHER,PROFESSOR,2.1
2,Architecture,1402,University of Arizona,Dominick,,Abbott,2692250,1,poor,TEACHER,PROFESSOR,2.0
3,Engineering,1402,University of Arizona,Ahmed,,Abdelmawgoud,2679238,3,poor,TEACHER,PROFESSOR,1.3
4,Mathematics,1402,University of Arizona,Houssam,,Abdul-Rahman,2486872,2,good,TEACHER,PROFESSOR,5.0
...,...,...,...,...,...,...,...,...,...,...,...,...
5283,Architecture,1402,University of Arizona,Adriana,,Zuniga Teran,2569096,0,zero,TEACHER,PROFESSOR,
5284,Business,1402,University of Arizona,Na,,Zuo,2359051,8,average,TEACHER,PROFESSOR,3.3
5285,English,1402,University of Arizona,Arianne,,Zwartjes,752837,12,good,TEACHER,PROFESSOR,4.2
5286,English,1402,University of Arizona,Lynda,,Zwinger,530616,12,good,TEACHER,PROFESSOR,4.3


In [None]:
# Merge rmp_professors_df with salaries_df on FirstName and LastName
merged_df = pd.merge(rmp_professors_df, salaries_df, how='inner', on=['FirstName', 'LastName'])

merged_df

Unnamed: 0,tDept,tSid,institution_name,FirstName,tMiddlename,LastName,tid,tNumRatings,rating_class,contentType,categoryType,overall_rating,Name,Primary Title,Annual at Actual FTE,FTE,Annual at Full FTE,State Fund Ratio,College Location,College Name,Department
0,Literature,1402,University of Arizona,Yuxuf,,Abana,523510,116,poor,TEACHER,PROFESSOR,2.1,"Abana,Yuxuf A","Lecturer, Africana Studies",56500,1.0,56500,0.800,Main Campus,College of Humanities,Africana Studies
1,English,1402,University of Arizona,Matthew,,Abraham,1868164,12,good,TEACHER,PROFESSOR,3.5,"Abraham,Matthew","Professor, English",88377,1.0,88377,1.000,Main Campus,College of Social & Behav Sci,English
2,Sociology,1402,University of Arizona,Corey,,Abramson,2685728,1,good,TEACHER,PROFESSOR,4.0,"Abramson,Corey","Associate Professor, Sociology",88777,1.0,88777,1.000,Main Campus,College of Social & Behav Sci,Sociology
3,Economics,1402,University of Arizona,Charity-Joy,,Acchiardo,2020486,41,good,TEACHER,PROFESSOR,3.8,"Acchiardo,Charity-Joy Revere","Lecturer, Economics",52800,0.6,88000,1.000,Main Campus,Eller College of Management,Economics
4,Spanish & Portuguese,1402,University of Arizona,Abraham,,Acosta,1108598,15,good,TEACHER,PROFESSOR,3.5,"Acosta,Abraham I","Associate Professor, Spanish and Portuguese",77733,1.0,77733,1.000,Main Campus,College of Humanities,Spanish and Portuguese
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1683,Engineering,1402,University of Arizona,Yitshak,,Zohar,1127140,22,poor,TEACHER,PROFESSOR,2.0,"Zohar,Yitshak","Professor, Aerospace-Mechanical Engineering",114183,1.0,114183,1.000,Main Campus,College of Engineering,Aerospace & Mechanical Engr
1684,Natural Science,1402,University of Arizona,Marek,,Zreda,2154159,0,zero,TEACHER,PROFESSOR,,"Zreda,Marek G","Professor, Hydrology / Atmospheric Sciences",120240,1.0,120240,1.000,Main Campus,College of Science,Hydrology & Atmospheric Sci
1685,Architecture,1402,University of Arizona,Adriana,,Zuniga Teran,2569096,0,zero,TEACHER,PROFESSOR,,"Zuniga Teran,Adriana Alejandra",Assistant Research Scientist,70000,1.0,70000,0.150,Main Campus,Col Arch Plan & Landscape Arch,Planning Degree Program
1686,Business,1402,University of Arizona,Na,,Zuo,2359051,8,average,TEACHER,PROFESSOR,3.3,"Zuo,Na",Assistant Professor of Practice,95000,1.0,95000,0.789,Main Campus,College of Agric and Life Sci,Agric & Resource Econ-Ins


In [None]:
# Now that we have the Professors who were found in both datasets, we can extract our last data set: reviews
# *** IMPORTANT: This cell shouldn't be run again because the results of this cell
#                 have been cached a Google Drive file. Rerunning will result in 
#                 unnecessary network traffic (took me about 4m 30s to send all of these requests)

'''
# get the first 20 reviews for each of the merged professors by using the RMP Reviews API
reviews_base_url = 'https://www.ratemyprofessors.com/paginate/professors/ratings?tid='
reviews = []
for i, tid in enumerate(merged_df['tid']):
  review_url = reviews_base_url + str(tid)
  with urllib.request.urlopen(review_url) as url:
    prof_reviews = json.loads(url.read().decode())

    # the retreived reviews don't include any information about the professor, so add the tid to all
    for review in prof_reviews['ratings']:
      review['tid'] = tid

    reviews.append(prof_reviews)

# this schema specifies to append each rating list to the already merged file
# instead of overwriting the previous one (which is the default)
schema = {
          "properties": {
              "ratings": {
                  "mergeStrategy": "append"
              }             
          }
         }

# merge all of the json objects together, via a chaining method
merger = Merger(schema)
merged_reviews = merger.merge(reviews[0], reviews[1])
for i in range(2, len(reviews)):
  merged_reviews = merger.merge(merged_reviews, reviews[i])

# dump the merged_professors json object into rmp_professors.json file
with open("rmp_reviews.json", "w") as rmp_reviews_json:
     json.dump(merged_reviews, rmp_reviews_json)

# I manually uploaded rmp_reviews.json to my Google Drive, so it can be used in the future
'''

'\n# get the first 20 reviews for each of the merged professors by using the RMP Reviews API\nreviews_base_url = \'https://www.ratemyprofessors.com/paginate/professors/ratings?tid=\'\nreviews = []\nfor i, tid in enumerate(merged_df[\'tid\']):\n  review_url = reviews_base_url + str(tid)\n  with urllib.request.urlopen(review_url) as url:\n    prof_reviews = json.loads(url.read().decode())\n\n    # the retreived reviews don\'t include any information about the professor, so add the tid to all\n    for review in prof_reviews[\'ratings\']:\n      review[\'tid\'] = tid\n\n    reviews.append(prof_reviews)\n\n# this schema specifies to append each rating list to the already merged file\n# instead of overwriting the previous one (which is the default)\nschema = {\n          "properties": {\n              "ratings": {\n                  "mergeStrategy": "append"\n              }             \n          }\n         }\n\n# merge all of the json objects together, via a chaining method\nmerger = Mer

In [None]:
# get the reviews file from Google Drive, name it rmp_reviews.json
reviews_url = 'https://drive.google.com/file/d/13QDkDkIZDkmzEon0eZYJ3QLVWlrDPtoQ/view?usp=sharing'
reviews_json_drive = drive.CreateFile({'id':'13QDkDkIZDkmzEon0eZYJ3QLVWlrDPtoQ'})
reviews_json_drive.GetContentFile('rmp_reviews.json')

with open('rmp_reviews.json', 'r') as rmp_reviews_json:
  reviews_json = json.load(rmp_reviews_json)['ratings']

reviews_df = pd.json_normalize(reviews_json)

reviews_df

Unnamed: 0,attendance,clarityColor,easyColor,helpColor,helpCount,id,notHelpCount,onlineClass,quality,rClarity,rClass,rComments,rDate,rEasy,rEasyString,rErrorMsg,rHelpful,rInterest,rOverall,rOverallString,rStatus,rTextBookUse,rTimestamp,rWouldTakeAgain,sId,takenForCredit,teacher,teacherGrade,teacherRatingTags,unUsefulGrouping,usefulGrouping,tid
0,,average,good,average,0,35221583,0,online,poor,2,HUMS376,I took this course 7 week online. It was wayyy...,11/10/2021,4.0,4.0,,2,,2.0,2.0,1,Yes,1636583242000,No,1402,Yes,,A,"[Get ready to read, Tough grader, So many papers]",people,people,523510
1,Not Mandatory,poor,good,poor,0,35087586,0,,awful,1,AFAS306,I would recommend avoiding this professor all ...,09/24/2021,5.0,5.0,,1,,1.0,1.0,1,Yes,1632512201000,No,1402,Yes,,WD,"[Graded by few things, Get ready to read, Toug...",people,people,523510
2,Not Mandatory,poor,good,poor,0,34786511,0,online,awful,1,AFAS303,"He is not accessible, won't give you feedback,...",05/13/2021,4.0,4.0,,1,,1.0,1.0,1,Yes,1620933549000,No,1402,Yes,,C,"[Tough grader, Lots of homework, Get ready to ...",people,people,523510
3,Not Mandatory,poor,good,poor,0,34777281,0,online,awful,1,AFASENGL303,I am an English major in several high level En...,05/12/2021,4.0,4.0,,1,,1.0,1.0,1,No,1620849716000,No,1402,Yes,,B-,"[Tough grader, So many papers, Graded by few t...",people,people,523510
4,Not Mandatory,average,good,average,0,34593307,0,,average,3,GWS312,Im a pretty good writer and I often get very h...,04/21/2021,4.0,4.0,,3,,3.0,3.0,1,No,1619033788000,No,1402,Yes,,B,"[Graded by few things, Tough grader, Get ready...",people,people,523510
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14838,,good,good,good,0,14918302,0,,awesome,5,ENGL300,I had Dr. Zwinger for a lit. of film presessio...,10/30/2008,4.0,4.0,,4,Really into it,4.5,4.5,1,No,1225338608000,,1402,,,,[],people,people,530616
14839,,good,good,good,0,14857211,0,,good,4,ENGL483,prof zwinger is a fantastic teacher. classes ...,10/08/2008,4.0,4.0,,4,Really into it,4.0,4.0,1,Yes,1223482431000,,1402,,,,[],people,people,530616
14840,,good,good,good,1,14439152,0,,awesome,5,ENG484,"Dr. Z is fantastic. She's always interesting,...",05/09/2008,4.0,4.0,,4,It's my life,4.5,4.5,1,Yes,1210353125000,,1402,,,,[],people,person,530616
14841,,average,average,average,1,14372588,0,,average,3,ENGL486,Zwinger is fun. She is always joking with stu...,04/30/2008,2.0,2.0,,3,Sorta interested,3.0,3.0,1,Yes,1209524978000,,1402,,,,[],people,person,530616


#Transform
Now we have all of the data loaded into two DataFrames: merged_df and reviews_df

We already did some transforming of the professor data to make our time easier to load the reviews data.

The goals of this section are:

1. Extract the useful columns and rename columns from reviews_df
2. Do the same for merged_df
3. Infer the genders and ethnicities for each professor
4. Create a department DataFrame by aggregating information from the merged_df
5. Create a course DataFrame by aggregating information from the reviews_df
6. Get rid of numpy types

End result:
* Have 4 DataFrames: merged_df, reviews_df, department_df and courses_df

In [None]:
# make copies of the dfs
reviews_copy = reviews_df.copy()
merged_copy = merged_df.copy()

In [None]:
# Transform reviews_df by dropping and renaming certain columns
reviews_df = reviews_copy.copy()

# drop the columns that aren't useful/are redundant
unnec_cols = ['unUsefulGrouping',	'usefulGrouping', 'teacher', 'sId', 'rTimestamp', 'rStatus', 'rErrorMsg', 'attendance', 'onlineClass', 
              'helpCount', 'notHelpCount', 'rTextBookUse', 'rWouldTakeAgain', 'takenForCredit', 'rInterest', 'rEasyString', 'rOverallString',
              'clarityColor', 'easyColor', 'helpColor', 'quality', 'teacherRatingTags']
reviews_df.drop(unnec_cols, axis='columns', inplace=True)

# rename the columns that are left to better names
reviews_df.rename(columns={'id':'ReviewID', 'rClarity':'ClarityRating', 'rClass':'CourseID', 'rComments':'Comment', 
                            'rDate':'ReviewDate', 'rEasy' : 'EasyRating', 'rHelpful':'HelpfulRating', 'rOverall':'OverallRating',
                            'teacherGrade':'CourseGrade', 'tid':'ProfID'}, inplace=True)

# convert CourseGrade to a numerical value
letter_num = {'A+':97,'A':93, 'A-':90, 'B+':87,'B':83, 'B-':80, 'C+':77,'C':73, 'C-':70, 'D+':67,'D':63, 'D-':60, 'F':50, 'E':50, 'WD':None, 'N/A':None,
              'Not sure yet':None, 'INC':None, 'Audit/No Grade':None, 'P':70}
reviews_df['CoursePercent'] = reviews_df.apply(lambda row: letter_num[row.CourseGrade], axis='columns')

reviews_df

Unnamed: 0,ReviewID,ClarityRating,CourseID,Comment,ReviewDate,EasyRating,HelpfulRating,OverallRating,CourseGrade,ProfID,CoursePercent
0,35221583,2,HUMS376,I took this course 7 week online. It was wayyy...,11/10/2021,4.0,2,2.0,A,523510,93.0
1,35087586,1,AFAS306,I would recommend avoiding this professor all ...,09/24/2021,5.0,1,1.0,WD,523510,
2,34786511,1,AFAS303,"He is not accessible, won't give you feedback,...",05/13/2021,4.0,1,1.0,C,523510,73.0
3,34777281,1,AFASENGL303,I am an English major in several high level En...,05/12/2021,4.0,1,1.0,B-,523510,80.0
4,34593307,3,GWS312,Im a pretty good writer and I often get very h...,04/21/2021,4.0,3,3.0,B,523510,83.0
...,...,...,...,...,...,...,...,...,...,...,...
14838,14918302,5,ENGL300,I had Dr. Zwinger for a lit. of film presessio...,10/30/2008,4.0,4,4.5,,530616,
14839,14857211,4,ENGL483,prof zwinger is a fantastic teacher. classes ...,10/08/2008,4.0,4,4.0,,530616,
14840,14439152,5,ENG484,"Dr. Z is fantastic. She's always interesting,...",05/09/2008,4.0,4,4.5,,530616,
14841,14372588,3,ENGL486,Zwinger is fun. She is always joking with stu...,04/30/2008,2.0,3,3.0,,530616,


In [None]:
# Transform merged_df
merged_df = merged_copy.copy()

# drop unused or redundant columns in merged_df
unnec_cols = ['tSid', 'institution_name', 'contentType', 'categoryType', 'tMiddlename', 'Name', 'College Location', 
              'State Fund Ratio', 'FTE', 'College Name', 'rating_class', 'tDept']
merged_df.drop(unnec_cols, axis='columns', inplace=True)

# rename columns in merged_df
merged_df.rename(columns={'tid':'ProfID','tNumRatings':'NumRatings', 'overall_rating':'AverageRating', 
                          'College Name':'CollegeName', 'Primary Title':'Title', 
                          'Annual at Actual FTE':'ActualSalary', 'Annual at Full FTE':'FTESalary'}, inplace=True)

# remove the department name from the end of each professor's Title
merged_df['Title'] = merged_df['Title'].str.split(',', expand=True)[0]

# convert to numeric columns
merged_df.loc[merged_df['AverageRating'] == 'N/A', 'AverageRating'] = 0
merged_df['ActualSalary'] = merged_df['ActualSalary'].str.replace(',', '')
merged_df['FTESalary'] = merged_df['FTESalary'].str.replace(',', '')
merged_df = merged_df.astype({'NumRatings':'int64', 'AverageRating':'float64', 'ActualSalary':'int64', 'FTESalary':'int64'})

merged_copy = merged_df.copy()

merged_df

Unnamed: 0,FirstName,LastName,ProfID,NumRatings,AverageRating,Title,ActualSalary,FTESalary,Department
0,Yuxuf,Abana,523510,116,2.1,Lecturer,56500,56500,Africana Studies
1,Matthew,Abraham,1868164,12,3.5,Professor,88377,88377,English
2,Corey,Abramson,2685728,1,4.0,Associate Professor,88777,88777,Sociology
3,Charity-Joy,Acchiardo,2020486,41,3.8,Lecturer,52800,88000,Economics
4,Abraham,Acosta,1108598,15,3.5,Associate Professor,77733,77733,Spanish and Portuguese
...,...,...,...,...,...,...,...,...,...
1683,Yitshak,Zohar,1127140,22,2.0,Professor,114183,114183,Aerospace & Mechanical Engr
1684,Marek,Zreda,2154159,0,0.0,Professor,120240,120240,Hydrology & Atmospheric Sci
1685,Adriana,Zuniga Teran,2569096,0,0.0,Assistant Research Scientist,70000,70000,Planning Degree Program
1686,Na,Zuo,2359051,8,3.3,Assistant Professor of Practice,95000,95000,Agric & Resource Econ-Ins


In [None]:
# infer the gender for each professor
gender_detector = gender.Detector()
merged_df['InferredGender'] = merged_df.apply(lambda row: gender_detector.get_gender(row.FirstName), axis='columns')
# replace estimates with their full values, and unknown values with Nones
merged_df.loc[merged_df.InferredGender == 'mostly_male', 'InferredGender'] = 'male'
merged_df.loc[merged_df.InferredGender == 'mostly_female', 'InferredGender'] = 'female'
merged_df.loc[merged_df.InferredGender == 'unknown', 'InferredGender'] = None
merged_df.loc[merged_df.InferredGender == 'andy', 'InferredGender'] = None

print(f"{merged_df['InferredGender'].isnull().sum()/len(merged_df.index) * 100:.2f}% of professors are missing their gender")

merged_df

10.72% of professors are missing their gender


Unnamed: 0,FirstName,LastName,ProfID,NumRatings,AverageRating,Title,ActualSalary,FTESalary,Department,InferredGender
0,Yuxuf,Abana,523510,116,2.1,Lecturer,56500,56500,Africana Studies,
1,Matthew,Abraham,1868164,12,3.5,Professor,88377,88377,English,male
2,Corey,Abramson,2685728,1,4.0,Associate Professor,88777,88777,Sociology,male
3,Charity-Joy,Acchiardo,2020486,41,3.8,Lecturer,52800,88000,Economics,
4,Abraham,Acosta,1108598,15,3.5,Associate Professor,77733,77733,Spanish and Portuguese,male
...,...,...,...,...,...,...,...,...,...,...
1683,Yitshak,Zohar,1127140,22,2.0,Professor,114183,114183,Aerospace & Mechanical Engr,
1684,Marek,Zreda,2154159,0,0.0,Professor,120240,120240,Hydrology & Atmospheric Sci,male
1685,Adriana,Zuniga Teran,2569096,0,0.0,Assistant Research Scientist,70000,70000,Planning Degree Program,female
1686,Na,Zuo,2359051,8,3.3,Assistant Professor of Practice,95000,95000,Agric & Resource Econ-Ins,


In [None]:
# infer ethnicity for each professor
ethnicities = ['pctwhite', 'pctblack', 'pctapi', 'pctaian', 'pct2prace', 'pcthispanic']

# put the percent of each ethnicity for each professor into merged_df
merged_df = census_ln(merged_df, 'LastName', 2010)

# convert each pct column to a str because NaN values aren't being replaced using normal means (these are object typed fields)
merged_df = merged_df.astype({'pctwhite':'str', 'pctblack':'str', 'pctapi':'str', 'pctaian':'str', 'pct2prace':'str', 'pcthispanic':'str'})

# convert each invalid column value to a 0
for eth in ethnicities:
  merged_df.loc[merged_df[eth] == 'nan', eth] = 0
  merged_df.loc[merged_df[eth] == '(S)', eth] = 0

# convert the pct columns to floats now that we have removed NaNs and (S)s
merged_df = merged_df.astype({'pctwhite':'float64', 'pctblack':'float64', 'pctapi':'float64', 'pctaian':'float64', 'pct2prace':'float64', 'pcthispanic':'float64'})

# find the maximum confidence for each professor's ethnicity
merged_df['EthnicityConfidence'] = merged_df[['pctwhite', 'pctblack', 'pctapi', 'pctaian', 'pct2prace', 'pcthispanic']].max(axis='columns')

# assign the correct ethnicity to the professor based on the max confidence found
merged_df['InferredEthnicity'] = merged_df.apply(lambda row: row[row == row['EthnicityConfidence']].index[0], axis=1)

# remove the 'pct' from the beginning of the ethincities
for eth in ethnicities:
  merged_df.loc[merged_df['InferredEthnicity'] == eth, 'InferredEthnicity'] = eth[3:]

# replace all None ethnicities with nonwhite (new ethnicity) because I am assuming the data is pulling mostly from white surnames, 
# so most of the misses will be non-white
merged_df.loc[merged_df['EthnicityConfidence'] == 0, 'InferredEthnicity'] = 'nonwhite'

# drop the pct columns since they are no longer needed
merged_df.drop(ethnicities, axis='columns', inplace=True)

merged_df

Unnamed: 0,FirstName,LastName,ProfID,NumRatings,AverageRating,Title,ActualSalary,FTESalary,Department,InferredGender,EthnicityConfidence,InferredEthnicity
0,Yuxuf,Abana,523510,116,2.1,Lecturer,56500,56500,Africana Studies,,0.00,nonwhite
1,Matthew,Abraham,1868164,12,3.5,Professor,88377,88377,English,male,45.00,white
2,Corey,Abramson,2685728,1,4.0,Associate Professor,88777,88777,Sociology,male,91.14,white
3,Charity-Joy,Acchiardo,2020486,41,3.8,Lecturer,52800,88000,Economics,,0.00,nonwhite
4,Abraham,Acosta,1108598,15,3.5,Associate Professor,77733,77733,Spanish and Portuguese,male,89.48,hispanic
...,...,...,...,...,...,...,...,...,...,...,...,...
1683,Yitshak,Zohar,1127140,22,2.0,Professor,114183,114183,Aerospace & Mechanical Engr,,94.37,white
1684,Marek,Zreda,2154159,0,0.0,Professor,120240,120240,Hydrology & Atmospheric Sci,male,0.00,nonwhite
1685,Adriana,Zuniga Teran,2569096,0,0.0,Assistant Research Scientist,70000,70000,Planning Degree Program,female,0.00,nonwhite
1686,Na,Zuo,2359051,8,3.3,Assistant Professor of Practice,95000,95000,Agric & Resource Econ-Ins,,90.67,api


In [None]:
# create the department dataframe
department_df = pd.DataFrame({'Department':merged_df['Department'].unique()})

# get the number of professors in each department
aggs = pd.merge(department_df, merged_df, how='inner', on='Department').groupby('Department').agg(NumProfessors=('Department', 'count'), 
                                                                                                  AvgRating=('AverageRating', 'mean'),
                                                                                                  AvgFTESalary=('FTESalary', 'mean'),
                                                                                                  TotalProfPay=('ActualSalary', 'sum'))
department_df = pd.merge(department_df, aggs, how='inner', on='Department')
ethnicities = ['white', 'black', 'api', 'aian', '2prace', 'hispanic', 'nonwhite']
for eth in ethnicities:
  eth_aggs = pd.merge(department_df, merged_df[merged_df['InferredEthnicity'] == eth], 
                    how='inner', on='Department').groupby('Department').agg(
                        count=('InferredEthnicity', 'count')
                    )
  eth_aggs.columns = ['Count'+eth]
  department_df = pd.merge(department_df, eth_aggs, how='left', on='Department').fillna(0)

  department_df['Percent'+eth] = department_df.apply(lambda row: row['Count'+eth] / row['NumProfessors'], axis='columns')
  department_df.drop(['Count'+eth], axis='columns', inplace=True)

department_df

Unnamed: 0,Department,NumProfessors,AvgRating,AvgFTESalary,TotalProfPay,Percentwhite,Percentblack,Percentapi,Percentaian,Percent2prace,Percenthispanic,Percentnonwhite
0,Africana Studies,13,2.892308,67658.923077,864780,0.461538,0.230769,0.000000,0.0,0.0,0.076923,0.230769
1,English,88,3.836364,64414.409091,5413256,0.659091,0.000000,0.045455,0.0,0.0,0.102273,0.193182
2,Sociology,18,3.316667,94438.833333,1524162,0.833333,0.055556,0.000000,0.0,0.0,0.055556,0.055556
3,Economics,24,3.012500,159175.833333,3656270,0.541667,0.000000,0.125000,0.0,0.0,0.083333,0.250000
4,Spanish and Portuguese,34,3.923529,69200.852941,2298773,0.411765,0.000000,0.000000,0.0,0.0,0.382353,0.205882
...,...,...,...,...,...,...,...,...,...,...,...,...
193,Environmental Science-Ext,1,3.900000,101489.000000,101489,1.000000,0.000000,0.000000,0.0,0.0,0.000000,0.000000
194,Otolaryngology,1,4.100000,577500.000000,577500,0.000000,0.000000,1.000000,0.0,0.0,0.000000,0.000000
195,Facilities Mgmt-Maint Services,1,4.500000,51394.000000,51394,0.000000,1.000000,0.000000,0.0,0.0,0.000000,0.000000
196,Asthma/Airway Disease Rsch Ctr,1,0.000000,187000.000000,37400,1.000000,0.000000,0.000000,0.0,0.0,0.000000,0.000000


In [None]:
# create courses df
courses_df = pd.DataFrame({'CourseID' : reviews_df['CourseID'].unique()})

courses_df = pd.merge(reviews_df, courses_df, how='inner', on='CourseID').groupby('CourseID').agg(AvgEasy=('EasyRating', 'mean'),
                                                                                            AvgHelpful=('HelpfulRating', 'mean'),
                                                                                            AvgClarity=('ClarityRating', 'mean'),
                                                                                            AvgOverall=('OverallRating', 'mean'),
                                                                                            MostCommonProf=('ProfID', lambda x: pd.Series.mode(x)[0]),
                                                                                            AvgGrade=('CoursePercent', 'mean'),
                                                                                            NumRatings=('ReviewID', 'count'))

# get the department of the most common professor for each course
courses_df.reset_index(level=0, inplace=True)
prof_dept = merged_df[['ProfID', 'Department']]
courses_df = pd.merge(courses_df, prof_dept, how='inner', left_on='MostCommonProf', right_on='ProfID')

courses_df

Unnamed: 0,CourseID,AvgEasy,AvgHelpful,AvgClarity,AvgOverall,MostCommonProf,AvgGrade,NumRatings,ProfID,Department
0,100A,3.000000,5.000000,5.000000,5.000000,2297258,97.000000,1,2297258,Sch Theatre Film & Television
1,FTV150B,2.000000,4.666667,4.666667,4.666667,2297258,95.666667,3,2297258,Sch Theatre Film & Television
2,100B,4.000000,3.000000,3.000000,3.000000,2220229,,1,2220229,Sch Theatre Film & Television
3,FTV100,4.666667,1.000000,1.000000,1.000000,2220229,83.000000,3,2220229,Sch Theatre Film & Television
4,FTV100A,4.000000,1.250000,1.250000,1.250000,2220229,73.000000,4,2220229,Sch Theatre Film & Television
...,...,...,...,...,...,...,...,...,...,...
3702,TTE596C,4.000000,5.000000,5.000000,5.000000,1674761,,1,1674761,Teachg Learning Sociocult Stds
3703,WFSC385,3.000000,5.000000,5.000000,5.000000,2590803,93.000000,1,2590803,Sch of Nat Resource&Enviro-Res
3704,WFSC444,3.500000,4.500000,4.000000,4.250000,1279959,,2,1279959,Sch of Nat Resource&Enviro-Res
3705,WFSC447,3.000000,2.500000,2.500000,2.500000,1279959,93.000000,2,1279959,Sch of Nat Resource&Enviro-Res


In [None]:
# numpy types muck everything up
import datetime as dt
def numpy_to_python_types(df):
  for col in df.columns:
    t = df[col].dtype

    if t == 'int64' or t == 'float64':
      df = df.astype({col: 'object'})

  return df

merged_df = numpy_to_python_types(merged_df)

reviews_df = reviews_df.where(pd.notnull(reviews_df), None)
reviews_df = numpy_to_python_types(reviews_df)

department_df = numpy_to_python_types(department_df)

courses_df = courses_df.replace({np.nan: None})
courses_df = numpy_to_python_types(courses_df)

       ReviewID ClarityRating     CourseID  ... CourseGrade  ProfID CoursePercent
0      35221583             2      HUMS376  ...           A  523510          93.0
1      35087586             1      AFAS306  ...          WD  523510          None
2      34786511             1      AFAS303  ...           C  523510          73.0
3      34777281             1  AFASENGL303  ...          B-  523510          80.0
4      34593307             3       GWS312  ...           B  523510          83.0
...         ...           ...          ...  ...         ...     ...           ...
14838  14918302             5      ENGL300  ...         N/A  530616          None
14839  14857211             4      ENGL483  ...         N/A  530616          None
14840  14439152             5       ENG484  ...         N/A  530616          None
14841  14372588             3      ENGL486  ...         N/A  530616          None
14842   2990505             5     ENGL458B  ...         N/A  530616          None

[14661 rows x 1

#Load
Now, we have merged_df (for professors), reviews_df, department_df, and courses_df.

So we need to create tables and insert all of the data into them.

End result:
* Have 4 tables: Professor, Review, Department, Course

In [None]:
import mysql.connector as mysql
import pandas as pd
# SQL function definitions
def get_conn_cur(): # define function name and arguments (there aren't any)
  # Make a connection
  conn = mysql.connect(host="54.184.86.6", 
                       database="rmp_ista",
                       user="select_rmp",
                       password="select_rmp")

  cur = conn.cursor()   # Make a cursor after

  return(conn, cur)   # Return both the connection and the cursor

def run_query(query_string):
  conn, cur = get_conn_cur() # get connection and cursor

  cur.execute(query_string) # executing string as before

  my_data = cur.fetchall() # fetch query data as before

  # here we're extracting the 0th element for each item in cur.description
  colnames = [desc[0] for desc in cur.description]

  cur.close() # close
  conn.close() # close

  return pd.DataFrame(data=my_data, columns=colnames)
  #return(colnames, my_data) # return column names AND data

# make sql_head function
def sql_head(table_name):
  return run_query(f"SELECT * FROM {table_name} LIMIT 5")

# Check table_names
def get_table_names():
  conn, cur = get_conn_cur() # get connection and cursor

  # query to get table names
  table_name_query = """SELECT table_name FROM information_schema.tables
       WHERE table_schema = 'public' """

  cur.execute(table_name_query) # execute
  my_data = cur.fetchall() # fetch results

  cur.close() #close cursor
  conn.close() # close connection

  return(my_data) # return your fetched results

In [None]:
# create the tables

conn, cur = get_conn_cur()

#cur.execute("DROP TABLE Professor")
#cur.execute("DROP TABLE Course")
#cur.execute("DROP TABLE Review")
#cur.execute("DROP TABLE Department")

# professor table
print(merged_df.columns)
create_professor = """CREATE TABLE Professor(
  FirstName VARCHAR(50),
  LastName VARCHAR(50),
  ProfID INT PRIMARY KEY,
  NumRatings INT,
  AverageRating REAL,
  Title VARCHAR(100),
  AcutalSalary INT,
  FTESalary INT,
  Department VARCHAR(100),
  InferredGender VARCHAR(10),
  EthnicityConfidence REAL,
  InferredEthnicity VARCHAR(10)
)"""

cur.execute(create_professor)

# create courses
print(courses_df.columns)
create_course = """CREATE TABLE Course(
  CourseID VARCHAR(20) PRIMARY KEY,
  AvgEasy REAL,
  AvgHelpful REAL,
  AvgClarity REAL,
  AvgOverall REAL,
  MostCommonProf INT,
  AvgGrade REAL,
  NumRatings INT,
  ProfID INT,
  Department VARCHAR(100)
)
"""
cur.execute(create_course)

# create reviews
print(reviews_df.columns)
create_review = """CREATE TABLE Review(
  ReviewID INT PRIMARY KEY,
  ClarityRating REAL,
  CourseID VARCHAR(20),
  Comment VARCHAR(5000),
  ReviewDate DATE,
  EasyRating REAL,
  HelpfulRating REAL,
  OverallRating REAL,
  CourseGrade VARCHAR(20),
  ProfID INT,
  CoursePercent REAL
)
"""
cur.execute(create_review)

# create departments
print(department_df.columns)
create_department = """CREATE TABLE Department(
  Department VARCHAR(100) PRIMARY KEY,
  NumProfessors INT,
  AvgRating REAL,
  AvgFTESalary REAL,
  TotalProfPay INT,
  Percentwhite REAL,
  Percentblack REAL,
  Percentapi REAL,
  Percentaian REAL,
  Percent2prace REAL,
  Percenthispanic REAL,
  Percentnonwhite REAL
)
"""

cur.execute(create_department)

cur.close()
conn.commit()
conn.close()

Index(['CourseID', 'AvgEasy', 'AvgHelpful', 'AvgClarity', 'AvgOverall',
       'MostCommonProf', 'AvgGrade', 'NumRatings', 'ProfID', 'Department'],
      dtype='object')
Index(['ReviewID', 'ClarityRating', 'CourseID', 'Comment', 'ReviewDate',
       'EasyRating', 'HelpfulRating', 'OverallRating', 'CourseGrade', 'ProfID',
       'CoursePercent'],
      dtype='object')
Index(['Department', 'NumProfessors', 'AvgRating', 'AvgFTESalary',
       'TotalProfPay', 'Percentwhite', 'Percentblack', 'Percentapi',
       'Percentaian', 'Percent2prace', 'Percenthispanic', 'Percentnonwhite'],
      dtype='object')


In [None]:
# need to be able to adapt numpy types so that postgres can handle them
# source: https://stackoverflow.com/questions/50626058/psycopg2-cant-adapt-type-numpy-int64
import numpy as np
# from psycopg2.extensions import register_adapter, AsIs
# register_adapter(np.int64, psycopg2._psycopg.AsIs)

In [None]:
# insert into professor table
conn, cur = get_conn_cur()

merged_df = merged_df.drop_duplicates('ProfID', keep='first')
#cur.execute("DELETE FROM Professor")

prof_tuples = merged_df.to_records(index=False)

for prof in prof_tuples:
  query = "INSERT INTO Professor VALUES(%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"

  cur.execute(query, tuple(prof))

cur.close()
conn.commit()
conn.close()


FirstName              object
LastName               object
ProfID                 object
NumRatings             object
AverageRating          object
Title                  object
ActualSalary           object
FTESalary              object
Department             object
InferredGender         object
EthnicityConfidence    object
InferredEthnicity      object
dtype: object


In [None]:
sql_head('Professor')

Unnamed: 0,FirstName,LastName,ProfID,NumRatings,AverageRating,Title,AcutalSalary,FTESalary,Department,InferredGender,EthnicityConfidence,InferredEthnicity
0,Joseph,Tolliver,8381,22,3.1,Associate Professor,69113,69113,Philosophy,male,48.87,black
1,Suzanne,Delaney,14962,304,4.2,Senior Lecturer,94145,94145,Management and Organizations,female,83.59,white
2,Sandra,Barr,32389,110,4.9,Lecturer,33731,33731,School of Art,female,82.69,white
3,Patrick,McDonald,43130,21,4.2,Adjunct Lecturer,24000,48000,Management Information Systems,male,76.65,white
4,Thomas,Wilson,49563,31,4.3,Associate Professor,64296,64296,The Honors College,male,67.36,white


In [None]:
# insert into course table

conn, cur = get_conn_cur()

courses_df = courses_df.drop_duplicates('CourseID', keep='first')
#cur.execute("DELETE FROM Course")

course_tuples = courses_df.to_records(index=False)
for course in course_tuples:
  query = "INSERT INTO Course VALUES(%s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"
  cur.execute(query, tuple(course))

cur.close()
conn.commit()
conn.close()


In [None]:
sql_head('Course')

Unnamed: 0,CourseID,AvgEasy,AvgHelpful,AvgClarity,AvgOverall,MostCommonProf,AvgGrade,NumRatings,ProfID,Department
0,100A,3.0,5.0,5.0,5.0,2297258,97.0,1,2297258,Sch Theatre Film & Television
1,100B,4.0,3.0,3.0,3.0,2220229,,1,2220229,Sch Theatre Film & Television
2,101,2.7,4.7,4.4,4.55,1670556,85.75,10,1670556,Spanish and Portuguese
3,101A,2.923077,3.692308,3.692308,3.692308,1491674,86.5,13,1491674,English
4,101A101B,4.0,5.0,5.0,5.0,1154235,97.0,1,1154235,Chemistry & Biochemistry - Sci


In [None]:
# insert into department table

conn, cur = get_conn_cur()

department_df = department_df.drop_duplicates('Department', keep='first')
#cur.execute("DELETE FROM Department")

department_tuples = department_df.to_records(index=False)
for dept in department_tuples:
  query = "INSERT INTO Department VALUES(%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"
  cur.execute(query, tuple(dept))

cur.close()
conn.commit()
conn.close()


In [None]:
sql_head('Department')

Unnamed: 0,Department,NumProfessors,AvgRating,AvgFTESalary,TotalProfPay,Percentwhite,Percentblack,Percentapi,Percentaian,Percent2prace,Percenthispanic,Percentnonwhite
0,Academic Administration,1,4.9,110000.0,110000,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Academic Inits & Stdnt Success,4,0.625,84375.0,337500,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Access & Information Services,1,0.0,112585.0,112585,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,Administration and Athletics,3,4.766667,57891.0,162755,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Aerospace & Mechanical Engr,29,2.403448,108132.758621,2872088,0.413793,0.0,0.103448,0.0,0.0,0.0,0.482759


In [None]:
# insert into review table

conn, cur = get_conn_cur()

reviews_df = reviews_df.drop_duplicates('ReviewID', keep='first')
#cur.execute("DELETE FROM Review")

review_tuples = reviews_df.to_records(index=False)
for rev in review_tuples:
  query = "INSERT INTO Review VALUES(%s, %s, %s, %s, STR_TO_DATE(%s, '%m/%d/%Y'), %s, %s, %s, %s, %s, %s)"

  cur.execute(query, tuple(rev))

cur.close()
conn.commit()
conn.close()


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
(25930715, 5, 'CE214', "Dean is the fricken man. Ever since I took statics I see statics everywhere. He really gets into statics and you can tell he is genuinely excited to teach the subject. It's a tough class, but he is fair and really wants you to leave the course understanding the material so you can apply it in future classes and the professional world.", '01/12/2016', 4.0, 5, 5.0, 'B+', 2097441, 87.0)
(25874183, 4, 'CE214', "He loves the material that he teaches. I have learned more in this class than I have in any other class. He doesn't make the class ridiculously easy but he is helpful and open to questions.  He is more than fair and allows cheat sheets for all exams. There is also weekly homework. I would recomend him to someone who wants to be taught the material.", '01/04/2016', 3.0, 5, 4.5, 'B', 2097441, 83.0)
(34959446, 4, 'BCOM214', 'touch on grading,but nice teacher', '07/08/2021', 4.0, 4, 4.0, 'B+', 25581

In [None]:
sql_head("Review")

Unnamed: 0,reviewid,clarityrating,courseid,comment,reviewdate,easyrating,helpfulrating,overallrating,coursegrade,profid,coursepercent
0,35221583,2.0,HUMS376,I took this course 7 week online. It was wayyy...,2021-11-10,4.0,2.0,2.0,A,523510,93.0
1,35087586,1.0,AFAS306,I would recommend avoiding this professor all ...,2021-09-24,5.0,1.0,1.0,WD,523510,
2,34786511,1.0,AFAS303,"He is not accessible, won't give you feedback,...",2021-05-13,4.0,1.0,1.0,C,523510,73.0
3,34777281,1.0,AFASENGL303,I am an English major in several high level En...,2021-05-12,4.0,1.0,1.0,B-,523510,80.0
4,34593307,3.0,GWS312,Im a pretty good writer and I often get very h...,2021-04-21,4.0,3.0,3.0,B,523510,83.0


#Example Queries
Contains example queries based on some of the possible questions I outlined in HW7.

In [None]:
# average salary by gender
query = "SELECT InferredGender, AVG(FTESalary) AS AVGFTE, AVG(AcutalSalary) AS AVGActual FROM Professor GROUP BY InferredGender"

run_query(query)

Unnamed: 0,InferredGender,AVGFTE,AVGActual
0,male,102911.1816,97477.1517
1,female,88994.7819,83164.4977
2,,94076.1326,88954.9724


In [None]:
# average salary by gender, grouped by department and title where title has at least two genders
depts_titles_2_genders = '''SELECT Department, Title
              FROM Professor 
              GROUP BY Department, Title
              HAVING COUNT(DISTINCT InferredGender) >= 2'''

other_restriction = "Department LIKE '%Math%'"

query = '''SELECT Department, Title, InferredGender, COUNT(InferredGender), AVG(FTESalary) AS AVGFTE, AVG(AcutalSalary) AS AVGActual 
            FROM Professor
            WHERE (Department, Title) IN (''' + depts_titles_2_genders + '''
            ) AND ''' + other_restriction + '''
            GROUP BY Department, Title, InferredGender
            ORDER BY Department, Title, InferredGender'''

run_query(query)

Unnamed: 0,Department,Title,InferredGender,COUNT(InferredGender),AVGFTE,AVGActual
0,Mathematics,Associate Professor,,0,86964.25,69764.25
1,Mathematics,Associate Professor,female,2,84312.0,84312.0
2,Mathematics,Associate Professor,male,10,88238.3,88238.3
3,Mathematics,Instructor,,0,42000.0,42000.0
4,Mathematics,Instructor,female,9,43444.4444,43444.4444
5,Mathematics,Instructor,male,8,42750.0,42750.0
6,Mathematics,Lecturer,,0,48000.0,48000.0
7,Mathematics,Lecturer,female,2,48750.0,42562.5
8,Mathematics,Lecturer,male,1,48500.0,48500.0
9,Mathematics,Postdoctoral Research Associate I,,0,47659.0,47659.0


In [None]:
# departments with more than 10 professors order by the highest minority ethnicity percent
query = """SELECT Department, NumProfessors, PercentWhite, 
            PercentBlack + PercentAPI + PercentAIAN + Percent2pRace + PercentHispanic + PercentNonWhite AS PercentMinority
            FROM Department
            WHERE NumProfessors > 10
            ORDER BY PercentMinority DESC"""
run_query(query)

Unnamed: 0,Department,NumProfessors,PercentWhite,PercentMinority
0,East Asian Studies,20,0.25,0.75
1,Finance,14,0.357143,0.642857
2,Systems and Industrial Engr,17,0.411765,0.588235
3,Spanish and Portuguese,34,0.411765,0.588235
4,Aerospace & Mechanical Engr,29,0.413793,0.586207
5,Africana Studies,13,0.461538,0.538462
6,Sch Middle E/N African Studies,13,0.461538,0.538462
7,Electrical and Computer Engr,29,0.482759,0.517241
8,Materials Science & Engr,12,0.5,0.5
9,Mathematics,97,0.525773,0.474227


In [None]:
# most to least difficult courses with more than 10 ratings
other_restriction = "Department LIKE '%Information%'"
query = """SELECT * FROM 
          Course
          WHERE NumRatings > 10 AND """ + other_restriction + """
          ORDER BY AvgEasy ASC"""
run_query(query)

Unnamed: 0,courseid,avgeasy,avghelpful,avgclarity,avgoverall,mostcommonprof,avggrade,numratings,profid,department
0,ESOC150,2.153846,3.538461,3.461539,3.5,1891506,93.22222,13,1891506,School of Information
1,MIS304,2.8,4.04,4.0,4.02,1705443,90.95,25,1705443,Management Information Systems
2,MIS373,2.842105,4.105263,4.210527,4.157895,1786909,93.818184,19,1786909,Management Information Systems
3,MIS111,2.918919,3.162162,3.162162,3.162162,880998,89.0,37,880998,Management Information Systems
4,MIS441,3.0,1.4375,1.4375,1.4375,996649,90.0,16,996649,Management Information Systems
5,ISTA130,3.25,3.333333,3.333333,3.333333,980523,89.333336,12,980523,School of Information
6,ISTA116,3.476191,3.238095,3.095238,3.166667,2284722,85.166664,21,2284722,School of Information
7,MIS531,4.142857,2.357143,2.357143,2.357143,1667601,82.8,14,1667601,Management Information Systems


In [None]:
# view all the ratings for professor named Jonathan Misurda
query = """SELECT Comment, OverallRating, CourseID, ReviewDate FROM 
          Review JOIN Professor ON (Review.ProfID=Professor.ProfID)
          WHERE FirstName='Jonathan' AND LastName = 'Misurda'"""

pd.set_option('display.max_colwidth', None)
run_query(query)

Unnamed: 0,comment,overallrating,courseid,reviewdate
0,"Boring speaker, monotone, with no passion. I have Aspergers and have difficulty with social situations. He did not accommodate my disability, just scoffed. Group projects would have been much better if there was a discussion board, forum, or chatsomething to get to know classmates and discuss problems. If I could ask for a refund I would",1.0,CSC335,2021-04-11
1,"Not understanding the negativity in other reviews. Dude's really enthusiastic about his field and knows what he's teaching backwards and forwards. Classes are very lecture-heavy, but I found that alright because he's a very good speaker. Take good notes and get a good mental model of the material, because it's hard to fake it on his tests/projects",5.0,CSC252,2021-03-17
2,"I was really disappointed after taking this class. Dr. Misruda makes this class unnecessarily difficult for no reason. There's absolutely no rubric for assignments. He talks the whole lecture time without any participation, so it's really hard to pay attention. He will cover a needed topic one or two days before an assignment is due.",2.0,CS335,2021-02-15
3,"Dr. Mirsurda is really passionate about the material, but I don't think he relates it to the students. He doesn't give a lot of homework so I you don't do well there isn't much of a buffer. He often speaks the whole lecture, so it's hard to follow for 75 minutes. He often gives snarky responses to the students when they ask questions.",1.0,CS252,2021-02-15
4,"Took two classes (335 and 352) with him and I respect him quite a bit. He got a PhD in JUnit testing, what a mad lad. He's very knowledgeable about programming and is capable of engaging in good discussions about the material. Sometimes, it seems like he's condescending, yeah, but I feel like that's because he thinks this is easier material",3.0,CSC352,2020-11-22
5,"I avoid Misurda at every chance I get, you will not find me willingly taking his class. All his lectures are monotoned, without use of examples or direction. Projects are extremely hard, no structure to the course and lack of examples in class. TA's are more helpful at explaining concepts then the teacher. No piazza even tho class is online.",1.0,CSC252,2020-11-20
6,"Misurda is easily the worst computer science professor in the department. Lectures are monotonous, no online discussion board for help, and seems to not care about feedback regarding his course. I have a high B in this course and have had to do all studying outside of the lecture. Avoid all of his classes if possible",1.0,CSC252,2020-11-12
7,The class wasn't all that difficult and it is my fault for slacking off towards the end. But he was easily the worst CS professor I have ever had. I'm not sure if it was just me but the way he taught the class and interacted with students made me feel very unmotivated to participate. I would avoid taking a class with him at all costs.,1.0,CSC335,2020-11-06
8,"He is nice professor but he really need to change his teach style. He is a kind of professor who hope students to go to his office hour. If you go to, he love talking with you and answer clearly. And his exam has many works. So, good luck on it.",3.0,CSC252,2020-10-06
9,"This teacher is not transparent to his students. Does not use piazza, barely uses D2L. Extremely lazy when posting information about hw and exams. Does not post announcements at all. Joke of a professor at an academic institution.",1.0,CSC252,2020-10-01
