# Names
Diego Piraquive 
James Firpo
Trevor Teerlink

### Project:  Professor Research vs Student Success

To refresh, the purpose of our project is to search for any correlations between the amount of time a professor spends on research, and detect if this has an effect on their students.  We have a suspicion that professors who are heavily involved in research, have less commitment to their students, which may be reflected through poorer student performance.

We have two source we are scraping from.  The OBIA (Office of Budget and Institutional Analysis), and also a factory directory website, which contains links to every professor's Google Scholar account.  

From the OBIA we will scrape classes, the class professor, and the class grades.
From the Faculty directory and google scholar links, we will attain an "h-index" for each professor.
The h-index is an author-level metric that attempts to measure both the productivity and citation impact of the publications of a scientist or scholar. The index is based on the set of the scientist's most cited papers and the number of citations that they have received in other publications.


In [None]:
# Here, we demonstrate we can scrape a common faculty directory, https://mech.utah.edu/faculty/.
# The code builds two functions:  one function to get the h-index from google scholar, and the second function
# navigates to grab the google scholar url from the faculty page.


import urllib.request
from bs4 import BeautifulSoup



def get_hindex(url):
    scholar_soup = BeautifulSoup(urllib.request.urlopen(url), 'lxml')
    return int(scholar_soup.findAll("td", {"class": "gsc_rsb_std"})[2].text)



def get_prof_scores():
    soup = BeautifulSoup(urllib.request.urlopen("https://mech.utah.edu/faculty/"), 'lxml')

    for row in soup.find("table", {"id": "tablepress-7"}).findAll("tr"):
        tds = row.findAll("td")
        if len(tds) == 0:
            continue

        _, name, body, _ = tds
        link = body.find("a")
        if not link:
            continue
        url = link["href"]
      

        yield (name.find("strong").text.strip(), get_hindex(url))

prof_scores = list(get_prof_scores())

print(prof_scores)

In [None]:
# now we have a list of all the Mechanical Engineering professors and their h-index score.  This is a very handy
# metric to show how "dedicated" a professor is to research.

In [None]:
# Here we have code to index all the engineering classes from the OBIA site. 


from bs4 import BeautifulSoup
import requests
import urllib.request

import time
import pandas as pd
import scipy as sc
import numpy as np

import statsmodels.formula.api as sm

import matplotlib.pyplot as plt 
plt.style.use('ggplot')
%matplotlib inline  
plt.rcParams['figure.figsize'] = (10, 6) 
from bs4 import UnicodeDammit
In [2]:
def getgrades(csv1, csv2):
    grades1=pd.read_csv(csv1, encoding="utf8")
    grades2=pd.read_csv(csv2, encoding="utf8")
    grds=[grades1, grades2]
    grades=pd.concat(grds,ignore_index=True)
    i=0
    grades=grades.reset_index(drop=True)
    hc=grades['sumHeadcount']
    while i < len(hc):
        if hc[i]== 'ds':
            grades=grades.drop(i, axis=0)
        i+=1
    grades=grades.reset_index(drop=True)
    return(grades)
In [3]:
def combinegrades(grades):
    i=1
    n=2
    gradesclean=pd.DataFrame([],columns=['Num', 'Section', 'Subject', 'A', 'B', 'C', 'D', 'E', 'W', 'Other'])
    gradesclean=gradesclean.append({'Num':1050, 'Section':1, 'Subject':'ASTR - Astronomy', 'A':0, 'B':0, 'C':0, 'D':0,
                                    'E':0, 'W':0, 'Other':2}, ignore_index=True)
    while i<len(grades['sumHeadcount']):
        if grades['Subject'].iloc[i]==gradesclean['Subject'].iloc[-1]:
            if grades['Catalog Num'].iloc[i]==gradesclean['Num'].iloc[-1]:
                fail=0
                for m in range(1,n):
                    if grades['Section'].iloc[i]==gradesclean['Section'].iloc[-m]:
                        gradesclean[grades['Grade Group'].iloc[i]].iat[-m]=grades['sumHeadcount'].iloc[i]
                        break
                    fail+=1
                if fail==n-1:
                    n+=1
                    gradesclean=gradesclean.append({'Num':grades['Catalog Num'].iloc[i],
                                                    'Section':grades['Section'].iloc[i],
                                                    'Subject':grades['Subject'].iloc[i], 'A':0, 'B':0,
                                                    'C':0,'D':0, 'E':0, 'W':0, 'Other':0}, ignore_index=True)
                    gradesclean[grades['Grade Group'].iloc[i]].iat[-1]=grades['sumHeadcount'].iloc[i]
            else:
                n=2
                gradesclean=gradesclean.append({'Num':grades['Catalog Num'].iloc[i], 'Section':grades['Section'].iloc[i],
                                                'Subject':grades['Subject'].iloc[i], 'A':0, 'B':0,
                                                'C':0,'D':0, 'E':0, 'W':0, 'Other':0}, ignore_index=True)
                gradesclean[grades['Grade Group'].iloc[i]].iat[-1]=grades['sumHeadcount'].iloc[i]
        else:
            n=2
            gradesclean=gradesclean.append({'Num':grades['Catalog Num'].iloc[i], 'Section':grades['Section'].iloc[i],
                                            'Subject':grades['Subject'].iloc[i], 'A':0, 'B':0,
                                            'C':0,'D':0, 'E':0, 'W':0, 'Other':0}, ignore_index=True)
            gradesclean[grades['Grade Group'].iloc[i]].iat[-1]=grades['sumHeadcount'].iloc[i]
        i+=1
    return(gradesclean)
In [5]:
sgrades=['fall17sci.csv','spring17sci.csv']
egrades=['fall17eng.csv','spring17eng.csv']
i=0
cleaned=[0,0]
while i<len(sgrades):
    gradess=getgrades(sgrades[i],egrades[i])
    gradescleans=combinegrades(gradess)
    cleaned[i]=gradescleans
    i+=1
In [7]:
cleaned[1]

## Relating professor names to class numbers
We used the catalog as we did in lecture in order to relate the class numbers to the professors. We manually modified the html code for MATH and ME EN subjects. We are going to explore using Selenium for when we scrape all subjects in order to have the code work more for us and automate it better. 

In [None]:
class_soup = BeautifulSoup(open("S17_MATH_class_list.html"), "html.parser")

In [None]:
classes = pd.read_html(str(class_table))[0]
classes.head(50)

In [None]:
# Cleanup data and get rid of columns 
mask = pd.notnull(classes["Sec."])
classes[mask]

classes = classes[(classes["Component"] == "Lecture") | (classes["Component"] == "Seminar") | (classes["Component"] == "Special Topics")]

classes = classes.drop(['Units','Location',"Class Attrs","Feed back",'Pre Req','Fees'],axis=1)
In [148]:
classes = classes.drop(['Days/Time & Session'],axis=1)
In [149]:
classes = classes.reset_index()
classes

In [None]:
# this code exports the dataframe into an excel file. this way we can integrate it into other sections of our code better
writer = pd.ExcelWriter('S17_math_classes.xlsx')
In [141]:
S17_math_classes = classes
S17_math_classes.to_excel(writer)
writer.save()

#### CONCLUSION

We now can begin to analyze our data, and search correlations within.  We plan on using correlations factors to determine if there is postive or negative correlations between the h-index variable of a professor, and their average class grade.  One Problem we are fixing is the innability to scrape sites past what you see without scrolling.  Just today we found out we need to use a function called Silenium to grab all the html from a page, which grabs the html to the bottom of the scroll.  As a back up, we also have considered looking at student feedback scores that are given to professors.