# Webscraping with Jupter!

### Outline:

- What is Webscrapping?

- What is Jupyter?

## Lets start with Webscraping, what is that?

![title](imgs/diagram1.png)

- When you load a website, you are basicly requesting files from the server to be stored on your computer
- Your browser will parse them and then give you a nice preaty page ;)
- Loading up a site like wikipedia may give you something like this:

![title](imgs/wiki1.png)

- So what happened?
    - We asked for some data, the server processed that request and gave us a page with all the data displayed for us to read.
    - But what if we just want the data itself, not the page?

#### `Web scraping` is the process of loading web pages through scripts, parsing through the files, and extracting useful information 

## Why use Python and Jupyter?

- Python has a lot of amazing libraries for data processing and is quickly monopolizing the data sience field.
- Python is also at the heart of machine learning libraries
- Python is just data friendly in general, something we want for our web scraper

- Jupyter is the program you are using right now
- It provided a very simple cell basted design that allows rapid conscruction of data pipelines
- Jupyter saves data in the background so you dont have to worry about re-downloading data every time you run part of a script

# Lets try it out!

[Diamond back sallaryies](https://salaryguide.dbknews.com)
![Preview:](imgs/sal1.png)

- This site from the diamondback is able to show all the sallary information for all umd faculty!
- Load up the page and go into the network tab, you can see the API requests they are making!
- Lets get some data!

In [41]:
# API request modules
import requests
import pandas as pd
import time
from IPython.display import clear_output, display

In [29]:
# Request url from network information
request_url = "https://api.dbknews.com/salary/year/2019?search=&sortby=employee&order=desc&page=1"

# Send the request yourself to get the info
website_response = requests.get(url = request_url)

# Print response out
website_response

<Response [200]>

In [30]:
# Ask for the json data
response_data = website_response.json()
response_data

{'data': [{'Division': 'VP Research',
   'Department': 'VPR-UM Ctr Applied Pol Studies (UMCAPS)',
   'Title': 'Fac Res Asst',
   'Employee': 'Zwiesler, Theodore W',
   'Salary': '$49,989.28'},
  {'Division': 'College of Computer, Math & Natural Sciences',
   'Department': 'CMNS-Computer Science',
   'Title': 'Prof',
   'Employee': 'Zwicker, Matthias',
   'Salary': '$231,199.32'},
  {'Division': 'College of Computer, Math & Natural Sciences',
   'Department': 'CMNS-Earth System Science Interdisciplinary Center',
   'Title': 'Res Sci',
   'Employee': 'Zwally, H Jay',
   'Salary': '$83,232.00'},
  {'Division': 'Robert H. Smith School of Business',
   'Department': 'BMGT-Accounting',
   'Title': 'Asst Prof',
   'Employee': 'Zur, Emanuel',
   'Salary': '$220,700.05'},
  {'Division': 'VP Student Affairs',
   'Department': 'VPSA-Transportation Services',
   'Title': 'Driver, Bus',
   'Employee': 'Zuniga, Ruben Oswaldo',
   'Salary': '$35,702.19'},
  {'Division': 'VP Student Affairs',
   'Depa

In [31]:
# Get the data atribute, see what one entry looks like
request_raw_data = response_data['data']
request_raw_data[0]

{'Division': 'VP Research',
 'Department': 'VPR-UM Ctr Applied Pol Studies (UMCAPS)',
 'Title': 'Fac Res Asst',
 'Employee': 'Zwiesler, Theodore W',
 'Salary': '$49,989.28'}

In [32]:
# Pandas is an amazing package that takes in data like this and makes nice charts and tables!
data_frame = pd.DataFrame(request_raw_data, columns = ['Division', 'Department', 'Title', 'Employee', 'Salary']) 
data_frame

Unnamed: 0,Division,Department,Title,Employee,Salary
0,VP Research,VPR-UM Ctr Applied Pol Studies (UMCAPS),Fac Res Asst,"Zwiesler, Theodore W","$49,989.28"
1,"College of Computer, Math & Natural Sciences",CMNS-Computer Science,Prof,"Zwicker, Matthias","$231,199.32"
2,"College of Computer, Math & Natural Sciences",CMNS-Earth System Science Interdisciplinary Ce...,Res Sci,"Zwally, H Jay","$83,232.00"
3,Robert H. Smith School of Business,BMGT-Accounting,Asst Prof,"Zur, Emanuel","$220,700.05"
4,VP Student Affairs,VPSA-Transportation Services,"Driver, Bus","Zuniga, Ruben Oswaldo","$35,702.19"
5,VP Student Affairs,VPSA-Transportation Services,"Driver, Bus","Zuniga, Dina B","$35,702.19"
6,College of Arts & Humanities,ARHU-Linguistics,Asst Res Sci,"Zukowski, Andrea L.","$66,911.62"
7,Universities at Shady Grove,USG-Shady Grove Center,Director,"Zuknick, John","$107,100.00"
8,College of Agriculture & Natural Resources,AGNR-Environmental Science & Technology,Lecturer,"Zucchetto, James John","$10,392.00"
9,Philip Merrill College of Journalism,JOUR-Philip Merrill College of Journalism,Lecturer,"Zremski, Jerry","$7,220.96"


In [132]:
# But we dont want one response! We want them all !!
all_site_data = []

for i in range (1, 1062):
    
    url = 'https://api.dbknews.com/salary/year/2019?search=&sortby=employee&order=desc&page=' + str(i)
    data = requests.get(url = url).json()['data']
    all_site_data.extend(data)
    clear_output(wait=True)
    print('Request #', i, 'values', len(all_site_data),'/','1060')

Request # 1061 values 10604 / 10601


In [133]:
# So we just got over 10 thousand faculties worth of salary info!
# Lets store it in a table like before!
faculty_info = pd.DataFrame( all_site_data, columns = ['Division', 'Department', 'Title', 'Employee', 'Salary'])
faculty_info

Unnamed: 0,Division,Department,Title,Employee,Salary
0,VP Research,VPR-UM Ctr Applied Pol Studies (UMCAPS),Fac Res Asst,"Zwiesler, Theodore W","$49,989.28"
1,"College of Computer, Math & Natural Sciences",CMNS-Computer Science,Prof,"Zwicker, Matthias","$231,199.32"
2,"College of Computer, Math & Natural Sciences",CMNS-Earth System Science Interdisciplinary Ce...,Res Sci,"Zwally, H Jay","$83,232.00"
3,Robert H. Smith School of Business,BMGT-Accounting,Asst Prof,"Zur, Emanuel","$220,700.05"
4,VP Student Affairs,VPSA-Transportation Services,"Driver, Bus","Zuniga, Ruben Oswaldo","$35,702.19"
5,VP Student Affairs,VPSA-Transportation Services,"Driver, Bus","Zuniga, Dina B","$35,702.19"
6,College of Arts & Humanities,ARHU-Linguistics,Asst Res Sci,"Zukowski, Andrea L.","$66,911.62"
7,Universities at Shady Grove,USG-Shady Grove Center,Director,"Zuknick, John","$107,100.00"
8,College of Agriculture & Natural Resources,AGNR-Environmental Science & Technology,Lecturer,"Zucchetto, James John","$10,392.00"
9,Philip Merrill College of Journalism,JOUR-Philip Merrill College of Journalism,Lecturer,"Zremski, Jerry","$7,220.96"


In [222]:
# If the above takes too long, try loading saved data :
faculty_info = pd.read_csv('faculty_data_output.csv')

In [None]:
# This is a lot of data, so lets save it just to make sure we have it in the future
faculty_info.to_csv('faculty_data_output.csv')

- What do we do now? lets try to get some faculty reviews to go allong with that data!
![rate](imgs/rate1.png)

### Looking at the network info, this wont be as easy
- Whereas the diamondback had a simple api that their site called and used to fill in data, this site sends us a html file with the data allready filled in

### Whats the solution?
- Parse the html file!

In [135]:
# lets import some actuall scrapers now!
from bs4 import BeautifulSoup
import json

In [136]:
# Lets try with Kruskal first

target_url = 'https://www.ratemyprofessors.com/ShowRatings.jsp?tid=544717'
content = requests.get(target_url).content

# What does the site look like 
content

# OOF

b'\n\n\n\n\n\n\n<!DOCTYPE html>\n<!--[if lt IE 7]>      <html class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->\n<!--[if IE 7]>         <html class="no-js lt-ie9 lt-ie8"> <![endif]-->\n<!--[if IE 8]>         <html class="no-js lt-ie9"> <![endif]-->\n<!--[if gt IE 8]><!--> <html class="no-js"> <!--<![endif]-->\n  <head>\n    <meta http-equiv="X-UA-Compatible" content="IE=edge">\n    <meta name="google-site-verification" content="F8gUuqzRvOrAlhaGyP7aAuMs_Se8zK-98Ai2sNsIZEo"/>\n    <meta name="google-site-verification" content="hk1NnSbYuDC0Sppgbf7YIT-VxUiRbOVRdtqA4AmkGzM"/>\n    <meta name="google-site-verification" content="CKMC_IwvKoVbX1U8x1A9yzKABsSlop6qxfuDzwfV7Qs"/> <!--  relaunch qa -->\n    <meta name="google-site-verification" content="1D_3ZfAdMu4Tki8pFRj68YAqYot-paTOoDVzCTIJZZI"/> <!--  relaunch rmp -->\n    <meta name="google-signin-client_id" content="14147781149-i5ph5oqooelp3k0qfrb3jkk6vs4528ha.apps.googleusercontent.com">\n    \n      <script type="text/javascript">\n        \

In [137]:
# But beutiful soup can handle it!
soup = BeautifulSoup(content, 'html5lib')

In [138]:
# AND THERE WE HAVE IT!
# We just read in the page and searched for the specific number we needed!
element = soup.find('div', attrs = {'class':'grade'})
element.text

'2.7'

In [139]:
# Lets get some more!
elements = soup.findAll('div', attrs = {'class':'grade'})

# Overall quality | Would take again | Level of difficulty
[e.text.strip() for e in elements]

['2.7', '21%', '4.0']

In [140]:
# Now lets use some function magic to get this result for any faculty we want!

def get_query( name ):
    # you can disect this function if you want, but just assume it works correctly and move on
    # its basicly just getting an api path and doing some parsing
    name = (name.lower() + '+university of maryland').replace(' ','+')
    base_url = 'https://solr-aws-elb-production.ratemyprofessors.com//solr/rmp/select/?solrformat=true&rows=20&wt=json&json.wrf=noCB&callback=noCB&q=' + name + '&defType=edismax&qf=teacherfirstname_t%5E2000+teacherlastname_t%5E2000+teacherfullname_t%5E2000+autosuggest&bf=pow(total_number_of_ratings_i%2C1.7)&sort=score+desc&siteName=rmp&group=on&group.field=content_type_s&group.limit=20'

    data = requests.get(url = base_url)
    raw_json_data =  json.loads(data.content.decode('utf-8')[5:-1])['grouped']['content_type_s']['groups'][0]['doclist']['docs'][0]
    return (    raw_json_data['averageratingscore_rf'], 
                raw_json_data['averagehelpfulscore_rf'], 
                raw_json_data['averageeasyscore_rf'], 
                raw_json_data['schoolname_s'],
           )


In [141]:
    
# Gets query data directly!
rating, helpful, easy, school = get_query('Kruskal')

print('Faculty from', school, '. Rating of', rating, '. Helpfulness of', helpful, '. Difficulty of', easy)

Faculty from University of Maryland . Rating of 2.7 . Helpfulness of 2.6 . Difficulty of 4.0


## So what have we got now?

- All the professors and their pay
- Ability to rate each ( most ) of them

## What now?

- Give each teacher a value score! ( I do not endorse actually thinking of any faculty in this way, but lets just try if for fun )

In [174]:
# Lets pull it together!
import math

In [181]:
# Searching by name, get results
names = faculty_info['Employee'].values
def search( name ):
    
    name = name.lower()
    options = [ a for a in names if name in a.lower()]
    
    print('Did search for', name, 'Results :', options)
    if len(options) == 0: return (False, None)
    
    print('Taking first entry info:')
    
    return (True, faculty_info.loc[faculty_info['Employee'] == options[0]])

found, data = search('Kruskal')
data

Did search for kruskal Results : ['Kruskal, Clyde P.']
Taking first entry info:


Unnamed: 0,Division,Department,Title,Employee,Salary
5490,"College of Computer, Math & Natural Sciences",CMNS-Computer Science,Assoc Prof,"Kruskal, Clyde P.","$84,033.89"


In [218]:
# Add in API Calls and print some information

def unified_info( name ):
    
    found, search_data = search( name )
    if not found: return
    
    rating, helpful, easy, school = get_query(name)
    
    # Salary saved as a string, converting it over to float
    money = float( search_data['Salary'].values[0][1:].replace(',',''))
    score = round(  money  / rating,2)
    
    print('-----')
    print(search_data['Employee'].values[0], 'makes', search_data['Salary'].values[0], 'dollars per year')
    print(search_data['Employee'].values[0], 'has a rating of', rating)
    print('That is a score of $', score, 'dollars / rating point')


In [219]:
unified_info('kruskal')

Did search for kruskal Results : ['Kruskal, Clyde P.']
Taking first entry info:
-----
Kruskal, Clyde P. makes $84,033.89 dollars per year
Kruskal, Clyde P. has a rating of 2.7
That is a score of $ 31123.66 dollars / rating point
