# First Steps in Data Scraping with BeautifulSoup

In this lab session, we will discuss how to get data from websites with BeautifulSoup. 

  * You will find BeautifulSoup [here](https://www.crummy.com/software/BeautifulSoup/). 
  * The documentation for BeautifulSoup is [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

## Installation

Install BeautifulSoup using your package manager, e.g., `pip install beautifulsoup4`

## School rankings

School rankings are typically published on websites with little or no download available.

We will scrap three webpages, which are already downloaded and in the working directory of this notebook.

Let's have a look at the files (please adapt to your operation system)

In [193]:
!ls *.html

payscale.html ruw.html      scu.html


## Preparation

We need to load a set of libraries.
  * Pandas for handling dataframes
  * BeautifulSoup to scrap webpages
  * `urllib` to handle URLs
  * `json` to handle JSON-Files
  * `re` to handle regular expressions

In [194]:
import pandas as pd
from bs4 import BeautifulSoup as bs
import urllib as url
import json
import re

## Scraping of ROI numbers from PayScale

We open the downloaded webpage that contains the ROI information 

In [195]:
soup = bs(open('payscale.html'), 'lxml')

Let's have a look at the 'soup' that we just created.

In [197]:
print(soup)

<!DOCTYPE html>
<html lang="en">
<head id="ctl00_m_htmlHeader"><title>
	PayScale College ROI Report: Best Value Colleges
</title><meta content="IE=edge" http-equiv="x-ua-compatible"/>
<link href="//cdn-payscale.com" rel="dns-prefetch"/>
<link href="//cdn-payscale.com" rel="preconnect"/>
<link crossorigin="" href="https://fonts.gstatic.com" rel="preconnect"/>
<meta content="" property="fb:app_id"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/><link href="https://cdn-payscale.com/images/favicon/apple-touch-icon-57x57.png" rel="apple-touch-icon" sizes="57x57"/>
<link href="https://cdn-payscale.com/images/favicon/apple-touch-icon-60x60.png" rel="apple-touch-icon" sizes="60x60"/>
<link href="https://cdn-payscale.com/images/favicon/apple-touch-icon-72x72.png" rel="apple-touch-icon" sizes="72x72"/>
<link href="https://cdn-payscale.com/images/favicon/apple-touch-icon-76x76.png" rel="apple-touch-icon" sizes="76x76"/>
<link href="https://cdn-payscale.com/images/favicon/

Let's find the ROI data.

In [202]:
roi_data = soup.find('script', text=re.compile('collegeRoiData'))

`soup.find` returns the element that contains the string `collegeRoiData`.
Let's have a look.

In [206]:
print(roi_data)

<script>
        
    window.pageType = 1;
    
    window.fbAsyncInit = function() {
        FB.init({
            appId                 : '355988364493140',
            status                : true,
            cookie                : true,
            xfbml                 : true,
            oauth                 : true,
            frictionlessRequests  : true,
            version               : 'v2.7'
        });
    };

    (function(d, s, id){
        var js, fjs = d.getElementsByTagName(s)[0];
        if (d.getElementById(id)) {return;}
        js = d.createElement(s); js.id = id;
        js.src = "//connect.facebook.net/en_US/sdk.js";
        fjs.parentNode.insertBefore(js, fjs);
    }(document, 'script', 'facebook-jssdk'));
    
        var metadata = {"key":["Name","Tuition"],"thenSortBy":"Name","columns":[{"text":"Rank","data":"RankText","sortData":"Rank","defaultSortOrder":"a","mobile":"hide"},{"text":"School Name","data":"Name","defaultSortOrder":"a","preUrl":"/research

We realize that the ROI data is in a JSON-File. We develop a small regular expressions that delivers the JSON file.

In [207]:
roi_as_json = re.search(r'collegeRoiData\s*=\s*(\[.*?\])', roi_data.string, flags=re.DOTALL | re.MULTILINE)

Let's have a look at the results.

In [208]:
print(json_text.group(1))

[{"Name":"Adams State College","CostOn":"$80,400","CostOff":"$83,400","RoiOn":"$93,000","RoiOff":"$90,000","RoiAidOn":"$126,000","RoiAidOff":"$123,000","AnnRoiOn":"3.9%","AnnRoiOff":"3.7%","AnnRoiAidOn":"6.7%","AnnRoiAidOff":"6.3%","GradRate":"21%","GradTime":"5 Years","Grant":"92%","LoanAmt":"$25,500","Url":"Adams_State_College/Salary","Img":"Adams State College_50px.jpg","Tuition":"(In-State)"},{"Name":"Adams State College","CostOn":"$123,000","CostOff":"$126,000","RoiOn":"$50,100","RoiOff":"$47,100","RoiAidOn":"$82,700","RoiAidOff":"$79,700","AnnRoiOn":"1.7%","AnnRoiOff":"1.6%","AnnRoiAidOn":"3.3%","AnnRoiAidOff":"3.1%","GradRate":"21%","GradTime":"5 Years","Grant":"92%","LoanAmt":"$25,500","Url":"Adams_State_College/Salary","Img":"Adams State College_50px.jpg","Tuition":"(Out-of-State)"},{"Name":"Adelphi University","CostOn":"$194,000","CostOff":"$194,000","RoiOn":"$292,000","RoiOff":"$293,000","RoiAidOn":"$362,000","RoiAidOff":"$362,000","AnnRoiOn":"4.7%","AnnRoiOff":"4.7%","AnnRo

We can now attemp to read the JSON file as a table.

In [209]:
data = pd.read_json(json_text.group(1))

In [211]:
data.head()

Unnamed: 0,AnnRoiAidOff,AnnRoiAidOn,AnnRoiOff,AnnRoiOn,CostOff,CostOn,GradRate,GradTime,Grant,Img,LoanAmt,Name,RoiAidOff,RoiAidOn,RoiOff,RoiOn,Tuition,Url
0,6.3%,6.7%,3.7%,3.9%,"$83,400","$80,400",21%,5 Years,92%,Adams State College_50px.jpg,"$25,500",Adams State College,"$123,000","$126,000","$90,000","$93,000",(In-State),Adams_State_College/Salary
1,3.1%,3.3%,1.6%,1.7%,"$126,000","$123,000",21%,5 Years,92%,Adams State College_50px.jpg,"$25,500",Adams State College,"$79,700","$82,700","$47,100","$50,100",(Out-of-State),Adams_State_College/Salary
2,7.1%,7.1%,4.7%,4.7%,"$194,000","$194,000",64%,4 Years,93%,Adelphi University_50px.png,"$33,300",Adelphi University,"$362,000","$362,000","$293,000","$292,000",(Private),Adelphi_University/Salary
3,2.2%,1.4%,-1.5%,-1.9%,"$164,000","$177,000",57%,4 Years,98%,no_logo.png,"$30,800",Adrian College,"$42,700","$29,300","($42,700)","($56,200)",(Private),Adrian_College/Salary
4,5.7%,3.3%,0.3%,-0.6%,"$159,000","$192,000",73%,4 Years,100%,no_logo.png,"$28,300",Agnes Scott College,"$114,000","$81,000","$9,700","($23,100)",(Private),Agnes_Scott_College/Salary


**YOUR TASK:** Let's have a look at Santa Clara University.

In [12]:
print(data[data.Name =='Santa Clara University'])

     AnnRoiAidOff AnnRoiAidOn AnnRoiOff AnnRoiOn   CostOff    CostOn GradRate  \
1117         8.9%        8.8%      6.4%     6.4%  $241,000  $243,000      85%   

     GradTime Grant                              Img  LoanAmt  \
1117  4 Years   71%  Santa Clara University_50px.png  $29,400   

                        Name RoiAidOff  RoiAidOn    RoiOff     RoiOn  \
1117  Santa Clara University  $680,000  $679,000  $590,000  $589,000   

        Tuition                            Url  
1117  (Private)  Santa_Clara_University/Salary  


Let's save everything as a csv-file.

In [212]:
data.to_csv('roi.csv', sep=';')

## The US News Rankings for Santa Clara University

Let's make a soup from the specific file.

In [213]:
soup = bs(open('usnews_scu.html'), 'lxml')

Let's have a look.

In [215]:
print(soup)

<!DOCTYPE html>
<html class="no-js" lang="">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<meta content="Find everything you need to know about Santa Clara, including tuition &amp; financial aid, student life, application info, academics &amp; more." name="description"/>
<meta content="Santa Clara, bestcolleges, directory, nationaluniversities, profile" name="keywords"/>
<meta content="origin" name="referrer"/>
<meta content="Best Colleges" name="site"/>
<meta content="Directory/Regional Universities West" name="zone"/>
<meta content="" name="robots"/>
<link data-test-id="relCanonical" href="https://www.usnews.com/best-colleges/santa-clara-1326" rel="canonical"/>
<title>Santa Clara University | Santa Clara University - Profile, Rankings and Data | Santa Clara | US News Best Colleges</title>
<script type="text/javascript">
            window.usnFirstByteTime = new Date();
        </script>
<script id="usn-js-gpt" src="//www.googlet

We need to get the strong elements and the links. 
We store each ranking in a dictionary. 
We also create a dataframe to hold the various rankings.

In [216]:
ranks = dict()
scu_ranks = pd.DataFrame()

We want four elements:
  * the rank
  * the URL
  * the Name of the Ranking
  * the source

In [217]:
for ranking in soup.find_all('div', {"style": "margin-left: 2rem;"}):
    for r in ranking.find_all('strong', text=re.compile('#') ):
        ranks['Rank'] = r.text
    for url in ranking.find_all('a'):
        ranks['URL'] = url.get('href')
        ranks['Name'] = url.contents
        ranks['Source'] = 'US News'
    scu_ranks = scu_ranks.append(ranks, ignore_index=True)

Let's have a look at the results.

In [218]:
print(scu_ranks)

                           Name Rank   Source  \
0  [Regional Universities West]   #2  US News   
1  [Best Colleges for Veterans]   #2  US News   
2           [Business Programs]  #63  US News   

                                                 URL  
0  https://www.usnews.com/best-colleges/rankings/...  
1  https://www.usnews.com/best-colleges/rankings/...  
2  https://www.usnews.com/best-colleges/rankings/...  


## Schools in the 'Regional Universities West Ranking'

We get the data on the 'Regional Universities West' Ranking.

We want the following elements:
  * the rank of the school
  * the name of the school
  * the URL of the school
  * the location of the school

Let's have a look at the results.

Clean the results

Save the results in a csv file.

In [236]:
ruw.to_csv('ruw.csv', sep=';')