# First Steps in Data Scraping with BeautifulSoup

In this lab session, we will discuss how to get data from websites with BeautifulSoup. 

  * You will find BeautifulSoup [here](https://www.crummy.com/software/BeautifulSoup/). 
  * The documentation for BeautifulSoup is [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

## Installation

Install BeautifulSoup using your package manager, e.g., `pip install beautifulsoup4`

## School rankings

School rankings are typically published on websites with little or no download available.

We will scrap three webpages, which are already downloaded and in the working directory of this notebook.

Let's have a look at the files (please adapt to your operation system)

In [None]:
!ls *.html

## Preparation

We need to load a set of libraries.
  * Pandas for handling dataframes
  * BeautifulSoup to scrap webpages
  * `urllib` to handle URLs
  * `json` to handle JSON-Files
  * `re` to handle regular expressions

In [None]:
import pandas as pd
from bs4 import BeautifulSoup as bs
import urllib as url
import json
import re

## Scraping of ROI numbers from PayScale

We open the downloaded webpage that contains the ROI information. [Look here for the webpage online.](http://www.payscale.com/college-roi?page=130) 

In [None]:
soup = bs(open('payscale.html'), 'lxml')

Let's have a look at the 'soup' that we just created.

In [None]:
print(soup)

Let's find the ROI data.

In [None]:
roi_data = soup.find('script', text=re.compile('collegeRoiData'))

`soup.find` returns the element that contains the string `collegeRoiData`.
Let's have a look.

In [None]:
print(roi_data)

We realize that the ROI data is in a JSON-File. We develop a small regular expressions that delivers the JSON file.

In [None]:
roi_as_json = re.search(r'collegeRoiData\s*=\s*(\[.*?\])', roi_data.string, flags=re.DOTALL | re.MULTILINE)

Let's have a look at the results.

In [None]:
print(json_text.group(1))

We can now attemp to read the JSON file as a table.

In [None]:
data = pd.read_json(json_text.group(1))

In [None]:
data.head()

**YOUR TASK:** Let's have a look at Santa Clara University.

In [None]:
print(data[data.Name =='Santa Clara University'])

Let's save everything as a csv-file.

In [None]:
data.to_csv('roi.csv', sep=';')

## The US News Rankings for Santa Clara University

Let's make a soup from the specific file. [Look here for the file online.](https://www.usnews.com/best-colleges/santa-clara-1326)

In [None]:
soup = bs(open('usnews_scu.html'), 'lxml')

Let's have a look.

In [None]:
print(soup)

We need to get the strong elements and the links. 
We store each ranking in a dictionary. 
We also create a dataframe to hold the various rankings.

In [None]:
ranks = dict()
scu_ranks = pd.DataFrame()

We want four elements:
  * the rank
  * the URL
  * the Name of the Ranking
  * the source

In [None]:
for ranking in soup.find_all('div', {"style": "margin-left: 2rem;"}):
    for r in ranking.find_all('strong', text=re.compile('#') ):
        ranks['Rank'] = r.text
    for url in ranking.find_all('a'):
        ranks['URL'] = url.get('href')
        ranks['Name'] = url.contents
        ranks['Source'] = 'US News'
    scu_ranks = scu_ranks.append(ranks, ignore_index=True)

Let's have a look at the results.

In [None]:
print(scu_ranks)

## Schools in the 'Regional Universities West Ranking'

We get the data on the 'Regional Universities West' Ranking. [Look here for the file online.](https://www.usnews.com/best-colleges/rankings/regional-universities-west?_mode=table)

In [None]:
soup = bs(open('ruw.html'), 'lxml')

In [None]:
print(soup)

We want the following elements:
  * the rank of the school
  * the name of the school
  * the URL of the school
  * the location of the school

In [None]:
ruw = pd.DataFrame()
schools = dict()
for school in soup.find_all('tr', attrs={'data-child-index':True}):
    schools['Rank'] = school.get('data-child-index')
    for school_name in school.find_all('a', attrs={'class':False}):
        schools['Name'] = school_name.get_text()
        schools['URL'] = 'https://www.usnews.com' + school_name.get('href')
    for location in school.find_all('div', text = re.compile('\w'), attrs={'class':'text-small block-tight'}):
        schools['Location'] = location.get_text()
    ruw = ruw.append(schools, ignore_index=True)

Let's have a look at the results.

In [None]:
print(ruw)

In [None]:
ruw['City'], ruw['State'] = ruw['Location'].str.split(',',1).str
ruw['City'] = ruw['City'].str.replace('\n', '')
ruw['State'] = ruw['State'].str.replace('\n', '')

In [None]:
ruw.head()

In [None]:
ruw.to_csv('ruw.csv', sep=';')