<center> <img src="https://github.ccs.neu.edu/caglar/DS3000/blob/master/img/ds3000.png?raw=true"> </center>

## Outline
1. <a href='#1'>Reading Web Resources from URLs</a>
2. <a href='#2'>Web Scraping using BeautifulSoup</a>
3. <a href='#3'>Scraping Specific Tags from Webpages</a>
4. <a href='#4'>Scraping Web Pages by Tags and Attributes</a>
5. <a href='#5'>Scraping Child Tags under a Parent Tag</a>
6. <a href='#6'>Storing the Scraped Data</a>
7. <a href='#7'>Web Crawling</a>
8. <a href='#8'>More on Web Scraping</a>


## Web Scraping
* Using programs or scripts to pretend to browse websites, examine the content on those websites, retrieve and extract data from those websites
* Also called web spidering, web crawling, web harvesting, or web data extraction

## Why Scrape?
* Pull your dataset from a website when your data is not readily available
* Extract different pieces of information from online resources when working with non-traditional datasets (e.g., social media posts)

<a id="1"></a>

## 1. Reading Web Resources from URLs
* **`urllib`** module allows you to read data from URLs
Open the URL url, which can be either a string 

In [1]:
import urllib.request as urllib
html = urllib.urlopen("https://www.khoury.northeastern.edu/role/tenured-and-tenure-track-faculty/")
print(html.read())

b'<!DOCTYPE html>\n<!--[if (lte IE 9) ]><html lang="en" class="no-js oldie"><![endif]-->\n<!--[if (gt IE 9)|!(IE)]><!--><html lang="en" class="no-js"><!--<![endif]-->\n<head>\n\t<meta http-equiv="X-UA-Compatible" content="IE=edge">\n\t<meta name="viewport" content="width=device-width, initial-scale=1">\n\t<script src="https://www.khoury.northeastern.edu/wp-content/themes/ccis_theme/lib/modernizr.min.js"></script>\n\t<link rel="shortcut icon" href="https://www.khoury.northeastern.edu/wp-content/themes/ccis_theme/img/favicon.ico" type="image/x-icon" />\n\t<meta name="apple-mobile-web-app-title" content="NU Khoury">\n\t<link rel="apple-touch-icon" href="https://www.khoury.northeastern.edu/wp-content/themes/ccis_theme/img/touchicon.png" />\n\t<link rel="apple-touch-icon-precomposed" href="https://www.khoury.northeastern.edu/wp-content/themes/ccis_theme/img/touchicon.png" />\n\t<link rel="apple-touch-icon" sizes="180x180" href="https://www.khoury.northeastern.edu/wp-content/themes/ccis_them

### 1.1. urlopen() function
* Opens a URL, which can be a string or Request object
* Returns a file-like object containing the contents of the URL

In [2]:
import urllib.request as urllib
html = urllib.urlopen("https://www.khoury.northeastern.edu/role/tenured-and-tenure-track-faculty/")
print(html.read())

b'<!DOCTYPE html>\n<!--[if (lte IE 9) ]><html lang="en" class="no-js oldie"><![endif]-->\n<!--[if (gt IE 9)|!(IE)]><!--><html lang="en" class="no-js"><!--<![endif]-->\n<head>\n\t<meta http-equiv="X-UA-Compatible" content="IE=edge">\n\t<meta name="viewport" content="width=device-width, initial-scale=1">\n\t<script src="https://www.khoury.northeastern.edu/wp-content/themes/ccis_theme/lib/modernizr.min.js"></script>\n\t<link rel="shortcut icon" href="https://www.khoury.northeastern.edu/wp-content/themes/ccis_theme/img/favicon.ico" type="image/x-icon" />\n\t<meta name="apple-mobile-web-app-title" content="NU Khoury">\n\t<link rel="apple-touch-icon" href="https://www.khoury.northeastern.edu/wp-content/themes/ccis_theme/img/touchicon.png" />\n\t<link rel="apple-touch-icon-precomposed" href="https://www.khoury.northeastern.edu/wp-content/themes/ccis_theme/img/touchicon.png" />\n\t<link rel="apple-touch-icon" sizes="180x180" href="https://www.khoury.northeastern.edu/wp-content/themes/ccis_them

### 1.2. read() method
* Reads the entire page of the Request object

In [3]:
import urllib.request as urllib
html = urllib.urlopen("https://www.khoury.northeastern.edu/role/tenured-and-tenure-track-faculty/")
print(html.read())

b'<!DOCTYPE html>\n<!--[if (lte IE 9) ]><html lang="en" class="no-js oldie"><![endif]-->\n<!--[if (gt IE 9)|!(IE)]><!--><html lang="en" class="no-js"><!--<![endif]-->\n<head>\n\t<meta http-equiv="X-UA-Compatible" content="IE=edge">\n\t<meta name="viewport" content="width=device-width, initial-scale=1">\n\t<script src="https://www.khoury.northeastern.edu/wp-content/themes/ccis_theme/lib/modernizr.min.js"></script>\n\t<link rel="shortcut icon" href="https://www.khoury.northeastern.edu/wp-content/themes/ccis_theme/img/favicon.ico" type="image/x-icon" />\n\t<meta name="apple-mobile-web-app-title" content="NU Khoury">\n\t<link rel="apple-touch-icon" href="https://www.khoury.northeastern.edu/wp-content/themes/ccis_theme/img/touchicon.png" />\n\t<link rel="apple-touch-icon-precomposed" href="https://www.khoury.northeastern.edu/wp-content/themes/ccis_theme/img/touchicon.png" />\n\t<link rel="apple-touch-icon" sizes="180x180" href="https://www.khoury.northeastern.edu/wp-content/themes/ccis_them

<a id="2"></a>

## 2. Web Scraping using BeautifulSoup
* A common web scraping library
* Helps format and organize the messy web by fixing bad HTML and presenting us with easily-traversible Python objects
* Need to install the library before you can use it in Python
    * pip install beautifulsoup4 (run this in Anaconda prompt)
    * Full documentation available at https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [4]:
#imports the BeautifulSoup object in bs4 library
from bs4 import BeautifulSoup

In [5]:
import urllib.request as urllib
from bs4 import BeautifulSoup

html = urllib.urlopen("https://www.khoury.northeastern.edu/role/tenured-and-tenure-track-faculty/")
soup = BeautifulSoup(html.read())

### soup = BeautifulSoup(htm.read())
* Transforms the HTML content into a BeautifulSoup object, called soup
* The BeautifulSoup object retains the general structure of a web page:
    * `<html></html>`
    * `<head></head>`
    * `<body></body>`

In [6]:
soup.html

<html class="no-js" lang="en"><!--<![endif]-->
<head>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<script src="https://www.khoury.northeastern.edu/wp-content/themes/ccis_theme/lib/modernizr.min.js"></script>
<link href="https://www.khoury.northeastern.edu/wp-content/themes/ccis_theme/img/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
<meta content="NU Khoury" name="apple-mobile-web-app-title"/>
<link href="https://www.khoury.northeastern.edu/wp-content/themes/ccis_theme/img/touchicon.png" rel="apple-touch-icon"/>
<link href="https://www.khoury.northeastern.edu/wp-content/themes/ccis_theme/img/touchicon.png" rel="apple-touch-icon-precomposed"/>
<link href="https://www.khoury.northeastern.edu/wp-content/themes/ccis_theme/img/touchicon-180.png" rel="apple-touch-icon" sizes="180x180"/>
<link href="https://fast.fonts.com/cssapi/cac43e8c-6965-44df-b8ca-9784607a3b53.css" rel="stylesheet" type="text/css"/>

In [None]:
soup.head

In [None]:
soup.body

In [None]:
soup.body.main

In [None]:
soup.body.main.h1

* What if you wanted to retrieve the text contained in `<h1> </h1>`?

### 2.1. get_text() method
* **`get_text()`** strips all tags from the webpage and returns a string containing the text inbetween the tags only.
* Strips away all the tags and returns a tagless block of text


In [None]:
soup.body.main.h1.get_text()

### 2.1. get_text() method cont'd
* Call .get_text() immediately before you print, store, or manipulate your final data
* A lot easier to find what you’re looking for in a BeautifulSoup object than in a block of text
* try to preserve the tag structure of a document as long as possible

In [None]:
soup.body.main

In [None]:
soup.body.main.get_text()

### 2.2. get() method
* Retrieves an attribute of a tag
* **tagName.get("attributeName")**

In [None]:
soup.body.main.a

In [None]:
soup.body.main.a.get("href")

### 2.3. attrs attribute
* Returns a dictionary of the attributes of a tag
* **tagName.attrs**

In [None]:
soup.body.main.a.attrs

* **attrs** can also be used to retrieve attributes of a tag:

In [None]:
soup.body.main.a.attrs["href"]

In [None]:
soup.body.main.a.get_text()

### 2.4. find() method
* Allows you to search through an HTML page and find a specific tag
* soup_name.find("tagName")
* Returns the first occurrence of the tag
* Returns None if the tag/attribute does not exist

In [None]:
soup.find("title")

In [None]:
soup.find("title").get_text()

<a id="3"></a>

## 3. Scraping Specific Tags from Webpages
* **`find_all(tagName)`** returns **a list of all the tags** found within the page.

#### Lets' extract all the links found on the Khoury Faculty page
* Links are placed in `<a>` tags 
* A typical link in HTML looks like this:
    * Backend: `<a href="https://www.khoury.northeastern.edu/people/carla-brodley/"> Carla E. Brodley </a>`
    * Frontend: <a href="https://www.khoury.northeastern.edu/people/carla-brodley/"> Carla E. Brodley </a>

In [None]:
import urllib.request as urllib
from bs4 import BeautifulSoup

html = urllib.urlopen("https://www.khoury.northeastern.edu/role/tenured-and-tenure-track-faculty/")
soup = BeautifulSoup(html.read())

a_tags = soup.find_all("a")

In [None]:
a_tags

In [None]:
for link in a_tags:
    if link.get('href').startswith("http"):
        print(link.get('href'))

* What if you justed wanted to scrape the hyperlinks, not email addresses or phone numbers?

In [None]:
for link in a_tags:
    href = link.get("href")
    if href.startswith("https"):
        print(href)

<a id="4"></a>

## 4. Scraping Web Pages by Tags and Attributes
* Web pages use tags and attributes to style and format pages.
* **`find_all()`** method allows you to search through a web page and extract useful information
* findAll(tagName, tagAttributes)
    * Looks through a tag’s descendants and retrieves all descendants that match your filters
    * Returns None if the tag/attribute does not exist


* Let's retrieve the faculty names from the page:

<img src="res/html_tree.png" />

In [None]:
import urllib.request as urllib
from bs4 import BeautifulSoup

html = urllib.urlopen("https://www.khoury.northeastern.edu/role/tenured-and-tenure-track-faculty/")
soup = BeautifulSoup(html.read())

#retrieves all h3 tags with class = "person-name"
faculty_list = soup.find_all("h3", {"class":"person-name"})

In [None]:
faculty_list

### find_all() method calls
* both retrieve all h3 tags with class = "person-name"

In [None]:
faculty_list = soup.find_all("h3", {"class":"person-name"})

In [None]:
faculty_list = soup.find_all("h3", class_="person-name")

* Now that we have all h3 tags stored in a list, we can extract the text content contained in `<h3></h3>`
* Use **get_text()**

In [None]:
import urllib.request as urllib
from bs4 import BeautifulSoup

html = urllib.urlopen("https://www.khoury.northeastern.edu/role/tenured-and-tenure-track-faculty/")
soup = BeautifulSoup(html.read())

#retrieves all h3 tags with class = "person-name"
faculty_list = soup.find_all("h3", {"class":"person-name"})

#retrieves the text contained in each prof's h3 tag
for prof in faculty_list:
    print(prof.get_text().strip())

<a id="5"></a>

  
## 5. Scraping Child Tags under a Parent Tag
* Let's create our own record of faculty names, titles, webpage links, and profile picture URLs

<img src = "res/parent_child_tags.png" />

<center><img src = "res/khoury_grid_item.png" /></center>

In [None]:
import urllib.request as urllib
from bs4 import BeautifulSoup

html = urllib.urlopen("https://www.khoury.northeastern.edu/role/tenured-and-tenure-track-faculty/")
soup = BeautifulSoup(html.read())

#retrieves all <div class = "grid-item"> tags
faculty_divs = soup.find_all("div", class_="grid-item")
faculty_divs

### 5.1. Let's extract faculty names
* All names are marked with `<h3 class = "person-name">`
* Because there is one `<h3>` tag under `<div class = "grid-item">` we can use the **find()** method to scrape it
* Need to call the **get_text()** on it to get the content
* Also need to strip whitespace using **strip()**

In [None]:
faculty_divs = soup.find_all("div", class_="grid-item")

for prof in faculty_divs:
    fname = prof.find("h3").get_text().strip()
    print(fname)

### 5.2. Similarly we can get the title of each faculty member
* Retrieve the `<p class="position"` tag under `<div class = "grid-item">`

In [None]:
faculty_divs = soup.find_all("div", class_="grid-item")

for prof in faculty_divs:   
    fname = prof.find("h3").get_text().strip()
    ftitle= prof.find("p","position").get_text().strip()
    
    print(fname)
    print(ftitle)  


### 5.3. Can get the link to the faculty member's webpage
* Retrieve the **`href`** attribute of the **`<a>`** tag under `<div class = "grid-item">`

In [None]:
faculty_divs = soup.find_all("div", class_="grid-item")

for prof in faculty_divs:   
    fname = prof.find("h3").get_text().strip()
    ftitle= prof.find("p","position").get_text().strip()
    fpage = prof.find("a").get("href").strip()

    print(fname)
    print(ftitle)
    print(fpage)

### 5.4. Can get the profile picture too
* Retrieve the **`src`** attribute of the **`<img>`** tag under `<div class = "grid-item">`

In [None]:
faculty_divs = soup.find_all("div", class_="grid-item")

for prof in faculty_divs:   
    fname = prof.find("h3").get_text().strip()
    ftitle= prof.find("p","position").get_text().strip()
    fpage = prof.find("a").get("href").strip()
    fpic = prof.find("img").get("src").strip()

    print(fname)
    print(ftitle)
    print(fpage)
    print(fpic)

<a id="6"></a>

## 6. Storing the Scraped Data
* Most of the time, you'll want to store your scraped data in a file.
* Consider using DataFrames for tabular data.

In [None]:
import pandas as pd

df = pd.DataFrame(columns = ["Name", "Title", "Link", "Picture"])
df

### 6.1. Appending Rows to a DataFrame
* Use append() method 
* Pass in a dictionary containing Column names and Values

In [None]:
df = df.append({"Name": "Dumbledore", "Title": "Headmaster", "Link":"hogwarts.edu", 
                "Picture":"hogwarts.edu/dumby.png"}, ignore_index=True)
df

In [None]:
df = df.drop(0)

In [None]:
df

In [None]:
faculty_divs = soup.find_all("div", class_="grid-item")

for prof in faculty_divs:   
    fname = prof.find("h3").get_text().strip()
    ftitle= prof.find("p","position").get_text().strip()
    fpage = prof.find("a").get("href").strip()
    fpic = prof.find("img").get("src").strip()
    
#appends the fields to their respective columns
#note the curly braces 
    df = df.append({"Name":fname, "Title":ftitle, "Link":fpage, "Picture": fpic}, ignore_index=True)

In [None]:
df

In [None]:
#displays first 5 rows
df.head()

In [None]:
#displays last 5 rows
df.tail()

### 6.2. Writing the DataFrame to CSV

In [None]:
df.to_csv("khoury_faculty.csv")

<a id="7"></a>

## 7. Web Crawling

In [None]:
deanURL = "https://www.khoury.northeastern.edu/people/carla-brodley/"
    
dean_page = urllib.urlopen(deanURL)    
page_soup = BeautifulSoup(dean_page.read())

In [None]:
page_soup

<img src = "res/email.png" />

In [None]:
deanURL = "https://www.khoury.northeastern.edu/people/carla-brodley/"

#let's open the page
dean_page = urllib.urlopen(deanURL)

#creates a BeautifulSoup object containing the content of the page
page_soup = BeautifulSoup(dean_page.read())

#finds the p tage with class = "contact-email"
email_container = page_soup.find("p", class_="contact-email")

#finds the a tag the ithin the contact-email p tag and extracts the text(email address)
email = email_container.find("a").get_text()

In [None]:
email

### Let's do this for everyone on the page!

### 7.1. Web Crawling
* Scrapers traversing multiple pages and even multiple sites
* Web crawlers retrieve page contents for a URL, examine that page for another URL, and retrieve that page or some portions of it

In [None]:
import urllib.request as urllib
from bs4 import BeautifulSoup
import time
import random

#opens the faculty page
html = urllib.urlopen("https://www.khoury.northeastern.edu/role/tenured-and-tenure-track-faculty/")
#turns the pag into a BeautifulSoup object
soup = BeautifulSoup(html.read())

#get all the content under <div class = "grid-item">
faculty_divs = soup.find_all("div", class_="grid-item")

#defines an empty list that will contain the email addressed crawled from faculty webpages
emails = []

for prof in faculty_divs[:3]:
    
    #gets faculty name for each prof
    fname = prof.find("h3").get_text().strip()
    #gets the URL to their webpage
    fpage = prof.find("a").get("href").strip()
    
    #opens the URL for each prof
    fac_page = urllib.urlopen(fpage)    
    #turns the URL for each prof into a BeautifulSoup object
    page_soup = BeautifulSoup(fac_page.read())
    
    #closes the urllib connection so the website won't get mad at us
    fac_page.close()
    
    #on the new page, finds the email container, <p class = "contact_email">
    email_container = page_soup.find("p", class_="contact-email")
    #gets the text for the <a> tage, the email address
    email = email_container.find("a").get_text()
    
    #appends the email to a list and displays it
    emails.append(email)
    print(email)
    
    #waits for a random number of seconds(2-5) before moving on to the next prof in the iterable
    #done to avoid overwhelming the website and getting blocked as a bot
    time.sleep(random.randint(2,6))

print("\n\n\nDone scraping the addresses")

In [None]:
#displays the list of email addresses
emails

#### Let's add these email addresses to our dataframe, df

In [None]:
df.head()

In [None]:
len(emails)

In [None]:
#we can add a new column
df["Email"] = emails

In [None]:
df.head()

<a id="8"></a>

## 8. More on Web Scraping
* **BeautifulSoup** Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
* **Selenium:** https://selenium-python.readthedocs.io/index.html
    * Web automation and scraping; dynamic GET and POST requests; can interact with dynamic web pages, forms, etc.
* **Scrapy:** https://scrapy.org/
    * Optimized web crawling tasks