<center><b>DIGHUM101</b></center>
<center>3-3: Web Scraping</center>

---

# Web scraping with BeautifulSoup

Web scraping is programmatically collecting information from various websites. While there are many libraries and frameworks in various languages that can extract web data, Python has long been a popular choice because of its plethora of options for web scraping.

# Ethical web scraping
Before choosing to engage in web scraping, you always have to consider some things:
1. Many websites have a Terms of Use which may not allow scraping. We must respect websites that do not want to be scraped.
2. Is there an API available already? If so, there's no need for us to write a scraper. APIs are created to provide access to data in a controlled way as defined by the owners of the data, so we prefer to use APIs if they're available.
3. Making requests to a website can cause a toll on a website's performance. A web scraper that makes too many requests can be as debilitating. We must scrape responsibly so we won't cause any disruption to the regular functioning of the website.

If you have doubts about the ethics of scraping some website, please consult with me.


# Scraping from Wikipedia
We're going to scrape some information from Wikipedia, which has a simple page layout with a consistent template.

For web scraping we're going to need two libraries: [requests](https://requests.readthedocs.io/en/master/) and [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). BeautifulSoup is what we use to actually navigate and parse the page that we're scraping. We'll import the `time` library too. This will allow us to `time.sleep(5)` so that we don't overload anyone's servers. 

We will talk a little about HTML and CSS - you need to know more about these if you want to get good at web scraping. Here's a good point to start: [What are HTML and CSS?](https://html.com/) 

If you're looking for a quick crash course in developer tools for HTML and CSS, check out this YouTube video.

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('FQKvro1Wz-E', width=640, height=360)

In [None]:
# !pip install beautifulsoup4

In [None]:
import requests
from bs4 import BeautifulSoup
import time
import pandas as pd

### For this exercise, we will scrape all the citations on the Wikipedia "Data Science" page

First we use requests to make a `.get` request to the page. First, hav a look at what's on the [Data science](https://en.wikipedia.org/wiki/Data_science) Wikipedia page. Next, we'll access this page using a GET request through the `requests` library.

In [None]:
r = requests.get('https://en.wikipedia.org/wiki/Data_science')

We now have an .html object. There is no .html method in the requests library (like for json), but BeautifulSoup will help us get there. First, extract the html string:

In [None]:
source = r.text
source

Neat! If you visit the Data Science Wikipedia page, right click with your mouse and click "View source" - it's the same thing! 

<img src="../../Img/page_source.gif" alt="source" style="width: 400px;"/>

Now we convert it into a BeautifulSoup object that makes navigating the HTML tree much easier.

Note that Beautiful Soup offers a number of ways to customize how the parser treats incoming HTML and XML. We are using the `html5` parser here, but we could use [different ones](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers) as well. It all depends on the website you're trying to scrape.

In [None]:
soup = BeautifulSoup(source, 'html5lib')
print(type(soup))

Then, use the `.prettify()` method to look at the HTML, and even get a slice of it. Let's take a look at what we have:

In [None]:
print(soup.prettify()[:2000])

Let's use BeautifulSoup functions to find things on a page, such as:

1. HTML tags
2. HTML Attributes
3. CSS Selectors

Let's search first for **HTML tags**. 

The function `find_all` searches the `soup` tree to find all the elements with a particular HTML tag, and returns a list of all those elements. Let's search for all of the [`a` tags](https://www.w3schools.com/tags/tag_a.asp) (i.e., hyperlinks).

In [None]:
soup.find_all("a")

Since the `.find_all()` method is used so frequently, there is a shortcut for it. You can just treat the soup object itself as a function, and pass it the tag you're looking for as an argument.

So `soup.find_all('a')` is the same as `soup('a')`:

In [None]:
soup.find_all('a') == soup('a')

You probably noticed that `.find_all()` returned a lot of elements, most of which we might not want. One way to narrow down our search is to specify that we're only looking for elements that have a certain CSS class. Alternatively we can use the `.select()` method. We pass an argument to the method that consists of the tag and the CSS class separated by a period. For instance, we can grab the title with the following CSS selector:

In [None]:
soup.select("h1.firstHeading")

How are we getting all these tag and attribute names? Typically, you will want to go to a web page on your browser, right-click on an element you're interested in (such as the heading in the example above) and select "inspect" in order to see the HTML and CSS that makes up the web page. You can then also navigate to other elements in the HTML.

<img src="../../Img/inspect.gif" alt="inspect" style="width: 400px;"/>

# Scraping text

Inspecting the HTML, we can see there's a tag with an id called `bodyContent`, where all the main text of the article can be found. Let's retrieve it.

In [None]:
# This is an attribute - not a method :D
body = soup.find(id="mw-content-text")
body

In [None]:
type(body)

Once we identify elements, we want to access the information in a certain element. This usually means two things:

1. Text
2. Attributes

Here, our `body` variable here is a BeautifulSoup `Tag` object. This means it has a `text` attribute. Let's grab all the `p` (paragraph) tags from our resulting BeautifulSoup object and print these `text` attributes.

In [None]:
for t in body.find_all("p"):
    print(t.text)

# Scraping links 

Next, let's find all the places in the text where there is a link to another website. Using the `.find()` method, we can find all the links on the page that are within the main text. 

Note that we have a special beautifulSoup `Tag` object, meaning we can use its methods on our `text` variable as well. Let's use the `.attrs` attribute to see the attributes for the first `a` tag (i.e., the first hyperlink in this BeautifulSoup object). We can get that with indexing :)

In [None]:
first_link = body("a")[0].attrs
print(first_link)

You'll notice that it looks a lot like a dictionary, so we can index it as such. Since we want the link, we can use the `href` attribute like a dictionary key to get the corresponding value.

In [None]:
first_link['href']

Knowing this, we can now iterate over all `a` tags and access them as dictionaries to retrieve the ["href" attribute](https://www.w3schools.com/tags/att_a_href.asp), which specifies the URL of the page the link goes to.

In [None]:
for line in text.find_all('a'):
    print(line['href'])

# Scraping references
Next, let's get the references one can find at the bottom of a Wikipedia page. Let's `find` the references part of the website first and save that to a new variable.

In [None]:
refs = soup.find("div", class_="reflist")
# or, using find_all: 
#refs = soup.find_all("div", class_="reflist")

Next, we'll `select` the first `reference-text` attribute.

*Note that in this case, we could either use `find_all` or `select`. Usage often depends on the use case. See [here](https://stackoverflow.com/questions/38028384/beautifulsoup-difference-between-find-and-select) if you want to learn more.*

In [None]:
first_citation = refs.select("span.reference-text")[0]
# or, using find_all
#first_citation = refs.find_all("span", class_="reference-text")[0]

first_citation


In [None]:
# check out its type
print(type(first_citation))

If we want to get the link to this citation, we just have to navigate to it. We can again find whatever `a` elements are in this tag, just like we did before.

In [None]:
# Find the "a" elements
print(first_citation("a"))

As you can see, this returns a list. 
Note that we have a special beautifulSoup "Tag" object. Let's use the `.attrs` attribute to see the attributes for the first `a` tag (using indexing).

In [None]:
# Get the first one
print(first_citation("a")[0])

Since we want the link, we can use the `href` attribute again to get the corresponding value.

In [None]:
print(first_citation("a")[0]['href'])

Now, get all the links contained in the references and add them to a list:

In [None]:
# make accumulator list
refs_list = []

# start at the endnotes
references = soup.select("span.reference-text")

# loop through references
for ref in references:
    if ref("a") != []:  # ignore the references without links
        
        a_element = ref("a")[0]
        link = a_element['href']
        
        refs_list.append(link)

# get rid of links to wiki articles
refs_list = [ref for ref in refs_list if not ref.startswith('/wiki')]

refs_list

In [None]:
# Convert to data frame
citations_df = pd.DataFrame(refs_list, columns = ["Citation"])
citations_df.head()

In [None]:
# Export to .csv
citations_df.to_csv("citations.csv")