# Web scraping with BeautifulSoup

We're going to scrape some information from Wikipedia, which has a simple page layout with a consistent template.

For web scraping we're going to need two libraries: [requests](https://requests.readthedocs.io/en/master/) and [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). BeautifulSoup is what we use to actually navigate and parse the page that we're scraping. We'll import the `time` library too. This will allow us to `time.sleep(5)` so that we don't overload anyone's servers. 

We will talk a little about HTML and CSS - learn more here: [What are HTML and CSS?](https://html.com/) 

In [None]:
# !pip install beautifulsoup4

In [None]:
import requests
from bs4 import BeautifulSoup
import time
import pandas as pd

### For this exercise, we will scrape all the citations on the Wikipedia "Data Science" page

First we use requests to make a `.get` request to the page. Let's see what's on the [Data science](https://en.wikipedia.org/wiki/Data_science) Wikipedia page:

In [None]:
r = requests.get('https://en.wikipedia.org/wiki/Data_science')

We now have an .html object, but there is no .html method in the requests library, but BeautifulSoup will help us get there. First, extract the html string:

In [None]:
source = r.text
source

Neat! If you visit the Data Science Wikipedia page, right click with your mouse and click "View source" - it's the same thing! Now we use BeatifulSoup to convert it into a soup class object that makes navigating the HTML tree much easier.

In [None]:
soup = BeautifulSoup(source, 'html5lib')
print(type(soup))

Then, use the `.prettify()` method to look at the HTML, and even get a slice of it. Let's take a look at what we have:

In [None]:
print(soup.prettify()[:1000])

Let's use BeautifulSoup functions to find things on a page, such as:

1. HTML tags
2. HTML Attributes
3. CSS Selectors

Let's search first for **HTML tags**. 

The function `find_all` searches the `soup` tree to find all the elements with a particular HTML tag, and returns a list of all those elements.

In [None]:
soup.find_all("a")

Since the `.find_all()` method is used so frequently, there is a shortcut for it. You can just treat the soup object itself as a function, and pass it the tag you're looking for as an argument.

So `soup.find_all('a')` is the same as `soup('a')`:

In [None]:
soup.find_all('a') == soup('a')

You probably noticed that `.soup('a')` returned a lot of elements, most of which we might not want. One way to narrow down our search is to specify that we're only looking for elements that have a certain CSS class. Alternatively we can use the `.select()` method. We pass the method an argument that consists of the tag and the CSS class separated by a period. We can grab all the links in the navigation box in the upper right with the following CSS selector:

In [None]:
# soup.select("table.vertical-navbox.nowraplinks.plainlist a")
soup.select("table.vertical-navbox")

If you're looking for a quick crash course in developer tools, check out this [YouTube video](https://www.youtube.com/watch?v=FQKvro1Wz-E).

[![IMAGE ALT TEXT HERE](https://img.youtube.com/vi/FQKvro1Wz-E/0.jpg)](https://www.youtube.com/watch?v=FQKvro1Wz-E)

# Find the first citation

Let's find all the places in the text where there is a citation, along with the references themselves. Using the `.select()` method, find all the elements in the page that belong to the "reference-text" class.

****

Once we identify elements, we want to access the information in a certain element. This usually means two things:

1. Text
2. Attributes

Getting the text inside an element is easy. All we have to do is use the "text" member of a "tag" object. Let's look at the first citation:

In [None]:
first_citation = soup.select("span.reference-text")[0]
first_citation

In [None]:
# check out its type
print(type(first_citation))

It's a tag! Which means it has a `text` member:

In [None]:
# This is an attribute - not a method :D
first_citation.text

That gives us the text of the citation. But we can also dig deeper into the tag to get other information that's contained there.

If we want to get the link to this citation, we just have to navigate to it. We can again find whatever `a` elements are in this tag, just like we did for the soup object as a whole.

In [None]:
# Find the "a" elements
print(first_citation("a"))

Again this returns a list. In this case the link is located in the first item. We can get that with indexing :)

In [None]:
# Get the first one
print(first_citation("a")[0])

This object is also a tag. Now let's use the `.attrs` attribute to see the tag's attributes.

In [None]:
first_citation("a")[0].attrs

You'll notice that it looks a lot like a dictionary, so we can index it as such. Since we want the link, we can use the `href` attribute like a dictionary key to get the corresponding value.

In [None]:
print(first_citation("a")[0]['href'])

Now, get all the links contained in the references and add them to a list:

In [None]:
# make accumulator list
refs_list = []

# start at the endnotes
references = soup.select("span.reference-text")

# loop through references
for ref in references:
    if ref("a") != []:  # ignore the references without links
        
        a_element = ref("a")[0]
        link = a_element['href']
        
        refs_list.append(link)

# get rid of links to wiki articles
refs_list = [ref for ref in refs_list if not ref.startswith('/wiki')]

refs_list

In [None]:
# Convert to data frame
citations_df = pd.DataFrame(refs_list, columns = ["Citation"])
citations_df.head()

In [None]:
# Export to .csv
citations_df.to_csv("citations.csv")