# Web scraping with BeautifulSoup

We're going to scrape some information from Wikipedia, which has a simple page layout with a consistent template.

For web scraping we're going to need two libraries: [requests](https://requests.readthedocs.io/en/master/) and [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). BeautifulSoup is what we use to actually navigate and parse the page that we're scraping. We'll import the `time` library too. This will allow us to `time.sleep(5)` so that we don't overload anyone's servers. 

We will talk a little about HTML and CSS - learn more here: [What are HTML and CSS?](https://html.com/) 

In [None]:
# !pip install beautifulsoup4

In [1]:
import requests
from bs4 import BeautifulSoup
import time
import pandas as pd

### For this exercise, we will scrape all the citations on the Wikipedia "Data Science" page

First we use requests to make a `.get` request to the page. Let's see what's on the [Data science](https://en.wikipedia.org/wiki/Data_science) Wikipedia page:

In [2]:
r = requests.get('https://en.wikipedia.org/wiki/Data_science')

We now have an .html object. There is no .html method in the requests library (like for json), but BeautifulSoup will help us get there. First, extract the html string:

In [3]:
source = r.text
source

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Data science - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"07086999-584c-4719-b56e-2546bbcd43c5","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Data_science","wgTitle":"Data science","wgCurRevisionId":1027202035,"wgRevisionId":1027202035,"wgArticleId":35458904,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 maint: others","Articles with short description","Short description matches Wikidata","Use dmy dates from December 2012","Information science","Computer occupa

Neat! If you visit the Data Science Wikipedia page, right click with your mouse and click "View source" - it's the same thing! Now we convert it into a BeautifulSoup class object that makes navigating the HTML tree much easier.

In [4]:
soup = BeautifulSoup(source, 'html5lib')
print(type(soup))

<class 'bs4.BeautifulSoup'>


Then, use the `.prettify()` method to look at the HTML, and even get a slice of it. Let's take a look at what we have:

In [5]:
print(soup.prettify()[:1000])

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Data science - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"07086999-584c-4719-b56e-2546bbcd43c5","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Data_science","wgTitle":"Data science","wgCurRevisionId":1027202035,"wgRevisionId":1027202035,"wgArticleId":35458904,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 maint: others","Articles with short description","Short description matches Wikidata","Use dmy dates from December 2012","Information science","Com

Let's use BeautifulSoup functions to find things on a page, such as:

1. HTML tags
2. HTML Attributes
3. CSS Selectors

Let's search first for **HTML tags**. 

The function `find_all` searches the `soup` tree to find all the elements with a particular HTML tag, and returns a list of all those elements.

In [6]:
soup.find_all("a")

[<a id="top"></a>,
 <a class="mw-jump-link" href="#mw-head">Jump to navigation</a>,
 <a class="mw-jump-link" href="#searchInput">Jump to search</a>,
 <a href="/wiki/Information_science" title="Information science">information science</a>,
 <a class="image" href="/wiki/File:PIA23792-1600x1200(1).jpg"><img alt="" class="thumbimage" data-file-height="1200" data-file-width="1600" decoding="async" height="188" src="//upload.wikimedia.org/wikipedia/commons/thumb/4/45/PIA23792-1600x1200%281%29.jpg/250px-PIA23792-1600x1200%281%29.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/4/45/PIA23792-1600x1200%281%29.jpg/375px-PIA23792-1600x1200%281%29.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/4/45/PIA23792-1600x1200%281%29.jpg/500px-PIA23792-1600x1200%281%29.jpg 2x" width="250"/></a>,
 <a class="internal" href="/wiki/File:PIA23792-1600x1200(1).jpg" title="Enlarge"></a>,
 <a href="/wiki/Comet_NEOWISE" title="Comet NEOWISE">Comet NEOWISE</a>,
 <a href="/wiki/Astronomical_survey

Since the `.find_all()` method is used so frequently, there is a shortcut for it. You can just treat the soup object itself as a function, and pass it the tag you're looking for as an argument.

So `soup.find_all('a')` is the same as `soup('a')`:

In [None]:
soup.find_all('a') == soup('a')

You probably noticed that `.find_all()` returned a lot of elements, most of which we might not want. One way to narrow down our search is to specify that we're only looking for elements that have a certain CSS class. Alternatively we can use the `.select()` method. We pass an argument to the method that consists of the tag and the CSS class separated by a period. We can grab all the links in the sidebar on the left-hand side of the page with the following CSS selector:

In [20]:
# soup.select("table.vertical-navbox.nowraplinks.plainlist a")
soup.select("table.sidebar")

[<table class="sidebar sidebar-collapse nomobile"><tbody><tr><td class="sidebar-pretitle">Part of a series on</td></tr><tr><th class="sidebar-title-with-pretitle"><a href="/wiki/Machine_learning" title="Machine learning">Machine learning</a><br/>and <a href="/wiki/Data_mining" title="Data mining">data mining</a></th></tr><tr><td class="sidebar-image"><a class="image" href="/wiki/File:Multi-Layer_Neural_Network-Vector-Blank.svg"><img alt="Multi-Layer Neural Network-Vector-Blank.svg" data-file-height="390" data-file-width="815" decoding="async" height="72" src="//upload.wikimedia.org/wikipedia/commons/thumb/0/00/Multi-Layer_Neural_Network-Vector-Blank.svg/150px-Multi-Layer_Neural_Network-Vector-Blank.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/0/00/Multi-Layer_Neural_Network-Vector-Blank.svg/225px-Multi-Layer_Neural_Network-Vector-Blank.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/0/00/Multi-Layer_Neural_Network-Vector-Blank.svg/300px-Multi-Layer_Neura

If you're looking for a quick crash course in developer tools, check out this [YouTube video](https://www.youtube.com/watch?v=FQKvro1Wz-E).

[![IMAGE ALT TEXT HERE](https://img.youtube.com/vi/FQKvro1Wz-E/0.jpg)](https://www.youtube.com/watch?v=FQKvro1Wz-E)

# Find the first citation

Let's find all the places in the text where there is a citation, along with the references themselves. Using the `.select()` method, find all the elements in the page that belong to the "reference-text" class.

****

Once we identify elements, we want to access the information in a certain element. This usually means two things:

1. Text
2. Attributes

Getting the text inside an element is easy. All we have to do is use the "text" member of a "tag" object. Let's look at the first citation:

In [18]:
first_citation = soup.select("span.reference-text")[0]
first_citation

<span class="reference-text"><style data-mw-deduplicate="TemplateStyles:r999302996">.mw-parser-output cite.citation{font-style:inherit}.mw-parser-output .citation q{quotes:"\"""\"""'""'"}.mw-parser-output .id-lock-free a,.mw-parser-output .citation .cs1-lock-free a{background:linear-gradient(transparent,transparent),url("//upload.wikimedia.org/wikipedia/commons/6/65/Lock-green.svg")right 0.1em center/9px no-repeat}.mw-parser-output .id-lock-limited a,.mw-parser-output .id-lock-registration a,.mw-parser-output .citation .cs1-lock-limited a,.mw-parser-output .citation .cs1-lock-registration a{background:linear-gradient(transparent,transparent),url("//upload.wikimedia.org/wikipedia/commons/d/d6/Lock-gray-alt-2.svg")right 0.1em center/9px no-repeat}.mw-parser-output .id-lock-subscription a,.mw-parser-output .citation .cs1-lock-subscription a{background:linear-gradient(transparent,transparent),url("//upload.wikimedia.org/wikipedia/commons/a/aa/Lock-red-alt-2.svg")right 0.1em center/9px no-r

In [21]:
# check out its type
print(type(first_citation))

<class 'bs4.element.Tag'>


It's a tag! Which means it has a `text` member:

In [22]:
# This is an attribute - not a method :D
first_citation.text

'.mw-parser-output cite.citation{font-style:inherit}.mw-parser-output .citation q{quotes:"\\"""\\"""\'""\'"}.mw-parser-output .id-lock-free a,.mw-parser-output .citation .cs1-lock-free a{background:linear-gradient(transparent,transparent),url("//upload.wikimedia.org/wikipedia/commons/6/65/Lock-green.svg")right 0.1em center/9px no-repeat}.mw-parser-output .id-lock-limited a,.mw-parser-output .id-lock-registration a,.mw-parser-output .citation .cs1-lock-limited a,.mw-parser-output .citation .cs1-lock-registration a{background:linear-gradient(transparent,transparent),url("//upload.wikimedia.org/wikipedia/commons/d/d6/Lock-gray-alt-2.svg")right 0.1em center/9px no-repeat}.mw-parser-output .id-lock-subscription a,.mw-parser-output .citation .cs1-lock-subscription a{background:linear-gradient(transparent,transparent),url("//upload.wikimedia.org/wikipedia/commons/a/aa/Lock-red-alt-2.svg")right 0.1em center/9px no-repeat}.mw-parser-output .cs1-subscription,.mw-parser-output .cs1-registration{c

That gives us the text of the citation. But we can also dig deeper into the tag to get other information that's contained there.

If we want to get the link to this citation, we just have to navigate to it. We can again find whatever `a` elements are in this tag, just like we did for the soup object as a whole.

In [23]:
# Find the "a" elements
print(first_citation("a"))

[<a class="external text" href="http://cacm.acm.org/magazines/2013/12/169933-data-science-and-prediction/fulltext" rel="nofollow">"Data science and prediction"</a>, <a class="mw-redirect" href="/wiki/Doi_(identifier)" title="Doi (identifier)">doi</a>, <a class="external text" href="https://doi.org/10.1145%2F2500499" rel="nofollow">10.1145/2500499</a>, <a class="mw-redirect" href="/wiki/S2CID_(identifier)" title="S2CID (identifier)">S2CID</a>, <a class="external text" href="https://api.semanticscholar.org/CorpusID:6107147" rel="nofollow">6107147</a>, <a class="external text" href="https://web.archive.org/web/20141109113411/http://cacm.acm.org/magazines/2013/12/169933-data-science-and-prediction/fulltext" rel="nofollow">Archived</a>]


Again this returns a list. In this case the link is located in the first item. We can get that with indexing :)

In [24]:
# Get the first one
print(first_citation("a")[0])

<a class="external text" href="http://cacm.acm.org/magazines/2013/12/169933-data-science-and-prediction/fulltext" rel="nofollow">"Data science and prediction"</a>


This object is also a tag. Now let's use the `.attrs` attribute to see the tag's attributes.

In [25]:
first_citation("a")[0].attrs

{'rel': ['nofollow'],
 'class': ['external', 'text'],
 'href': 'http://cacm.acm.org/magazines/2013/12/169933-data-science-and-prediction/fulltext'}

You'll notice that it looks a lot like a dictionary, so we can index it as such. Since we want the link, we can use the `href` attribute like a dictionary key to get the corresponding value.

In [26]:
print(first_citation("a")[0]['href'])

http://cacm.acm.org/magazines/2013/12/169933-data-science-and-prediction/fulltext


Now, get all the links contained in the references and add them to a list:

In [None]:
# make accumulator list
refs_list = []

# start at the endnotes
references = soup.select("span.reference-text")

# loop through references
for ref in references:
    if ref("a") != []:  # ignore the references without links
        
        a_element = ref("a")[0]
        link = a_element['href']
        
        refs_list.append(link)

# get rid of links to wiki articles
refs_list = [ref for ref in refs_list if not ref.startswith('/wiki')]

refs_list

In [None]:
# Convert to data frame
citations_df = pd.DataFrame(refs_list, columns = ["Citation"])
citations_df.head()

In [None]:
# Export to .csv
citations_df.to_csv("citations.csv")