<center><b>DIGHUM101</b></center>
<center>4-5: Web Scraping</center>

---

# Web scraping with BeautifulSoup

Web scraping is programmatically collecting information from various websites. While there are many libraries and frameworks in various languages that can extract web data, Python has long been a popular choice because of its plethora of options for web scraping.

# Ethical web scraping
Before choosing to engage in web scraping, you always have to consider some things:
1. Many websites have a Terms of Use which may not allow scraping. We must respect websites that do not want to be scraped.
2. Is there an API available already? If so, there's no need for us to write a scraper. APIs are created to provide access to data in a controlled way as defined by the owners of the data, so we prefer to use APIs if they're available.
3. Making requests to a website can cause a toll on a website's performance. A web scraper that makes too many requests can be as debilitating. We must scrape responsibly so we won't cause any disruption to the regular functioning of the website.

If you have doubts about the ethics of scraping some website, please consult with me.


# Scraping from Wikipedia
We're going to scrape some information from Wikipedia, which has a simple page layout with a consistent template.

For web scraping we're going to need two libraries: [requests](https://requests.readthedocs.io/en/master/) and [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). BeautifulSoup is what we use to actually navigate and parse the page that we're scraping. We'll import the `time` library too. This will allow us to `time.sleep(5)` so that we don't overload anyone's servers. 

We will talk a little about HTML and CSS - you need to know more about these if you want to get good at web scraping. Here's a good point to start: [What are HTML and CSS?](https://html.com/) 

If you're looking for a quick crash course in developer tools for HTML and CSS, check out this [YouTube video](https://www.youtube.com/watch?v=FQKvro1Wz-E).

In [None]:
# !pip install beautifulsoup4

In [1]:
import requests
from bs4 import BeautifulSoup
import time
import pandas as pd

<img src="../img/html-image-tag.png" alt="source" style="width: 400px;"/>

### For this exercise, we will scrape all the citations on the Wikipedia "Data Science" page

First we use requests to make a `.get` request to the page. First, hav a look at what's on the [Data science](https://en.wikipedia.org/wiki/Data_science) Wikipedia page. Next, we'll access this page using a GET request through the `requests` library.

In [2]:
r = requests.get('https://en.wikipedia.org/wiki/Data_science')

We now have an .html object. There is no .html method in the requests library (like for json), but BeautifulSoup will help us get there. First, extract the html string:

In [3]:
source = r.text
source

'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-sticky-header-enabled vector-toc-available" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8">\n<title>Data science - Wikipedia</title>\n<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-

Neat! If you visit the Data Science Wikipedia page, right click with your mouse and click "View source" - it's the same thing! 

<img src="../img/page_source.gif" alt="source" style="width: 400px;"/>

Now we convert it into a BeautifulSoup object that makes navigating the HTML tree much easier.

Note that Beautiful Soup offers a number of ways to customize how the parser treats incoming HTML and XML. We are using the `html.parser` parser here, but we could use [different ones](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers) as well. It all depends on the website you're trying to scrape.

In [4]:
soup = BeautifulSoup(source, "html.parser")
print(type(soup))
print(soup)

<class 'bs4.BeautifulSoup'>
<!DOCTYPE html>

<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-sticky-header-enabled vector-toc-available" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Data science - Wikipedia</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width

Then, use the `.prettify()` method to look at the HTML, and even get a slice of it. Let's take a look at what we have:

In [5]:
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-sticky-header-enabled vector-toc-available" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Data science - Wikipedia
  </title>
  <script>
   (function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-en

Let's use BeautifulSoup functions to find things on a page, such as:

1. HTML tags
2. HTML Attributes
3. CSS Selectors

Let's search first for **HTML tags**. 

The function `find_all` searches the `soup` tree to find all the elements with a particular HTML tag, and returns a list of all those elements. Let's search for all of the [`a` tags](https://www.w3schools.com/tags/tag_a.asp) (i.e., hyperlinks).

In [6]:
soup.find_all("a")

[<a class="mw-jump-link" href="#bodyContent">Jump to content</a>,
 <a accesskey="z" href="/wiki/Main_Page" title="Visit the main page [z]"><span>Main page</span></a>,
 <a href="/wiki/Wikipedia:Contents" title="Guides to browsing Wikipedia"><span>Contents</span></a>,
 <a href="/wiki/Portal:Current_events" title="Articles related to current events"><span>Current events</span></a>,
 <a accesskey="x" href="/wiki/Special:Random" title="Visit a randomly selected article [x]"><span>Random article</span></a>,
 <a href="/wiki/Wikipedia:About" title="Learn about Wikipedia and how it works"><span>About Wikipedia</span></a>,
 <a href="//en.wikipedia.org/wiki/Wikipedia:Contact_us" title="How to contact Wikipedia"><span>Contact us</span></a>,
 <a href="/wiki/Help:Contents" title="Guidance on how to use and edit Wikipedia"><span>Help</span></a>,
 <a href="/wiki/Help:Introduction" title="Learn how to edit Wikipedia"><span>Learn to edit</span></a>,
 <a href="/wiki/Wikipedia:Community_portal" title="The

Since the `.find_all()` method is used so frequently, there is a shortcut for it. You can just treat the soup object itself as a function, and pass it the tag you're looking for as an argument.

So `soup.find_all('a')` is the same as `soup('a')`:

In [None]:
soup.find_all('a') == soup('a')

You probably noticed that `.find_all()` returned a lot of elements, most of which we might not want. One way to narrow down our search is to specify that we're only looking for elements that have a certain CSS class. Alternatively we can use the `.select()` method. We pass an argument to the method that consists of the tag and the CSS class separated by a period. For instance, we can grab the title with the following CSS selector:

In [None]:
soup.select("h1.firstHeading")

How are we getting all these tag and attribute names? Typically, you will want to go to a web page on your browser, right-click on an element you're interested in (such as the heading in the example above) and select "inspect" in order to see the HTML and CSS that makes up the web page. You can then also navigate to other elements in the HTML.

<img src="../img/inspect.gif" alt="inspect" style="width: 800px;"/>

# Scraping text

Inspecting the HTML, we can see there's a tag with an id called `bodyContent`, where all the main text of the article can be found. Let's retrieve it.

In [7]:
# 'mw-content-text' is an attribute
body = soup.find(id="mw-content-text")
body

<div class="mw-body-content" id="mw-content-text"><div class="mw-content-ltr mw-parser-output" dir="ltr" lang="en"><div class="shortdescription nomobile noexcerpt noprint searchaux" style="display:none">Field of study to extract knowledge from data</div>
<style data-mw-deduplicate="TemplateStyles:r1236090951">.mw-parser-output .hatnote{font-style:italic}.mw-parser-output div.hatnote{padding-left:1.6em;margin-bottom:0.5em}.mw-parser-output .hatnote i{font-style:normal}.mw-parser-output .hatnote+link+.hatnote{margin-top:-0.5em}@media print{body.ns-0 .mw-parser-output .hatnote{display:none!important}}</style><div class="hatnote navigation-not-searchable" role="note">Not to be confused with <a href="/wiki/Information_science" title="Information science">Information science</a> or <a href="/wiki/Computer_science" title="Computer science">Computer science</a>.</div>
<p class="mw-empty-elt">
</p>
<figure class="mw-default-size" typeof="mw:File/Thumb"><a class="mw-file-description" href="/wiki

In [8]:
type(body)

bs4.element.Tag

Once we identify elements, we want to access the information in a certain element. This usually means two things:

1. Text
2. Attributes

Here, our `body` variable here is a BeautifulSoup `Tag` object. This means it has a `text` attribute. Let's grab all the `p` (paragraph) tags from our resulting BeautifulSoup object and print these `text` attributes.

In [9]:
for t in body.find_all("p"):
    print(t.text)



Data science is an interdisciplinary academic field[1] that uses statistics, scientific computing, scientific methods, processing, scientific visualization, algorithms and systems to extract or extrapolate knowledge from potentially noisy, structured, or unstructured data.[2]

Data science also integrates domain knowledge from the underlying application domain (e.g., natural sciences, information technology, and medicine).[3] Data science is multifaceted and can be described as a science, a research paradigm, a research method, a discipline, a workflow, and a profession.[4]

Data science is "a concept to unify statistics, data analysis, informatics, and their related methods" to "understand and analyze actual phenomena" with data.[5] It uses techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, information science, and domain knowledge.[6] However, data science is different from computer science and information science. Turing

# Scraping links 

Next, let's find all the places in the text where there is a link to another website. Using the `.find()` method, we can find all the links on the page that are within the main text. 

Note that we have a special beautifulSoup `Tag` object, meaning we can use its methods on our `text` variable as well. Let's use the `.attrs` attribute to see the attributes for the first `a` tag (i.e., the first hyperlink in this BeautifulSoup object). We can get that with indexing :)

In [10]:
first_link = body("a")[0].attrs
print(first_link)

{'href': '/wiki/Information_science', 'title': 'Information science'}


You'll notice that it looks a lot like a dictionary, so we can index it as such. Since we want the link, we can use the `href` attribute like a dictionary key to get the corresponding value.

In [11]:
first_link['href']

'/wiki/Information_science'

In [12]:
# We can also use .get() to access attributes
first_link.get('href')

# This method is safer as it returns None if the attribute does not exist


'/wiki/Information_science'

Knowing this, we can now iterate over all `a` tags and access them as dictionaries to retrieve the ["href" attribute](https://www.w3schools.com/tags/att_a_href.asp), which specifies the URL of the page the link goes to.

In [13]:
for line in body.find_all('a'):
    href = line.get('href')  # ← returns None if 'href' doesn't exist
    if href:
        print(href)

/wiki/Information_science
/wiki/Computer_science
/wiki/File:PIA23792-1600x1200(1).jpg
/wiki/Comet_NEOWISE
/wiki/Astronomical_survey
/wiki/Space_telescope
/wiki/Wide-field_Infrared_Survey_Explorer
/wiki/Interdisciplinary
#cite_note-1
/wiki/Statistics
/wiki/Scientific_computing
/wiki/Scientific_method
/wiki/Scientific_visualization
/wiki/Algorithm
/wiki/Knowledge
/wiki/Data_model
/wiki/Unstructured_data
#cite_note-2
#cite_note-3
#cite_note-4
/wiki/Statistics
/wiki/Data_analysis
/wiki/Informatics
/wiki/Scientific_method
/wiki/Phenomena
/wiki/Data
#cite_note-5
/wiki/Mathematics
/wiki/Computer_science
/wiki/Information_science
/wiki/Domain_knowledge
#cite_note-:2-6
/wiki/Computer_science
/wiki/Turing_Award
/wiki/Jim_Gray_(computer_scientist)
/wiki/Empirical_research
/wiki/Basic_research
/wiki/Computational_science
/wiki/Information_technology
/wiki/Information_explosion
#cite_note-TansleyTolle2009-7
#cite_note-BellHey2009-8
#cite_note-9
/w/index.php?title=Data_science&action=edit&section=1


# Scraping references
Next, let's get the references one can find at the bottom of a Wikipedia page. Let's `find` the references part of the website first and save that to a new variable.

In [14]:
refs = soup.find("div", class_="reflist")
# or, using find_all: 
#refs = soup.find_all("div", class_="reflist")

Next, we'll `select` the first `reference-text` attribute.

*Note that in this case, we could either use `find_all` or `select`. Usage often depends on the use case. See [here](https://stackoverflow.com/questions/38028384/beautifulsoup-difference-between-find-and-select) if you want to learn more.*

In [15]:
first_citation = refs.select("span.reference-text")[0]
# or, using find_all
#first_citation = refs.find_all("span", class_="reference-text")[0]

first_citation


<span class="reference-text"><style data-mw-deduplicate="TemplateStyles:r1238218222">.mw-parser-output cite.citation{font-style:inherit;word-wrap:break-word}.mw-parser-output .citation q{quotes:"\"""\"""'""'"}.mw-parser-output .citation:target{background-color:rgba(0,127,255,0.133)}.mw-parser-output .id-lock-free.id-lock-free a{background:url("//upload.wikimedia.org/wikipedia/commons/6/65/Lock-green.svg")right 0.1em center/9px no-repeat}.mw-parser-output .id-lock-limited.id-lock-limited a,.mw-parser-output .id-lock-registration.id-lock-registration a{background:url("//upload.wikimedia.org/wikipedia/commons/d/d6/Lock-gray-alt-2.svg")right 0.1em center/9px no-repeat}.mw-parser-output .id-lock-subscription.id-lock-subscription a{background:url("//upload.wikimedia.org/wikipedia/commons/a/aa/Lock-red-alt-2.svg")right 0.1em center/9px no-repeat}.mw-parser-output .cs1-ws-icon a{background:url("//upload.wikimedia.org/wikipedia/commons/4/4c/Wikisource-logo.svg")right 0.1em center/12px no-repeat

In [16]:
# check out its type
print(type(first_citation))

<class 'bs4.element.Tag'>


If we want to get the link to this citation, we just have to navigate to it. We can again find whatever `a` elements are in this tag, just like we did before.

In [17]:
# Find the "a" elements
print(first_citation("a"))

[<a class="external text" href="https://doi.org/10.1080%2F10618600.2017.1384734" rel="nofollow">"50 Years of Data Science"</a>, <a href="/wiki/Journal_of_Computational_and_Graphical_Statistics" title="Journal of Computational and Graphical Statistics">Journal of Computational and Graphical Statistics</a>, <a class="mw-redirect" href="/wiki/Doi_(identifier)" title="Doi (identifier)">doi</a>, <a class="external text" href="https://doi.org/10.1080%2F10618600.2017.1384734" rel="nofollow">10.1080/10618600.2017.1384734</a>, <a class="mw-redirect" href="/wiki/S2CID_(identifier)" title="S2CID (identifier)">S2CID</a>, <a class="external text" href="https://api.semanticscholar.org/CorpusID:114558008" rel="nofollow">114558008</a>]


As you can see, this returns a list. 
Note that we have a special beautifulSoup "Tag" object. Let's use the `.attrs` attribute to see the attributes for the first `a` tag (using indexing).

In [18]:
# Get the first one
print(first_citation("a")[0])

<a class="external text" href="https://doi.org/10.1080%2F10618600.2017.1384734" rel="nofollow">"50 Years of Data Science"</a>


Since we want the link, we can use the `href` attribute again to get the corresponding value.

In [19]:
print(first_citation("a")[0]['href'])

https://doi.org/10.1080%2F10618600.2017.1384734


Now, get all the links contained in the references and add them to a list:

In [20]:
# make accumulator list
refs_list = []

# start at the endnotes
references = soup.select("span.reference-text")

# loop through references
for ref in references:
    if ref("a") != []:  # ignore the references without links
        
        a_element = ref("a")[0]
        link = a_element['href']
        
        refs_list.append(link)

# get rid of links to wiki articles
refs_list = [ref for ref in refs_list if not ref.startswith('/wiki')]

refs_list

['https://doi.org/10.1080%2F10618600.2017.1384734',
 'http://cacm.acm.org/magazines/2013/12/169933-data-science-and-prediction/fulltext',
 'https://dstf.acm.org/DSTF_Final_Report.pdf',
 'https://doi.org/10.1145%2F3575663',
 'https://www.springer.com/book/9784431702085',
 'https://doi.org/10.1145%2F3076253',
 'https://books.google.com/books?id=oGs_AQAAIAAJ',
 'https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/',
 'https://doi.org/10.3390%2Fmake1010015',
 'https://www.oreilly.com/library/view/doing-data-science/9781449363871/ch01.html',
 'http://archive.nyu.edu/handle/2451/31553',
 'https://statmodeling.stat.columbia.edu/2013/11/14/statistics-least-important-part-data-science/',
 'http://courses.csail.mit.edu/18.337/2015/docs/50YearsDataScience.pdf',
 'https://www2.isye.gatech.edu/~jeffwu/publications/fazhan.pdf',
 'https://doi.org/10.3390%2Fbdcc2020014',
 'http://www2.isye.gatech.edu/~jeffwu/presentations/datascience.pdf',
 'https://hbr.org/2012/10/data-scientis

In [21]:
# Convert to data frame
citations_df = pd.DataFrame(refs_list, columns = ["Citation"])
citations_df.head()

Unnamed: 0,Citation
0,https://doi.org/10.1080%2F10618600.2017.1384734
1,http://cacm.acm.org/magazines/2013/12/169933-d...
2,https://dstf.acm.org/DSTF_Final_Report.pdf
3,https://doi.org/10.1145%2F3575663
4,https://www.springer.com/book/9784431702085


In [22]:
# Export to .csv
citations_df.to_csv("citations.csv")