<img src='images/gesis.png' style='height: 50px; float: left'>
<img src='images/social_comquant.png' style='height: 50px; float: left; margin-left: 40px'>

## Introduction to Computational Social Science methods with Python

# Session [NUMBER?]: Web scraping

**Data collection** is a procedure of gathering information from subjects (all relevant sources), measuring and analyzing insights for research using various techniques, such as web scraping or API harvesting. Researchers can evaluate their research questions and hypotheses on the basis of collected data. In most cases, data collection is the primary and most important step for research, irrespective of the field of study. The approach of data collection varies for different fields of study, depending on the required information.

<img src="./images/webscrape.png"  width="350" height = "350" align="center"/>

Digital behavioral data from the internet is an absolutely massive source of data which we can access by various ways such as connecting APIs (see [Session 3: API harvesting](https://github.com/gesiscss/css_methods_python/blob/main/b_data_collection_methods/3_api_harvesting.ipynb)). In addition to APIs, web scraping opens up another way to access **Digital Behavioral Data** that it is not available in convenient CSV exports or APIs. Also, APIs are not offered by every websites, and they might not always provide every piece of information we need. Therefore, scraping can be the only solution to extract data from websites. 

Websites are often valuable sources of data; for example, [weather forecasts](https://www.accuweather.com/), [articles on news sites](https://www.washingtonpost.com/), and [posts on forums](https://quora.com/). To access these sorts of information on webpages, we can use web scraping. Different use cases are studied after collecting data by web scarping to understand hospitality through online reviews, compare food prices,  aggregate news and bank accounts, and build datasets which are not available otherwise (Han et al., 2021; Hillen 2019)..

While web scraping is one of the common ways of collecting data from websites, a lot of websites offer APIs to access the public data that they host on their website. This is to avoid unnecessary traffic on the websites. However, even though we have access to these API, as researchers, we should not forget to respect API access rules and always read the **Terms of Use** documents before collecting data. In order to access APIs, you first need to create an account and apply to have a developer account on the platform that you want to work on. With this developer account, platforms provide you KEYS (e.g., secret, public, or access) to authenticate their system. The more practical information and hands on examples on API harvesting can be found in [Session 3: API harvesting](https://github.com/gesiscss/css_methods_python/blob/main/b_data_collection_methods/3_api_harvesting.ipynb).

<div class='alert alert-block alert-success'>
<b>In this session</b>, 

you will learn how to do web scraping with Python from scratch. In subsession **4.1**, we will have a deep look at the fundamentals of web scraping. You will experience how you can use the Python libraries to handle different data collections. In subsubsession **4.1.1**, you will learn about basics for the feed parsing with the feedparser library, and in subsubsession **4.1.2**, you will learn about html syntax. We move to the basic web scraping tool; the Beautiful soup library in the following subsession **4.2**. In subsession **4.3**, we will introduce the Selenium library to collect data from dynamic websites that only load content once you interact with it (scrolling, clicking, etc.) and are difficult to obtain through more traditional scraping approaches. We will work through an actual web scraping projects throughout this session, focusing on online news sites, pdf pages, and Quora. Finally, in subsession **4.4**, we will compare these libraries and talk about the challanges and data privacy approaches.
</div>

<div class='alert alert-block alert-danger'>
<b>Caution</b>

This Jupyter Notebook demonstrates a workflow that consists of a **sequence of processing steps**. The notebook must be executed from top to bottom. Going back up from a certain code cell and trying to execute a cell that precedes it may not work.
</div>

## 4.1. Fundamentals of Web scraping

<div class='alert alert-block alert-danger'>
<b>Gizem's note:</b>
Gizem will add more info on the fundamentals from this realpython page and parts from Singrodia et al.2019 as well. Also, a short intro on the types of webpages will come here.
    
https://realpython.com/beautiful-soup-web-scraper-python/#reasons-for-web-scraping
    
https://medium.com/pythoneers/the-fundamentals-of-web-scraping-using-python-its-libraries-6f146b91efb4

</div>


### 4.1.1. Feed Parsing [<a href='#destination1'>1, 2, 3</a>] <a id='destination1_'></a>

According to [Wikipedia](https://en.wikipedia.org/wiki/Web_feed), a web feed (or news feed) is a data format used for providing users with frequently updated content. Content distributors syndicate a web feed, thereby allowing users to subscribe a channel to it by adding the feed resource address to a news aggregator client (also called a feed reader or a news reader). Users typically subscribe to a feed by manually entering the URL of a feed or clicking a link in a web browser or by dragging the link from the web browser to the aggregator, thus "RSS and Atom files provide news updates from a website in a simple form for your computer."

Here we introduce [feedparser](https://pypi.org/project/feedparser/), a powerful python package for parsing RSS feeds. By providing the RSS feed link, we can get structured information in the form of python lists and dictionaries, which could then be used to extract the desired information in a simple and efficient way.


<div class='alert alert-block alert-danger'>
<b>Gizem's note:</b>
Pouria, please change the citations to APA format, below Feed Parsing also has three references in a number format. We should integrate these references into the text not on the title. Does that sound clear? Thanks!
</div>

#### Getting started with feedparser

<div class="alert alert-block alert-info">
<b>Hint:</b> 
    
Before importing the libraries, we need to have the neccessary software packages and libraries installed. You can always go back to [Session 1: Setting up the computing environment](https://github.com/gesiscss/css_methods_python/blob/main/a_introduction/1_computing_environment.ipynb) to learn about how to install software packages and libraries that you need for this session.
    
</div>

As usual, we need to import the package in the first place:

In [None]:
import feedparser

#### Parsing an RSS feed URL
To parse an RSS feed link, you can simply use the **parse()** method from the feedparser package. It takes a string as argument, which could be a URL or the address to the file locally saved on the computer. Here we use CNN RSS as an example URL:

In [None]:
feed = feedparser.parse("https://www.voanews.com/api/zgvmqye_o_qv")

# You can try other news websites as well:

# feed = feedparser.parse("https://www.aljazeera.com/xml/rss/all.xml")
# feed = feedparser.parse("http://rss.cnn.com/rss/edition_europe.rss")

feed

<div class="alert alert-block alert-info">
<b>Hint:</b> 
You can try the following ways in order to get a website's RSS feed:

- If the website is powered by Wordpress, you can do it by adding /feed/ at the end of its URL. Trying /rss/ is another option.
<img src='images/rss_logo.png' style='height: 50px; float: right; margin-left: 50px' >
- If you see the standard orange RSS logo, by simply clicking on it you will be taken to the website's RSS feed.
- You can also use the page source: right click on the page and choose page source. In the new window, use ctrl+f and type in RSS. You’ll find the feed’s URL between the quotes after **href=**.

The parse method fetches the feed from the provided URL, extracts the information in a systematic way and stores each piece in a structured format. At the high level, it returns a python dictionary with multiple keys and values, in which each value may contain python lists or other dictionaries. You can access the keys using the **keys()** method:
</div>

In [None]:
feed.keys()

Using these keys, we can access the more specific information that we want. The most common keys that can be used for extracting information are **entries** and **feed**.

#### Extracting the contents from the feed
We will start with the **entries** key. We can get the list of all the posts/podcasts/entries or any other form of content the feed is serving for, from the **entries** key in the dictionary. More information on other possible keys in the returned dictionary can be found [here](https://feedparser.readthedocs.io/en/latest/reference.html).

In [None]:
entries = feed['entries']
entries

We can get the number of articles/entries using the **len()** function:

In [None]:
len(entries)

#### Getting details of the entries
We can iterate over the items of the entries list and print them to get more details on each article:

In [None]:
for entry in entries:
    print (entry)
    print ("\n")

As we can see, each entry in the list is a dictionary again, which has different key-value pairs like **title**, **summary**, **link**, etc. We can again use the **keys()** method in order to explore the keys of the new dictionary: 

In [None]:
entries[0].keys()

Now that we have all the keys associated with the entries, we can extract the specific information like title, author, and actual contents of the feed.
Though this might not be the same for all RSS feeds, it might be very similar and a matter of using the right keyword for the associated keys in the list of dictionaries.

Let's say, we want to print out the titles of all the entries in the feed, and save them to the `titles` list. We can do that by iterating over the entries list and fetching the title from the iterator:

In [None]:
titles = []

for entry in entries:
    titles.append(entry.title)
    print (entry.title)

Similarly, we can get the links, summaries, publishing dates and tags of the entries using the corresponding keys in the dictionary. We will save them in lists:

In [None]:
links = []
summaries = []
published = []
tags = []

for entry in entries:
    links.append(entry.link)
    summaries.append(entry.summary)
    published.append(entry.published)
    tags.append(entry.tags)

<div class='alert alert-block alert-danger'>
<b>Gizem's note:</b>
Pouria, please Add a Caution like I did before in red: Go to Session 2 if you don't know the basics of the dataframe.
All entries should be in a dataframe and written to a csv file.
A dataframe with entry id, title, summary, link as column titles.

<b>Pouria's note:</b>
Added it below:
</div>

Now we put all of the results in a pandas dataframe and then save it to a csv file. You can find it in the `outputs` folder, which is located in our current directory.

<div class='alert alert-block alert-danger'>
<b>Caution</b>

Pandas dataframes are used for keeping data in a well-structured manner. If you are not familiar with the basics of dataframes, check out [session 2: data handling and visualization](https://github.com/gesiscss/css_methods_python/blob/main/a_introduction/2_data_handling_and_visualization.ipynb).
</div>

In [None]:
import pandas as pd

feeds_df = pd.DataFrame([titles, published, summaries, links, tags]).transpose()
feeds_df.columns = ['title', 'published', 'summary', 'link', 'tags']

feeds_df.head()

In [None]:
feeds_df.to_csv('./outputs/feeds_dataframe.csv')

### 4.1.2. Scraping HTML content

<img src='images/html.png' style='height: 90px; float: right; margin-left: 50px' >

The **HyperText Markup Language** or **HTML** is the standard markup language for documents designed to be 
displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScript. While the main content of the web pages are in the form of HTML, CSS add styling to the pages to make them look nicer and JavaScript files add interactivity to them.


HTML code consists of a series of **elements**, and these elements tell the browser how to display the content. For collecting data from HTML web pages, it's necessary to have an idea of how this element syntax works.

#### HTML Element Syntax [<a href='#destination2'>4</a>] <a id='destination2_'></a>

HTML language can be applied to pieces of text to give them different meanings in a document (Is it a paragraph? Is it a bulleted list? Is it part of a table?), structure a document into logical sections (Does it have a header? Three columns of content? A navigation menu?), and embed content such as images and videos into a page. In this section we will introduce the first two of these, together with the fundamental concepts and syntax you need to know to understand HTML.

To get started, we will begin with defining elements, attributes, and some other important terms. We will also explain where these fit into HTML. You will learn how HTML elements are structured, how a typical HTML page is structured, and other important basic language features.

As already mentioned, HTML is a markup language that tells web browsers how to structure the web pages you visit. It can be as complicated or as simple as the web developer wants it to be. HTML consists of a series of elements, which you use to enclose, wrap, or mark up different parts of content to make it appear or act in a certain way. The enclosing tags can make content into a hyperlink to connect to another page, italicize words, and so on. For example, consider the following line of text:

`My cat is very grumpy`

If we wanted the text to stand by itself, we could specify that it is a paragraph by enclosing it in a paragraph (`<p>`) element:

`<p>My cat is very grumpy</p>`

<div class='alert alert-block alert-info'>
<b>Insight</b>

Tags in HTML are not case-sensitive, but it's better to write all of them in lower case for the sake of consistency and readability.
    
</div>

####  Anatomy of an HTML element

Let's further explore our paragraph element mentioned above:

<img src='images/html4.png' width="500" height="400" align="center"/>

The anatomy of our element is:

- **The opening tag**: This consists of the name of the element (in this example, p for paragraph), wrapped in opening and closing angle brackets. This opening tag marks where the element begins or starts to take effect. In this example, it precedes the start of the paragraph text.


- **The content**: This is the content of the element. In this example, it is the paragraph text.


- **The closing tag**: This is the same as the opening tag, except that it includes a forward slash before the element name. This marks where the element ends. Failing to include a closing tag is a common beginner error that can produce peculiar results.

So, *the element* is the opening tag, followed by content, followed by the closing tag.

**Create your first HTML element:**  Edit the `html` string below (it contains an HTML code) and get the actual rendered HTML output from `HTML()`. You can wrap the text of your choice with the tags `<em>` and `</em>`. Doing this should give the line italic text formatting.

In [None]:
html = "<em>This is my text.</em>"

In [None]:
from IPython.display import HTML

HTML(html)

#### Nesting elements

Elements can be placed within other elements. This is called *nesting*. If we wanted to state that our cat is **very** grumpy, we could wrap the word "very" in a `<strong>` element, which means that the word is to have strong(er) text formatting:

`<p>My cat is <strong>very</strong> grumpy.</p>`

There is a right and wrong way to do nesting. In the example above, we opened the `p` element first, then opened the `strong` element. For proper nesting, we should close the `strong` element first, before closing the `p`.
The following is an example of the *wrong* way to do nesting:

`<p>My cat is <strong>very grumpy.</p></strong>`

<u>The tags have to open and close in a way that they are inside or outside one another.</u> With the kind of overlap in the example above, the browser has to guess at your intent. This kind of guessing can lead to unexpected results.

#### Block versus inline elements

There are two important categories of elements to know in HTML: block-level elements and inline elements.

- Block-level elements form a visible block on a page. A block-level element appears on a new line following the content that precedes it. Any content that follows a block-level element also appears on a new line. Block-level elements are usually structural elements on the page. For example, a block-level element might represent headings, paragraphs, lists, navigation menus, or footers. A block-level element wouldn't be nested inside an inline element, but it might be nested inside another block-level element.


- Inline elements are contained within block-level elements, and surround only small parts of the document's content (not entire paragraphs or groupings of content). An inline element will not cause a new line to appear in the document. It is typically used with text, for example an `<a>` element creates a hyperlink, and elements such as `<em>` or `<strong>` create emphasis.

Consider the following example:

`<em>first</em><em>second</em><em>third</em>`

`<p>fourth</p><p>fifth</p><p>sixth</p>`

`<em>` is an inline element. As you can see below, the first three elements sit on the same line, with no space in between. On the other hand, `<p>` is a block-level element. Each p element appears on a new line, with space above and below. (The spacing is due to default CSS styling that the browser applies to paragraphs.)

In [None]:
HTML("<em>first</em><em>second</em><em>third</em>")

In [None]:
HTML("<p>fourth</p><p>fifth</p><p>sixth</p>")

#### Empty elements

Not all elements follow the pattern of an opening tag, content, and a closing tag. Some elements consist of a single tag, which is typically used to insert/embed something in the document. For example, the `<img>` element embeds an image file onto a page:

`<img src="https://raw.githubusercontent.com/mdn/beginner-html-site/gh-pages/images/firefox-icon.png">`

This would output the following:

In [None]:
HTML('<img src="https://raw.githubusercontent.com/mdn/beginner-html-site/gh-pages/images/firefox-icon.png">')

#### Attributes

Elements can also have attributes. Attributes look like this:

<img src='images/html5.png' width="800" height="400" align="center"/>

Attributes contain extra information about the element that won't appear in the content. In this example, the `class` attribute is an identifying name used to target the element with style information.

An attribute should have:

- A space between it and the element name. (For an element with more than one attribute, the attributes should be separated by spaces too.)
- The attribute name, followed by an equal sign.
- An attribute value, wrapped with opening and closing quote marks.

**Adding attributes to an element**: Another example of an element is `<a>`. This stands for *anchor*. An anchor can make the text it encloses into a hyperlink. Anchors can take a number of attributes, but several are as follows:

- `href`: This attribute's value specifies the web address for the link. For example: `href="https://www.mozilla.org/"`
- `title`: The `title` attribute specifies extra information about the link, such as a description of the page that is being linked to. For example, `title="The Mozilla homepage"`. This appears as a tooltip when a cursor hovers over the element.
- `target`: The `target` attribute specifies the browsing context used to display the link. For example, `target="_blank"` will display the link in a new tab. If you want to display the linked content in the current tab, just omit this attribute.

You can edit the `html` string below to turn it into a link to your favorite website:

In [None]:
html = '<p>A link to my <a href="https://www.mozilla.org/" title="The Mozilla homepage" target="_blank">favorite website</a>.</p>'
HTML(html)

#### Anatomy of an HTML document

Individual HTML elements aren't very useful on their own. Next, let's examine how individual elements combine to form an entire HTML page:

```
<!DOCTYPE html>
<html lang="en-US">
  <head>
    <meta charset="utf-8">
    <title>My test page</title>
  </head>
  <body>
    <p>This is my page</p>
  </body>
</html>
```

Here we have:

1. `<!DOCTYPE html>`: The doctype. When HTML was young (1991-1992), doctypes were meant to act as links to a set of rules that the HTML page had to follow to be considered good HTML. More recently, the doctype is a historical artifact that needs to be included for everything else to work right. `<!DOCTYPE html>` is the shortest string of characters that counts as a valid doctype. That is all you need to know!


2. `<html></html>`: The `<html>` element. This element wraps all the content on the page. It is sometimes known as the root element.


3. `<head></head>`: The `<head>` element. This element acts as a container for everything you want to include on the HTML page, **that isn't the content** the page will show to viewers. This includes keywords and a page description that would appear in search results, CSS to style content, character set declarations, and more. You will learn more about this in the next article of the series.


4. `<meta charset="utf-8">`: The `<meta>` element. This element represents metadata that cannot be represented by other HTML meta-related elements, like `<base>`, `<link>`, `<script>`, `<style>` or `<title>`. The charset attributes sets the character set for your document to UTF-8, which includes most characters from the vast majority of human written languages. With this setting, the page can now handle any textual content it might contain. There is no reason not to set this, and it can help avoid some problems later.


5. `<title></title>`: The `<title>` element. This sets the title of the page, which is the title that appears in the browser tab the page is loaded in. The page title is also used to describe the page when it is bookmarked.


6. `<body></body>`: The `<body>` element. This contains all the content that displays on the page, including text, images, videos, games, playable audio tracks, or whatever else.

Later in this notebook, you will get to explore HTML codes in more details.

#### HTML Tree Structure [<a href='#destination3'>5</a>] <a id='destination3_'></a>
 
Each HTML document can actually be referred to as a document tree. We describe the elements in the tree like we would describe a family tree. There are ancestors, descendants, parents, children and siblings.

Use the sample HTML document below for the following examples. The `<head>` section of the document is omitted for brevity.

```
<body>

  <div id="content">
    <h1>Heading here</h1>
    <p>Lorem ipsum dolor sit amet.</p>
    <p>Lorem ipsum dolor <em>sit</em> amet.</p>
    <hr>
  </div>
  
  <div id="nav">
    <ul>
      <li>item 1</li>
      <li>item 2</li>
      <li>item 3</li>
    </ul>
  </div>

</body>
```

A diagram of the above HTML document tree would look like this:

<img src='images/tree1.gif' width="435" height="400" align="center"/>

##### Ancestor

An ancestor refers to any element that is connected but further up the document tree - no matter how many levels higher.

In the diagram below, the `<body>` element is the ancestor of all other elements on the page.

<img src='images/tree_ancestor.gif' width="435" height="400" align="center"/>

##### Descendant

A descendant refers to any element that is connected but lower down the document tree - no matter how many levels lower.
In the diagram below, all elements that are connected below the `<div>` element are descendants of that `<div>`.

<img src='images/tree_descendant.gif' width="435" height="400" align="center"/>

##### Parent and Child

A parent is an element that is directly above and connected to an element in the document tree. In the diagram below, the `<div>` is a parent to the `<ul>`.

A child is an element that is directly below and connected to an element in the document tree. In the diagram above, the `<ul>` is a child to the `<div>`.

<img src='images/tree_parent.gif' width="435" height="400" align="center"/>

##### Sibling

A sibling is an element that shares the same parent with another element.

In the diagram below, the `<li>`s are siblings as they all share the same parent - the `<ul>`.

<img src='images/tree_siblings.gif' width="435" height="400" align="center"/>

## 4.2. Scraping static HTML content with Beautiful Soup
<img src='images/bs.png' style='height: 150px; float: right; margin-left: 0px' >

[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a Python library that makes it easy to scrape information from web pages. It sits atop (an interactive monitor to view the load on a Linux system) an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.

Now that you have an idea of how HTML webpages are structured, we can start working with Beautiful Soup. We will go through some of the most important methods of it, and then you will get to write your first scraping project.

<div class='alert alert-block alert-danger'>
<b>Caution</b>
    
If you need to the package installed on your system, use `pip` (check out [Session 1](https://github.com/gesiscss/css_methods_python/blob/main/a_introduction/1_computing_environment.ipynb) for installing packages), and then import the neccessary packages as the upcoming cell shows.
</div>

In [None]:
from bs4 import BeautifulSoup

### 4.2.1. Learning basic functions

[<a href='#destination4'>6, 7</a>] <a id='destination4_'></a>

We will begin with an example page at http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html.

The HTML source code of the page is stored in the `content` string as follows:

<div class='alert alert-block alert-danger'>
<b>Gizem's note:</b>
Pouria, please change the reference styling to APA and let's not have numbers for citations on the title.
</div>

In [None]:
content = """<html>
<head>
<title>A simple example page</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
                First paragraph.
            </p>
<p class="inner-text">
                Second paragraph.
            </p>
</div>
<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>
<p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>
</body>
</html>"""

In [None]:
HTML (content)

## requests 

You can get the same content by fetching the page through `requests`. It is a simple and useful HTTP library:

In [None]:
import requests

page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
content = page.content

By printing `page`, you can check to see if fetching the contents has been successful. The status code of "200" means you are good to go:

In [None]:
print(page)
# print(page.status_code)

#### html parser

By using its HTML parser, Beautiful Soup transforms a complex HTML document into a tree of python objects, so we can manage working with it easier. 

In [None]:
soup = BeautifulSoup(content, 'html.parser')

In [None]:
soup

Using `soup.pretiffy()`, we can have a better tree overview of the code:

In [None]:
print (soup.prettify())

Each tag can now be viewed as an object. We can also access all children objects of a tag using dots:

In [None]:
list(soup.html.body.children)

#### find() & find_all()

Two  of the most important methods of Beautiful Soup are its `find` and `find_all()` methods.

`find()` method finds the first occurence of a certain tag matching the given criteria. Its first argument is the tag name, so if we pass `p` as a string to it, it will return the first occurence of the `p` tag:

In [None]:
soup.find('p')

As you can see, the output is the same as when we use a dot for accessing the `p` tag:

In [None]:
soup.p

With the `find_all()` method, we can get a list of all of the occurences of a certain tag matching the given criteria. Again, if we pass the "p" string to it, it will return all the occurences of the `p` tag:

In [None]:
soup.find_all('p')

In [None]:
len(soup.find_all('p'))

In [None]:
soup.find_all('p')[0]

We can also specify attribute values and pass them to the method. The following line of code returns the list of all the `p` tags whose values for the `class` attribute is `"outer-text"`.

In [None]:
soup.find_all('p', {'class': "outer-text"})

This one returns the list of all tags whose `id` attributes equal `"first"`:

In [None]:
soup.find_all(id="first")

#### select()

Beautiful Soup has a `select()` method which uses the [SoupSieve](https://facelessuser.github.io/soupsieve/) package to run a CSS selector against a parsed document and return all the matching elements.

The SoupSieve documentation lists all the currently supported CSS selectors, but here are some of the basics;

You can find tags:

In [None]:
soup.select("p")

You can find tags beneath other tags:

In [None]:
soup.select("div p")

You can find tags with specific classes:

In [None]:
soup.select("p.first-item")

You can find tags by id:

In [None]:
soup.select("#second")

In [None]:
soup.select("p#second")

And you can also find tags by a combination of the above-mentioned criteria:

In [None]:
soup.select("div p.first-item#first")

#### get_text()

If you only want the human-readable text inside a document or tag, you can use the get_text() method. It returns all the text in a document or beneath a tag, as a single Unicode string:

In [None]:
soup.get_text()

You can tell Beautiful Soup to strip whitespace from the beginning and end of each bit of text:

In [None]:
soup.get_text(strip = True)

You can also specify a string to be used to join the bits of text together:

In [None]:
soup.get_text("|", strip=True)

But at that point you might want to use the `stripped_strings` generator instead, and process the text yourself:

In [None]:
[text for text in soup.stripped_strings]

### 4.2.2. Extracting relavent information from static webpages

Depending on our research projects, we might need data from different sources. For example, if we want to investigate the news exposure in specific countries and compare the topics of news, the first step will be collecting news articles from news websites. Or, if our project is related to understanding the subjects discussed at the European Union meetings, we first should have those meeting minutes at hand. These two project examples' data collection processes are explained in detail in this section. You can always think of another project, which might use and combine similar techniques that are introduced here. 

#### Example: Scraping news articles from Aljazeera [<a href='#destination5'>8</a>] <a id='destination5_'></a>

<img src='images/aljazeera.png' style='height: 150px; float: right; margin-left: 50px' >

Now that you are familiar with the basics of Beautiful Soup, we can do a more practical scraping project for the news exposure in Turkey on a particular news website [aljazeera.com](https://www.aljazeera.com), and we will get to practice what you have learnt so far.

To have a better idea of what exactly we are going to do, go the [Aljazeera website](https://www.aljazeera.com), use the search bar and search "Turkey". In the new page, sort the retrieved news articles by date. As you can see, the 10 most recent news articles related to Turkey are now displayed. We are going to scrape and store these articles and their relavent information (e.g., title, text, or url) in a pandas dataframe.

First, we need to make sure we have all the necessary packages available:

In [None]:
# import these libraries if you have not done so
import os
import requests
from bs4 import BeautifulSoup
import pandas as pd

Then we construct the right URL from `address` and `searchterm`:

In [None]:
address = "https://www.aljazeera.com/search/"
searchterm = "Turkey"
parameters = "?sort=date"
url = address + searchterm + parameters 

In [None]:
url

The resulting URL is the same as that of the page you explored at the first stage. Now we fetch it using `requests`:

In [None]:
page = requests.get(url) 

Then we parse the webpage with Beautiful Soup:

In [None]:
soup = BeautifulSoup(page.content,'lxml')

Now that we have the page parsed, we need to select the right elements of it to extract our desired information from. In the simple webpage that we investigated in the Beautiful Soup Basics section, it was easy to pick the right elements to investigate from the few lines of code. In real HTML web pages it's a bit different.

In order to find the right elemets, right-click somewhere on the page and click on *inspect*. Then press Ctrl+Shift+C. Now you should be able to inspect the page and see the HTML code for each part of the page you hover the mouse. Equivalently, by hovering the mouse on certain lines of HTML code you can see what that code actually creates on the page.

On Google Chrome it would look like this:

<img src='images/inspect.png' style='height: 550px; float: right; margin-left: 50px' >

It turns out that the elements that we would like to work on are the ones with the `article` tags. We'll select them:

In [None]:
articles = soup.select('article')

Next, we will scrape different information from the articles. We do that by putting every article's title, text and URL in a corresponding dictionary, and will add all the dictionaries to the `results` list:

<div class='alert alert-block alert-danger'>
<b>Gizem's note:</b>
Pouria, can you add the date here for each article let's keep the date as well for an overtime analysis chance?
    
<b>Pouria's note:</b>
Done!    
</div>

In [None]:
len(articles)

In [None]:
# Initialize empty list for results
results = []

for article in articles: 
    
    # Initialize empty dictionary
    # Extract title, text and URL of articles 
    item = {}
    item['title'] = article.select_one('span').text.strip()
    item['text'] = article.select_one('p').text.strip()    
    item['url'] = article.select_one('a').get('href')
    item['date'] = articles[0].select('span')[2].text
    # You can also get the URLs with article.select_one('a')['href']
    
    # Append items to result-list
    results.append(item)

At last, we convert the results list to a dataframe:

In [None]:
results = pd.DataFrame(results)
results.head()

We can save the resulting dataframe in a csv file. You can access the file in the `outputs` folder in the current directory.

<div class='alert alert-block alert-danger'>
<b>Gizem's note:</b>
Pouria, should we have a results or output folder for each Session? Let's save these results to output folder?
    
<b>Pouria's note:</b>
It's a good idea. Now the csv file is saved to `outputs` folder. The results for feedparser and PDF downloading sections are also kept there.
</div>

In [None]:
results.to_csv('./outputs/aljazeera.csv', mode = 'w')

#### Example: Collecting multiple PDF files from different pages of a particular website

<img src='images/eu_council.png' style='height: 90px; float: right; margin-left: 50px' >

In this example, we will showcase how we can automate downloading numerous PDF files from different pages of a particular website, getting help from `requests` and `BeautifulSoup`. We will be collecting data from The Minutes of European Council Meetings; take a look at [this web page](https://www.consilium.europa.eu/en/documents-publications/public-register/council-minutes/?year=2023) and check out its structure and overview.

The image below will be our starting web page for this task:

<img src='images/pdf_download_1.png' style='height: 550px; float: right; margin-left: 50px' >

As you can see, there are some main elements that are relevant to our work;

   - The first one shows how many contents there are for each year. Some of these contents may contain PDF files, and some may not.
   - The second one is a clickable link that takes you to the related contents of each year.
   - The ones with label `3` are the contents.
   - The ones with label `4` are the links for downloading PDF files.
   - The ones with label `5` are the dates in which these contents have been added to the website.

If you scroll down to the end of the page, you can see that there may be many different pages containing the above-mentioned contents. In fact, there are 20 contents listed in each page, and for a year like 2001 that has 242 contents, there are 13 different pages:


<img src='images/pdf_download_2.png' style='height: 550px; float: right; margin-left: 50px' >

<div class='alert alert-block alert-danger'>
<b>Gizem's note:</b>
    
Pouria, can you please extract the date information from this webpage? For instance, in your function, you are using the counter in the pdf_counter and use this number in the file name? Can you add the date after the pdf_counter to the filename? If this task is not clear, let's chat?
    
<b>Pouria's note:</b>
I added the dates to the end of the downloaded file names. This changed the structure of the code a little bit. I also changed the screenshots and the colors of numerical labels accordingly.
</div>

We are going to automatically go through all these pages for **any number of years** of our choice, and download all the available PDFs into a `PDFs` folder in the `outputs` directory.

#### Getting started

After making sure that we have imported all the necessary libraries, we'll put the main url of the site in the `main_url` variable. We will be using it in our main function `get_PDFs`.

In [None]:
import requests
import os
from bs4 import BeautifulSoup

main_url = "https://www.consilium.europa.eu/en/documents-publications/public-register/council-minutes/"

Once again, go to the main webpage and click on a different page than 2023, for example 2021. Take a look at the url of the new page:

`https://www.consilium.europa.eu/en/documents-publications/public-register/council-minutes/?year=2021`

As you can see, it is in the following form:

`main_url` + `?year=2021`

For each year that you click on, the resulting url is made of the `?year=` string, followed by the year number (here: `2021`), added to the `main_url`.

If you click on the second page of 2021, you can see that there is a similar thing for different pages within each year. For second page, it is

`https://www.consilium.europa.eu/en/documents-publications/public-register/council-minutes/?year=2021&Page=2`

Which follows this structure: `main_url` + `?year=2021` + `&Page=2`

We will use these simple rules to loop through all the pages that we want, and get all the PDFs.

#### The main function: `get_PDFs`

Now that you know the necessary principles for scraping the website, we can work with the `get_PDFs` function. It takes two arguments, `start_year` and `end_year`, and downloads every single PDF file available on the website for the `start_year` - `end_year` interval, including the end_year itself. It saves all the files in their corresponding folders in the `PDFs` folder. The rest of the explanations could be found in comments among the code:

In [None]:
def get_PDFs (start_year, end_year):
    
    
    # Making a list of all the years between start_year and end_year:
    years = [i for i in range(start_year, end_year+1)]

    # Looping through all years, using the  years list:
    for year in years:

        # A counter for PDFs that will be used in naming the downloaded files:
        pdf_counter = 1

        # Making the url of the first page of the year:
        first_page_url = main_url + '?year=' + str(year)

        
        # Finding the number of contents in the page, to be used to determine the number of webpages in each year:
        first_page = requests.get(first_page_url)
        content = first_page.content
        soup = BeautifulSoup(content, 'html.parser')
        number_of_contents = int(soup.find('h2').text.split()[0])
        number_of_pages = number_of_contents // 20 + 1

        # Starting the download procedure:
        print(f'Downloading PDFs of {year} (from {number_of_pages} webpages):\n')

        # Looping through all webpages of the desired year:
        for page_number in range(number_of_pages):

            print(f'Getting page {page_number + 1}...')

            url = main_url + '?year=' + str(year)

            # For pages after page 1, we need to add the '&Page=n' element to the url to access other pages:
            if page_number != 0:
                url = url + '&Page=' + str(page_number + 1)

            # Getting the page:
            page = requests.get(url)
            content = page.content
            soup = BeautifulSoup(content, 'html.parser')

            # Finding the links and dates of PDF files:
            for j in soup.find_all('li', {'class': "margin-0"}):

                # Ignoring the contents which do not have a pdf file to download:
                if j.find('a', {'class': 'link-pdf'}) == None:
                    continue

                # Getting the date of the content, for including in the PDF file name:
                date = j.find('span', {'class': 'pull-right'}).text

                # Getting the actual download link of the PDF file:
                pdf_link = j.find('a', {'class': 'link-pdf'})
                pdf_url = pdf_link.get('href', [])


                # The title of the download link, which will be used in the name of the downloaded file later:
                file_name = pdf_link.text 

                # If the file name has got '/' character, we need to replace it, so that it won't make trouble
                # when we make the folders and directories:
                if '/' in file_name:
                    file_name = file_name.replace('/', '-')

                # We do the same thing for date:
                if '/' in date:
                    date = date.replace('/', '.')

                # Similarly, if it contains quotation character, it will get troublesome in windows, so we need
                # to replace it:
                if '"' in file_name:
                    file_name = file_name.replace('"', "'")

                # The PDFs folder contains other folders named after the year:
                file_name = './outputs/PDFs/' + str(year) + '/' + str(pdf_counter) + '. ' + file_name + ' (' + date + ')'

                # Making the right folder/directory for downloaded files:
                os.makedirs(os.path.dirname(file_name), exist_ok=True)          

                # Getting the pdf file:
                response = requests.get(pdf_url)

                # Writing the PDF files:
                with open(file_name, 'wb') as f:
                    f.write(response.content)

                pdf_counter = pdf_counter + 1

        print ('\nDone!')

Let's try it for 2021:

In [None]:
get_PDFs(2021, 2021)

<div class='alert alert-block alert-danger'>
<b>Gizem's note:</b>
    
    
Pouria, please see the error above. Some of the meeting minutes in the website are not downloaded fully? For instance the second file in 2001 search? Why is this happenning? Also, after you find the reason to this, if the reason is not technically solvable, can you write a line of code that passes this error and continues downloading the other available pages? Let's also save these PDFs output under the outputs folder. ./outputs/PDFs/...

<b>Pouria's note:</b>
Solved. I could have run it on Linux, but windows is a bit different when it comes to file names. Now it works on Windows and Mac as well.
</div>

You can access the results in the `PDFs` folder under the `outputs` folder in your current directory!

## References

[<a href='#destination1_'>1</a>] https://pypi.org/project/feedparser/ <a id='destination1'></a>

[<a href='#destination1_'>2</a>] https://rss.com/blog/find-rss-feed/#:~:text=Right%20click%20on%20the%20website's,between%20the%20quotes%20after%20href%3D

[<a href='#destination1_'>3</a>] https://dev.to/mr_destructive/feedparser-python-package-for-reading-rss-feeds-5fnc

[<a href='#destination2_'>4</a>] https://developer.mozilla.org/en-US/docs/Learn/HTML/Introduction_to_HTML/Getting_started <a id='destination2'></a>

[<a href='#destination3_'>5</a>] http://web.simmons.edu/~grabiner/comm244/weekfour/document-tree.html <a id='destination3'></a>

[<a href='#destination4_'>6</a>] https://www.crummy.com/software/BeautifulSoup/bs4/doc/ <a id='destination4'></a>

[<a href='#destination4_'>7</a>] Fabian's notebook from GESIS fall seminar 2021: https://colab.research.google.com/drive/1uKxOc8mXTE2b05uUq-YlijJYzOTgi5DZ#scrollTo=ao_sLGiOSu7Y

[<a href='#destination5_'>8</a>] The Social Comquant Workshop 10 at https://github.com/strohne/autocol <a id='destination5'></a>


Han, S., & Anderson, C. K. (2021). Web scraping for hospitality research: Overview, opportunities, and implications. Cornell Hospitality Quarterly, 62(1), 89-104.https://doi.org/10.1177/1938965520973587

Singrodia, V., Mitra, A., & Paul, S. (2019). A review on web scrapping and its applications. In 2019 international conference on computer communication and informatics (ICCCI) (pp. 1-6). IEEE.

Hillen, J. (2019). Web scraping for food price research. British Food Journal.
Can we use a use case on the food prices, this study's pipeline looks informative: https://www.emerald.com/insight/content/doi/10.1108/BFJ-02-2019-0081/full/pdf?title=web-scraping-for-food-price-research



<br/><br/>

<div class='alert alert-block alert-danger'>
<b>Gizem's note:</b>
Pouria, let's be aware of the ordering of these links below! At least for the first two ones are in order in the intro sections.
</div>


https://realpython.com/beautiful-soup-web-scraper-python/#reasons-for-web-scraping

https://medium.com/pythoneers/the-fundamentals-of-web-scraping-using-python-its-libraries-6f146b91efb4

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

https://www.dataquest.io/blog/web-scraping-python-using-beautiful-soup/#tve-jump-1788432a71d

https://developer.mozilla.org/en-US/docs/Web/HTML/Element

https://medium.com/geekculture/web-scraping-cheat-sheet-2021-python-for-web-scraping-cad1540ce21c#b81d

https://trends.google.com/trends/yis/2021/DE/

https://blog.google/products/search/15-tips-getting-most-out-google-trends/

https://limeproxies.netlify.app/blog/selenium-vs-beautifulsoup


Do not miss checking out the Social Comquant Workshop 10 at: https://github.com/strohne/autocol


<div class='alert alert-block alert-success'>
<b>Document information</b>

Contact and main author: N. Gizem Bacaksizlar Turbic

Contributors: Pouria Mirelmi, Felix Soldner, Haiko Lietz & ..?

Acknowledgements: Fabian Floeck? ...

Version date: XX. January 2023

License: ...
</div>

#### Notes to be removed before publication

Reviewers: Felix Soldner & Felix Schmidt?

Review intro

Review and finish red boxes

Add insight boxes more?