<img src='images/gesis.png' style='height: 50px; float: left'>
<img src='images/social_comquant.png' style='height: 50px; float: left; margin-left: 40px'>

#### Notes to be removed before publication

Reviewers: Felix Soldner & Felix Schmidt?

Review intro

Review and finish red boxes

Add insight boxes more?

## Introduction to Computational Social Science methods with Python

# Session 4: Web scraping

Data collection is a procedure of gathering information from subjects (all relevant sources), measuring and analyzing accurate insights for research using various techniques. Researchers can evaluate their research questions and hypotheses on the basis of collected data. In most cases, data collection is the primary and most important step for research, irrespective of the field of study. The approach of data collection varies for different fields of study, depending on the required information.

Digital behavioral data from the internet is an absolutely massive source of data which we can access by various ways such as connecting APIs (see [Session 3: API harvesting](https://github.com/gesiscss/css_methods_python/blob/main/b_data_collection_methods/3_api_harvesting.ipynb)). In addtion to APIs, web scraping is often the only way we can access data that it is not available in convenient CSV exports or APIs. Websites are often valuable sources of data; for example, weather forecasts, comments on news sites, and posts on forums. To access those sorts of information on webpages, we use web scraping.


<div class='alert alert-block alert-success'>
<b>In this session</b>, 

you will learn how to do web scraping with Python from scratch. In subsession **4.1**, we will have a deep look at the fundamentals of web scraping. You will experience how you can use the Python libraries to handle different data collections. In subsubsession **4.1.1**, you will learn about basics for the feed parsing with the feedparser library, and in subsubsession **4.1.2**, you will learn about html syntax. We move to the basic web scraping tool; the Beautiful soup library in the following subsession **4.2**. In subsession **4.3**, we will introduce the Selenium library to collect data from websites that only load content once you interact with it (scrolling, clicking, etc.) and are difficult to obtain through more traditional scraping approaches. We will work through an actual web scraping projects throughout this session, focusing on Wikipedia pages, online news sites, pdf pages, and Quora. Finally, in subsession **4.4**, we will compare these libraries and talk about the challanges and data privacy approaches.
</div>


## 4.1. Fundamentals of Web scraping

<img src="./images/webscrape.png"  width="350" height = "350" align="center"/>

In order to access APIs, you first need to create an account and apply to have a developer account on the platform that you want to work on. With this developer account, platforms provide you KEYS (e.g., secret, public, or access) to authenticate their system.

While web scraping is one of the common ways of collecting data from websites, a lot of websites offer APIs to access the public data that they host on their website. This is to avoid unnecessary traffic on the websites. However, even though we have access to these API, as researchers, we should not forget to respect API access rules and always read the **Term of Use** documents before collecting data.

### 4.1.1. Feed Parsing [<a href='#destination1'>1, 2, 3</a>] <a id='destination1_'></a>

According to [Wikipedia](https://en.wikipedia.org/wiki/Web_feed), a web feed (or news feed) is a data format used for providing users with frequently updated content. Content distributors syndicate a web feed, thereby allowing users to subscribe a channel to it by adding the feed resource address to a news aggregator client (also called a feed reader or a news reader). Users typically subscribe to a feed by manually entering the URL of a feed or clicking a link in a web browser or by dragging the link from the web browser to the aggregator, thus "RSS and Atom files provide news updates from a website in a simple form for your computer."

Here we introduce [feedparser](https://pypi.org/project/feedparser/), a powerful python package for parsing RSS feeds. By providing the RSS feed link, we can get structured information in the form of python lists and dictionaries, which could then be used to extract the desired information in a simple and efficient way.

#### Getting started with feedparser

<div class="alert alert-block alert-info">
<b>Hint:</b> 
    
Before importing the libraries, we need to have the neccessary software packages and libraries installed. You can always go back to [Session 1: Setting up the computing environment](https://github.com/gesiscss/css_methods_python/blob/main/a_introduction/1_computing_environment.ipynb) to learn about how to install software packages and libraries that you need for this session.
    
</div>

As usual, we need to import the package in the first place:

In [None]:
import feedparser

#### Parsing an RSS feed URL
To parse an RSS feed link, you can simply use the **parse()** method from the feedparser package. It takes a string as argument, which could be a URL or the address to the file locally saved on the computer. Here we use CNN RSS as an example URL:

In [None]:
feed = feedparser.parse("https://www.voanews.com/api/zgvmqye_o_qv")

# You can try other news websites as well:

# feed = feedparser.parse("https://www.aljazeera.com/xml/rss/all.xml")
# feed = feedparser.parse("http://rss.cnn.com/rss/edition_europe.rss")

feed

<div class="alert alert-block alert-info">
<b>Hint:</b> 
You can try the following ways in order to get a website's RSS feed:

- If the website is powered by Wordpress, you can do it by adding /feed/ at the end of its URL. Trying /rss/ is another option.
<img src='images/rss_logo.png' style='height: 50px; float: right; margin-left: 50px' >
- If you see the standard orange RSS logo, by simply clicking on it you will be taken to the website's RSS feed.
- You can also use the page source: right click on the page and choose page source. In the new window, use ctrl+f and type in RSS. You’ll find the feed’s URL between the quotes after **href=**.

The parse method fetches the feed from the provided URL, extracts the information in a systematic way and stores each piece in a structured format. At the high level, it returns a python dictionary with multiple keys and values, in which each value may contain python lists or other dictionaries. You can access the keys using the **keys()** method:
</div>

In [None]:
feed.keys()

Using these keys, we can access the more specific information that we want. The most common keys that can be used for extracting information are **entries** and **feed**.

#### Extracting the contents from the feed
We will start with the **entries** key. We can get the list of all the posts/podcasts/entries or any other form of content the feed is serving for, from the **entries** key in the dictionary. More information on other possible keys in the returned dictionary can be found [here](https://feedparser.readthedocs.io/en/latest/reference.html).

In [None]:
feed['entries']

We can get the number of articles/entries using the **len()** function:

In [None]:
len(feed['entries'])

#### Getting details of the entries
We can iterate over the items of the entries list and print them to get more details on each article:

In [None]:
for entry in feed['entries']:
    print (entry)
    print ("\n")

As we can see, each entry in the list is a dictionary again, which has different key-value pairs like **title**, **summary**, **link**, etc. We can again use the **keys()** method in order to explore the keys of the new dictionary: 

In [None]:
feed['entries'][0].keys()

Now that we have all the keys associated with the entries, we can extract the specific information like title, author, and actual contents of the feed.
Though this might not be the same for all RSS feeds, it might be very similar and a matter of using the right keyword for the associated keys in the list of dictionaries.

Let's say, we want to print out the titles of all the entries in the feed. We can do that by iterating over the entries list and fetching the title from the iterator:

In [None]:
for entry in feed.entries:
    print (entry.title)

Similarly, we can get the links or the summaries of the entries using the link key in the dictionary:

In [None]:
for entry in feed.entries:
    print (entry.link)
    
# for entry in feed.entries:
#     print (entry.summary)

In [None]:
from IPython.display import HTML

### 4.1.2. HTML for Web Scraping

<img src='images/html.png' style='height: 90px; float: right; margin-left: 50px' >

The **HyperText Markup Language** or **HTML** is the standard markup language for documents designed to be 
displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScript. While the main content of the web pages are in the form of HTML, CSS add styling to the pages to make them look nicer and JavaScript files add interactivity to them.


HTML code consists of a series of **elements**, and these elements tell the browser how to display the content. For collecting data from HTML web pages, it's necessary to have an idea of how this element syntax works.

#### TML Element Syntax [<a href='#destination2'>4</a>] <a id='destination2_'></a>

HTML language can be applied to pieces of text to give them different meanings in a document (Is it a paragraph? Is it a bulleted list? Is it part of a table?), structure a document into logical sections (Does it have a header? Three columns of content? A navigation menu?), and embed content such as images and videos into a page. In this section we will introduce the first two of these, together with the fundamental concepts and syntax you need to know to understand HTML.

To get started, we will begin with defining elements, attributes, and some other important terms. We will also explain where these fit into HTML. You will learn how HTML elements are structured, how a typical HTML page is structured, and other important basic language features.

As already mentioned, HTML is a markup language that tells web browsers how to structure the web pages you visit. It can be as complicated or as simple as the web developer wants it to be. HTML consists of a series of elements, which you use to enclose, wrap, or mark up different parts of content to make it appear or act in a certain way. The enclosing tags can make content into a hyperlink to connect to another page, italicize words, and so on. For example, consider the following line of text:

`My cat is very grumpy`

If we wanted the text to stand by itself, we could specify that it is a paragraph by enclosing it in a paragraph (`<p>`) element:

`<p>My cat is very grumpy</p>`

<font color='darkblue'>**Note**: Tags in HTML are not case-sensitive, but it's better to write all of them in lower case for the sake of consistency and readability.
</font>

####  Anatomy of an HTML element

Let's further explore our paragraph element mentioned above:

<img src='images/html4.png' width="500" height="400" align="center"/>

The anatomy of our element is:

- **The opening tag**: This consists of the name of the element (in this example, p for paragraph), wrapped in opening and closing angle brackets. This opening tag marks where the element begins or starts to take effect. In this example, it precedes the start of the paragraph text.


- **The content**: This is the content of the element. In this example, it is the paragraph text.


- **The closing tag**: This is the same as the opening tag, except that it includes a forward slash before the element name. This marks where the element ends. Failing to include a closing tag is a common beginner error that can produce peculiar results.

So, *the element* is the opening tag, followed by content, followed by the closing tag.

**Create your first HTML element:**  Edit the `html` string below (it contains an HTML code) and get the actual rendered HTML output from `HTML()`. You can wrap the text of your choice with the tags `<em>` and `</em>`. Doing this should give the line italic text formatting.

In [None]:
html = "<em>This is my text.</em>"
HTML(html)

#### Nesting elements

Elements can be placed within other elements. This is called *nesting*. If we wanted to state that our cat is **very** grumpy, we could wrap the word "very" in a `<strong>` element, which means that the word is to have strong(er) text formatting:

`<p>My cat is <strong>very</strong> grumpy.</p>`

There is a right and wrong way to do nesting. In the example above, we opened the `p` element first, then opened the `strong` element. For proper nesting, we should close the `strong` element first, before closing the `p`.
The following is an example of the *wrong* way to do nesting:

`<p>My cat is <strong>very grumpy.</p></strong>`

<u>The tags have to open and close in a way that they are inside or outside one another.</u> With the kind of overlap in the example above, the browser has to guess at your intent. This kind of guessing can lead to unexpected results.

#### Block versus inline elements

There are two important categories of elements to know in HTML: block-level elements and inline elements.

- Block-level elements form a visible block on a page. A block-level element appears on a new line following the content that precedes it. Any content that follows a block-level element also appears on a new line. Block-level elements are usually structural elements on the page. For example, a block-level element might represent headings, paragraphs, lists, navigation menus, or footers. A block-level element wouldn't be nested inside an inline element, but it might be nested inside another block-level element.


- Inline elements are contained within block-level elements, and surround only small parts of the document's content (not entire paragraphs or groupings of content). An inline element will not cause a new line to appear in the document. It is typically used with text, for example an `<a>` element creates a hyperlink, and elements such as `<em>` or `<strong>` create emphasis.

Consider the following example:

`<em>first</em><em>second</em><em>third</em>`

`<p>fourth</p><p>fifth</p><p>sixth</p>`

`<em>` is an inline element. As you can see below, the first three elements sit on the same line, with no space in between. On the other hand, `<p>` is a block-level element. Each p element appears on a new line, with space above and below. (The spacing is due to default CSS styling that the browser applies to paragraphs.)

In [None]:
HTML("<em>first</em><em>second</em><em>third</em>")

In [None]:
HTML("<p>fourth</p><p>fifth</p><p>sixth</p>")

#### Empty elements

Not all elements follow the pattern of an opening tag, content, and a closing tag. Some elements consist of a single tag, which is typically used to insert/embed something in the document. For example, the `<img>` element embeds an image file onto a page:

`<img src="https://raw.githubusercontent.com/mdn/beginner-html-site/gh-pages/images/firefox-icon.png">`

This would output the following:

In [None]:
HTML('<img src="https://raw.githubusercontent.com/mdn/beginner-html-site/gh-pages/images/firefox-icon.png">')

#### Attributes

Elements can also have attributes. Attributes look like this:

<img src='images/html5.png' width="800" height="400" align="center"/>

Attributes contain extra information about the element that won't appear in the content. In this example, the `class` attribute is an identifying name used to target the element with style information.

An attribute should have:

- A space between it and the element name. (For an element with more than one attribute, the attributes should be separated by spaces too.)
- The attribute name, followed by an equal sign.
- An attribute value, wrapped with opening and closing quote marks.

**Adding attributes to an element**: Another example of an element is `<a>`. This stands for *anchor*. An anchor can make the text it encloses into a hyperlink. Anchors can take a number of attributes, but several are as follows:

- `href`: This attribute's value specifies the web address for the link. For example: `href="https://www.mozilla.org/"`
- `title`: The `title` attribute specifies extra information about the link, such as a description of the page that is being linked to. For example, `title="The Mozilla homepage"`. This appears as a tooltip when a cursor hovers over the element.
- `target`: The `target` attribute specifies the browsing context used to display the link. For example, `target="_blank"` will display the link in a new tab. If you want to display the linked content in the current tab, just omit this attribute.

You can edit the `html` string below to turn it into a link to your favorite website:

In [None]:
html = '<p>A link to my <a href="https://www.mozilla.org/" title="The Mozilla homepage" target="_blank">favorite website</a>.</p>'
HTML(html)

#### Anatomy of an HTML document

Individual HTML elements aren't very useful on their own. Next, let's examine how individual elements combine to form an entire HTML page:

```
<!DOCTYPE html>
<html lang="en-US">
  <head>
    <meta charset="utf-8">
    <title>My test page</title>
  </head>
  <body>
    <p>This is my page</p>
  </body>
</html>
```

Here we have:

1. `<!DOCTYPE html>`: The doctype. When HTML was young (1991-1992), doctypes were meant to act as links to a set of rules that the HTML page had to follow to be considered good HTML. More recently, the doctype is a historical artifact that needs to be included for everything else to work right. `<!DOCTYPE html>` is the shortest string of characters that counts as a valid doctype. That is all you need to know!


2. `<html></html>`: The `<html>` element. This element wraps all the content on the page. It is sometimes known as the root element.


3. `<head></head>`: The `<head>` element. This element acts as a container for everything you want to include on the HTML page, **that isn't the content** the page will show to viewers. This includes keywords and a page description that would appear in search results, CSS to style content, character set declarations, and more. You will learn more about this in the next article of the series.


4. `<meta charset="utf-8">`: The `<meta>` element. This element represents metadata that cannot be represented by other HTML meta-related elements, like `<base>`, `<link>`, `<script>`, `<style>` or `<title>`. The charset attributes sets the character set for your document to UTF-8, which includes most characters from the vast majority of human written languages. With this setting, the page can now handle any textual content it might contain. There is no reason not to set this, and it can help avoid some problems later.


5. `<title></title>`: The `<title>` element. This sets the title of the page, which is the title that appears in the browser tab the page is loaded in. The page title is also used to describe the page when it is bookmarked.


6. `<body></body>`: The `<body>` element. This contains all the content that displays on the page, including text, images, videos, games, playable audio tracks, or whatever else.

Later in this notebook, you will get to explore HTML codes in more details.

#### HTML Tree Structure [<a href='#destination3'>5</a>] <a id='destination3_'></a>
 
Each HTML document can actually be referred to as a document tree. We describe the elements in the tree like we would describe a family tree. There are ancestors, descendants, parents, children and siblings.

Use the sample HTML document below for the following examples. The `<head>` section of the document is omitted for brevity.

```
<body>

  <div id="content">
    <h1>Heading here</h1>
    <p>Lorem ipsum dolor sit amet.</p>
    <p>Lorem ipsum dolor <em>sit</em> amet.</p>
    <hr>
  </div>
  
  <div id="nav">
    <ul>
      <li>item 1</li>
      <li>item 2</li>
      <li>item 3</li>
    </ul>
  </div>

</body>
```

A diagram of the above HTML document tree would look like this:

<img src='images/tree1.gif' width="435" height="400" align="center"/>

##### Ancestor

An ancestor refers to any element that is connected but further up the document tree - no matter how many levels higher.

In the diagram below, the `<body>` element is the ancestor of all other elements on the page.

<img src='images/tree_ancestor.gif' width="435" height="400" align="center"/>

##### Descendant

A descendant refers to any element that is connected but lower down the document tree - no matter how many levels lower.
In the diagram below, all elements that are connected below the `<div>` element are descendants of that `<div>`.

<img src='images/tree_descendant.gif' width="435" height="400" align="center"/>

##### Parent and Child

A parent is an element that is directly above and connected to an element in the document tree. In the diagram below, the `<div>` is a parent to the `<ul>`.

A child is an element that is directly below and connected to an element in the document tree. In the diagram above, the `<ul>` is a child to the `<div>`.

<img src='images/tree_parent.gif' width="435" height="400" align="center"/>

##### Sibling

A sibling is an element that shares the same parent with another element.

In the diagram below, the `<li>`s are siblings as they all share the same parent - the `<ul>`.

<img src='images/tree_siblings.gif' width="435" height="400" align="center"/>

## 4.2. Beautiful Soup
<img src='images/bs.png' style='height: 150px; float: right; margin-left: 0px' >

Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing pythonic idioms for iterating, searching, and modifying the parse tree.


Now that you have an idea of how HTML webpages are structured, we can start working with Beautiful Soup. We will go through some of the most important methods of it, and then you will get to write your first scraping project.



If you don't have the package installed on your system, do it using `pip`, and then import it:

In [None]:
from bs4 import BeautifulSoup

### 4.2.1. Basics [<a href='#destination4'>6, 7</a>] <a id='destination4_'></a>

We will begin with an example page at http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html.

The HTML source code of the page is stored in the `content` string as follows:

In [None]:
content = """<html>
<head>
<title>A simple example page</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
                First paragraph.
            </p>
<p class="inner-text">
                Second paragraph.
            </p>
</div>
<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>
<p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>
</body>
</html>"""

In [None]:
HTML (content)

#### requests 

You can get the same content by fetching the page through `requests`. It is a simple and useful HTTP library:

In [None]:
import requests

page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
content = page.content

By printing `page`, you can check to see if fetching the contents has been successful. The status code of "200" means you are good to go:

In [None]:
print(page)
# print(page.status_code)

#### html parser

By using its HTML parser, Beautiful Soup transforms a complex HTML document into a tree of python objects, so we can manage working with it easier. 

In [None]:
soup = BeautifulSoup(content, 'html.parser')

In [None]:
soup

Using `soup.pretiffy()`, we can have a better tree overview of the code:

In [None]:
print (soup.prettify())

Each tag can now be viewed as an object. We can also access all children objects of a tag using dots:

In [None]:
list(soup.html.body.children)

#### find() & find_all()

Two  of the most important methods of Beautiful Soup are its `find` and `find_all()` methods.

`find()` method finds the first occurence of a certain tag matching the given criteria. Its first argument is the tag name, so if we pass `p` as a string to it, it will return the first occurence of the `p` tag:

In [None]:
soup.find('p')

As you can see, the output is the same as when we use a dot for accessing the `p` tag:

In [None]:
soup.p

With the `find_all()` method, we can get a list of all of the occurences of a certain tag matching the given criteria. Again, if we pass the "p" string to it, it will return all the occurences of the `p` tag:

In [None]:
soup.find_all('p')

In [None]:
len(soup.find_all('p'))

In [None]:
soup.find_all('p')[0]

We can also specify attribute values and pass them to the method. The following line of code returns the list of all the `p` tags whose values for the `class` attribute is `"outer-text"`.

In [None]:
soup.find_all('p', {'class': "outer-text"})

This one returns the list of all tags whose `id` attributes equal `"first"`:

In [None]:
soup.find_all(id="first")

#### select()

Beautiful Soup has a `select()` method which uses the [SoupSieve](https://facelessuser.github.io/soupsieve/) package to run a CSS selector against a parsed document and return all the matching elements.

The SoupSieve documentation lists all the currently supported CSS selectors, but here are some of the basics;

You can find tags:

In [None]:
soup.select("p")

You can find tags beneath other tags:

In [None]:
soup.select("div p")

You can find tags with specific classes:

In [None]:
soup.select("p.first-item")

You can find tags by id:

In [None]:
soup.select("#second")

In [None]:
soup.select("p#second")

And you can also find tags by a combination of the above-mentioned criteria:

In [None]:
soup.select("div p.first-item#first")

#### get_text()

If you only want the human-readable text inside a document or tag, you can use the get_text() method. It returns all the text in a document or beneath a tag, as a single Unicode string:

In [None]:
soup.get_text()

You can tell Beautiful Soup to strip whitespace from the beginning and end of each bit of text:

In [None]:
soup.get_text(strip = True)

You can also specify a string to be used to join the bits of text together:

In [None]:
soup.get_text("|", strip=True)

But at that point you might want to use the `stripped_strings` generator instead, and process the text yourself:

In [None]:
[text for text in soup.stripped_strings]

### 4.2.2. Example: Scraping data from Aljazeera [<a href='#destination5'>8</a>] <a id='destination5_'></a>

<img src='images/aljazeera.png' style='height: 150px; float: right; margin-left: 50px' >

Now that you are familiar with the basics of Beautiful Soup, we can do a more practical scraping project. We will collect some news data from [aljazeera.com](https://www.aljazeera.com), and you will get to examine what you have learnt so far.

To have a better idea of what exactly we are going to do, open the website, use the search bar and search "Turkey". In the new page, sort the retrieved news articles by date. As you can see, the 10 most recent news articles related to Turkey are now displayed. We are going to scrape and store them in a pandas dataframe.

First, we need to make sure we have all the necessary packages available:

In [None]:
# import these libraries if you have not done so
import os
import requests
from bs4 import BeautifulSoup
import pandas as pd

We make an *html* directory and an *articles.html* file for downloads:

In [None]:
# create directory if it doesn't exist yet:
directory = "html"
if not os.path.exists(directory):
    os.makedirs(directory)
    
filename = directory + "/articles.html"

Then we construct the right URL from `path` and `searchterm`:

In [None]:
path = "https://www.aljazeera.com/search/"
searchterm = "Turkey"
parameters = "?sort=date"
url = path + searchterm + parameters 

The resulting URL is the same as that of the page you explored at the first stage. Now we fetch it using `requests`:

In [None]:
page = requests.get(url)  

We check to see if we get the right status code (200), then we save the contents of the page in *articles.html*:

In [None]:
if page.status_code == 200:
    with open(filename, 'wb') as file:
        file.write(page.content)

Then we parse the webpage with Beautiful Soup:

In [None]:
soup = BeautifulSoup(open(filename, encoding="utf-8"),'lxml')

Now that we have the page parsed, we need to select the right elements of it to extract our desired information from. In the simple webpage that we investigated in the Beautiful Soup Basics section, it was easy to pick the right elements to investigate from the few lines of code. In real HTML web pages it's a bit different.

In order to find the right elemets, right-click somewhere on the page and click on *inspect*. Then press Ctrl+Shift+C. Now you should be able to inspect the page and see the HTML code for each part of the page you hover the mouse. Equivalently, by hovering the mouse on certain lines of HTML code you can see what that code actually creates on the page.

On Google Chrome it would look like this:

In [None]:
url

<img src='images/inspect.png' style='height: 550px; float: right; margin-left: 50px' >

It turns out that the elements that we would like to work on are the ones with the `article` tags. We'll select them:

In [None]:
articles = soup.select('article')

Next, we will scrape different information from the articles. We do that by putting every article's title, text and URL in a corresponding dictionary, and will add all the dictionaries to the `results` list:

In [None]:
# Initialize empty list for results
results = []

for article in articles: 
    
    # Initialize empty dictionary
    # Extract title, text and URL of articles 
    item = {}
    item['title'] = article.select_one('span').text.strip()
    item['text'] = article.select_one('p').text.strip()    
    item['url'] = article.select_one('a').get('href')
    # You can also get the URLs with article.select_one('a')['href']
    
    # Append items to result-list
    results.append(item)

At last, we convert the results list to a dataframe:

In [None]:
results = pd.DataFrame(results)
results

<div class='alert alert-block alert-danger'>
<b>Gizem's note:</b>
We should write this results dataframe to csv."
    
<b>Pouria:</b>
Added it right below:
</div>

We can save the resulting dataframe in a csv file like this:

In [None]:
results.to_csv('aljazeera.csv', mode = 'w')

### 4.2.3. Toy example: Scraping Chief Seattle Speech

As another small example of working with BeautifulSoup, we will scrape the text from the Chief Seattle Speech from [this link](https://suquamish.nsn.us/home/about-us/chief-seattle-speech/). Try openning the page and see how it looks!

Like before, we first get the contents of the page using `requests` and `BeautifulSoup`:

In [None]:
page = requests.get("https://suquamish.nsn.us/home/about-us/chief-seattle-speech/")
print(page)
content = page.content

In [None]:
soup = BeautifulSoup(content, 'html.parser')
# soup

If you `inspect` the page, you will see that the main text of the whole page is structured under `section` tag. For accessing the main text all at once, you can simply use the `get_text()` method on this tag:

In [None]:
soup.section.get_text()

There are also multiple `p` tags under the `sectoin` tag, each containing a single paragraph in the page. So, for accessing a certain paragraph, what you need to do is getting the corresponding `p` tag first, and then extracting its text using `get_text()`.

Let's store the contents of `section` tag in the `soup2` list first:

In [None]:
soup2 = soup.section.find_all('p')

As you can see, we have 9 different paragraphs in the text:

In [None]:
len(soup2)

Now, let's say we want to access the text of the 3rd paragraph. We can get that by using the `get_text()` method on the 3rd item of `soup2`: 

In [None]:
soup2[2].get_text()

## 4.3. Selenium 

<img src='images/selenium.png' style='height: 100px; float: right; margin-left: 100px' >

In this section, we will show you how to use Selenium (a browser automation software) with python, to collect data from websites that only load content once you interact with them (scrolling, clicking, etc.), and are difficult to obtain through more traditional scraping approaches.

### Dynamic websites
Dynamic websites change their content (i.e., source code) due to various reasons, such as:
   - Clicking, scrolling, mouse hovering 
   - Screen sizes, languages (IP-based), devices, time of the day
   - Previous visits (user’s browsing history)
   - And more!

Changes can occur on the client side, such as JavaScript interactions, that do not necessarily change the source code from the websites but change its appearance, such as expanding text boxes. In that case, classical scraping methods are sufficient for grabbing the entire text because the displayed truncated version of a text might already be entirely stored in the source code. 

However, changes can also occur on the server side, leading to changes in the website’s appearance and content (i.e., source code). For example, scrolling to the bottom of a website leads to more content being loaded. Such effects can often be observed in social media feeds (e.g., Facebook, Twitter, Quora) or online shopping platforms, enabling users to scroll endlessly. Website interactions that lead to content changes are often challenging or not obtainable through classical scraping approaches since they require a JavaScript execution initiated by user interaction. Interestingly, we can use browser automation tools, such as Selenium, to help us imitate user interactions and make data collection possible.

### Getting started

Selenium is a [browser automation software](https://www.selenium.dev/) that can interface with many different browser types and programming languages. Thus, we can write programming scripts that control the browser and imitate our behavior, such as clicking or scrolling. Before we can start writing a programming script, we need to set up Selenium by downloading a driver. Depending on the browser we want to use (e.g., Firefox, Chrome), we need a different driver, which could be found [here](https://www.selenium.dev/downloads/). In this notebook, we will go through instructions for using both Google Chrome and Mozilla Firefox, for which you can find the drivers [here](https://chromedriver.chromium.org/downloads) and [here](https://github.com/mozilla/geckodriver/releases), respectively. To download the correct driver, you need to know which operating system (e.g. Windows, Linux, Mac) your machine runs on, and which browser version you have. For Chrome, you can find the browser version under Settings > About Chrome (see screenshot below):

<img src='images/chrome_version.JPEG' width="1000" height="1000" align="center"/>

Download the correct driver, and after unpacking the zip folder, place the *.exe* driver file in the same folder we are running this script.

The Selenium webpage contains documentation for all the programming languages, which you can find [here](https://www.selenium.dev/documentation/). However, the documentation is not as concise, and since we are using python, we can also find a separate documentation [here](https://selenium-python.readthedocs.io/).

Both documentations are very handy and should be kept close when working with Selenium. When you inspect the documentation, you will recognize that besides sending specific behavioral commands to the browser, accessing web elements is very similar to other approaches, such as beautiful soup. You will need XPATH, CSS selectors, and other properties of web elements to interact with them.

Next, we will showcase how you can use Selenium with Chrome/Firefox to collect data from the dynamic website Quora. Before collecting data, we need to check whether we are allowed to collect data from the website. Quora states that we are permitted to employ scrapers but must adhere to the [robots.txt](https://www.quora.com/robots.txt), which specifies the allowed and disallowed contents for scraping, and that we make ourself known to the website so that they can contact us if they want to. We can give Quora our contact information by adding them to the user-agent, the information the browser sends to the website. *Section 4-d: Permitted uses of Quora’s terms of service* specifies the rules for scraping (see the screenshot below):

<img src='images/quora.png' width="700" height="700" align="center"/>

#### Scraping Quora

Before scraping any information, we need to create an account on Quora. We would recommend creating a new account for your scraping project. Go to www.quora.com and create a new account.

After creating the account, make sure to import all the necessary libraries:

In [None]:
import pandas as pd # to work with data frames; you may have already imported it in this notebook
from time import sleep # to slow down our scraper

# all selenium specific packages:
from selenium import webdriver # to load the browser
from selenium.webdriver.common.keys import Keys # necessary to automate typings, like filling out the forms
from selenium.webdriver.common.by import By # necessary to search for web elements

In case you are using Chrome, import the first line in the next cell, if it is Firefox, import the second one (note that if you import both of them at the same time, it will only work for the last one- Firefox!):

In [None]:
from selenium.webdriver.chrome.options import Options # necessary to change our user agent when working with Chrome

# from selenium.webdriver.firefox.options import Options # necessary to change our user agent when working with Firefox

We can now start the driver (i.e., the browser), which should appear as a separate window.

<div class='alert alert-block alert-danger'>
<b>Gizem's note:</b>
Make sure that chromedriver.exe is under the same folder of this notebook before running the next line, write a note here!
    
<b>Pouria:</b>
Mentioned this in the previous section (getting started) based on the structure suggested by Felix.
</div>

In [None]:
# starting the driver

# For Chrome:
driver = webdriver.Chrome()

# For Firefox:
# driver = webdriver.Firefox()

<div class='alert alert-block alert-danger'>
<b>Gizem's note:</b>
We can have a screenshot of the warning page from Google Chrome: "Chrome is being controlled by automated test software."
    
<b>Pouria:</b>
Added it right below:
</div>

As you can see, a new browser window opens, which is *being controlled by automated test software:*

<img src='images/chrome.png' width="700" height="700" align="center"/>

For Firefox, it looks something like this:

<img src='images/firefox.png' width="700" height="700" align="center"/>

The *driver* instance is the browser we will use to navigate the website and find web elements. 
We can now check what our user-agent for our browser is with the following code:

In [None]:
agent = driver.execute_script("return navigator.userAgent")
print(agent)

We can change the user-agent information to make ourself identifiable, and Quora can contact us if they want. We need to initiate a new driver with the changed information. Hence, we first need to quit our current session:

In [None]:
# quit current session
driver.quit()

<div class='alert alert-block alert-danger'>
<b>Gizem's note:</b>
Again, the notification on the terminal can be shown here so that the participants don't get condfused.
    
<b>Pouria:</b>
I didn't get which notification you mean, let's discuss it in the next meeting!
</div>

Make sure to run the correct line of code when restarting the driver; in the middle two lines, the first line is for Chrome and second one (which is commented by default) is for Firefox.

In [None]:
# adding our e-mail address to the user-agent
opts = Options()
opts.add_argument("user-agent=Getting news feed data; contact me through: [e-mail address]")


driver = webdriver.Chrome(options=opts)# initaite driver with new user-agent for Chrome
# driver = webdriver.Firefox(options=opts)# initaite driver with new user-agent for Firefox

# lets check if we changed our user-agent
agent = driver.execute_script("return navigator.userAgent")
print(agent)

Looks good! Now we can start with our project and visit Quora.

In [None]:
# url to visit
url_search = "https://www.quora.com/"
# go to url
driver.get(url_search)
sleep(1.5) # set sleep time for 1.5 seconds

We will set some pauses occasionally to slow down the scraping process and give the browser some time to load the website. Next, we want to sign into the website. With Selenium, we can automate the step and fill in all the text fields. 

Similarly, when working with other scarping approaches, we need to find the web elements by inspecting the HTML structure of the website and locating them through their paths, class, or names. Ideally, the elements have an ID we can identify them with, as in the case of the e-mail address and password fields.

In [None]:
# providing log-in credentials to the website

EMail_field = driver.find_element(By.XPATH, '//*[@id="email"]') # Find e-mail field
# specify your e-mail address
my_email = "ENTER YOUR E-MAIL ADDRESS"
my_email = "pouria.mirelmi@gmail.com"

EMail_field.send_keys(my_email) # sending the string to the e-mail field
sleep(1.5)

PW_field = driver.find_element(By.XPATH, '//*[@id="password"]') # find password field
# specify your password 
my_password = "ENTER YOUR PASSWORD"
my_password = "Scraping95@"

PW_field.send_keys(my_password) # sending the string to the password field
sleep(1.5)

After we fill in all our information, we can find the log-in button and click on it:

<div class='alert alert-block alert-danger'>
<b>Gizem's note:</b>
There is another click here to say Ich bin kein Roboter?
    
<b>Pouria:</b>
I don't face it, maybe it happens when we go though the process too many times in a row? We can add a new cell for handling it, but we will need to mention that it doesn't happen all the time.
</div>

In [None]:
driver.find_element(By.CLASS_NAME, "cCiFZD").click()
sleep(10)

**Note: In case you encounter a recaptcha, you can click on the check box and then log in:**

In [None]:
# Clicking on the checkbox
driver.find_elements(By.CSS_SELECTOR, "div.qu-mb--medium")[3].click()

In [None]:
# Logging in
driver.find_elements(By.CLASS_NAME, "iyYUZT")[4].click()
sleep(10)

After we log into our account, we can see the cookie notification. We can also interact with pop-ups and accept or reject them. We will reject the cookies by finding the *Reject All* button and clicking on it:

In [None]:
# rejecting cookies
driver.find_element(By.ID, "onetrust-reject-all-handler").click()

<div class='alert alert-block alert-danger'>
<b>Gizem's note:</b>
Let's make sure about these errors and mention them in the notebook!
</div>

Your website might not be in English, depending on the region you are accessing Quora from.

In our case, we are accessing the website from Germany. However, through the language settings at the top of the website, we can change the language to English. We also can automate that step:

In [None]:
# click on menu
driver.find_elements(By.CLASS_NAME, "puppeteer_popper_reference")[1].click()
# click to select "English"
driver.find_element(By.CSS_SELECTOR, "div.qu-dynamicFontSize--button").click()

Next, we want to collect some of the information present in our news feed. We need to find the container for the entire feed to collect individual posts, answers, or questions.

In [None]:
# access whole feed containing various forms of posts/questions/answers etc.
NewsFeed = driver.find_element(By.CLASS_NAME, "dom_annotate_multifeed_home")

When we inspect the structure of the news feed, we can see that posts, questions, answers, or advertisements have different classes. Thus, we can leverage the class names to access the information we are interested in. For this guide, we only want to collect data from answered questions, which are the predominant elements in our feed.

By inspecting the website, we know that one element with the class "dom_annotate_multifeed_bundle_AnswersBundle" contains the answers we are interested in.

Let's have a look at the first answer in our feed:

In [None]:
# find all answers
Answers = NewsFeed.find_elements(By.CLASS_NAME, "dom_annotate_multifeed_bundle_AnswersBundle")
# select first answer and print all containing texts
answer1 = Answers[0]
print(answer1.text)

The text shows that we have several different elements, which are separated by a new line (\n).
The first separated element seems to be the author's name. We can also spot the story title and the number of shares and comments at the end of the string.

Since newline characters in the current string separate all the information, we could just split the text by "\n" and deduce by the order what each text section associates to which information. However, it is likely, that some answers will contain more or less data points, making the order of split elements not generalizable. Thus, a better way to obtain each piece of information is by selecting their elements.

For example, we can find the title of the answer through the class name "qu-userSelect--text" with the following code:

In [None]:
title = answer1.find_element(By.CLASS_NAME, "qu-userSelect--text").text
print("the title is: ", title)

Similarly, we can find the number of shares or comments:

In [None]:
shares = answer1.find_element(By.CLASS_NAME, "dom_annotate_answer_action_bar_share").text
comments = answer1.find_element(By.CLASS_NAME, "dom_annotate_answer_action_bar_comment").text

print("number of shares: ", shares)
print("number of comments: ", comments)

The numbers of shares and comments are still a string, but we can convert them into integers when we save the data in a data frame.

Since we saved all answers of our feed in one variable, we can loop over it and extract all the information we are interested in. To make this process easier, we can define a function that extracts the information from an answer element and returns a list with the information we want:

In [None]:
# defining the function to extract information from each answer post
def get_post_info(PostInfo):
    AuthorInfo = PostInfo.find_element(By.CLASS_NAME, "qu-alignItems--flex-start") # Container for author information 
    authorName = AuthorInfo.find_element(By.CLASS_NAME, "qu-wordBreak--break-word").text # author name
    authorLink = AuthorInfo.find_element(By.TAG_NAME, "a").get_attribute("href") # author link
   
    title = PostInfo.find_element(By.CLASS_NAME, "qu-userSelect--text").text # answer title
    
    StoryLinkContainer = PostInfo.find_element(By.CLASS_NAME, "qu-mb--tiny").find_element(By.TAG_NAME, "a") # container for story link 
    StoryLink = StoryLinkContainer.get_attribute("href") # get story link
    
    upvotes = PostInfo.find_element(By.CLASS_NAME, "dom_annotate_answer_action_bar_upvote").text # number of upvotes
    shares = PostInfo.find_element(By.CLASS_NAME, "dom_annotate_answer_action_bar_share").text # number of shares
    comments = PostInfo.find_element(By.CLASS_NAME, "dom_annotate_answer_action_bar_comment").text # number of comments
    
    # aggregate the answer data into a list
    post_info = [authorName, authorLink, title, StoryLink, upvotes, shares, comments]
    return post_info

Now, we need a loop and save the data into another list:

In [None]:
AnswersInfo = [] # initiate list to save data
for post in Answers: # loop over all answers
    AnswersInfo.append(get_post_info(post)) # add answer info to the list

In [None]:
# lets have a look at the number of answers we gathered:
len(AnswersInfo)

We collected only very few answers. Why is that?

Because the website only loaded very few answers, and we need to scroll down on the website to generate more content. 
Luckily, Selenium can help us!

We have several methods to imitate scrolling:

   - scrolling by pixel 
   - scrolling to the bottom of the page

Let's start with the first one:

In [None]:
# 1. scoll down incrementally
driver.execute_script("window.scrollTo(0, 1000)")
sleep(1) 
driver.execute_script("window.scrollTo(0, 2000)")
sleep(1) 
driver.execute_script("window.scrollTo(0, 3000)")
sleep(1) 
driver.execute_script("window.scrollTo(0, 4000)")
sleep(1) 
driver.execute_script("window.scrollTo(0, 5000)")
sleep(1) 
driver.execute_script("window.scrollTo(0, 6000)")
sleep(1) 
driver.execute_script("window.scrollTo(0, 7000)")
sleep(1) 
driver.execute_script("window.scrollTo(0, 8000)")
sleep(1) 
driver.execute_script("window.scrollTo(0, 9000)")
sleep(1) 
driver.execute_script("window.scrollTo(0, 10000)")
sleep(1) 

We can also use another approach, but first, we can scroll to the top of the page:

In [None]:
# to the top of the page
driver.execute_script("window.scrollTo(0, -document.body.scrollHeight);")
sleep(5)

For option two, we scroll to the bottom of the page, indicated by the document height of the website. Similarly, as before, we can repeat that process to load more content:

In [None]:
# to the bottom of the page - here we give the browser a bit more time to load all the content
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
sleep(2)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
sleep(2) 
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
sleep(2) 
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
sleep(2) 
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
sleep(2)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
sleep(2)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
sleep(2) 
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
sleep(2) 
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
sleep(2) 
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
sleep(2) 

Of course, we can also implement loops for each of those processes, but for now, we loaded enough content.

To grab the newly loaded content, we need to find the newsfeed again and find all answers.

In [None]:
NewsFeed = driver.find_element(By.CLASS_NAME, "dom_annotate_multifeed_home")
Answers = NewsFeed.find_elements(By.CLASS_NAME, "dom_annotate_multifeed_bundle_AnswersBundle")

Finally, we can re-run our loop to grab all the info.

In [None]:
AnswersInfo = []
for post in Answers:
    AnswersInfo.append(get_post_info(post))

Let's check the number of answers we collected this time:

In [None]:
len(Answers)

Better! If we want to collect more data, we can implement more scrolling, but we collected enough information for demonstration purposes now.

Next, we can convert the list into a data frame, making it easier for us to work with the data.

In [None]:
AnswersInfo_df = pd.DataFrame(AnswersInfo, columns=["author", "author_link",
                                                    "title", "story_link",
                                                    "num_upvotes", "num_shares", 
                                                    "num_comments"])

In [None]:
AnswersInfo_df.head(4)

Once we are done working with the driver, we can close the current session:

In [None]:
# close driver
driver.quit()

<div class="alert alert-block alert-info">
<b>Hint:</b> 
    
We can do similar operations to  gather advertisements, posts, or questions information by accessing other classes, such as:
- "dom_annotate_multifeed_bundle_AdBundle" for Ads
- "dom_annotate_multifeed_bundle_PostBundle" for posts

However, our function for accessing author and post information might have to be adapted for those classes.

It is also worth noting that instead of looking at the browser and its behavior, we can also implement a headless browser that will function in the background without us seeing it (have a look at [this Stackoverflow link](https://stackoverflow.com/questions/53657215/running-selenium-with-headless-chrome-webdriver). You can specify those settings at the beginning with options like `options.add_argument("--headless")`.
</div>

## 4.4. Important notes and insights

Web scraping provides us the daily data we need for analysis and decision making. While different types of task sizes and websites are available to obtain data from, the various libraries that we teach here can make our tasks easier and faster. We have talked about BeautifulSoup and Selenium, which show us that BeautifulSoup is more user-friendly and allows us to learn faster and begin web scraping smaller tasks in an efficient way. Whereas in Selenium, tasks are more complicated and the target website has a lot of java elements in its code. We can use Selenium to control every major web browser such as chrome, internet explorer, and Firefox. Our actions aren’t limited to loading web pages, we can also perform other actions that allow us better interact with the websites such as mouse clicks and filling forms. 

We have not introduced [Scrapy](https://scrapy.org/) yet as it is more for complex web tasks. It uses less memory and CPU storage and supports data extraction from HTML sources as well. We can even extend its functionality. As we mentioned, it is great a great library for complex and larger projects and we can easily transfer existing projects into another project.

In all, we should have good knowledge of all the tools available so that we can choose the best tool depending on our tasks. As we have seen, we can comfortably use any of the libraries as they are all free and also open source. Having a community of developers to support us is a big plus as we develop our projects and use the libraries. The choice of one over the other however depends on the project we have at hand. They all have their pros and cons.

<div class='alert alert-block alert-danger'>
<b>Gizem's note:</b>
We should make sure about the reference structure!
</div>

## 4.5. References

[<a href='#destination1_'>1</a>] https://pypi.org/project/feedparser/ <a id='destination1'></a>

[<a href='#destination1_'>2</a>] https://rss.com/blog/find-rss-feed/#:~:text=Right%20click%20on%20the%20website's,between%20the%20quotes%20after%20href%3D

[<a href='#destination1_'>3</a>] https://dev.to/mr_destructive/feedparser-python-package-for-reading-rss-feeds-5fnc

[<a href='#destination2_'>4</a>] https://developer.mozilla.org/en-US/docs/Learn/HTML/Introduction_to_HTML/Getting_started <a id='destination2'></a>

[<a href='#destination3_'>5</a>] http://web.simmons.edu/~grabiner/comm244/weekfour/document-tree.html <a id='destination3'></a>

[<a href='#destination4_'>6</a>] https://www.crummy.com/software/BeautifulSoup/bs4/doc/ <a id='destination4'></a>

[<a href='#destination4_'>7</a>] Fabian's notebook from GESIS fall seminar 2021: https://colab.research.google.com/drive/1uKxOc8mXTE2b05uUq-YlijJYzOTgi5DZ#scrollTo=ao_sLGiOSu7Y

[<a href='#destination5_'>8</a>] The Social Comquant Workshop 10 at https://github.com/strohne/autocol <a id='destination5'></a>


<br/><br/>

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

https://www.dataquest.io/blog/web-scraping-python-using-beautiful-soup/#tve-jump-1788432a71d

https://developer.mozilla.org/en-US/docs/Web/HTML/Element

https://medium.com/geekculture/web-scraping-cheat-sheet-2021-python-for-web-scraping-cad1540ce21c#b81d

https://trends.google.com/trends/yis/2021/DE/

https://blog.google/products/search/15-tips-getting-most-out-google-trends/

https://limeproxies.netlify.app/blog/selenium-vs-beautifulsoup


Do not miss checking out the Social Comquant Workshop 10 at: https://github.com/strohne/autocol


<div class='alert alert-block alert-success'>
<b>Document information</b>

Contact and main author: N. Gizem Bacaksizlar Turbic & ..?

Contributors: Pouria Mirelmi, Haiko Lietz & Felix Soldner & ..?

Acknowledgements: Fabian Floeck? ...

Version date: XX. December 2022

License: ...
</div>