<img src='images/gesis.png' style='height: 50px; float: left'>
<img src='images/social_comquant.png' style='height: 50px; float: left; margin-left: 40px'>

## Introduction to Computational Social Science methods with Python

# Session B3: Dynamic web scraping

In the previous [Session B2](2_data_parsing_and_static_web_scraping.ipynb), we have introduced [Data parsing and static web scraping](2_data_parsing_and_static_web_scraping.ipynb). To reiterate, the idea behind web scraping is to extract content from web pages by parsing their semi-structured HTML source code. This is fairly simple for pages whose content is generated statically, that is, whose content is entirely stored in the source code. The Beautiful Soup package is a user-friendly package that allows for fast learning and beginning static web scraping in an efficient way.

However, many websites change their content in a browsing session. Content changes can occur on the client side (your side) that do not necessarily change the website's source code but change its appearance, such as expanding text boxes. In that case, classical scraping methods are sufficient for grabbing the entire text because the displayed truncated version of a text might already be entirely stored in the source code. But changes can also occur on the server side. For example, when you scroll to the bottom of a page and more content is being displayed, the source code itself changes. Such content is generated dynamically by users interacting with websites and is hard to capture with data collection methods used for static websites. This is when you need **dynamic web scraping**.

Dynamic websites change their content (*i.e.*, source code) due to various reasons, such as:

- clicking, scrolling, mouse hovering  
- screen sizes, languages (IP-based), devices, time of day 
- previous visits (user's browsing history) 
- and more...

<img src='images/selenium.png' style='height: 100px; float: right; margin-left: 100px'>

Usually, user interactions are registered and source code is updated via code pieces of JavaScript. Besides HTML, the programming language [JavaScript](https://en.wikipedia.org/wiki/JavaScript) is one of the core technologies of the web. Website interactions that lead to content changes are often challenging or not obtainable through classical scraping approaches since they require JavaScript executions initiated by user interactions. Thus, other methods such as browser automation tools are needed to help us imitate user interactions and make dynamic web scraping possible. [Selenium](https://www.selenium.dev/) is such a browser automation software. We can use Selenium to control every major web browser such as Chrome, Firefox, or Edge. Actions are not limited to loading web pages, we can also perform other actions that allow for better interaction with the websites such as mouse clicks, handling pop-up windows, or filling forms.

Beautiful Soup made for a new experience in Jupyter Notebooks and Python: you had a webpage in another browser tab and had to inspect its source code to build a scraper. Now, with Selenium, we will go one step further: we will actually control the webpage from within the Jupyter Notebook. Beautiful Soup can then be used on top of Selenium. Dynamic web scraping is typically discussed in the literature as a complication of static web scraping, for example, by Bosse *et al.* (2022).

<div class='alert alert-block alert-success'>
<b>In this session</b>, 

you will learn how to collect webpage content that is dynamically generated. In subsession **B3.1**, we will get to know the Selenium package and learn how to automate web browsing with it. In subsession **B3.2**, we will work on practical examples: scraping questions and answers from the Quora platform.
</div>

<div class='alert alert-block alert-danger'>
<b>Caution</b>

This Jupyter Notebook demonstrates a workflow that consists of a **sequence of processing steps**. The notebook must be executed from top to bottom. Going back up from a certain code cell and trying to execute a cell that precedes it may not work.
</div>

<div class='alert alert-block alert-warning'>
<b>Additional resources</b>

If you would like to explore web scraping further, [Scrapy](https://scrapy.org/) can be your next address for more complex web tasks. It uses less memory and CPU storage and supports data extraction from HTML sources as well. We can even extend its functionality. As we mentioned, it is great a great library for complex and larger projects and we can easily transfer existing projects into another project.
</div>

"If the approaches we’ve covered thus far
won’t work (which can happen when a website is dynamically generated or interactive, for instance), then we’ll have to call in the cavalry. In this case, the cavalry is a Python package called selenium. Since my editors at Sage feel it would be best if one could carry this book without the assistance of a hydraulic lift, we’re not going to have room to cover selenium in-text. If you want to read more about how to scrape the interactive web, we’ve prepared an **online supplement** that will guide you through the process." (McLevey)

Also Patel (2020)



## B3.1. Introducing Selenium 

### 3.1.1. Setting up basic configurations

Selenium is a [browser automation software](https://www.selenium.dev/) that can interface with many different browser types and programming languages. Thus, we can write programming scripts that control the browser and imitate our behavior, such as clicking or scrolling. Before we can start writing a programming script, we need to set up Selenium by downloading a driver. Depending on the browser we want to use (e.g., Firefox, Chrome), we need a different driver, which could be found [here](https://www.selenium.dev/downloads/). In this notebook, we will go through instructions for using both Google Chrome and Mozilla Firefox, for which you can find the drivers [here](https://chromedriver.chromium.org/downloads) and [here](https://github.com/mozilla/geckodriver/releases), respectively. To download the correct driver, you need to know which operating system (e.g. Windows, Linux, Mac) your machine runs on, and which browser version you have. For Chrome, you can find the browser version under Settings > About Chrome (see screenshot below):

<img src='images/chrome_version.JPEG' width="1000" height="1000" align="center"/>

Download the correct driver, and after unpacking the zip folder, place the *.exe* driver file in the same folder we are running this script.

The Selenium webpage contains documentation for all the programming languages, which you can find [here](https://www.selenium.dev/documentation/). However, the documentation is not as concise, and since we are using python, we can also find a separate documentation [here](https://selenium-python.readthedocs.io/).

Both documentations are very handy and should be kept close when working with Selenium. When you inspect the documentation, you will recognize that besides sending specific behavioral commands to the browser, accessing web elements is very similar to other approaches, such as beautiful soup. You will need XPATH, CSS selectors, and other properties of web elements to interact with them.


### B3.1.2. Extracting relavent information from dynamic webpages

<img src='images/quora_logo.png' style='height: 120px; float: right; margin-left: 10px' >

As we already discussed in the section of extracting relavent information from static webpages, each and every project might require data from different sources by web scarping to answer some of our research questions. Or simply, we might like to make some tasks of our lives easier and faster by automated web scraping compared to manual browsing or scrolling; such as searching jobs on internet, finding your favorite bands' histories, or reading different minds over various questions. 


## B3.2. Getting practical with on Quora

Along with the last example, let's explore scraping [Quora](www.quora.com), which is a social question-and-answer website where users can collaborate by editing questions and commenting on answers that have been submitted by other users. 

In this section, we will showcase how you can use Selenium with Chrome/Firefox to collect data from the dynamic website Quora. Before collecting data, we need to check whether we are allowed to collect data from the website. Quora states that we are permitted to employ scrapers but must adhere to the [robots.txt](https://www.quora.com/robots.txt), which specifies the allowed and disallowed contents for scraping, and that we make ourself known to the website so that they can contact us if they want to. We can give Quora our contact information by adding them to the user-agent, the information the browser sends to the website. *Section 4-d: Permitted uses of Quora’s terms of service* specifies the rules for scraping (see the screenshot below):

<img src='images/quora.png' width="700" height="700" align="center"/>

### B3.2.1 Example: Scraping posts

Before scraping any information, we need to create an account on Quora. We would recommend creating a new account for your scraping project. Go to www.quora.com and create a new account.

After creating the account, make sure to import all the necessary libraries:

In [None]:
import pandas as pd # to work with data frames; you may have already imported it in this notebook
from time import sleep # to slow down our scraper

# all selenium specific packages:
from selenium import webdriver # to load the browser
from selenium.webdriver.common.keys import Keys # necessary to automate typings, like filling out the forms
from selenium.webdriver.common.by import By # necessary to search for web elements

In case you are using Chrome, import the first line in the next cell, if it is Firefox, import the second one (note that if you import both of them at the same time, it will only work for the last one- Firefox!):

In [None]:
#from selenium.webdriver.chrome.options import Options # necessary to change our user agent when working with Chrome

from selenium.webdriver.firefox.options import Options # necessary to change our user agent when working with Firefox

We can now start the driver (i.e., the browser), which should appear as a separate window.

In [None]:
# starting the driver

# For Chrome:
#driver = webdriver.Chrome()

# For Firefox:
driver = webdriver.Firefox()

As you can see, a new browser window opens, which is *being controlled by automated test software:*

<img src='images/chrome.png' width="700" height="700" align="center"/>

For Firefox, it looks something like this:

<img src='images/firefox.png' width="700" height="700" align="center"/>

The *driver* instance is the browser we will use to navigate the website and find web elements. 
We can now check what our user-agent for our browser is with the following code:

In [None]:
agent = driver.execute_script("return navigator.userAgent")
print(agent)

We can change the user-agent information to make ourself identifiable, and Quora can contact us if they want. We need to initiate a new driver with the changed information. Hence, we first need to quit our current session:

In [None]:
# quit current session
driver.quit()

Make sure to run the correct line of code when restarting the driver; in the middle two lines, the first line is for Chrome and second one (which is commented by default) is for Firefox.

In [None]:
# adding our e-mail address to the user-agent
opts = Options()
opts.add_argument("user-agent=Getting news feed data; contact me through: [e-mail address]")


#driver = webdriver.Chrome(options=opts)# initaite driver with new user-agent for Chrome
driver = webdriver.Firefox(options=opts)# initaite driver with new user-agent for Firefox

# lets check if we changed our user-agent
agent = driver.execute_script("return navigator.userAgent")
print(agent)

If we all have come to this point, now, we can start with our new project with scraping [Quora](https://www.quora.com/).

In [None]:
# url to visit
url_search = "https://www.quora.com/"
# go to url
driver.get(url_search)
sleep(1.5) # set sleep time for 1.5 seconds

We will set some pauses occasionally to slow down the scraping process and give the browser some time to load the website. Next, we want to sign into the website. With Selenium, we can automate the step and fill in all the text fields. 

Similarly, when working with other scarping approaches, we need to find the web elements by inspecting the HTML structure of the website and locating them through their paths, class, or names. Ideally, the elements have an ID we can identify them with, as in the case of the e-mail address and password fields.

In [None]:
# providing log-in credentials to the website

EMail_field = driver.find_element(By.XPATH, '//*[@id="email"]') # Find e-mail field
# specify your e-mail address
my_email = "ENTER YOUR E-MAIL ADDRESS"

EMail_field.send_keys(my_email) # sending the string to the e-mail field
sleep(1.5)

PW_field = driver.find_element(By.XPATH, '//*[@id="password"]') # find password field
# specify your password 
my_password = "ENTER YOUR PASSWORD"

PW_field.send_keys(my_password) # sending the string to the password field
sleep(1.5)

After we fill in all our information, we can find the log-in button and click on it:

In [None]:
driver.find_elements(By.CLASS_NAME, "iyYUZT")[4].click()
sleep(10)

<div class='alert alert-block alert-danger'>
<b>Caution</b>
    
In case you encounter a recaptcha, you can click on the check box and then log in.
</div>


In [None]:
# Clicking on the checkbox
driver.find_elements(By.CSS_SELECTOR, "div.qu-mb--medium")[3].click()

In [None]:
# Logging in
driver.find_elements(By.CLASS_NAME, "iyYUZT")[4].click()
sleep(10)

After we log into our account, we can see the cookie notification. We can also interact with pop-ups and accept or reject them. We will reject the cookies by finding the *Reject All* button and clicking on it:

In [None]:
# rejecting cookies
driver.find_element(By.ID, "onetrust-reject-all-handler").click()

<div class='alert alert-block alert-info'>
<b>Insight</b>

Your website might not be in English, depending on the region you are accessing Quora from.

In our case, we are accessing the website from Germany. However, through the language settings at the top of the website, we can change the language to English. We also can automate that step as in the following cell.
    
</div>


In [None]:
# click on menu
driver.find_elements(By.CLASS_NAME, "puppeteer_popper_reference")[1].click()
# click to select "English"
driver.find_element(By.CSS_SELECTOR, "div.qu-dynamicFontSize--button").click()

Next, we want to collect some of the information present in our news feed. We need to find the container for the entire feed to collect individual posts, answers, or questions.

In [None]:
# access whole feed containing various forms of posts/questions/answers etc.
NewsFeed = driver.find_element(By.CLASS_NAME, "dom_annotate_multifeed_home")

When we inspect the structure of the news feed, we can see that posts, questions, answers, or advertisements have different classes. Thus, we can leverage the class names to access the information we are interested in. For this guide, we only want to collect data from answered questions, which are the predominant elements in our feed.

By inspecting the website, we know that one element with the class "dom_annotate_multifeed_bundle_AnswersBundle" contains the answers we are interested in.

Let's have a look at the first answer in our feed:

In [None]:
# find all answers
Answers = NewsFeed.find_elements(By.CLASS_NAME, "dom_annotate_multifeed_bundle_AnswersBundle")
# select first answer and print all containing texts
answer1 = Answers[0]
print(answer1.text)

The text shows that we have several different elements, which are separated by a new line (\n).
The first separated element seems to be the author's name. We can also spot the story title and the number of shares and comments at the end of the string.

Since newline characters in the current string separate all the information, we could just split the text by "\n" and deduce by the order what each text section associates to which information. However, it is likely, that some answers will contain more or less data points, making the order of split elements not generalizable. Thus, a better way to obtain each piece of information is by selecting their elements.

For example, we can find the title of the answer through the class name "qu-userSelect--text" with the following code:

In [None]:
title = answer1.find_element(By.CLASS_NAME, "qu-userSelect--text").text
print("the title is: ", title)

Similarly, we can find the number of shares or comments:

In [None]:
shares = answer1.find_element(By.CLASS_NAME, "dom_annotate_answer_action_bar_share").text
comments = answer1.find_element(By.CLASS_NAME, "dom_annotate_answer_action_bar_comment").text

print("number of shares: ", shares)
print("number of comments: ", comments)

The numbers of shares and comments are still a string, but we can convert them into integers when we save the data in a data frame.

Since we saved all answers of our feed in one variable, we can loop over it and extract all the information we are interested in. To make this process easier, we can define a function that extracts the information from an answer element and returns a list with the information we want:

In [None]:
# defining the function to extract information from each answer post
def get_post_info(PostInfo):
    AuthorInfo = PostInfo.find_element(By.CLASS_NAME, "qu-alignItems--flex-start") # Container for author information 
    authorName = AuthorInfo.find_element(By.CLASS_NAME, "qu-wordBreak--break-word").text # author name
    authorLink = AuthorInfo.find_element(By.TAG_NAME, "a").get_attribute("href") # author link
   
    title = PostInfo.find_element(By.CLASS_NAME, "qu-userSelect--text").text # answer title
    
    StoryLinkContainer = PostInfo.find_element(By.CLASS_NAME, "qu-mb--tiny").find_element(By.TAG_NAME, "a") # container for story link 
    StoryLink = StoryLinkContainer.get_attribute("href") # get story link
    
    upvotes = PostInfo.find_element(By.CLASS_NAME, "dom_annotate_answer_action_bar_upvote").text # number of upvotes
    shares = PostInfo.find_element(By.CLASS_NAME, "dom_annotate_answer_action_bar_share").text # number of shares
    comments = PostInfo.find_element(By.CLASS_NAME, "dom_annotate_answer_action_bar_comment").text # number of comments
    
    # aggregate the answer data into a list
    post_info = [authorName, authorLink, title, StoryLink, upvotes, shares, comments]
    return post_info

Now, we need a loop to save the data into another list:

In [None]:
AnswersInfo = [] # initiate list to save data
for post in Answers: # loop over all answers
    AnswersInfo.append(get_post_info(post)) # add answer info to the list

In [None]:
# lets have a look at the number of answers we gathered:
len(AnswersInfo)

In [None]:
AnswersInfo[1]

We collected only very few answers. Why is that?

Because the website only loaded very few answers, and we need to scroll down on the website to generate more content. 
Luckily, Selenium can help us!

We have several methods to imitate scrolling:

   - scrolling by pixel 
   - scrolling to the bottom of the page

Let's start with the first one:

In [None]:
# 1. scoll down incrementally
driver.execute_script("window.scrollTo(0, 1000)")
sleep(1) 
driver.execute_script("window.scrollTo(0, 2000)")
sleep(1) 
driver.execute_script("window.scrollTo(0, 3000)")
sleep(1) 
driver.execute_script("window.scrollTo(0, 4000)")
sleep(1) 
driver.execute_script("window.scrollTo(0, 5000)")
sleep(1) 
driver.execute_script("window.scrollTo(0, 6000)")
sleep(1) 
driver.execute_script("window.scrollTo(0, 7000)")
sleep(1) 
driver.execute_script("window.scrollTo(0, 8000)")
sleep(1) 
driver.execute_script("window.scrollTo(0, 9000)")
sleep(1) 
driver.execute_script("window.scrollTo(0, 10000)")
sleep(1) 

We can also use another approach, but first, we can scroll to the top of the page:

In [None]:
# to the top of the page
driver.execute_script("window.scrollTo(0, -document.body.scrollHeight);")
sleep(5)

For option two, we scroll to the bottom of the page, indicated by the document height of the website. Similarly, as before, we can repeat that process to load more content:

In [None]:
# to the bottom of the page - here we give the browser a bit more time to load all the content
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
sleep(2)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
sleep(2) 
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
sleep(2) 
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
sleep(2) 
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
sleep(2)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
sleep(2)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
sleep(2) 
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
sleep(2) 
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
sleep(2) 
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
sleep(2) 

Of course, we can also implement loops for each of those processes, but for now, we loaded enough content.

To grab the newly loaded content, we need to find the newsfeed again and find all answers.

In [None]:
NewsFeed = driver.find_element(By.CLASS_NAME, "dom_annotate_multifeed_home")
Answers = NewsFeed.find_elements(By.CLASS_NAME, "dom_annotate_multifeed_bundle_AnswersBundle")

Finally, we can re-run our loop to grab all the info.

In [None]:
AnswersInfo = []
for post in Answers:
    AnswersInfo.append(get_post_info(post))

Let's check the number of answers we collected this time:

In [None]:
len(Answers)

Better! If we want to collect more data, we can implement more scrolling, but we collected enough information for demonstration purposes now.

Next, we can convert the list into a data frame, making it easier for us to work with the data.

In [None]:
AnswersInfo_df = pd.DataFrame(AnswersInfo, columns=["author", "author_link",
                                                    "title", "story_link",
                                                    "num_upvotes", "num_shares", 
                                                    "num_comments"])

In [None]:
AnswersInfo_df.head(4)

Once we are done working with the driver, we can close the current session:

In [None]:
# close driver
driver.quit()

<div class="alert alert-block alert-info">
<b>Hint:</b> 
    
We can do similar operations to  gather advertisements, posts, or questions information by accessing other classes, such as:
- "dom_annotate_multifeed_bundle_AdBundle" for Ads
- "dom_annotate_multifeed_bundle_PostBundle" for posts

However, our function for accessing author and post information might have to be adapted for those classes.

It is also worth noting that instead of looking at the browser and its behavior, we can also implement a headless browser that will function in the background without us seeing it (have a look at [this Stackoverflow link](https://stackoverflow.com/questions/53657215/running-selenium-with-headless-chrome-webdriver). You can specify those settings at the beginning with options like `options.add_argument("--headless")`.
</div>

### B3.2.2 Example: Searching for a specific question and gathering all answers to that question on Quora

In this example, we want to log into quora, search for a specific question, click on it, specify all answers to it and then scrape them. We will do this using Google Chrome.

Like before, we begin with starting the driver:

In [None]:
# adding our e-mail address to the user-agent
opts = Options()
opts.add_argument("user-agent=Getting answers data; contact me through: [e-mail address]")

#driver = webdriver.Chrome(options=opts)# initaite driver with new user-agent for Chrome
driver = webdriver.Firefox(options=opts)# initaite driver with new user-agent for Firefox

# lets check if we changed our user-agent
agent = driver.execute_script("return navigator.userAgent")
print(agent)

Then we go to the website:

In [None]:
# url to visit
url_search = "https://www.quora.com/"
# go to url
driver.get(url_search)
sleep(1.5) # set sleep time for 1.5 seconds

We enter our credentials:

In [None]:
# providing log-in credentials to the website

EMail_field = driver.find_element(By.XPATH, '//*[@id="email"]') # Find e-mail field
# specify your e-mail address
my_email = "ENTER YOUR E-MAIL ADDRESS"

EMail_field.send_keys(my_email) # sending the string to the e-mail field
sleep(1.5)

PW_field = driver.find_element(By.XPATH, '//*[@id="password"]') # find password field
# specify your password 
my_password = "ENTER YOUR PASSWORD"

PW_field.send_keys(my_password) # sending the string to the password field
sleep(1.5)

And then we click on the login button:

In [None]:
driver.find_elements(By.CLASS_NAME, "iyYUZT")[4].click()
sleep(10)

Like before, in case you encountered a recaptcha, just click on the check box by hand and then log in.

Rejecting cookies:

In [None]:
# rejecting cookies
driver.find_element(By.ID, "onetrust-reject-all-handler").click()

Changing the language to English:

In [None]:
# click on menu
driver.find_elements(By.CLASS_NAME, "puppeteer_popper_reference")[1].click()
# click to select "English"
driver.find_element(By.CSS_SELECTOR, "div.qu-dynamicFontSize--button").click()
sleep(2)

Now we should find the search bar, and pass our question to it for searching.

In [None]:
search_field = driver.find_element(By.XPATH, '//*[@enterkeyhint="search"]') # finding search field

question = "Will AI kill the art industry?"

search_field.send_keys(question) # sending the question string to the search field
sleep(2)

It will look like this:

<img src='images/search_question.png' width="900" height="700" align="center"/>

As you can see, a list of related questions show up. The first one is the one we are looking for, So we click on it:

In [None]:
# Clicking on the question link
driver.find_elements(By.CLASS_NAME, "iyYUZT")[4].click()

Now, a list of results are shown on the page. These results contain both the answers to the question and also some related topics. We want to have the answers only, so we need to click on the "All related" button, and then select the "Answers" option. You can see how it looks like here:

<img src='images/answers.png' width="900" height="700" align="center"/>

We first click on the "All related" button:

In [None]:
# Selecting answers to be shown
driver.find_elements(By.CLASS_NAME, "iyYUZT")[17].click()
sleep(2)

Now, before we click on the Answers option, let's find the number of answers to the question and save it in a variable. We will need this later for scrolling down to the last answer, so we don't miss out any of them:

In [None]:
text = driver.find_elements(By.CLASS_NAME, "iyYUZT")[19].text
text = text.split()[1]
text = text.replace("(", "")
text = text.replace(")", "")
answers_count = int(text)

answers_count

Now we can click on the Answers option:

In [None]:
driver.find_elements(By.CLASS_NAME, "iyYUZT")[19].click()

If we scroll down to the end of the page, it will show us the first 15 answers to the question. If we want to scroll down to the last answer, we need to scroll down to the end of the page, and we need to repeat this process n times, with n being the number of answers divided by 15, plus 1:

In [None]:
for i in range(answers_count // 15 + 1):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    sleep(4)

Now that we have completely scrolled down to the end of the page, we can extract the infromation from the answers. We get help from BeautifulSoup to ease the process, and we keep the author names, dates, number of upvotes and the answer texts in their corresponding lists:

In [None]:
from bs4 import BeautifulSoup

authors = []
dates = []
upvotes = []
texts = []

author_urls = []

for i in range (answers_count):
    
    # Finding the answers html code:
    answer = driver.find_element(By.CLASS_NAME, "dom_annotate_question_answer_item_" + str(i))
    soup = BeautifulSoup(answer.get_attribute('innerHTML'), 'lxml')
    
    # Finding the desired information and saving them to their lists:
    authors.append(soup.find_all('a', {'class': "dFkjrQ"})[1].text)
    dates.append(soup.find('a', {'class': "answer_timestamp"}).text)
    texts.append(soup.find_all('div', {'class': "iyYUZT"})[3].text)
    
    # We also keep the URLs of the authors in a list, we will need that later in this section.
    author_urls.append(soup.find_all('a', {'class': "dFkjrQ"})[1]['href'])
    
    # Some of the answers do not have upvotes, we add None objects to the list for them:
    upvote = soup.find('span', {'class': "q-text qu-whiteSpace--nowrap qu-display--inline-flex qu-alignItems--center qu-justifyContent--center"})    
    if upvote == None:
        upvotes.append(upvote)
    else:
        upvotes.append(upvote.text)

Now that we have all the information, we can make a dataframe to keep the data in a more structured way:

In [None]:
quora_search_question_df = pd.DataFrame([authors, dates, texts, upvotes]).transpose()
quora_search_question_df.columns = ['author', 'date', 'text', 'upvotes']

quora_search_question_df.head()

We will save the dataframe to the outputs folder as a csv file:

In [None]:
quora_search_question_df.to_csv('./outputs/quora search question dataframe.csv')

#### Getting users' information in more details:

In [None]:
# Openning a new tab:
driver.execute_script("window.open('https://www.quora.com/profile/Mohammed-1745')")

In [None]:
# Current tab:
driver.current_window_handle

In [None]:
# Available tabs:
driver.window_handles

In [None]:
# Switching to a child tab:

child = driver.window_handles[-1]

driver.switch_to.window(child)

In [None]:
# Switching to parent tab:
parent = driver.window_handles[0]

driver.switch_to.window(parent)

In [None]:
# Getting all users' info:

no_of_followers = []
no_of_followings = []
no_of_answers = []
no_of_questions = []
no_of_posts = []
date_joined = []
total_content_views = []
this_month_content_views = []

parent_tab = driver.current_window_handle


for i in author_urls:
    
    if 'www.quora.com' in i:
        
        driver.execute_script("window.open('"+i+"')")        
        sleep(3)
        
        child_tab = driver.window_handles[-1]
        driver.switch_to.window(child_tab)
        
        iyYUZT = driver.find_elements(By.CLASS_NAME, "iyYUZT")
        
        if len(iyYUZT) < 22:
            no_of_followers.append('N/A')
            no_of_followings.append('N/A')
            no_of_answers.append('N/A')
            no_of_questions.append('N/A')
            no_of_posts.append('N/A')
            date_joined.append('N/A')
            total_content_views.append('N/A')
            this_month_content_views.append('N/A')
            
            driver.close()
            driver.switch_to.window(parent_tab)
            continue
            
        
        no_of_followers.append(iyYUZT[11].text.split()[0])
        no_of_followings.append(iyYUZT[12].text.split()[0])
        no_of_answers.append(iyYUZT[18].text.split()[0])
        no_of_questions.append(iyYUZT[19].text.split()[0])
        no_of_posts.append(iyYUZT[20].text.split()[0])

        for i in driver.find_elements(By.CLASS_NAME, "qu-truncateLines--2"):
            if 'content views' in i.text:
                content_views = i.text
            if 'Joined' in i.text:
                date_joined.append(i.text.split('Joined ')[1])

        total_content_views.append(content_views.split()[0])
        this_month_content_views.append(content_views.split('views')[1].split()[0])
        
        driver.close()
        driver.switch_to.window(parent_tab)


    else:        
        
        no_of_followers.append('N/A')
        no_of_followings.append('N/A')
        no_of_answers.append('N/A')
        no_of_questions.append('N/A')
        no_of_posts.append('N/A')
        date_joined.append('N/A')
        total_content_views.append('N/A')
        this_month_content_views.append('N/A')

In [None]:
# Making the dataframe:

users_info_df = pd.DataFrame([authors, no_of_followers, no_of_followings, no_of_answers, no_of_questions, no_of_posts,
                             date_joined, total_content_views, this_month_content_views]).transpose()
users_info_df.columns = ['authors', 'no_of_followers', 'no_of_followings', 'no_of_answers', 'no_of_questions', 'no_of_posts',
                        'date_joined', 'total_content_views', 'this_month_content_views']

users_info_df

In [None]:
users_info_df.to_csv('./outputs/users_info_df.csv')

<div class='alert alert-block alert-info'>
<b>Insight</b>
    
We can also start considering whatelse we can do after learning these practices, such as searching for questions with a specific keywords or looking for multiple questions and their answers.
</div>

<div class='alert alert-block alert-warning'>
<b>Additional resources</b>

If you would like to explore web scraping further, [Scrapy](https://scrapy.org/) can be your next address for more complex web tasks. It uses less memory and CPU storage and supports data extraction from HTML sources as well. We can even extend its functionality. As we mentioned, it is great a great library for complex and larger projects and we can easily transfer existing projects into another project.
</div>

## References

### Recommended readings

Bosse, S., Dahlhaus, L., & Engel, U. (2022) "Web data mining: Collecting textual data from web pages using R." In: Engel, U. & Quan-Haase, A. (eds), *Handbook of Computational Social
Science* 2 (p. 46–70). Abingdon: Routledge. https...

<a id='mclevey_doing_2022'></a>
McLevey, J. (2022). *Doing Computational Social Science: A Practical Introduction*. SAGE. https://us.sagepub.com/en-us/nam/doing-computational-social-science/book266031. *A rather complete introduction to the field with well-structured and insightful chapters also on using Pandas. The [website](https://github.com/UWNETLAB/dcss_supplementary) offers the code used in the book.*

___

https://realpython.com/beautiful-soup-web-scraper-python/#reasons-for-web-scraping

https://medium.com/pythoneers/the-fundamentals-of-web-scraping-using-python-its-libraries-6f146b91efb4

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

https://www.dataquest.io/blog/web-scraping-python-using-beautiful-soup/#tve-jump-1788432a71d

https://developer.mozilla.org/en-US/docs/Web/HTML/Element

https://medium.com/geekculture/web-scraping-cheat-sheet-2021-python-for-web-scraping-cad1540ce21c#b81d

https://trends.google.com/trends/yis/2021/DE/

https://blog.google/products/search/15-tips-getting-most-out-google-trends/

https://limeproxies.netlify.app/blog/selenium-vs-beautifulsoup

https://github.com/strohne/autocol

<div class='alert alert-block alert-success'>
<b>Document information</b>

Contact and main author: Pouria Mirelmi

Contributors: Felix Beck-Soldner, N. Gizem Bacaksizlar Turbic, & Haiko Lietz

Acknowledgements: Fabian Flöck

Version date: 18 August 2023

License: Creative Commons Attribution 4.0 International (CC BY 4.0)
</div>