# web-scraper-with-python-and-selenium

In this project will cover how to start extracting data with Selenium and Python. We will build a Python script that will log in to a website, scrape some data, format it nicely, and store it in a CSV file.

## An overview of Selenium
Selenium is a suite of tools for automating web browsers that was first introduced as a tool for cross-browser testing.

The `API` built by the Selenium team uses the `WebDriver` protocol to take control of a web browser, like Chrome or Firefox, and perform different tasks, like:

- Filling forms
- Scrolling
- Taking screenshots
- Clicking buttons



Now you might be wondering how all this translates into web scraping. It’s simple, really.

Data extraction can be a real pain in the neck sometimes. Websites are being built as Single Page Applications nowadays even when there’s no need for that. They’re popping `CAPTCHAs` more frequently than needed and even `blocking regular users’ IPs`.
In short, `bot detection` is a very `frustrating` feature that feels like a bug.

`Selenium` can help in these cases by understanding and executing `Javascript` code and automating many tedious processes of web scraping, like `scrolling` through the page, grabbing `HTML elements`, or `exporting fetched data`.


## Installation

To show the real power of Selenium and Python, we are going to scrape some information off the /r/learnprogramming subreddit. Besides scraping data, I’ll also show you how signing in can be implemented. Now that we have an understanding of the primary tool and the website we are going to use, let’s see what other requisites we need to have installed:

- Python
- Selenium package
- Pandas package
- BeautifulSoup package
- Google Chrome
- Chrome driver

In [2]:
! pip install selenium --upgrade --quiet
! pip install pandas --upgrade --quiet
! pip install bs4 --upgrade --quiet

## Starting the browser

In [8]:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options


we can also run Google Chrome without a graphical user interface and `log the page’s HTML content` by adding a couple of lines of code. We will set the headless option to true for the chrome driver (to remove the graphical interface) and a `window size of 1080 pixels` (to get the correct HTML code for our use case).

In [10]:
options = Options()
options.headless = True
options.add_argument("--window-size=1920,1080")

## Locating specific data

In [None]:
PATH = "C:\Program Files (x86)\chromedriver.exe"

driver = webdriver.Chrome(PATH)
driver.get("https://www.reddit.com/r/learnprogramming/top/?t=month")

In [None]:
print(driver.page_source)
driver.quit()

The combined code will look like. The last two lines of code exit Chrome right after finishing logging the page’s HTML.

In [None]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.headless = True
options.add_argument("--window-size=1920,1080")

PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)
driver.get("https://www.reddit.com/r/learnprogramming/top/?t=month")

print(driver.page_source)
driver.quit()

## WebElement

A `WebElement` is a Selenium object that represents an HTML element. As you will see in the following tutorial, we can perform many actions on these elements. Some of them are:

- Clicking on it by using the .click() method
- Providing text to a specific input element by calling the .send_keys() method
- Reading the text of an element by using element.text
- Checking if an element is displayed on the page by calling .is_displayed() on it

## An example of Selenium in action
Now that we have our project set up, we can finally get to scraping.

### Logging in

We are going to showcase the power of Selenium by logging in to our Reddit account and scraping the previously presented data. Let’s start by making Selenium click on the login button at the top of the page. After inspecting the page’s HTML, we can see that the login button’s Xpath name is `//*[@id="SHORTCUT_FOCUSABLE_DIV"]/div[1]/header/div/div[2]/div/div[1]/a[1]`.

In [None]:
# Logging in
login_button= driver.find_element(By.XPATH, '//*[@id="SHORTCUT_FOCUSABLE_DIV"]/div[1]/header/div/div[2]/div/div[1]/a[1]')
login_button.click()

This will open up the login modal where we can see the user and password inputs we have to fill up. Let’s continue with the following lines:

In [None]:
# switch to login frame
login_frame = driver.find_element(By.XPATH,'/html/body/div/main/div[1]')
driver.switch_to_frame(login_frame)

# enter user name
username=driver.find_element(By.XPATH,'//*[@id="loginUsername"]')
username.send_keys('')
time.sleep(3)

#enter password
password = driver.find_element(By.XPATH,'//*[@id="loginPassword"]')
password.send_keys('')
time.sleep(3)

#enter submit button
submit = driver.find_element(By.XPATH,'/html/body/div/main/div[1]/div/div/form/fieldset[4]/button')
submit.click()

If we inspect the `modal` element, we can see that its container is an `iframe`. This is why we have to switch to frame in the first part of the code, as selecting the inputs without it will result in an error.

Next, we get the input elements and provide them with the proper credentials before hitting the submit button. This will bring us back to the /r/learnprogramming page, but now we are logged in and ready to upvote!

## Taking a screenshot

Taking a screenshot using Selenium and Python is pretty easy. All you have to do is write the following command in the  after declaring the web driver.

In [None]:
driver.save_screenshot('screenshot.png')

It’s useful to know that you can set the Google Chrome window size by adding the following lines of code:

In [None]:
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--window-size=1920,1080")

## Extracting data

As we have previously stated, we need to get the posts’ title, author, and number of upvotes. Let’s start with the help of BeautifulSoup and Pandas packages and creating three empty arrays for every type of information we need.

In [None]:
titles = []
upvotes=[]
authors = []

We are going to use BeautifulSoup to parse the HTML document by writing the following lines:

In [None]:
content = driver.page_source
soup = BeautifulSoup(content, features="html.parser")

After successfully inspecting the HTML document and choosing the right selectors, we are now going to fetch the titles, upvotes, and authors and assign them to the right array:

In [None]:
for i in div:
    title = i.find('h3', attrs={'class':'_eYtD2XCVieq6emjKBH3m'})
    upvote = i.find('div', attrs={'class': '_1E9mcoVn4MYnuBQSVDt1gC'})
    author = i.find('a', attrs={'class': '_23wugcdiaj44hdfugIAlnX'})
    titles.append(title.text)
    upvotes.append(upvote.text)
    authors.append(author.text)

Finally, we will store the information in a CSV file using the Pandas package we imported earlier.

In [None]:
df = pd.DataFrame({'Post title': titles, 'Author': authors, 'Number of upvotes': upvotes})
df.to_csv('posts.csv', index=False, encoding='utf-8')

**Bonus tip**: Sometimes, we need more data than the website provides on the first load. Most of the time, the fetching data action fires when the user scrolls down. If you need to scroll down to get more data, you can use the .execute_script() method like this:

In [None]:
scrollDown = "window.scrollBy(0,2000);"
driver.execute_script(scrollDown)

In [None]:
!pip install jovian --upgrade --quiet

In [None]:
import jovian

In [None]:
# Execute this to save new versions of the notebook
jovian.commit(project="web-scraper-with-python-and-selenium")

<IPython.core.display.Javascript object>