<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Web Scraping using Selenium

---


<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Installation" data-toc-modified-id="Installation-1">Installation</a></span></li><li><span><a href="#First-example:-Scroll-down!" data-toc-modified-id="First-example:-Scroll-down!-2">First example: Scroll down!</a></span></li><li><span><a href="#Second-example:-Click-for-more!" data-toc-modified-id="Second-example:-Click-for-more!-3">Second example: Click for more!</a></span></li></ul></div>

## Installation

There are a couple of installations you need to do to prepare for using Selenium:

1. Install a webdriver. This is what Selenium uses to interact with the web browser.

Go to: https://sites.google.com/a/chromium.org/chromedriver/downloads

Download the correct version for your machine.

2. Install Selenium. This is done using the `pip install selenium` command in the terminal.

In [3]:
#!pip install selenium

In [4]:
import numpy as np
import pandas as pd
from selenium import webdriver
from time import time, sleep

## First example: Scroll down!

In [9]:
driver = webdriver.Chrome(executable_path='./chromedriver')
driver.get('https://www.next.co.uk/shop/gender-men/feat-newin')
start = time()
while time()-start<5:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
item_tags = driver.find_elements_by_class_name('Info')
# so here you first scroll down for a specific amount of time to get all the content... then below, you will 
# extract the info you need.

item_names = []
item_prices = []
for item in item_tags:
    try:
        item_names.append(item.find_element_by_class_name('Title').text)
        item_prices.append(item.find_element_by_class_name('Price').text)
    except:
        pass
items = pd.DataFrame({'Item': item_names,
                        'Price': item_prices})

In [10]:
items.head()

Unnamed: 0,Item,Price
0,Black Oxford Brogue Shoes,£35
1,Nike Air Max Excee Trainers,£95
2,Mid Blue Loose Fit Jeans With Stretch,£25
3,Superdry Lightweight Leather Track Jacket,£200
4,Black Bright Rainbow Waistband Hipsters Eight ...,£40


In [11]:
items.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48 entries, 0 to 47
Data columns (total 2 columns):
Item     48 non-null object
Price    48 non-null object
dtypes: object(2)
memory usage: 848.0+ bytes


In [12]:
items.duplicated().sum()

0

## Second example: Click for more!

In [18]:
driver = webdriver.Chrome('./chromedriver')
driver.get('https://www.edie.net/news/')

start = time()
while time()-start<2: continue

for x in range(10):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    button = driver.find_element_by_class_name('read-more')
    button.click()
    start = time()
    while time()-start<3: continue

article_tags = driver.find_elements_by_class_name('story')
article_titles = []
article_urls = []

for article in article_tags:
    try:
        article_titles.append(article.find_element_by_tag_name('h2').text)
    except:
        article_titles.append(np.nan)
    try:
        article_urls.append(article.find_element_by_tag_name('a').get_attribute('href'))
    except:
        article_urls.append(np.nan)
articles = pd.DataFrame({'Title': article_titles,
                        'URL': article_urls})

In [19]:
articles.head()

Unnamed: 0,Title,URL
0,Plastics Week: edie launches content campaign ...,
1,Green groups slam Government proposals to cut ...,https://www.edie.net/news/11/Green-groups-slam...
2,Report: One in five UK businesses not motivate...,https://www.edie.net/news/5/Report--One-in-fiv...
3,EU unveils €100bn fund for 'just' low-carbon t...,https://www.edie.net/news/11/EU-unveils-EUR100...
4,edie's next masterclass: how corporations are ...,https://www.edie.net/news/16/edie-s-next-maste...


In [15]:
articles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 108 entries, 0 to 107
Data columns (total 2 columns):
Title    108 non-null object
URL      107 non-null object
dtypes: object(2)
memory usage: 1.8+ KB


In [16]:
articles.duplicated().sum()

0