# Selenium:

## Learning Goals:

- Able to install and setup Selenium
- Able to login to website platform
- Able to navigate through pages/pop-ups
- Able to write scraper that is more "human-like"
- Able to know when to use appropriate `find_element(s)_by...`
- Able to acquire desired data

---

## First, head over to [this page](https://chromedriver.chromium.org/downloads) and locate the chromedriver that matches your chrome version.

**How to Find Your Internet Browser Version Number - Google Chrome.**

1) Click on the Menu icon in the upper right corner of the screen. 

2) Click on Help, and then About Google Chrome. 

3) Your Chrome browser version number can be found here.

## Next, download the appropriate driver that matches your version of Chrome

- After you have downloaded the driver, press `command` + `spacebar`
- Inside of the spotlight search you just opened, type `/usr/local/bin/` and open that folder
- Next, in a separate finder window (`command` + `n`), navigate to where you downloaded the `chromedriver`
- Finally, move the `chromedriver` from where ever you downloaded it into your `/usr/local/bin/`

*Technically, you can install the driver anywhere, but most tutorials I have read say to put it in `/usr/local/bin/`*

...However, after a bit of research, I believe the reason we want to install the `chromedriver` inside of `/usr/local/bin/` is so that you don't have to explicitly state the chromedriver path when you instantiate your driver 😎 

## Install Selenium if you have not already done so:

If you can please use the Conda install, and I would suggest running it in terminal not jupyter

In [82]:
#!conda install -c conda-forge selenium
#or
#!pip install selenium

In [1]:
import re
import os
import time
import random
import requests
import numpy as np
import pandas as pd
from os import system
from math import floor
from copy import deepcopy
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

In [2]:
pd.set_option('display.max_columns', 20)
pd.set_option('display.max_colwidth', 200)

---

**Next, I always like to label my driver with a bold title cell**


- I find it helps when we need to re-instantiate our driver and for general organization
- Also, when we copy and paste this code form a notebook to a .py file, we would usually only need one driver

## DRIVER HERE:

In [3]:
driver = webdriver.Chrome()

---

### Note: Headless Browsers

**Headless Browser**
A Headless Browser is also a Web Browser but without a graphical user interface (GUI) but can be controlled programmatically which can be extensively used for automation, testing, and other purposes.

**Why to use Headless Browsers?**
There are a lot of advantages and disadvantages in using the Headless Browsers. Using a headless browser might not be very helpful for browsing the Web, but for Automating tasks and tests it’s awesome.

**Advantages of Headless Browsers**

Some of the advantages are as follows:

- Headless Browsers are typically faster than real browsers. 
    - The reason for being faster is because we are not starting up a Browser GUI and can bypass all the time a real browser takes to load CSS, JavaScript and open and render HTML DOM.
- Performance wise, you can typically see a 2x to 15x faster performance when using a headless browser.

*More info on headless browsers here:* https://stackoverflow.com/questions/53083952/difference-of-headless-browsers-for-automation

## Time to scrape!

<img src = "https://media1.tenor.com/images/3fd84ba4b54f8d299f7732e63cdb3c00/tenor.gif?itemid=11903546" />

### Visiting a webpage

In [4]:
# Visit the website of your choice:

driver.get('https://www.espn.com')

#### Methods for finding a single element 

    This will return the FIRST instance of your desired "element"

* find_element_by_id
* find_element_by_name
* find_element_by_xpath  
* find_element_by_link_text
* find_element_by_partial_link_text
* find_element_by_tag_name
* find_element_by_class_name
* find_element_by_css_selector

---

#### Methods for finding multiple elements

    This will return a list of ALL instances of your desired "element"

* find_elements_by_name
* find_elements_by_xpath
* find_elements_by_link_text
* find_elements_by_partial_link_text
* find_elements_by_tag_name
* find_elements_by_class_name
* find_elements_by_css_selector

From the [Selenium Python Docs](https://selenium-python.readthedocs.io/locating-elements.html "Selenium Docs") 

### Selecting the FIRST instance of an "element"

First, well check out `.find_element_by_css_selector()`

In [5]:
driver.find_element_by_css_selector('h1')

<selenium.webdriver.remote.webelement.WebElement (session="2bc1033724426f806949cc33b1067c48", element="f24c0727-455f-4570-a2ad-4f74195b4bdf")>

In [6]:
driver.find_element_by_css_selector('h1').text

'ESPN'

### Selecting ALL instances of your desired "element"

In [7]:
listy = driver.find_elements_by_css_selector('h1')

In [8]:
for x in listy[:15]:
    if len(x.text) > 3:
        print(x.text)

ESPN
NFL draft mailbag: Is Mac Jones really the Niners' pick, and 10 other questions for Kiper and McShay
Riddick is taken aback by Lawrence's commitment comments
Doncic drills incredible buzzer-beater after Grizzlies' late collapse
Luka Legend: Mavs win on Doncic's 'lucky' shot
WNBA DRAFT
WNBA mock draft: Sparks-Wings trade shakes up first round
Charli Collier, projected No. 1 pick in draft, is 'not even where I'm supposed to be yet'
CUSTOMIZE ESPN
Yankees
Giants
Knicks
Mets


---

#### Closing the driver:

If you were to just close your driver's browsing window, your Google chrome instance will still appear open in your mac's dock. Using `driver.quit()`, we can close the Google chrome instance, which will also close the driver's browser:

In [9]:
driver.quit()

#### Timing

Sometimes we will need to wait for the page to load. Other times, we may want to have our scraper act more like a human, in terms of "click rate."

Two possible ways to make this happen are by using `time.sleep()` or `WebDriverWait()`

If we just want to mimic the behavior of a human, we can use `time.sleep()`:

In [10]:
# Using a single "wait" time:

time.sleep(2)

In [11]:
# Using a randomized time:

sequence = [x/10 for x in range(8, 14)]
print(sequence)

time.sleep(random.choice(sequence))

[0.8, 0.9, 1.0, 1.1, 1.2, 1.3]


If we explicitly want to wait for our page to load, we can use `WebDriverWait()`:

In [None]:
wait = WebDriverWait(driver, 5)

try:
    page_loaded = wait.until(lambda driver: driver.current_url == my_url)
    print('The page loaded correctly')
except TimeoutException:
    print("Loading timeout expired")

#### Wait a second... what is that `xpath` thing?

XPath is defined as XML path. It is a syntax or language for finding any element on the web page using XML path expression. XPath is used to find the location of any element on a webpage using HTML DOM structure. The basic format of XPath is explained below with screen shot.

<img src='https://www.guru99.com/images/3-2016/032816_0758_XPathinSele1.png' >

XPath contains the path of the element situated at the web page. Standard syntax for creating XPath is:

`Xpath=//tagname[@attribute='value']`

- // == Select current node.
- Tagname == Tagname of the particular node.
- @ == Select attribute.
- Attribute == Attribute name of the node.
- Value == Value of the attribute.

<img src='https://media1.giphy.com/media/XBpEStoQ5rftPFA8rh/giphy.gif?cid=790b7611dbcd651cd785fb8382888f7b41666d5c8695755b&rid=giphy.gif'>

**We can perform the next operations a few different ways:**

Similar to above, we could use the `xpath`

Or... based on visual knowledge of inspecting html/css elements, we can see the css selector `input` and we could assume that the only 2 possible inputs are Username and Password

---

With that knowledge, we can define both variables in one line of code


# Modal buttons and scrolling:

In [13]:
driver = webdriver.Chrome()

In [14]:
# These websites have modal popups:

driver.get('https://www.nike.com')

# Other options:
# https://www.carbon38.com
# https://www.meundies.com

The following cell is an example of how you could write functions to scroll down the page (for dynamic loading) and for loading more content with "clicks"

In [15]:
# Example: Scroll down (with a test for a modal)

def scroll_down():
    for i in range(1, 10):
        try:
            modal_button = driver.find_element_by_class_name("button2")
            webdriver.ActionChains(driver).move_to_element(modal_button).click(modal_button).perform()
      ##### modal_button.click() also works 
            
        except:
            time.sleep(.5)
            pass 
        
        #scroll to the bottom
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(1)

        
# Example: Load more content
# Code snippet for context purposes only. We will not run this function:

def get_more(): 
    for i in range(1, 5):
        try:
            next_b = driver.find_element_by_xpath("//*[contains(text(), 'Load next Politics story')]")
            webdriver.ActionChains(driver).move_to_element(next_b).click(next_b).perform()
            time.sleep(.5)
        except: 
            print("Page #" + str(i) + " has failed to load") 

In [16]:
# Run this cell and watch the page scrollllllll

scroll_down()

In [17]:
driver.quit()

## When to use BeautifulSoup vs.  Selenium?

<img src='https://media.giphy.com/media/xTiN0IuPQxRqzxodZm/giphy.gif' width = 400>

<img src='https://media2.giphy.com/media/3o7TKAdOad9Y3eSMZG/giphy.gif?cid=790b761168b43f2be748800602251dce3cad91fcb4c972f9&rid=giphy.gif' width = 400>

<img src = "https://media1.giphy.com/media/8VLgtJqaxIlhu/giphy.gif?cid=790b7611df175494e219b99894f7e717b3ea7bfbf806f9c4&rid=giphy.gif" />

**Just kidding!**

Everything depends on the website and your data goals.

In general:
- If the data needs to be exposed interactively, then go for Selenium. 
- Selenium for more complex JavaScript heavy pages. 
---
- If the data is accessible in the HTML structure (more static pages), soup is a more lightweight tool. 
- Soup gives you more control about navigating the HTML tree.

In [18]:
html = requests.get('https://www.skysports.com/premier-league-table')
bs = BeautifulSoup(html.content, 'lxml')
table = bs.table

# table = bs.find(lambda tag: tag.name=='table' ) 
# rows = table.findAll(lambda tag: tag.name=='tr')
table

<table class="standing-table__table callfn" data-fn="table-sorter-lite" data-lite="true">
<caption class="standing-table__caption">
    Premier League 2020/21  </caption>
<colgroup class="standing-table__cols">
<col class="standing-table__cell standing-table__col1"/>
<col class="standing-table__cell standing-table__col2 standing-table__cell--name"/>
<col class="standing-table__cell standing-table__col3"/>
<col class="standing-table__cell standing-table__col4 is-hidden--bp35"/>
<col class="standing-table__cell standing-table__col5 is-hidden--bp35"/>
<col class="standing-table__cell standing-table__col6 is-hidden--bp35"/>
<col class="standing-table__cell standing-table__col7 is-hidden--bp35"/>
<col class="standing-table__cell standing-table__col8 is-hidden--bp35"/>
<col class="standing-table__cell standing-table__col9"/>
<col class="standing-table__cell standing-table__col10"/>
<col class="standing-table__cell standing-table__col11 is-hidden--bp15 is-hidden--bp35"/>
</colgroup>
<thead>
<

In [19]:
table_rows = table.find_all('tr')

for tr in table_rows:
    td = tr.find_all('td')
    row = [i.text for i in td]
    print(row)

[]
['1', '\nManchester City\n', '32', '23', '5', '4', '67', '23', '44', '74', '\n\n       \n']
['2', '\nManchester United\n', '31', '18', '9', '4', '61', '34', '27', '63', '\n\n       \n']
['3', '\nLeicester City\n', '31', '17', '5', '9', '55', '37', '18', '56', '\n\n       \n']
['4', '\nWest Ham United\n', '31', '16', '7', '8', '51', '39', '12', '55', '\n\n       \n']
['5', '\nChelsea\n', '31', '15', '9', '7', '50', '31', '19', '54', '\n\n       \n']
['6', '\nLiverpool\n', '31', '15', '7', '9', '53', '37', '16', '52', '\n\n       \n']
['7', '\nTottenham Hotspur\n', '31', '14', '7', '10', '52', '35', '17', '49', '\n\n       \n']
['8', '\nEverton\n', '30', '14', '6', '10', '41', '38', '3', '48', '\n\n       \n']
['9', '\nArsenal\n', '31', '13', '6', '12', '43', '35', '8', '45', '\n\n       \n']
['10', '\nLeeds United\n', '31', '14', '3', '14', '49', '49', '0', '45', '\n\n       \n']
['11', '\nAston Villa\n', '30', '13', '5', '12', '43', '33', '10', '44', '\n\n       \n']
['12', '\nWolve

In [20]:
html = requests.get('https://www.skysports.com/premier-league-table')
bs = BeautifulSoup(html.content, 'lxml')
table = bs.table

#If you know there is more than one table, you can edit the code to include the proper index:
# table = bs.find_all('table')[0] 

df = pd.read_html(str(table), index_col='Team')
df = df[0].dropna(axis=0, thresh=4)
df

Unnamed: 0_level_0,#,Pl,W,D,L,F,A,GD,Pts,Last 6
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Manchester City,1,32,23,5,4,67,23,44,74,
Manchester United,2,31,18,9,4,61,34,27,63,
Leicester City,3,31,17,5,9,55,37,18,56,
West Ham United,4,31,16,7,8,51,39,12,55,
Chelsea,5,31,15,9,7,50,31,19,54,
Liverpool,6,31,15,7,9,53,37,16,52,
Tottenham Hotspur,7,31,14,7,10,52,35,17,49,
Everton,8,30,14,6,10,41,38,3,48,
Arsenal,9,31,13,6,12,43,35,8,45,
Leeds United,10,31,14,3,14,49,49,0,45,


#### Adjusting the header and index:

- Caveat: this uses pandas, not Selenium or Soup

If there is more than one table, pandas reads the html as a list of tables:

In [21]:
df2 = pd.read_html('https://www.sportsmole.co.uk/football/premier-league/2018-19/')

df2

[      0                               1    2    3    4               5  \
 0   NaN                            Team    P    W    D               L   
 1     C         Manchester CityMan City   38   32    2               4   
 2     2                       Liverpool   38   30    7               1   
 3     3                         Chelsea   38   21    9               8   
 4     4          Tottenham HotspurSpurs   38   23    2              13   
 5     5                         Arsenal   38   21    7              10   
 6     6        Manchester UnitedMan Utd   38   19    9              10   
 7     7   Wolverhampton WanderersWolves   38   16    9              13   
 8     8                         Everton   38   15    9              14   
 9     9         Leicester CityLeicester   38   15    7              16   
 10   10         West Ham UnitedWest Ham   38   15    7              16   
 11   11                         Watford   38   14    8              16   
 12   12                 

In [22]:
# Let's check out one of our tables:

df2[0]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,,Team,P,W,D,L,F,A,GD,PTS
1,C,Manchester CityMan City,38,32,2,4,95,23,72,98
2,2,Liverpool,38,30,7,1,89,22,67,97
3,3,Chelsea,38,21,9,8,63,39,24,72
4,4,Tottenham HotspurSpurs,38,23,2,13,67,39,28,71
5,5,Arsenal,38,21,7,10,73,51,22,70
6,6,Manchester UnitedMan Utd,38,19,9,10,65,54,11,66
7,7,Wolverhampton WanderersWolves,38,16,9,13,47,46,1,57
8,8,Everton,38,15,9,14,54,46,8,54
9,9,Leicester CityLeicester,38,15,7,16,51,48,3,52


As we can see above, the table's formatting is slightly off...

So we can make adjustments like so:

In [23]:
df2 = pd.read_html('https://www.sportsmole.co.uk/football/premier-league/2018-19/',header=0, index_col=1)

df2[0].columns =  ['final_standings', 'P', 'W', 'D', 'L', 'F', 'A', 'GD', 'PTS']

df2[0]

Unnamed: 0_level_0,final_standings,P,W,D,L,F,A,GD,PTS
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Manchester CityMan City,C,38.0,32.0,2.0,4,95,23,72,98
Liverpool,2,38.0,30.0,7.0,1,89,22,67,97
Chelsea,3,38.0,21.0,9.0,8,63,39,24,72
Tottenham HotspurSpurs,4,38.0,23.0,2.0,13,67,39,28,71
Arsenal,5,38.0,21.0,7.0,10,73,51,22,70
Manchester UnitedMan Utd,6,38.0,19.0,9.0,10,65,54,11,66
Wolverhampton WanderersWolves,7,38.0,16.0,9.0,13,47,46,1,57
Everton,8,38.0,15.0,9.0,14,54,46,8,54
Leicester CityLeicester,9,38.0,15.0,7.0,16,51,48,3,52
West Ham UnitedWest Ham,10,38.0,15.0,7.0,16,52,55,-3,52


---

### An example where formatting is an issue:

In [24]:
html = requests.get('http://www.nfl.com/stats/team')
nfl_soup = BeautifulSoup(html.content, 'lxml')
table = nfl_soup.table

In [25]:
table.prettify()

'<table class="d3-o-table d3-o-table--detailed d3-o-team-stats--detailed d3-o-table--sortable {sortlist: [[0,0]], sortinitialorder: \'asc\'}" data-require="modules/tableSortable">\n <thead>\n  <tr>\n   <th>\n    Team\n   </th>\n   <th scope="col">\n    Att\n   </th>\n   <th scope="col">\n    Cmp\n   </th>\n   <th scope="col">\n    Cmp %\n   </th>\n   <th scope="col">\n    Yds/Att\n   </th>\n   <th scope="col">\n    Pass Yds\n   </th>\n   <th scope="col">\n    TD\n   </th>\n   <th scope="col">\n    INT\n   </th>\n   <th scope="col">\n    Rate\n   </th>\n   <th scope="col">\n    1st\n   </th>\n   <th scope="col">\n    1st%\n   </th>\n   <th scope="col">\n    20+\n   </th>\n   <th scope="col">\n    40+\n   </th>\n   <th scope="col">\n    Lng\n   </th>\n   <th scope="col">\n    Sck\n   </th>\n   <th scope="col">\n    SckY\n   </th>\n  </tr>\n </thead>\n <tbody>\n  <tr>\n   <td scope="row" tabindex="0">\n    <div class="d3-o-club-info">\n     <div class="d3-o-club-logo">\n      <picture>\n 

In [26]:
nfl = pd.read_html('http://www.nfl.com/stats/team')

nfl

[                            Team  Att  Cmp  Cmp %  Yds/Att  Pass Yds  TD  INT  \
 0   Football Team  Football Team  601  389   64.7      6.3      3796  16   16   
 1         Buccaneers  Buccaneers  626  410   65.5      7.6      4776  42   12   
 2             Seahawks  Seahawks  563  388   68.9      7.5      4245  40   13   
 3                   49ers  49ers  570  371   65.1      7.6      4320  25   17   
 4             Chargers  Chargers  627  413   65.9      7.3      4548  31   10   
 5             Steelers  Steelers  656  428   65.2      6.3      4129  35   11   
 6           Cardinals  Cardinals  575  387   67.3      7.1      4102  27   13   
 7                 Eagles  Eagles  598  334   55.9      6.2      3728  22   20   
 8                     Jets  Jets  499  292   58.5      6.2      3115  16   14   
 9                 Giants  Giants  517  321   62.1      6.5      3336  12   11   
 10                Saints  Saints  522  370   70.9      7.6      3945  28    8   
 11            P

In [27]:
# PRO-TIP: if you want to instantiate a new df variable from a previous df or list of dfs, 
# making a copy of the df will save you from a headache

offense = deepcopy(nfl[0])
offense

Unnamed: 0,Team,Att,Cmp,Cmp %,Yds/Att,Pass Yds,TD,INT,Rate,1st,1st%,20+,40+,Lng,Sck,SckY
0,Football Team Football Team,601,389,64.7,6.3,3796,16,16,80.1,184,30.6,40,6,68T,50,331
1,Buccaneers Buccaneers,626,410,65.5,7.6,4776,42,12,102.8,238,38.0,67,12,50,22,150
2,Seahawks Seahawks,563,388,68.9,7.5,4245,40,13,105.0,216,38.4,45,11,62,48,304
3,49ers 49ers,570,371,65.1,7.6,4320,25,17,90.1,217,38.1,55,10,76T,39,287
4,Chargers Chargers,627,413,65.9,7.3,4548,31,10,97.0,226,36.0,54,10,72T,34,219
5,Steelers Steelers,656,428,65.2,6.3,4129,35,11,93.5,206,31.4,48,7,84T,14,126
6,Cardinals Cardinals,575,387,67.3,7.1,4102,27,13,94.1,211,36.7,45,14,80T,29,186
7,Eagles Eagles,598,334,55.9,6.2,3728,22,20,72.9,177,29.6,43,9,81T,65,401
8,Jets Jets,499,292,58.5,6.2,3115,16,14,75.9,146,29.3,39,6,69T,43,319
9,Giants Giants,517,321,62.1,6.5,3336,12,11,79.6,178,34.4,36,5,53,50,310


In [116]:
offense

Unnamed: 0,Team,Att,Cmp,Cmp %,Yds/Att,Pass Yds,TD,INT,Rate,1st,1st%,20+,40+,Lng,Sck,SckY
0,Football Team Football Team,601,389,64.7,6.3,3796,16,16,80.1,184,30.6,40,6,68T,50,331
1,Buccaneers Buccaneers,626,410,65.5,7.6,4776,42,12,102.8,238,38.0,67,12,50,22,150
2,Seahawks Seahawks,563,388,68.9,7.5,4245,40,13,105.0,216,38.4,45,11,62,48,304
3,49ers 49ers,570,371,65.1,7.6,4320,25,17,90.1,217,38.1,55,10,76T,39,287
4,Chargers Chargers,627,413,65.9,7.3,4548,31,10,97.0,226,36.0,54,10,72T,34,219
5,Steelers Steelers,656,428,65.2,6.3,4129,35,11,93.5,206,31.4,48,7,84T,14,126
6,Cardinals Cardinals,575,387,67.3,7.1,4102,27,13,94.1,211,36.7,45,14,80T,29,186
7,Eagles Eagles,598,334,55.9,6.2,3728,22,20,72.9,177,29.6,43,9,81T,65,401
8,Jets Jets,499,292,58.5,6.2,3115,16,14,75.9,146,29.3,39,6,69T,43,319
9,Giants Giants,517,321,62.1,6.5,3336,12,11,79.6,178,34.4,36,5,53,50,310


### The best example of when Selenium is supreme:

When the page is written in JavaScript

In [33]:
html = requests.get('http://www.tennisabstract.com/cgi-bin/player.cgi?p=RogerFederer')
bs = BeautifulSoup(html.content, 'lxml')
table = bs.table

# table = bs.find(lambda tag: tag.name=='table' ) 
# rows = table.findAll(lambda tag: tag.name=='tr')

In [34]:
bs

<html><head>
<title>Tennis Abstract: Roger Federer ATP Match Results, Splits, and Analysis</title>
<link href="http://www.minorleaguesplits.com/tennisabstract/blue/style.css" rel="stylesheet" type="text/css"/>
<script src="http://www.minorleaguesplits.com/tennisabstract/jquery-1.7.1-min.js" type="text/javascript"></script>
<script src="http://www.minorleaguesplits.com/tennisabstract/jquery.tablesorter.js" type="text/javascript"></script>
<script src="http://www.minorleaguesplits.com/tennisabstract/cgi-bin/frags/RogerFederer.js" type="text/javascript"></script>
<script language="JavaScript">
var currentTime = new Date();
var month = currentTime.getMonth() + 1;
var day = currentTime.getDate();
var year = currentTime.getFullYear().toString();
var mm, dd;
if (month < 10) {mm = '0' + month.toString();}
else {mm = month.toString();}
if (day < 10) {dd = '0' + day.toString();}
else {dd = day.toString();}
var today = year + mm + dd;
var one_day=1000*60*60*24;
var nameparam = 'RogerFederer';





In [35]:
table

<table width="1240px">
<tr>
<td align="left"> </td>
<td align="right">
 </td>
</tr>
</table>

In [36]:
url = "http://www.tennisabstract.com/cgi-bin/player.cgi?p=RogerFederer"
driver = webdriver.Chrome()

In [37]:
driver.get(url)

In [59]:
table = driver.find_element_by_id("recent-results")

In [60]:
table.text

'Date Tournament Surface Rd Rk vRk Score DR A% DF% 1stIn 1st% 2nd% BPSvd Time\n08-Mar-2021 Doha Hard QF 6 42 Nikoloz Basilashvili [GEO] d. (2)Federer 3-6 6-1 7-5 0.87 12.5% 0.0% 68.8% 66.7% 50.0% 7/10 1:50\n08-Mar-2021 Doha Hard R16 6 28 (2)Federer d. Daniel Evans [GBR] 7-6(8) 3-6 7-5 1.16 12.4% 0.0% 67.6% 78.9% 52.9% 3/4 2:24\n20-Jan-2020 Australian Open Hard SF 3 2 (2)Novak Djokovic [SRB] d. (3)Federer 7-6(1) 6-4 6-3 0.76 14.4% 2.9% 65.4% 66.2% 41.7% 7/11 2:18\n20-Jan-2020 Australian Open Hard QF 3 100 (3)Federer d. Tennys Sandgren [USA] 6-3 2-6 2-6 7-6(8) 6-3 0.93 2.9% 1.8% 65.5% 71.4% 52.5% 10/14 3:31\n20-Jan-2020 Australian Open Hard R16 3 67 (3)Federer d. Marton Fucsovics [HUN] 4-6 6-1 6-2 6-2 1.36 4.9% 0.0% 61.2% 76.2% 52.5% 7/9 2:11\n20-Jan-2020 Australian Open Hard R32 3 47 (3)Federer d. John Millman [AUS] 4-6 7-6(2) 6-4 4-6 7-6(8) 1.08 9.4% 3.5% 65.3% 76.6% 50.8% 4/8 4:03\n20-Jan-2020 Australian Open Hard R64 3 41 (3)Federer d. Filip Krajinovic [SRB] 6-1 6-4 6-1 1.86 20.6% 0.

In [61]:
body = table.find_element_by_css_selector('tbody')

In [62]:
# Table rows usually have the css tag 'tr'
rows = body.find_elements_by_css_selector('tr')

In [63]:
[x.text for x in rows]

['08-Mar-2021 Doha Hard QF 6 42 Nikoloz Basilashvili [GEO] d. (2)Federer 3-6 6-1 7-5 0.87 12.5% 0.0% 68.8% 66.7% 50.0% 7/10 1:50',
 '08-Mar-2021 Doha Hard R16 6 28 (2)Federer d. Daniel Evans [GBR] 7-6(8) 3-6 7-5 1.16 12.4% 0.0% 67.6% 78.9% 52.9% 3/4 2:24',
 '20-Jan-2020 Australian Open Hard SF 3 2 (2)Novak Djokovic [SRB] d. (3)Federer 7-6(1) 6-4 6-3 0.76 14.4% 2.9% 65.4% 66.2% 41.7% 7/11 2:18',
 '20-Jan-2020 Australian Open Hard QF 3 100 (3)Federer d. Tennys Sandgren [USA] 6-3 2-6 2-6 7-6(8) 6-3 0.93 2.9% 1.8% 65.5% 71.4% 52.5% 10/14 3:31',
 '20-Jan-2020 Australian Open Hard R16 3 67 (3)Federer d. Marton Fucsovics [HUN] 4-6 6-1 6-2 6-2 1.36 4.9% 0.0% 61.2% 76.2% 52.5% 7/9 2:11',
 '20-Jan-2020 Australian Open Hard R32 3 47 (3)Federer d. John Millman [AUS] 4-6 7-6(2) 6-4 4-6 7-6(8) 1.08 9.4% 3.5% 65.3% 76.6% 50.8% 4/8 4:03',
 '20-Jan-2020 Australian Open Hard R64 3 41 (3)Federer d. Filip Krajinovic [SRB] 6-1 6-4 6-1 1.86 20.6% 0.0% 69.1% 76.6% 61.9% 2/3 1:32',
 '20-Jan-2020 Australian Op

In [64]:
rows[0]

<selenium.webdriver.remote.webelement.WebElement (session="6df1a0ce8ac1cee033cfa3490971f0c6", element="01b1817d-ccd6-4e53-9e3c-1d0ac38faadf")>

In [65]:
#allows us to look at the HTML inside the selenium object
rows[0].get_attribute('innerHTML')

'<td>08-Mar-2021</td><td><a href="http://www.tennisabstract.com/cgi-bin/tourney.cgi?t=2021Doha">Doha</a></td><td>Hard</td><td>QF</td><td align="right">6</td><td align="right">42</td><td><a href="http://www.tennisabstract.com/cgi-bin/player.cgi?p=NikolozBasilashvili">Nikoloz Basilashvili</a> [GEO] d. (2)<b>Federer</b></td><td>3-6 6-1 7-5</td><td align="right">0.87</td><td align="right">12.5%</td><td align="right">0.0%</td><td align="right">68.8%</td><td align="right">66.7%</td><td align="right">50.0%</td><td align="right">7/10</td><td align="right">1:50</td>'

In [66]:
row_data = rows[0].find_elements_by_css_selector('td')

In [67]:
for e in row_data: 
    print(e.text)

08-Mar-2021
Doha
Hard
QF
6
42
Nikoloz Basilashvili [GEO] d. (2)Federer
3-6 6-1 7-5
0.87
12.5%
0.0%
68.8%
66.7%
50.0%
7/10
1:50


In [68]:
data_list = []
for r in rows: 
    row_list = []
    row_data = r.find_elements_by_css_selector('td')
    for d in row_data: 
        row_list.append(d.text)
    data_list.append(row_list)

In [69]:
data_list[10]

['11-Nov-2019',
 'Tour Finals',
 'Hard',
 'RR',
 '3',
 '5',
 '(5)Dominic Thiem [AUT] d. (3)Federer',
 '7-5 7-5',
 '0.97',
 '7.4%',
 '0.0%',
 '69.1%',
 '70.2%',
 '47.6%',
 '2/5',
 '1:40']

In [70]:
len(data_list[10])

16

In [71]:
headers = table.find_element_by_css_selector('thead')
headers.text

'Date Tournament Surface Rd Rk vRk Score DR A% DF% 1stIn 1st% 2nd% BPSvd Time'

In [72]:
columns = headers.text.split(' ')
print(columns)

['Date', 'Tournament', 'Surface', 'Rd', 'Rk', 'vRk', 'Score', 'DR', 'A%', 'DF%', '1stIn', '1st%', '2nd%', 'BPSvd', 'Time']


In [73]:
print('Number of columns:     '+ str(len(columns)))
print()
print('Number of data points: '+ str(len(data_list[0])))

Number of columns:     15

Number of data points: 16


In [74]:
columns = ['Date','Tournament','Surface','Rd','Rk','vRk', 
           'Opponent','Score','DR','A%','DF%','1stIn',
           '1st%','2nd%','BPSvd','Time']

In [75]:
print(data_list[0])

['08-Mar-2021', 'Doha', 'Hard', 'QF', '6', '42', 'Nikoloz Basilashvili [GEO] d. (2)Federer', '3-6 6-1 7-5', '0.87', '12.5%', '0.0%', '68.8%', '66.7%', '50.0%', '7/10', '1:50']


In [81]:
federer_h2h = pd.DataFrame(data_list, columns=columns)

In [82]:
federer_h2h.head()

Unnamed: 0,Date,Tournament,Surface,Rd,Rk,vRk,Opponent,Score,DR,A%,DF%,1stIn,1st%,2nd%,BPSvd,Time
0,08-Mar-2021,Doha,Hard,QF,6,42,Nikoloz Basilashvili [GEO] d. (2)Federer,3-6 6-1 7-5,0.87,12.5%,0.0%,68.8%,66.7%,50.0%,7/10,1:50
1,08-Mar-2021,Doha,Hard,R16,6,28,(2)Federer d. Daniel Evans [GBR],7-6(8) 3-6 7-5,1.16,12.4%,0.0%,67.6%,78.9%,52.9%,3/4,2:24
2,20-Jan-2020,Australian Open,Hard,SF,3,2,(2)Novak Djokovic [SRB] d. (3)Federer,7-6(1) 6-4 6-3,0.76,14.4%,2.9%,65.4%,66.2%,41.7%,7/11,2:18
3,20-Jan-2020,Australian Open,Hard,QF,3,100,(3)Federer d. Tennys Sandgren [USA],6-3 2-6 2-6 7-6(8) 6-3,0.93,2.9%,1.8%,65.5%,71.4%,52.5%,10/14,3:31
4,20-Jan-2020,Australian Open,Hard,R16,3,67,(3)Federer d. Marton Fucsovics [HUN],4-6 6-1 6-2 6-2,1.36,4.9%,0.0%,61.2%,76.2%,52.5%,7/9,2:11


- A slightly different approach:

In [83]:
header = table.find_element_by_css_selector('thead')

header_elements = header.find_elements_by_css_selector('th')

len(header_elements)

16

In [84]:
headers = []

for x in header_elements: 
    headers.append(x.text)
print(headers)

['Date', 'Tournament', 'Surface', 'Rd', 'Rk', 'vRk', '', 'Score', 'DR', 'A%', 'DF%', '1stIn', '1st%', '2nd%', 'BPSvd', 'Time']


## Some other neat stuff:

In [85]:
# Let's take a screenshot! 

driver.get('https://www.nytimes.com')

driver.get_screenshot_as_file('ny_times_front_pg.png')

driver.quit()

# IMDB
Example using click and send keys

In [125]:
url = 'https://www.imdb.com/'

In [126]:
driver = webdriver.Chrome()

In [127]:
driver.get(url)

In [129]:
search_bar = driver.find_element_by_xpath('//*[@id="suggestion-search"]')

In [130]:
search_bar.send_keys('King Kong')

In [131]:
search_bar.send_keys(Keys.ENTER)

In [132]:
rp1 = [x.find_element_by_tag_name('a').get_attribute('href') for x in driver.find_elements_by_class_name('result_text')]

In [133]:
rp1

['https://www.imdb.com/title/tt0360717/?ref_=fn_al_tt_1',
 'https://www.imdb.com/title/tt0074751/?ref_=fn_al_tt_2',
 'https://www.imdb.com/name/nm0426363/?ref_=fn_al_nm_1',
 'https://www.imdb.com/name/nm3289725/?ref_=fn_al_nm_2',
 'https://www.imdb.com/search/keyword?keywords=king-kong&ref_=fn_al_kw_1',
 'https://www.imdb.com/search/keyword?keywords=hong-kong&ref_=fn_al_kw_2',
 'https://www.imdb.com/company/co0178991/?ref_=fn_al_co_1',
 'https://www.imdb.com/company/co0120023/?ref_=fn_al_co_2']

In [134]:
driver.get(rp1[0])

In [135]:
f = driver.find_element_by_class_name('title_block').text.split('\n')

In [136]:
f

['7.2/10',
 '397,566',
 'Rate This',
 'King Kong (2005)',
 'PG-13 | 3h 7min | Action, Adventure, Drama | 14 December 2005 (USA)']

In [137]:
keys = ['star_rating','num_reviews','Title','rating','length','Genres','Release']

In [138]:
f = f[:-1] + f[-1].split(' | ')
f.pop(2)
f

['7.2/10',
 '397,566',
 'King Kong (2005)',
 'PG-13',
 '3h 7min',
 'Action, Adventure, Drama',
 '14 December 2005 (USA)']

In [139]:
dict(zip(keys,f))

{'star_rating': '7.2/10',
 'num_reviews': '397,566',
 'Title': 'King Kong (2005)',
 'rating': 'PG-13',
 'length': '3h 7min',
 'Genres': 'Action, Adventure, Drama',
 'Release': '14 December 2005 (USA)'}

## Find reviews for a list of movies

In [140]:
#get url for reviews by class name and use driver.get
reviews = driver.find_element_by_class_name('user-comments')
#this shows us we want the last link so we take the last a_tag
print([x.text for x in reviews.find_elements_by_tag_name('a')])
reviews.find_elements_by_tag_name('a')[-1].get_attribute('href')

['MikeWindgren', 'See all my reviews', 'Report this', 'Review this title', 'See all 2,736 user reviews']


'https://www.imdb.com/title/tt0360717/reviews?ref_=tt_urv'

In [141]:
#or find url by xpath and click on that button
driver.find_element_by_xpath('//*[@id="titleUserReviewsTeaser"]/div/a[2]').click()

In [142]:
#Lets pull out 1 review and get each item, then we can put it in a loop
x = driver.find_elements_by_class_name('review-container')[0]

In [143]:
#get rating
x.find_element_by_tag_name('span').text

'9/10'

In [144]:
#get title
x.find_element_by_class_name('title').text

'Terrific Adaptation'

In [145]:
#get date
x.find_element_by_class_name('review-date').text

'26 July 2020'

In [146]:
#get the review
x.find_element_by_class_name('content').text

"This is the best adaptation to date of King Kong. It's a lengthy and rewarding piece. The over three hour run time might seem standoffish at first but you won't regret investing the time in watching this movie. The star of the film, Kong, is wonderfully anthropomorphic and you can readily identify with his nonverbal communication thanks to the work of Andy Serkis. A great and satisfying film. I am always saddened at the tragic ending.\n26 out of 28 found this helpful. Was this review helpful? Sign in to vote.\nPermalink"

In [147]:
#now lets put it in a loop
data = []
keys = ['rating','title','date','review']
reviews = driver.find_elements_by_class_name('review-container')
for review in reviews:
    lst = [review.find_element_by_tag_name('span').text,
    review.find_element_by_class_name('title').text,
    review.find_element_by_class_name('review-date').text,
    review.find_element_by_class_name('content').text]
    data.append(dict(zip(keys,lst)))

In [148]:
#Whats wrong with the output of this data?
data

[{'rating': '9/10',
  'title': 'Terrific Adaptation',
  'date': '26 July 2020',
  'review': "This is the best adaptation to date of King Kong. It's a lengthy and rewarding piece. The over three hour run time might seem standoffish at first but you won't regret investing the time in watching this movie. The star of the film, Kong, is wonderfully anthropomorphic and you can readily identify with his nonverbal communication thanks to the work of Andy Serkis. A great and satisfying film. I am always saddened at the tragic ending.\n26 out of 28 found this helpful. Was this review helpful? Sign in to vote.\nPermalink"},
 {'rating': '10/10',
  'title': 'Not the recognition it deserves!!!',
  'date': '10 May 2020',
  'review': "Typical Peter Jackson, however gonna watch the even longer extended 3 hours 20 minutes edition in 4k, not watched in years but the picture is epic apart from it being too warm for my liking. The sound is the unusual DTS-X high def sound and already is gorgeous!\n\nHowev

In [149]:
#Lets see if we can expand the hidden reviews
[x.click() for x in driver.find_elements_by_class_name('ipl-expander')]

ElementNotInteractableException: Message: element not interactable
  (Session info: chrome=89.0.4389.128)


Nope, We can't

## Where beautiful Soup wins

In [150]:
resp = requests.get('https://www.imdb.com/title/tt0360717/reviews?ref_=tt_urv')

In [151]:
bs = BeautifulSoup(resp.content, 'html.parser')

In [152]:
[x.text for x in bs.findAll(class_='content')]

["\nThis is the best adaptation to date of King Kong. It's a lengthy and rewarding piece. The over three hour run time might seem standoffish at first but you won't regret investing the time in watching this movie. The star of the film, Kong, is wonderfully anthropomorphic and you can readily identify with his nonverbal communication thanks to the work of Andy Serkis. A great and satisfying film. I am always saddened at the tragic ending.\n\n                    26 out of 28 found this helpful.\n                        \n                            Was this review helpful?  Sign in to vote.\n                        \n\nPermalink\n\n",
 "\nTypical Peter Jackson, however gonna watch the even longer extended 3 hours 20 minutes edition in 4k, not watched in years but the picture is epic apart from it being too warm for my liking. The sound is the unusual DTS-X high def sound and already is gorgeous!However enough of the technical borefest hahaha... this film for me is stunningly shot... som

In [153]:
driver.quit()

# YELP
Using beautiful soup to parse reviews and dealing with different layouts

In [271]:
resp = requests.get('https://www.yelp.com/biz/bushwick-grind-cafe-brooklyn-2?adjust_creative=XCA5Kc7RlIdeGhJ7qoZSYA&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=XCA5Kc7RlIdeGhJ7qoZSYA')

In [272]:
bs = BeautifulSoup(resp.content, 'html.parser')

In [273]:
x = bs.findAll(class_='margin-b5__373c0__2ErL8')

In [274]:
len(x)

11

In [275]:
x[0].text,x[1].text

("UsernameLocation001 star ratingEek! Methinks not.2 star ratingMeh. I've experienced better.3 star ratingA-OK.4 star ratingYay! I'm a fan.5 star ratingWoohoo! As good as it gets!Start your review of Bushwick Grind Cafe.",
 "Janet L.Fremont, CA22281/4/2021Amazing customer service! Not sure if she was the owner or just the staff, but our barista was so incredibly friendly and courteous. Great selection of drinks and food. Definitely had my fix for the morning! Unfortunately was coming from out of town, but if I wasn't, would definitely become a regular!UsefulFunnyCool 1")

In [276]:
x = x[1]

In [277]:
x.find(class_='css-n6i4z7').text

'Fremont, CA'

In [278]:
x.find(class_='i-stars__373c0__1T6rz')['aria-label']

'5 star rating'

In [279]:
x.find(class_='css-e81eai').text

'1/4/2021'

In [280]:
x.find(class_='raw__373c0__3rcx7').text

"Amazing customer service! Not sure if she was the owner or just the staff, but our barista was so incredibly friendly and courteous. Great selection of drinks and food. Definitely had my fix for the morning! Unfortunately was coming from out of town, but if I wasn't, would definitely become a regular!"

Second Company

In [281]:
resp = requests.get('https://www.yelp.com/biz/grey-cafe-flushing?adjust_creative=XCA5Kc7RlIdeGhJ7qoZSYA&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=XCA5Kc7RlIdeGhJ7qoZSYA')

In [282]:
resp

<Response [200]>

In [283]:
bs = BeautifulSoup(resp.content, 'html.parser')

In [284]:
x = bs.findAll(class_='margin-b5__373c0__2ErL8')
len(x)

12

In [285]:
# Are these reviews
x[0].text, x[1].text

('Q:How is this business operating during COVID-19? Are they offering takeout, delivery, or both? Anything else to note about them right now?A:They are offering both takeout and delivery. They also have outdoor seating available for those who want to sit and enjoy their coffee/food. Kaitlin M.\xa09 months ago\xa01 person found this helpfulView question details',
 "UsernameLocation001 star ratingEek! Methinks not.2 star ratingMeh. I've experienced better.3 star ratingA-OK.4 star ratingYay! I'm a fan.5 star ratingWoohoo! As good as it gets!Start your review of GREY Cafe.")

In [289]:
#this sure look like one!
x[2].text

"Fiona C.Queens, Queens, NY864963/2/2021The outdoor covered seating is so convenient! There's seating indoors and they limit the # of people and how long you can stay (1 hr I think).Every. Single. Pudding. Is. Amazing. Try the banana, tiramisu, or matcha pudding - they're all great quality and $6 each. Got the matcha latte from here and it's pretty good.Useful 1Funny 1Cool 1Business owner informationSunny C.Business Owner3/3/2021Hi Fiona, Thank you so much for your kind words. Glad you are enjoying our space. We look forward to the days when we can fully re-open. In the meantime stay safe and healthy!Read more"

In [290]:
x = x[2]

In [291]:
x.find(class_='css-n6i4z7').text

'Queens, Queens, NY'

In [294]:
x.find(class_='i-stars__373c0__1T6rz')['aria-label']

'5 star rating'

In [295]:
x.find(class_='css-e81eai').text

'3/2/2021'

In [296]:
x.find(class_='raw__373c0__3rcx7').text

"The outdoor covered seating is so convenient! There's seating indoors and they limit the # of people and how long you can stay (1 hr I think).Every. Single. Pudding. Is. Amazing. Try the banana, tiramisu, or matcha pudding - they're all great quality and $6 each. Got the matcha latte from here and it's pretty good."

In [None]:
#Then would come the loop to go through pages 
size=100
for page in range(0,101):