### This notebook demonstrates how to crawl a Javascript-rendered website and also teaches advanced topics in web crawling. 

This is an **advanced topic** for web crawling.

#### Topics covered in this tutorial:

- Crawling javascript website
- Crawling login website
- Crawling website with input forms
- Crawling website using infinite rolling
- And more ...

### Javascript-rendered website

Go to **http://quotes.toscrape.com/js/** (A javascript website)

# Look up how to install and configure Selenium on Mac using Google Chrome

In [25]:
import requests
from lxml import html
import pandas as pd
import csv

In [1]:
#storing response
response = requests.get('http://quotes.toscrape.com/js/')
data = html.fromstring(response.text)

print data.xpath("//span/text()")

[u'\u2192', u'\u2764']



The above Xpath appears to be correct, but it does not return the data we're expecting. This is because this webpage is javascript-rendered page.


Crawling Javascipt pages require advanced approach: **Python Selenium**

### Selenium

Install python selenium **pip install selenium**

Selenium requires a **driver** to interface with the chosen browser. Firefox, for example, requires **geckodriver**, which needs to be installed before the below examples can be run. 

Go to https://github.com/mozilla/geckodriver/releases (**Firefox** is used in this tutorial) and download **geckodriver** (and unzip the file). After unzipping, place **the exe file** in **/Anaconda/Library/bin** and **Make sure it’s in your PATH (environment variables).**

<img src="images\geckodriver.png">
<img src="images\path.png">

Failure to observe this step will give you an error selenium.common.exceptions.WebDriverException: Message: ‘geckodriver’ executable needs to be in PATH.

Other supported browsers will have their own drivers available. Links to some of the more popular browser drivers follow.

- Chrome:	https://sites.google.com/a/chromium.org/chromedriver/downloads
- Edge:	https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/
- Safari:	https://webkit.org/blog/6900/webdriver-support-in-safari-10/

Source: Python selenium webpage

### Example: Crawling Javascript site using selenium

In [1]:
from selenium import webdriver

driver = webdriver.Chrome()
driver.get("http://quotes.toscrape.com/js/")

title = driver.find_elements_by_xpath("//div[@class='quote']/span[@class='text']")

for i in title:
    print i.text

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
“Try not to become a man of success. Rather become a man of value.”
“It is better to be hated for what you are than to be loved for what you are not.”
“I have not failed. I've just found 10,000 ways that won't work.”
“A woman is like a tea bag; you never know how strong it is until it's in hot water.”
“A day without sunshine is like, you know, night.”


### Locating Elements
http://selenium-python.readthedocs.io/locating-elements.html

There are various strategies to locate elements in a page. You can use the most appropriate one for your case. Selenium provides the following methods to locate elements in a page:

    find_element_by_id
    find_element_by_name
    find_element_by_xpath
    find_element_by_link_text
    find_element_by_partial_link_text
    find_element_by_tag_name
    find_element_by_class_name
    find_element_by_css_selector

To find multiple elements (these methods will return a **list**):

    find_elements_by_name
    find_elements_by_xpath
    find_elements_by_link_text
    find_elements_by_partial_link_text
    find_elements_by_tag_name
    find_elements_by_class_name
    find_elements_by_css_selector

### Examples: Locating Elements
http://selenium-python.readthedocs.io/locating-elements.html

#### Locating Elements by Class Name

Use this when you want to locate an element by class attribute name. With this strategy, the first element with the matching class attribute name will be returned. If no element has a matching class attribute name, a NoSuchElementException will be raised.

For instance, consider this page source:

    <html>
     <body>
      <p class="content">Site content goes here.</p>
    </body>
    <html>

The “p” element can be located like this:

    content = driver.find_element_by_class_name('content')
    
    
#### Locating by XPath

For instance, consider this page source:

    <html>
     <body>
      <form id="loginForm">
       <input name="username" type="text" />
       <input name="password" type="password" />
       <input name="continue" type="submit" value="Login" />
       <input name="continue" type="button" value="Clear" />
      </form>
    </body>
    <html>
    
The form elements can be located like this:

    login_form = driver.find_element_by_xpath("/html/body/form[1]")
    login_form = driver.find_element_by_xpath("//form[1]")
    login_form = driver.find_element_by_xpath("//form[@id='loginForm']")
    
1. Absolute path (would break if the HTML was changed only slightly)
2. First form element in the HTML
3. The form element with attribute named id and the value loginForm

The username element can be located like this:

    username = driver.find_element_by_xpath("//form[input/@name='username']")
    username = driver.find_element_by_xpath("//form[@id='loginForm']/input[1]")
    username = driver.find_element_by_xpath("//input[@name='username']")

### Example: Login page

In [3]:
from selenium import webdriver
import time

driver = webdriver.Firefox()
driver.get("http://quotes.toscrape.com/login")

time.sleep(5)

username = driver.find_element_by_id("username")
password = driver.find_element_by_id("password")

username.send_keys("abc")
password.send_keys("abc")

login_attempt = driver.find_element_by_xpath("//*[@type='submit']")
login_attempt.submit()

### Example: Login and Collect Data

In [4]:
from selenium import webdriver
import time

driver = webdriver.Firefox()
driver.get("http://quotes.toscrape.com/login")

time.sleep(5)

username = driver.find_element_by_id("username")
password = driver.find_element_by_id("password")

username.send_keys("abc")
password.send_keys("abc")

login_attempt = driver.find_element_by_xpath("//*[@type='submit']")
login_attempt.submit()

#just wait a little for the browser to be ready
time.sleep(10)

for review in driver.find_elements_by_xpath("//div[@class='quote']"):
    name = review.find_element_by_xpath("span[2]/small[@class='author']").text
    url = review.find_element_by_xpath("span[2]/a[2]").get_attribute('href')
    print name, url

Albert Einstein http://goodreads.com/author/show/9810.Albert_Einstein
J.K. Rowling http://goodreads.com/author/show/1077326.J_K_Rowling
Albert Einstein http://goodreads.com/author/show/9810.Albert_Einstein
Jane Austen http://goodreads.com/author/show/1265.Jane_Austen
Marilyn Monroe http://goodreads.com/author/show/82952.Marilyn_Monroe
Albert Einstein http://goodreads.com/author/show/9810.Albert_Einstein
André Gide http://goodreads.com/author/show/7617.Andr_Gide
Thomas A. Edison http://goodreads.com/author/show/3091287.Thomas_A_Edison
Eleanor Roosevelt http://goodreads.com/author/show/44566.Eleanor_Roosevelt
Steve Martin http://goodreads.com/author/show/7103.Steve_Martin


In [5]:
# save data
import pandas as pd
from selenium import webdriver
import time

driver = webdriver.Firefox()
driver.get("http://quotes.toscrape.com/login")

time.sleep(5)

username = driver.find_element_by_id("username")
password = driver.find_element_by_id("password")

username.send_keys("abc")
password.send_keys("abc")

login_attempt = driver.find_element_by_xpath("//*[@type='submit']")
login_attempt.submit()

#just wait a little for the browser to be ready
time.sleep(10)

data = []
for review in driver.find_elements_by_xpath("//div[@class='quote']"):
    name = review.find_element_by_xpath("span[2]/small[@class='author']").text
    url = review.find_element_by_xpath("span[2]/a[2]").get_attribute('href')
    print name, url
    data.append([name, url])

df = pd.DataFrame(data)
df.to_csv("quotes.csv", index=False, encoding='utf-8')    

Albert Einstein http://goodreads.com/author/show/9810.Albert_Einstein
J.K. Rowling http://goodreads.com/author/show/1077326.J_K_Rowling
Albert Einstein http://goodreads.com/author/show/9810.Albert_Einstein
Jane Austen http://goodreads.com/author/show/1265.Jane_Austen
Marilyn Monroe http://goodreads.com/author/show/82952.Marilyn_Monroe
Albert Einstein http://goodreads.com/author/show/9810.Albert_Einstein
André Gide http://goodreads.com/author/show/7617.Andr_Gide
Thomas A. Edison http://goodreads.com/author/show/3091287.Thomas_A_Edison
Eleanor Roosevelt http://goodreads.com/author/show/44566.Eleanor_Roosevelt
Steve Martin http://goodreads.com/author/show/7103.Steve_Martin


### Example: Form

In [6]:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
import time

driver = webdriver.Firefox()
driver.get("http://quotes.toscrape.com/search.aspx")

time.sleep(5)

#driver.find_element_by_xpath("//select[@name='author']/option[text()='Steve Martin']").click()
#driver.find_element_by_xpath("//select[@name='tag']/option[text()='humor']").click()

select_author = Select(driver.find_element_by_name('author'))
select_author.select_by_visible_text('Steve Martin')

select_tag = Select(driver.find_element_by_name('tag'))
select_tag.select_by_visible_text('humor')

login_attempt = driver.find_element_by_xpath("//*[@type='submit']")
login_attempt.submit()

time.sleep(5)

author = driver.find_element_by_xpath("//div[@class='quote']/span[@class='author']").text
quote = driver.find_element_by_xpath("//div[@class='quote']/span[@class='content']").text
                                     
print author, quote

Steve Martin “A day without sunshine is like, you know, night.”


### Example: Pagination

In [7]:
from selenium import webdriver
import time

driver = webdriver.Firefox()
driver.get("http://quotes.toscrape.com/")

while True:
    for review in driver.find_elements_by_xpath("//div[@class='quote']"):
        name = review.find_element_by_xpath("span[2]/small[@class='author']").text
        url = review.find_element_by_xpath("span[2]/a[1]").get_attribute('href')
        print name, url
 
    try:
        next_link = driver.find_element_by_xpath("//li[@class='next']/a")
        next_link.click()
        time.sleep(5)
    except:
        break

Albert Einstein http://quotes.toscrape.com/author/Albert-Einstein
J.K. Rowling http://quotes.toscrape.com/author/J-K-Rowling
Albert Einstein http://quotes.toscrape.com/author/Albert-Einstein
Jane Austen http://quotes.toscrape.com/author/Jane-Austen
Marilyn Monroe http://quotes.toscrape.com/author/Marilyn-Monroe
Albert Einstein http://quotes.toscrape.com/author/Albert-Einstein
André Gide http://quotes.toscrape.com/author/Andre-Gide
Thomas A. Edison http://quotes.toscrape.com/author/Thomas-A-Edison
Eleanor Roosevelt http://quotes.toscrape.com/author/Eleanor-Roosevelt
Steve Martin http://quotes.toscrape.com/author/Steve-Martin
Marilyn Monroe http://quotes.toscrape.com/author/Marilyn-Monroe
J.K. Rowling http://quotes.toscrape.com/author/J-K-Rowling
Albert Einstein http://quotes.toscrape.com/author/Albert-Einstein
Bob Marley http://quotes.toscrape.com/author/Bob-Marley
Dr. Seuss http://quotes.toscrape.com/author/Dr-Seuss
Douglas Adams http://quotes.toscrape.com/author/Douglas-Adams
Elie Wie

In [9]:
from selenium import webdriver
import time
import csv

driver = webdriver.Firefox()
driver.get("http://quotes.toscrape.com/")

data = []
for i in range(1,4):   # first three pages only
    for review in driver.find_elements_by_xpath("//div[@class='quote']"):
        name = review.find_element_by_xpath("span[2]/small[@class='author']").text.encode('utf-8')
        url = review.find_element_by_xpath("span[2]/a[1]").get_attribute('href').encode('utf-8')
        print name, url
        data.append([name, url])
 
    try:
        next_link = driver.find_element_by_xpath("//li[@class='next']/a")
        next_link.click()
        time.sleep(5)
    except:
        break
        

df = pd.DataFrame(data)
df.to_csv("quotes_pagination.csv", index=False, encoding='utf-8') 

Albert Einstein http://quotes.toscrape.com/author/Albert-Einstein
J.K. Rowling http://quotes.toscrape.com/author/J-K-Rowling
Albert Einstein http://quotes.toscrape.com/author/Albert-Einstein
Jane Austen http://quotes.toscrape.com/author/Jane-Austen
Marilyn Monroe http://quotes.toscrape.com/author/Marilyn-Monroe
Albert Einstein http://quotes.toscrape.com/author/Albert-Einstein
André Gide http://quotes.toscrape.com/author/Andre-Gide
Thomas A. Edison http://quotes.toscrape.com/author/Thomas-A-Edison
Eleanor Roosevelt http://quotes.toscrape.com/author/Eleanor-Roosevelt
Steve Martin http://quotes.toscrape.com/author/Steve-Martin
Marilyn Monroe http://quotes.toscrape.com/author/Marilyn-Monroe
J.K. Rowling http://quotes.toscrape.com/author/J-K-Rowling
Albert Einstein http://quotes.toscrape.com/author/Albert-Einstein
Bob Marley http://quotes.toscrape.com/author/Bob-Marley
Dr. Seuss http://quotes.toscrape.com/author/Dr-Seuss
Douglas Adams http://quotes.toscrape.com/author/Douglas-Adams
Elie Wie

### Example: Infinite Rolling

In [10]:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
import time

driver = webdriver.Firefox()
driver.get("http://spidyquotes.herokuapp.com/scroll")
#driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

SCROLL_PAUSE_TIME = 0.5

# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")

while True:
    for row in driver.find_elements_by_xpath("//div[@class='quote']"):
        author = row.find_element_by_xpath("span[2]/small[@class='author']").text
        quote = row.find_element_by_xpath("span[@class='text']").text
        print author, quote
    
    try:
        # Scroll down to bottom
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

        # Wait to load page
        time.sleep(SCROLL_PAUSE_TIME)

        # Calculate new scroll height and compare with last scroll height
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            break
        last_height = new_height
    except:
        break


Albert Einstein “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
J.K. Rowling “It is our choices, Harry, that show what we truly are, far more than our abilities.”
Albert Einstein “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
Jane Austen “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
Marilyn Monroe “Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
Albert Einstein “Try not to become a man of success. Rather become a man of value.”
André Gide “It is better to be hated for what you are than to be loved for what you are not.”
Thomas A. Edison “I have not failed. I've just found 10,000 ways that won't work.”
Eleanor Roosevelt “A woman is like a tea bag; you never know how strong it is until it's in hot water.”
Steve Martin

Albert Einstein “Life is like riding a bicycle. To keep your balance, you must keep moving.”
Marilyn Monroe “The real lover is the man who can thrill you by kissing your forehead or smiling into your eyes or just staring into space.”
Marilyn Monroe “A wise girl kisses but doesn't love, listens but doesn't believe, and leaves before she is left.”
Martin Luther King Jr. “Only in the darkness can you see the stars.”
J.K. Rowling “It matters not what someone is born, but what they grow to be.”
James Baldwin “Love does not begin and end the way we seem to think it does. Love is a battle, love is a war; love is a growing up.”
Jane Austen “There is nothing I would not do for those who are really my friends. I have no notion of loving people by halves, it is not my nature.”
Eleanor Roosevelt “Do one thing every day that scares you.”
Marilyn Monroe “I am good, but not an angel. I do sin, but I am not the devil. I am just a small girl in a big world trying to find someone to love.”
Albert Einste

J.K. Rowling “Of course it is happening inside your head, Harry, but why on earth should that mean that it is not real?”
Bob Marley “The truth is, everyone is going to hurt you. You just got to find the ones worth suffering for.”
Mother Teresa “Not all of us can do great things. But we can do small things with great love.”
J.K. Rowling “To the well-organized mind, death is but the next great adventure.”
Charles M. Schulz “All you need is love. But a little chocolate now and then doesn't hurt.”
William Nicholson “We read to know we're not alone.”
Albert Einstein “Any fool can know. The point is to understand.”
Jorge Luis Borges “I have always imagined that Paradise will be a kind of library.”
George Eliot “It is never too late to be what you might have been.”
George R.R. Martin “A reader lives a thousand lives before he dies, said Jojen. The man who never reads lives only one.”
C.S. Lewis “You can never get a cup of tea large enough or a book long enough to suit me.”
Marilyn Monroe “You

Stephenie Meyer “He's like a drug for you, Bella.”
Ernest Hemingway “There is no friend as loyal as a book.”
Helen Keller “When one door of happiness closes, another opens; but often we look so long at the closed door that we do not see the one which has been opened for us.”
George Bernard Shaw “Life isn't about finding yourself. Life is about creating yourself.”
Charles Bukowski “That's the problem with drinking, I thought, as I poured myself a drink. If something bad happens you drink in an attempt to forget; if something good happens you drink in order to celebrate; and if nothing happens you drink to make something happen.”
Suzanne Collins “You don’t forget the face of the person who was your last hope.”
Suzanne Collins “Remember, we're madly in love, so it's all right to kiss me anytime you feel like it.”
C.S. Lewis “To love at all is to be vulnerable. Love anything and your heart will be wrung and possibly broken. If you want to make sure of keeping it intact you must give it to 

Ernest Hemingway “There is no friend as loyal as a book.”
Helen Keller “When one door of happiness closes, another opens; but often we look so long at the closed door that we do not see the one which has been opened for us.”
George Bernard Shaw “Life isn't about finding yourself. Life is about creating yourself.”
Charles Bukowski “That's the problem with drinking, I thought, as I poured myself a drink. If something bad happens you drink in an attempt to forget; if something good happens you drink in order to celebrate; and if nothing happens you drink to make something happen.”
Suzanne Collins “You don’t forget the face of the person who was your last hope.”
Suzanne Collins “Remember, we're madly in love, so it's all right to kiss me anytime you feel like it.”
C.S. Lewis “To love at all is to be vulnerable. Love anything and your heart will be wrung and possibly broken. If you want to make sure of keeping it intact you must give it to no one, not even an animal. Wrap it carefully round

# Example: Tripadvisor

## Get Reviews per User

In [27]:
# https://stackoverflow.com/questions/38788367/trying-to-scrape-tripadvisor-members-using-beautifulsoup
# https://stackoverflow.com/questions/45857311/r-selenium-tripadvisor-detailed-member-info

from selenium import webdriver

driver = webdriver.Firefox()

url = 'https://www.tripadvisor.com/members/387piyalim'

driver.get(url)

next_button = driver.find_element_by_xpath("//li[@data-filter='REVIEWS_RESTAURANTS']")
next_button.click()

results = []

for review in driver.find_elements_by_xpath("//div[@class='cs-content-container']/ul/li"):
    name = review.find_element_by_xpath("div[@class='cs-review-details']/div[@class='cs-review-location']/a").text.encode('utf-8')
    url = review.find_element_by_xpath("div[@class='cs-review-details']/div[@class='cs-review-location']/a").get_attribute('href').encode('utf-8')
    reviewtitle = review.find_element_by_xpath("div[@class='cs-review-details']/a[@class='cs-review-title']").text.encode('utf-8')
    reviewurl = review.find_element_by_xpath("div[@class='cs-review-details']/a[@class='cs-review-title']").get_attribute('href').encode('utf-8')
    reviewdate = review.find_element_by_xpath("div[@class='cs-review-details']/div[@class='cs-review-date']").text.encode('utf-8')
    rating = review.find_element_by_xpath("div[@class='cs-review-rating']/span").get_attribute('class').encode('utf-8')
    print name, url, reviewtitle, reviewurl, reviewdate, rating
    results.append([name, url, reviewtitle, reviewurl, reviewdate, rating])
    
len(results)

Dubai: Margherita https://www.tripadvisor.com/Restaurant_Review-g295424-d7745900-Reviews-Margherita-Dubai_Emirate_of_Dubai.html “Good Italian food” https://www.tripadvisor.com/ShowUserReviews-g295424-d7745900-r609115071-Margherita-Dubai_Emirate_of_Dubai.html Aug 22, 2018 ui_bubble_rating bubble_4
Dubai: Karachi Haleem and Biryani https://www.tripadvisor.com/Restaurant_Review-g295424-d8739641-Reviews-Karachi_Haleem_and_Biryani-Dubai_Emirate_of_Dubai.html “My favorite Chicken Haleem” https://www.tripadvisor.com/ShowUserReviews-g295424-d8739641-r608318208-Karachi_Haleem_and_Biryani-Dubai_Emirate_of_Dubai.html Aug 20, 2018 ui_bubble_rating bubble_4
Dubai: Pappa Roti https://www.tripadvisor.com/Restaurant_Review-g295424-d2628708-Reviews-Pappa_Roti-Dubai_Emirate_of_Dubai.html “I am totally bonkers over their Signature Buns” https://www.tripadvisor.com/ShowUserReviews-g295424-d2628708-r608076499-Pappa_Roti-Dubai_Emirate_of_Dubai.html Aug 19, 2018 ui_bubble_rating bubble_5
Dubai: Fistikzade Ca

23

## Get Reviews from Multiple User Profiles

In [32]:
# https://stackoverflow.com/questions/38788367/trying-to-scrape-tripadvisor-members-using-beautifulsoup
# https://stackoverflow.com/questions/45857311/r-selenium-tripadvisor-detailed-member-info

from selenium import webdriver
driver = webdriver.Firefox()

# multiple urls --> you can read multiple urls from csv file as well
urls = ['https://www.tripadvisor.com/members/387piyalim',
       'https://www.tripadvisor.com/members/CabanaBoyToronto'
      ]

#to write or create a new csv file
output = open('results.csv','wb')
w = csv.writer(output)

for url in urls:

    driver.get(url)

    next_button = driver.find_element_by_xpath("//li[@data-filter='REVIEWS_RESTAURANTS']")
    next_button.click()

    for review in driver.find_elements_by_xpath("//div[@class='cs-content-container']/ul/li"):
        name = review.find_element_by_xpath("div[@class='cs-review-details']/div[@class='cs-review-location']/a").text.encode('utf-8')
        url = review.find_element_by_xpath("div[@class='cs-review-details']/div[@class='cs-review-location']/a").get_attribute('href').encode('utf-8')
        reviewtitle = review.find_element_by_xpath("div[@class='cs-review-details']/a[@class='cs-review-title']").text.encode('utf-8')
        reviewurl = review.find_element_by_xpath("div[@class='cs-review-details']/a[@class='cs-review-title']").get_attribute('href').encode('utf-8')
        reviewdate = review.find_element_by_xpath("div[@class='cs-review-details']/div[@class='cs-review-date']").text.encode('utf-8')
        rating = review.find_element_by_xpath("div[@class='cs-review-rating']/span").get_attribute('class').encode('utf-8')
        print name, url, reviewtitle, reviewurl, reviewdate, rating
        w.writerow([name, url, reviewtitle, reviewurl, reviewdate, rating])
    
output.close()

Dubai: Margherita https://www.tripadvisor.com/Restaurant_Review-g295424-d7745900-Reviews-Margherita-Dubai_Emirate_of_Dubai.html “Good Italian food” https://www.tripadvisor.com/ShowUserReviews-g295424-d7745900-r609115071-Margherita-Dubai_Emirate_of_Dubai.html Aug 22, 2018 ui_bubble_rating bubble_4
Dubai: Karachi Haleem and Biryani https://www.tripadvisor.com/Restaurant_Review-g295424-d8739641-Reviews-Karachi_Haleem_and_Biryani-Dubai_Emirate_of_Dubai.html “My favorite Chicken Haleem” https://www.tripadvisor.com/ShowUserReviews-g295424-d8739641-r608318208-Karachi_Haleem_and_Biryani-Dubai_Emirate_of_Dubai.html Aug 20, 2018 ui_bubble_rating bubble_4
Dubai: Pappa Roti https://www.tripadvisor.com/Restaurant_Review-g295424-d2628708-Reviews-Pappa_Roti-Dubai_Emirate_of_Dubai.html “I am totally bonkers over their Signature Buns” https://www.tripadvisor.com/ShowUserReviews-g295424-d2628708-r608076499-Pappa_Roti-Dubai_Emirate_of_Dubai.html Aug 19, 2018 ui_bubble_rating bubble_5
Dubai: Fistikzade Ca

Palm - Eagle Beach: Texas de Brazil Aruba https://www.tripadvisor.com/Restaurant_Review-g147249-d2188478-Reviews-Texas_de_Brazil_Aruba-Palm_Eagle_Beach_Aruba.html “Come Hungry, Leave Satiated” https://www.tripadvisor.com/ShowUserReviews-g147249-d2188478-r598602635-Texas_de_Brazil_Aruba-Palm_Eagle_Beach_Aruba.html Jul 22, 2018 ui_bubble_rating bubble_5
Woodbridge: Ice Cream Patio Ltd https://www.tripadvisor.com/Restaurant_Review-g562671-d5033313-Reviews-Ice_Cream_Patio_Ltd-Woodbridge_Vaughan_Ontario.html “Don't Let the Name Fool You” https://www.tripadvisor.com/ShowUserReviews-g562671-d5033313-r519510741-Ice_Cream_Patio_Ltd-Woodbridge_Vaughan_Ontario.html Aug 30, 2017 ui_bubble_rating bubble_5
Woodbridge: Extreme Pita https://www.tripadvisor.com/Restaurant_Review-g562671-d5098264-Reviews-Extreme_Pita-Woodbridge_Vaughan_Ontario.html “Good Choice of Pitas” https://www.tripadvisor.com/ShowUserReviews-g562671-d5098264-r518973215-Extreme_Pita-Woodbridge_Vaughan_Ontario.html Aug 29, 2017 ui_b

In [34]:
df = pd.read_csv('results.csv', header=None)
df

Unnamed: 0,0,1,2,3,4,5
0,Dubai: Margherita,https://www.tripadvisor.com/Restaurant_Review-...,“Good Italian food”,https://www.tripadvisor.com/ShowUserReviews-g2...,"Aug 22, 2018",ui_bubble_rating bubble_4
1,Dubai: Karachi Haleem and Biryani,https://www.tripadvisor.com/Restaurant_Review-...,“My favorite Chicken Haleem”,https://www.tripadvisor.com/ShowUserReviews-g2...,"Aug 20, 2018",ui_bubble_rating bubble_4
2,Dubai: Pappa Roti,https://www.tripadvisor.com/Restaurant_Review-...,“I am totally bonkers over their Signature Buns”,https://www.tripadvisor.com/ShowUserReviews-g2...,"Aug 19, 2018",ui_bubble_rating bubble_5
3,Dubai: Fistikzade Cafe,https://www.tripadvisor.com/Restaurant_Review-...,"“Delectable Turkish Baklava, not to be missed”",https://www.tripadvisor.com/ShowUserReviews-g2...,"Aug 17, 2018",ui_bubble_rating bubble_5
4,Dubai: Fish Hut,https://www.tripadvisor.com/Restaurant_Review-...,“A must visit for sea food lovers”,https://www.tripadvisor.com/ShowUserReviews-g2...,"Aug 15, 2018",ui_bubble_rating bubble_4
5,Dubai: My Shawarma,https://www.tripadvisor.com/Restaurant_Review-...,“Small funky restaurant serving scrummy shawar...,https://www.tripadvisor.com/ShowUserReviews-g2...,"Aug 10, 2018",ui_bubble_rating bubble_4
6,Dubai: Asha's,https://www.tripadvisor.com/Restaurant_Review-...,“Serving Indian Cuisine in it's finest culinar...,https://www.tripadvisor.com/ShowUserReviews-g2...,"Aug 7, 2018",ui_bubble_rating bubble_4
7,Dubai: Salt,https://www.tripadvisor.com/Restaurant_Review-...,“A food truck serving delumptious Sliders and ...,https://www.tripadvisor.com/ShowUserReviews-g2...,"Aug 6, 2018",ui_bubble_rating bubble_5
8,Dubai: MTR - Mavalli Tiffin Rooms,https://www.tripadvisor.com/Restaurant_Review-...,“Best Rava Idli And Ragi Dosa in town”,https://www.tripadvisor.com/ShowUserReviews-g2...,"Aug 6, 2018",ui_bubble_rating bubble_4
9,Dubai: Katrina Sweets & Confectionary,https://www.tripadvisor.com/Restaurant_Review-...,“Love their honey cake”,https://www.tripadvisor.com/ShowUserReviews-g2...,"Aug 5, 2018",ui_bubble_rating bubble_4


## User Profile

In [62]:
# https://stackoverflow.com/questions/14068119/python-web-crawling

from StringIO import StringIO
import requests
from lxml import etree

response = requests.get("http://www.tripadvisor.in/members/SomersetKeithers")

parser = etree.HTMLParser()
tree   = etree.parse(StringIO(response.text), parser)

def get_definition_description(tree, term):
  description = tree.xpath("//dl[dt/text()='%s']//dd/text()" % term)
  if len(description):
    return description[0].strip()

print get_definition_description(tree, "ageSince:")
print get_definition_description(tree, "Gender:")
print get_definition_description(tree, "Location:")

None
None
None


In [60]:
# https://stackoverflow.com/questions/38788367/trying-to-scrape-tripadvisor-members-using-beautifulsoup
# https://stackoverflow.com/questions/45857311/r-selenium-tripadvisor-detailed-member-info

from selenium import webdriver

driver = webdriver.Firefox()

url = 'https://www.tripadvisor.com/members/387piyalim'

driver.get(url)

for review in driver.find_elements_by_xpath("//div[@id='MODULES_MEMBER_CENTER']"):
    print review.find_element_by_xpath("div[@class='leftProfile']/div/div[@class='profileBlock']/div/div/span").text.encode('utf-8')
    print review.find_element_by_xpath("div[@class='leftProfile']/div/div[@class='profInfo']/div/p").text.encode('utf-8')    

Piyali M
Since Apr 2014


In [61]:
# https://stackoverflow.com/questions/38788367/trying-to-scrape-tripadvisor-members-using-beautifulsoup
# https://stackoverflow.com/questions/45857311/r-selenium-tripadvisor-detailed-member-info

from selenium import webdriver

driver = webdriver.Firefox()

urls = ['https://www.tripadvisor.com/members/387piyalim',
       'https://www.tripadvisor.com/members/CabanaBoyToronto'
      ]

#to write or create a new csv file
output = open('profiles.csv','wb')
w = csv.writer(output)

for url in urls:

    driver.get(url)

    for review in driver.find_elements_by_xpath("//div[@id='MODULES_MEMBER_CENTER']"):
        uid = review.find_element_by_xpath("div[@class='leftProfile']/div/div[@class='profileBlock']/div/div/span").text.encode('utf-8')
        since = review.find_element_by_xpath("div[@class='leftProfile']/div/div[@class='profInfo']/div/p").text.encode('utf-8')
        w.writerow([uid, since])
    
output.close()

# Lab:

**http://www.horsedeathwatch.com/** (Another Javascript-rendered Website)

Collect three columns: 
- hourse name
- date
- course

In [18]:
import requests
from lxml import html

#storing response
response = requests.get('http://www.horsedeathwatch.com/')
data = html.fromstring(response.text)

print data.xpath('//tr/td[@data-th="Horse"]/a/text')

[]


No data is returned. You need to use **Selenium**

In [None]:
from selenium import webdriver
import time

driver = webdriver.Firefox()
driver.get("http://www.horsedeathwatch.com/")



















# References

- http://selenium-python.readthedocs.io/
- https://stackoverflow.com/questions/20986631/how-can-i-scroll-a-web-page-using-selenium-webdriver-in-python

# Appendix: Pagination & Downloading images

In [20]:
# https://hackernoon.com/30-minute-python-web-scraper-39d6d038e5da

import requests
import time
from selenium import webdriver
from PIL import Image
from io import BytesIO

url = "https://unsplash.com"

driver = webdriver.Firefox()
#driver = webdriver.Firefox(executable_path=r'geckodriver.exe')
driver.get(url)

driver.execute_script("window.scrollTo(0,1000);")
time.sleep(5)
image_elements = driver.find_elements_by_css_selector("#gridMulti img")
i = 0

for image_element in image_elements:
    image_url = image_element.get_attribute("src")
    # Send an HTTP GET request, get and save the image from the response
    image_object = requests.get(image_url)
    image = Image.open(BytesIO(image_object.content))
    image.save("./download_images/image" + str(i) + "." + image.format, image.format)
    i += 1