# Scraping the data

In [119]:
from __future__ import print_function

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import numpy as np

import os
import re

import pandas as pd

chromedriver = "/Applications/chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)

I will do an analysis on the Sci-Fi and Fantasy movies of all times on Rotten Tomatoes.

In [120]:
url = 'https://www.rottentomatoes.com/browse/dvd-streaming-all?minTomato=0&maxTomato=100&services=amazon;hbo_go\
;itunes;netflix_iw;vudu;amazon_prime;fandango_now&genres=14&sortBy=release'

In [123]:
driver.get(url)

# Part one: getting general information about the movies with Selenium

The reason why I do this with Selenium and the further data scraping (also) with Scrapy instead of doing everything with one package is twofold:
* I didn't manage to find the necessary urls with Scrapy, not even when copying the exact Xpath from the browser, so I wasn't able to do part one with scrapy. Part two takes forever with Selenium (driver.get(url) is not time efficient at all), so I decided to scrape the data for each movie with Scrapy.
* To learn the basics about both Selenium and Scrapy. 

The main difference in utility that I noticed is that when working with Selenium, you simulate a real person. For instance, you cannot click buttons that are scrolled out of view. I assume that this is because Selenium uses a webdriver. With Scrapy you can simulate clicking whatever there is on the page, independently of whether it is in view or not.

Three things will be extracted from the initial page (all Sci-Fi and Fantasy movies): 

* title
* url to movie page

*__Important:__* the rest of the information will be extracted per url. In order to put all of the data together in a df, there has to be a variable on which to match the rating information with the general information, nl. url.  
As a first step, I will create two lists: one with urls, one with titles. Since lists are ordered, I assume that the elements at the same index in each of the three lists refers to the same movie. They can be added to a df, and afterwards, the url-column will be used as a reference to merge the rest of the data.

Only 32 out of 1528 movies are visible without clicking the 'show more' button. In order to see all movies from the chosen category, the button needs to be clicked 48 times.  
A problem that surged was that clicking the button is not possible if it is scrolled out of view (see https://github.com/seleniumhq/selenium-google-code-issue-archive/issues/4637 where this problem is discussed). The solution was to scroll down to the button at each time it is found, and then click it.  
Other possibility: try clicking the button (an scroll) until you get an error that the button wasn't found. This allows this code to be used if you do not know the total amount of movies.

In [None]:
for i in range(49):
    button = driver.find_element_by_css_selector('button.btn.btn-secondary-rt.mb-load-btn')
    driver.execute_script("arguments[0].scrollIntoView();", button)
    button.click()

__Get list of urls to all individual movies.__ 

In [134]:
divs=driver.find_elements_by_class_name('poster_container')
links = []

for d in divs:
    link=d.find_element_by_tag_name('a')
    links.append(link.get_attribute('href'))
links

ConnectionRefusedError: [Errno 61] Connection refused

__Get list of all titles.__

In [11]:
titles = driver.find_elements_by_class_name("movieTitle")
tits = []
for title in titles:
    tits.append(title.text)
len(tits)

1489

__Add the lists to a df and inspect whether correspondencies are correct.__

In [18]:
all_data = pd.DataFrame()

all_data['title'] = tits
all_data['url'] = links

In [19]:
all_data.head()

Unnamed: 0,title,url
0,Resident Evil: The Final Chapter,https://www.rottentomatoes.com/m/resident_evil...
1,Passengers,https://www.rottentomatoes.com/m/passengers_2016
2,Beauty And The Beast (La Belle Et La Bête),https://www.rottentomatoes.com/m/beauty_and_th...
3,Fantastic Beasts And Where To Find Them,https://www.rottentomatoes.com/m/fantastic_bea...
4,Absolutely Anything,https://www.rottentomatoes.com/m/absolutely_an...


In [125]:
all_data.tail()

Unnamed: 0,title,url
1484,The Canterville Ghost,https://www.rottentomatoes.com/m/the_cantervil...
1485,Abbott And Costello Meet The Invisible Man,https://www.rottentomatoes.com/m/abbott_and_co...
1486,The Borrower,https://www.rottentomatoes.com/m/borrower
1487,Ghost Of Frankenstein,https://www.rottentomatoes.com/m/ghost_of_fran...
1488,Frankenstein And The Monster From Hell,https://www.rottentomatoes.com/m/frankenstein_...


Correspondencies are correct.  
One mystery: I only get 1489 movies, while in the driver supposedly 1528 are exposed. No idea why that is. The last movie I have in the df is the last one in the driver.

Pickle it!

In [21]:
all_data.to_pickle('/Users/aleksandra/ds/metis/project_luther/all_data.pkl')

# Part two: get more detailed info on all movies.

I did the second part both with Selenium and with Scrapy. One problem was that the same Xpaths didn't retrieve the same things in each package. Who knows why.  
Hypothesis: Scrapy can also deal with invisible things. For instace, in the Tomatometer, you can choose to view all critics or top critics. Selenium will only scrape all critics, Scrapy will scrape both.

## *Selenium*

## First step: get all the xpaths to the info I need.

These xpaths lead to the node that contain the text that I need. 

In [126]:
score_c = ('score_c', '//a[@id="tomato_meter_link"]/span[2]/span')
rating_c = ('rating_c', '//*[@id="scoreStats"]/div[1]')
n_reviews_c = ('n_reviews_c', '//div[@id="scoreStats"]/div[2]/span[2]')
fresh_c = ('fresh_c','//*[@id="scoreStats"]/div[3]/span[2]')
rotten_c = ('rotten_c', '//*[@id="scoreStats"]/div[4]/span[2]')
score_u = ('score_u', '//*[@id="scorePanel"]/div[2]/div[1]/a/div/div[2]/div[1]/span')
rating_u = ('rating_u', '//*[@id="scorePanel"]/div[2]/div[2]/div[1]')
n_reviews_u = ('n_reviews_u', '//*[@id="scorePanel"]/div[2]/div[2]/div[2]')
director = ('director','//*[@id="mainColumn"]/section[3]/div/div/ul/li[3]/div[2]/a')
box_office = ('box_office', '//div[contains(text(),"Box Office:") and not(descendant::*)]//following-sibling::div')
runtime = ('runtime', '//*[@id="mainColumn"]/section[3]/div/div/ul/li[8]/div[2]/time')
in_theatre = ('in_theatre', '//*[@id="mainColumn"]/section[3]/div/div/ul/li[5]/div[2]/time')
on_disc = ('on_disc', '//*[@id="mainColumn"]/section[3]/div/div/ul/li[6]/div[2]/time')
audience = ('audience', '//*[@id="mainColumn"]/section[3]/div/div/ul/li[1]/div[2]')
studio = ('studio', '//*[@id="mainColumn"]/section[3]/div/div/ul/li[9]/div[2]/a')

In [127]:
variables = [score_c, rating_c, n_reviews_c, fresh_c, rotten_c, score_u, rating_u, n_reviews_u, director, 
            box_office, runtime, in_theatre, on_disc, audience, studio]

## Second step: go to each url in the driver and retrieve the info.

The info will be stored in a dictionary with the url as key and a list of (variable, value) in the values. This dictionary will be merged with the df containing titles and urls later on.

In [None]:
all_ratings = {}
# in final code, remove the slicing for links. I only use a slice to test whether the code works.
# sometimes not all variables are specified for a certain movie, so a try + except clause is needed.
    
for link in all_data.url:
    driver.get(link)
    info = []
    for (name, var) in variables:
        try:
            text = driver.find_element_by_xpath(var).text
            info.append((name, text))
        except:
            info.append((name, np.nan))
    all_ratings[link] = info

In [118]:
all_ratings

{'https://www.rottentomatoes.com/m/beauty_and_the_beast_2014': [('score_c',
   '32'),
  ('rating_c', 'Average Rating: 4.6/10'),
  ('n_reviews_c', '19'),
  ('fresh_c', '6'),
  ('rotten_c', '13'),
  ('score_u', '53%'),
  ('rating_u', 'Average Rating: 3.3/5'),
  ('n_reviews_u', 'User Ratings: 3,682'),
  ('director', 'Christophe Gans'),
  ('box_office', nan),
  ('runtime', nan),
  ('in_theatre', 'Sep 23, 2016'),
  ('on_disc', 'Feb 21, 2017'),
  ('audience',
   'PG-13 (for some action violence, peril and frightening images)'),
  ('studio', nan)],
 'https://www.rottentomatoes.com/m/passengers_2016': [('score_c', '31'),
  ('rating_c', 'Average Rating: 4.9/10'),
  ('n_reviews_c', '229'),
  ('fresh_c', '70'),
  ('rotten_c', '159'),
  ('score_u', '63%'),
  ('rating_u', 'Average Rating: 3.5/5'),
  ('n_reviews_u', 'User Ratings: 52,572'),
  ('director', 'Morten Tyldum'),
  ('box_office', '$100,014,092'),
  ('runtime', '116 minutes'),
  ('in_theatre', 'Dec 21, 2016'),
  ('on_disc', 'Mar 14, 2017'),

In [132]:
len(all_ratings)
# Ok, before being kicked of I got 734 movies done. Not bad.

734

Pickle it! 

In [129]:
import pickle

In [130]:
pickle.dump(all_ratings, open("all_ratings.pkl", "wb"))

## *Scrapy*

## First step: get all the xpaths

Through the scrapy terminal, test how to get each necessary piece of information.

In [None]:
score_c = response.xpath('//a[@id="tomato_meter_link"]/span/span/text()').extract_first()
rating_c = response.xpath('//*[@id="scoreStats"]/div[1]/text()').extract()[1].strip()
n_reviews_c = response.xpath('//*[@id="scoreStats"]/div[2]/span[2]/text()').extract()[0]
fresh_c = response.xpath('//*[@id="scoreStats"]/div[3]/span[2]/text()').extract()[0]
rotten_c = esponse.xpath('//*[@id="scoreStats"]/div[4]/span[2]/text()').extract()[0]
score_u = response.xpath('//*[@id="scorePanel"]/div[2]/div[1]/a/div/div[2]/div[1]/span/text()').extract()[0]
rating_u = response.xpath('//*[@id="scorePanel"]/div[2]/div[2]/div[1]/text()').extract()[1].strip()
n_reviews_u = response.xpath('//*[@id="scorePanel"]/div[2]/div[2]/div[2]/text()').extract()[1].strip()
director = response.xpath('//*[@id="mainColumn"]/section[3]/div/div/ul/li[3]/div[2]/a/text()').extract()[0]
box_office = response.xpath('//div[contains(text(),"Box Office:") and not(descendant::*)]//following-sibling::div/text()').extract_first()
runtime = response.xpath('//*[@id="mainColumn"]/section[3]/div/div/ul/li[8]/div[2]/time/text()').extract()[0].strip()
in_theatre = response.xpath('//*[@id="mainColumn"]/section[3]/div/div/ul/li[5]/div[2]/time/text()').extract()[0]
on_disc = response.xpath('//*[@id="mainColumn"]/section[3]/div/div/ul/li[6]/div[2]/time/text()').extract()[0]
audience = esponse.xpath('//*[@id="mainColumn"]/section[3]/div/div/ul/li[1]/div[2]/text()').extract()[0]
studio = response.xpath('//*[@id="mainColumn"]/section[3]/div/div/ul/li[9]/div[2]/a/text()').extract()[0]

## Second step: create Spider

Scrapy is smart enough to just skip whatever it can't find, so unlike as with Selenium, no try-except clauses are necessary. 

In [None]:
import scrapy
import numpy as np
import pandas as pd


class ExampleSpider(scrapy.Spider):
    name = 'rotten_spider2'

    custom_settings = {
        "DOWNLOAD_DELAY": 3,
        "CONCURRENT_REQUESTS_PER_DOMAIN": 3,
        "HTTPCACHE_ENABLED": True
    }

    df = pd.read_pickle('/Users/aleksandra/ds/metis/project_luther/all_data.pkl')
    start_urls = df.url.tolist()

    def parse(self, response):
        url = response.url
        score_c = response.xpath('//a[@id="tomato_meter_link"]/span/span/text()').extract_first()
        rating_c = response.xpath('//*[contains(text(), "Average Rating")]/following-sibling::text()').extract_first().strip()
        n_reviews_c = response.xpath('//*[contains(text(), "Reviews Counted")]/following-sibling::span/text()').extract_first()
        fresh_c = response.xpath('//*[contains(text(), "Fresh")]/following-sibling::span/text()').extract_first()
        rotten_c = response.xpath('//*[contains(text(), "Rotten")]/following-sibling::span/text()').extract_first()
        score_u = response.xpath('//div[@class="meter-value"]/span/text()').extract_first()
        rating_u = response.xpath('//*[contains(text(), "Average Rating")]/following-sibling::text()').extract()[1].strip()
        n_reviews_u = response.xpath('//*[contains(text(), "User Ratings")]/following-sibling::text()').extract_first().strip()
        director = response.xpath('//*[contains(text(), "Directed By")]/following-sibling::div/a/text()').extract_first()
        box_office = response.xpath('//*[contains(text(), "Box Office:")]/following-sibling::div/text()').extract_first()
        runtime = response.xpath('//*[contains(text(), "Runtime")]/following-sibling::div/time/@datetime').extract_first()
        in_theatre = response.xpath('//*[contains(text(), "In Theaters")]/following-sibling::div/time/@datetime').extract_first()
        on_disc = response.xpath('//*[contains(text(), "On Disc")]/following-sibling::div/time/@datetime').extract_first()
        audience = response.xpath('//*[contains(text(), "Rating")]/following-sibling::div/text()').extract_first()
        studio = response.xpath('//*[contains(text(), "Studio")]/following-sibling::div/a/text()').extract_first()

        yield {
            'url': url,
            'score_c': score_c,
            'rating_c': rating_c,
            'n_reviews_c' : n_reviews_c,
            'fresh_c': fresh_c,
            'rotten_c': rotten_c,
            'score_u':score_u,
            'rating_u': rating_u,
            'n_reviews_u': n_reviews_u,
            'director': director,
            'box_office': box_office,
            'runtime': runtime,
            'in_theatre': in_theatre,
            'on_disc': on_disc,
            'audience': audience,
            'studio': studio}



Run the script and save as csv!

**Debugging:** with the first script (rotten_spider.py), I used indexes in the xpaths to find the right node. However, Rotten Tomatoes does not have a uniform structure. When certain variables are not present, the nodes are simply not present in the html, and indexing changes. As a consequence, I had a lot of None in the data which shouldn't have been None. I solved this by finding nodes by text (ex. "Box Office") and specify the path to the wanted variable from there. If the text node is not on the page (and hence the variable that I want isn't there either), Scrapy yields None and continues with the next response.  

**Lesson learned:** avoid using indexes in the xpaths, make use of things that are less likely to vary.