# Data

Lets begin by loading some data, as this is required for all data science / machine learning projects.


## Scraping Review URLs

In order to collect that data we need to build a spider for webscraping. I won't go into depth regarding this portion as this is outside the scope of our topic.

We use a little object oriented programming to create a spider class that will help us scrape movie review links which we will use to parse movie reviews.

In [3]:
import scrapy
from scrapy import Selector
from scrapy.crawler import CrawlerProcess


#custom spider class
class MovieSpider( scrapy.Spider ):
    name = "Horror_Movie_Spider"
  # start_requests method
    def start_requests( self ):
        urls = ['https://www.imdb.com/search/title/?genres=horror&explore=title_type,genres','https://www.imdb.com/search/title/?genres=horror&start=51&explore=title_type,genres&ref_=adv_nxt']
        for url in urls:
            yield scrapy.Request(url = url, callback=self.movie_links)
      
  # movie_links parse method - finds link for movie
    def movie_links( self, response ):
        movie_links = response.xpath('//div[@id="main"]//div[@class="lister-item mode-advanced"]//div[@class="lister-item-content"]//h3[@class="lister-item-header"]//a/@href')
        movie_genres = movie_links
    #extract link
        genre_string_link = movie_genres.extract()
        for link in genre_string_link:
            yield response.follow(url=link, callback=self.review_links)
  # review_links parse method - find link for review      
    def review_links(self, response):
        review_links = response.xpath('//div[@id="quicklinksBar"]//a[3]//@href')
        review_list = review_links.extract()
        for link in review_list:
            yield response.follow(url=link, callback=self.reviews)
   # reviews parse method - pull first review           
    def reviews(self, response):
        relativeurl = response.xpath('//div[@class="lister-list"]/div[1]//a[@class="title"]/@href').extract()
        for sel in relativeurl:
            data = {}
            data['url'] = response.urljoin(sel)
            yield data            

   

In [None]:
process = CrawlerProcess() # initiate the CrawlerProcess

process.crawl(MovieSpider) # tell the process which spider to use

process.start() # start the crawling process

## Scaping Reviews

I have saved the reviews in a csv file to prevent from having to rerun the spider and potentially being blocked for web scraping. And also pickled the reviews and titles as a list.

In [6]:
import requests
from bs4 import BeautifulSoup
import csv

with open('../data/ReviewUrls.csv', newline='') as f:
    reader = csv.reader(f)
    next(reader, None)  # Skip header row
    URLlist = list(reader)

URLlist = [val for sublist in URLlist for val in sublist]

titles_list = []
reviews_list = []

for url in URLlist:
    try:
        response = requests.get(url)
        response.raise_for_status()  # Raises an exception for error codes (4xx or 5xx)
        soup = BeautifulSoup(response.content, "html.parser")
        results = soup.find('div', {'class': 'text show-more__control'})
        reviews_list.append(results)
        titles = soup.find('div', {'class': 'subpage_title_block'})
        titles = titles.h1.a.contents
        titles_list.append(titles)
    except requests.HTTPError as e:
        print(f"Error ({e.response.status_code}): {url}")

titles_list = [val for sublist in titles_list for val in sublist]
reviews_list = [x.get_text() for x in reviews_list]


Error (404): https://www.imdb.com/review/rw6073713/
Error (404): https://www.imdb.com/review/rw4236449/
Error (404): https://www.imdb.com/review/rw6112553/
Error (404): https://www.imdb.com/review/rw5867239/
Error (404): https://www.imdb.com/review/rw5954517/
Error (404): https://www.imdb.com/review/rw1621039/
Error (404): https://www.imdb.com/review/rw5925550/
Error (404): https://www.imdb.com/review/rw2833730/
Error (404): https://www.imdb.com/review/rw5500399/
Error (404): https://www.imdb.com/review/rw5489629/
Error (404): https://www.imdb.com/review/rw6157062/
Error (404): https://www.imdb.com/review/rw6073667/
