## Exercise sheet \#3

Here, you will have to implement a web scraping engine, which will retrieve data from the famous [Internet Movie DataBase](https://www.imdb.com/) (IMDB).
The data we are interested in is **movie ratings**. 

Before starting, we need to keep in mind that requesting webpage could be time consuming. Should we want to build a large dataset, we would need to pay attention to the number of requests sent by our scraping engine (and to their frequency to avoid being black listed by IMDB's servers).

### Question 1

Let us inspect the following IMDB entry : [Movie tt0211915](https://www.imdb.com/title/tt0211915/).

On this page, you can find at least two ratings: the one by IMDB and the one by mediacritic. By pressing Ctrl+U, you can access the page's source code, and on chrome by clicking F12 you can launch the DevTools panel.

Where are located these ratings in the underlying HTML code ?

_To be completed_

### Question 2
Write a python program which retrieves the page located at the URL given above, parses it and extract the ratings of that movie.

In [2]:
import requests
from bs4 import BeautifulSoup
page=requests.get('https://www.imdb.com/title/tt0211915/')
soup = BeautifulSoup(page.content, 'html.parser')
# print(soup.prettify())
metacritic = soup.find_all(class_='metacriticScore score_favorable titleReviewBarSubItem')[0]
score = metacritic.get_text()

reviewer = soup.find_all(class_='ratingValue')[0]
score2 = reviewer.get_text()
print(score, score2)


69
 
8.3/10 


### Question 3

By using IMDB's advanced search feature, can you list the movies released in 2018 by decreasing number of ratings (not descreasing ratings!) ? Describes the actions you performed on IMDB's web interface.

_To be completed_

The URL of the first page of the search results looks like the following (250 results per page to save time during future retrieval):

`https://www.imdb.com/search/title?release_date=2018-01-01,2018-12-31&sort=num_votes,desc&count=250`

Note also the URL of the second page of the search results:

`https://www.imdb.com/search/title?release_date=2018-01-01,2018-12-31&sort=num_votes,desc&count=250&start=251&ref_=adv_nxt`

### Question 4

Write a python program which retrieves the ten first pages of results and store them locally (that is, save a local copy of each of them). Apply a 3-second sleep between requests.

In [11]:
import scrapy, time

class movieSpider1(scrapy.Spider):
    name = 'MovieSpider1'
    start_urls = ['https://www.imdb.com/search/title?release_date=2018-01-01,2018-12-31&sort=num_votes,desc&count=250']
       
    def __init__(self):
        super(movieSpider1, self).__init__()
        self.counter = 0
    
    def parse(self, response):
        page = response.url.split('/')
        filename = 'movie_search-%s.html' % self.counter
        self.counter += 1
        with open(filename, 'wb') as f:
            f.write(response.body)
        next_page = response.css('a[class="lister-page-next\ next-page"]::attr("href")').extract_first()
        if next_page is not None and self.counter <= 10:
            next_page = response.urljoin(next_page)
            time.sleep(3)
            #yield response.follow(next_page, callback=self.parse)
            yield scrapy.Request(next_page, callback=self.parse)



### Question 5
Describes the path where the movie information appears in the HTML tree of the retrieved webpages. 

_To be completed_

### Question 6
Extend the python program from Question 4 so that the following information is extracted for each movie:

- name of the movie
- year of release
- number of votes
- IMDB rating
- Metacritic score


In [12]:
import scrapy

class movieSpider2(scrapy.Spider):
    name = 'MovieSpider2'
    start_urls = ['https://www.imdb.com/search/title?release_date=2018-01-01,2018-12-31&sort=num_votes,desc&count=250']
       
    def __init__(self):
        super(movieSpider2, self).__init__()
        self.counter = 0
    
    def parse(self, response):
        title = response.css('img[class="loadlate"]::attr(alt)').extract_first()
        yest = response.css('span[class="lister-item-year text-muted unbold"]::text').extract_first()
        num_votes = response.css('span[name="nv"]::attr(data-value)').extract_first()
        rating = response.css('div[class="inline-block ratings-imdb-rating"]::attr(data-value)').extract_first()
        meta = response.css('span[class="metascore favorable"]::text').extract_first()
        print(title, year, num_votes, rating, meta)
        next_page = response.css('a[class="lister-page-next next-page"]::attr(href)').extract_first()
        if next_page is not None and self.counter <= 10:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback)

### Question 7
Extend the python program from Question 6 so that the results of extraction are stored in a [CSV file](https://en.wikipedia.org/wiki/Comma-separated_values).

In [13]:
import scrapy, csv

class movieSpider3(scrapy.Spider):
    name = 'MovieSpider3'
    start_urls = ['https://www.imdb.com/search/title?release_date=2018-01-01,2018-12-31&sort=num_votes,desc&count=250']
       
    def __init__(self):
        super(movieSpider3, self).__init__()
        self.counter = 0
    
    def parse(self, response):
        title = response.css('img[class="loadlate"]::attr(alt)').extract_first()
        yest = response.css('span[class="lister-item-year text-muted unbold"]::text').extract_first()
        num_votes = response.css('span[name="nv"]::attr(data-value)').extract_first()
        rating = response.css('div[class="inline-block ratings-imdb-rating"]::attr(data-value)').extract_first()
        meta = response.css('span[class="metascore favorable"]::text').extract_first()
        with open ('movie_ratings.csv', 'w', newline = '') as csvfile:
            writer = csv.writer(csvfile)
            write.writerow(('title', 'year', 'num votes', 'IMDB rating', 'Metacritic score'))
            write.writerow((title, year, num_votes, rating, meta))
            
        next_page = response.css('a[class="lister-page-next next-page"]::attr(href)').extract_first()
        if next_page is not None and self.counter <= 10:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

### Question 8
Manually inspect the results of your data extraction. Are there any errors (missing information, ill-formed data, etc.) ? If so, extend your python script to fix these.

In [None]:
if __name__ == '__main__':
    import scrapy.crawler
    myspider = movieSpider1()
    myspider2 = movieSpider2()
    myspider3 = movieSpider3()
    process = scrapy.crawler.CrawlerProcess({'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'})
    process.crawl(myspider)
    process.crawl(myspider2)
    process.crawl(myspider3)
    process.start()

2019-03-14 00:48:41 [scrapy.utils.log] INFO: Scrapy 1.5.2 started (bot: scrapybot)
2019-03-14 00:48:41 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 18.9.0, Python 3.7.1 (default, Dec 10 2018, 22:54:23) [MSC v.1915 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.1.1a  20 Nov 2018), cryptography 2.4.2, Platform Windows-10-10.0.17134-SP0
2019-03-14 00:48:41 [scrapy.crawler] INFO: Overridden settings: {'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
2019-03-14 00:48:41 [scrapy.extensions.telnet] INFO: Telnet Password: 6902157c311110ad
2019-03-14 00:48:41 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2019-03-14 00:48:41 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeo

## References
On-line [tutorial](https://www.dataquest.io/blog/web-scraping-beautifulsoup/) entitled "Web Scraping with Python and BeautifulSoup" by Alex Olteanu.