# Scraping Yelp

The aim of this exercise is to allow a user to make an automatic search on <a href="https://www.yelp.fr/" target="_blank">Yelp</a> and store the results in a `.json` file. You will be guided through the different steps: making a form request with search keywords, parsing the search results, crawling all the result pages and storing the results into a file.

⚠ **As scrapy is not made to launch several crawler processes in the same script, you will have to restart your notebook's kernel before completing each question!**

1. Create a class `YelpSpider(scrapy.Spider)` with `start_urls = ['https://www.yelp.fr/']`. In this class, define a `parse(self, response)` method that automatically fills Yelp's homepage form with: "restaurant japonais" as search keywords and "Paris" as search location. Then, define another method `after_search(self, response)` that parses the first page of results, and yields the name and url of each search result. Finally, declare a `CrawlerProcess` that will store the results in a file named `"restaurant_japonais-paris.json"`.

# !pip install Scrapy

In [3]:
!pip install scrapy

In [9]:
import requests
quoi = input('tu veux manger qwa ?\n')
ou = input('tu veux manger où ?\n ')
url = 'https://www.yelp.fr/search?find_desc=' + quoi + '&find_loc'+ ou + '&start=10'
print('C\'est parti pour une recherche sur Yelp de ' + quoi + ' à ' + ou + '\n' + url)


tu veux manger qwa ?
pizza
tu veux manger où ?
 lyon
C'est parti pour une recherche sur Yelp des pizza à lyon
https://www.yelp.fr/search?find_desc=pizza&find_loclyon&start=10


In [5]:
# Import os => Library used to easily manipulate operating systems
## More info => https://docs.python.org/3/library/os.html
import os 

# Import logging => Library used for logs manipulation 
## More info => https://docs.python.org/3/library/logging.html
import logging

# Import scrapy and scrapy.crawler 
import scrapy
from scrapy.crawler import CrawlerProcess

In [6]:
# Define your class YelpSpider(scrapy.Spider) with all methods needed

In [7]:
# from scrapy.http import FormRequest
from scrapy.http import Request
class YelpSpider(scrapy.Spider):
    name = 'yelp'
    allowed_domains = ['yelp.fr']
    start_urls = [url]

    def parse(self, response):
        for restau in response.xpath('//h4/span'):
            yield {
                'restau': restau.xpath('.//a/@name').extract(),
                'restau_link': restau.xpath('.//a/@href').extract()
            }

        # Select the NEXT button and store it in next_page
        try:
            next_page =  response.xpath('//a[contains(@class, "next-link")]/@href').get()

        except KeyError:
            # In the last page, there won't be any "href" and a KeyError will be raised
            logging.info('No next page.')
        else:
            # If a next page is found, execute the parse method once again
             yield response.follow(next_page, self.parse)

In [8]:
# Name of the file where the results will be saved
filename =  quoi.replace(' ','_') +'_'+ou


## More info on built-in settings => https://docs.scrapy.org/en/latest/topics/settings.html?highlight=settings#settings
DEFAULT_REQUEST_HEADERS = {
    'Referer': 'http://www.google.com' 
}
# Obey robots.txt rules
ROBOTSTXT_OBEY = False

DEFAULT_REQUEST_HEADERS = {
    'Referer': 'https://www.google.com'
}

process = CrawlerProcess(settings = {
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
    'LOG_LEVEL': logging.INFO,
    "FEEDS": {
        'content/' + filename: {"format": "csv"},
    },
    
    "AUTOTHROTTLE_ENABLED": True  # AutoThrottle Here!
})

# Start the crawling using the spider you defined above
process.crawl(YelpSpider)
process.start()

2021-07-05 13:21:38 [scrapy.utils.log] INFO: Scrapy 2.5.0 started (bot: scrapybot)
2021-07-05 13:21:38 [scrapy.utils.log] INFO: Versions: lxml 4.2.6.0, libxml2 2.9.8, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.2.0, Python 3.7.10 (default, May  3 2021, 02:48:31) - [GCC 7.5.0], pyOpenSSL 20.0.1 (OpenSSL 1.1.1k  25 Mar 2021), cryptography 3.4.7, Platform Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic
2021-07-05 13:21:38 [scrapy.crawler] INFO: Overridden settings:
{'LOG_LEVEL': 20,
 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
2021-07-05 13:21:38 [scrapy.extensions.telnet] INFO: Telnet Password: 659889b5e1d12126
2021-07-05 13:21:38 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2021-07-05 13:21:39 [scrapy.middleware] INFO: Enabled downloader mid