This notebook will provide a little bit of starter code to help people with getting started on the web scraping project.

In [10]:
# do our imports
import pandas as pd
from bs4 import BeautifulSoup
import requests
import time

In [3]:
# load the url into the scraper
url = 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=London&start=0'
req = requests.get(url)
scraper = BeautifulSoup(req.text)

In [4]:
# now we'll get the restaurant titles by first getting all of the links with the class title
links = scraper.find_all('a', {'class': 'css-166la90'})
# and here we are
links

[<a class="css-166la90" href="/biz/the-mayfair-chippy-london-2?osq=Restaurants" name="The Mayfair Chippy" rel="" target="">The Mayfair Chippy</a>,
 <a class="css-166la90" href="/biz/restaurant-gordon-ramsay-london-3?osq=Restaurants" name="Restaurant Gordon Ramsay" rel="" target="">Restaurant Gordon Ramsay</a>,
 <a class="css-166la90" href="/biz/the-fat-bear-london?osq=Restaurants" name="The Fat Bear" rel="" target="">The Fat Bear</a>,
 <a class="css-166la90" href="/biz/dishoom-london?osq=Restaurants" name="Dishoom" rel="" target="">Dishoom</a>,
 <a class="css-166la90" href="/biz/flat-iron-london-2?osq=Restaurants" name="Flat Iron" rel="" target="">Flat Iron</a>,
 <a class="css-166la90" href="/biz/padella-london-3?osq=Restaurants" name="Padella" rel="" target="">Padella</a>,
 <a class="css-166la90" href="/biz/ffionas-restaurant-london?osq=Restaurants" name="Ffiona’s Restaurant" rel="" target="">Ffiona’s Restaurant</a>,
 <a class="css-166la90" href="/biz/the-queens-arms-london?osq=Restau

In [6]:
# in class we converted to text, and then separated the titles based off of their length and presence of a digit
link_txt = [link.text for link in links]
# which gives us these results
link_txt

['The Mayfair Chippy',
 'Restaurant Gordon Ramsay',
 'The Fat Bear',
 'Dishoom',
 'Flat Iron',
 'Padella',
 'Ffiona’s Restaurant',
 'The Queens Arms',
 'Abeno',
 'The Golden Chippy',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 '']

In [7]:
# and then separate out further
          # return the text value for each link
titles = [link for link in link_txt 
          # if it is not only numbers
          if not link.isdigit() 
          # and it's more than one character long
         and len(link) > 1]
# and here we are
titles

['The Mayfair Chippy',
 'Restaurant Gordon Ramsay',
 'The Fat Bear',
 'Dishoom',
 'Flat Iron',
 'Padella',
 'Ffiona’s Restaurant',
 'The Queens Arms',
 'Abeno',
 'The Golden Chippy']

In [8]:
# REMEMBER THOUGH!  There are often different ways of doing things
# For example if you look at the link text, you can see that there's a /biz/ piece of text for everyone that
# points to a restaurant, so if we wanted, we could also use that
          # return the link text for each link in the list
titles = [link.text for link in links
          # if the characters '/biz/' are inside the href attribute
          # note the notation -- you can use selectors like link['href'] or link['class']
          # as long as it's provided in the source code
          if '/biz/' in link['href']]
# this gets everything directly
titles

['The Mayfair Chippy',
 'Restaurant Gordon Ramsay',
 'The Fat Bear',
 'Dishoom',
 'Flat Iron',
 'Padella',
 'Ffiona’s Restaurant',
 'The Queens Arms',
 'Abeno',
 'The Golden Chippy']

### Using a Loop For Multiple Pages

If you want to go beyond the first page then you'll need to connect data listed in the tabs at the bottom of the page.

If you remember from class, you'll note that there was an argument in the url called `start` and it denoted what portion of the results you would be seeing.  The url looked like this: 

`https://www.yelp.com/search?find_desc=Restaurants&find_loc=London&start=0`

If you change the value of `start` to 10, 20, 30, etc, you'll then go ahead and take yourself to the next page.

This means you can easily take the sample code above and put it into a loop to go through as many pages as you need.  

Here's an example:

In [15]:
# where we'll store all of our lists
all_titles = []

# start at 0, go to 10, 20, 30, etc, all the way to 100
for i in range(0, 100, 10):
    # set the url with the appropriate version of i
    url = f'https://www.yelp.com/search?find_desc=Restaurants&find_loc=London&start={i}'
    # connect to the url
    req = requests.get(url)
    # load it into the scraper
    scraper = BeautifulSoup(req.text)
    # select the <a> tags with the associated class
    links   = scraper.find_all('a', {'class': 'css-166la90'})
    # grab the links that only have '/biz/' in the href attribute
    titles  = [link.text for link in links if '/biz/' in link['href']]
    # add these values to the master list
    all_titles += titles
    # NOTE:  slowing down the speed at which you connect to a website makes you less likely to appear as a bot!
    print(f"Finished round for restaurants starting at: {i}")
    time.sleep(3)

Finished round for restaurants starting at: 0
Finished round for restaurants starting at: 10
Finished round for restaurants starting at: 20
Finished round for restaurants starting at: 30
Finished round for restaurants starting at: 40
Finished round for restaurants starting at: 50
Finished round for restaurants starting at: 60
Finished round for restaurants starting at: 70
Finished round for restaurants starting at: 80
Finished round for restaurants starting at: 90


In [16]:
# and here we go
all_titles

['The Mayfair Chippy',
 'Restaurant Gordon Ramsay',
 'The Fat Bear',
 'Dishoom',
 'Flat Iron',
 'Padella',
 'Ffiona’s Restaurant',
 'The Queens Arms',
 'Abeno',
 'The Golden Chippy',
 'Barrafina',
 'Duck & Waffle',
 'Hawksmoor Seven Dials',
 'Mestizo',
 'Savoir Faire',
 'The Alchemist',
 'The Alfred Tennyson Belgravia',
 'Tito’s',
 'Bibimbap',
 'Sketch',
 'Ngon Ngon',
 'The Breakfast Club',
 'HOOK',
 'Circolo Popolare',
 'Lanzhou Noodle Bar',
 'Seoul Bakery',
 'Mother Mash',
 'The Roof Deck Restaurant and Bar, Selfridges',
 'Da Mario Restaurant',
 'Kennington Lane Cafe',
 'Jinjuu',
 'BAO - Soho',
 'Kennington Lane Cafe',
 'Regency Café',
 'Honey & Co',
 'Ye Olde Cheshire Cheese',
 'NOPI',
 'The Palomar Restaurant',
 'Yauatcha',
 'The Ivy',
 'The Ivy',
 'Gloria',
 'The Victoria',
 'The Barge House',
 'Piccolino',
 'The Ledbury',
 'Naru',
 'Tayyabs',
 'Orsini',
 'Duck & Waffle Local',
 'The Pig and Butcher',
 'Tokyo Sukiyaki-Tei',
 'Wahaca',
 'Shoryu Ramen',
 'Nambu Tei',
 'Cirilo Filipi