# Module 5 Homework - Let's Get Lunch

Suppose we are looking for a good place for lunch on our next trip to La Crosse. Let's collect some data on nearby restaurants.

## <font color="red"> Problem 1 </font>

Goto yelp.com and perform a search with the following parameters

* **Find** Restaurants
* **Near** La Crosse, WI

**Tasks**
1. Copy the resulting web address below and determine the how the     specified search terms related to the resulting address
2. Use requests and Beautiful Soap to download the content of the front page.

In [1]:
# Import modules here
import requests
from bs4 import BeautifulSoup
import re # relevant later

In [2]:
# Get and process the Yelp search
s = requests.Session()
r = s.get('https://www.yelp.com/search?find_desc=Restaurants&find_loc=La+Crosse%2C+WI')
soup = BeautifulSoup(r.content, "html.parser")

The category of item searched is the "find_desc" (description?), and the location is the "find_loc" (location) (%2C is a comma)

In [3]:
# Helper function
def go_up(tag, n):
    """Eliminates need to call .parent so many times.
    
    Args: 
        tag: bs4.element.Tag
        n: Integer value
        
    Returns:
        A bs4.element.Tag that is the nth parent, or None if no parent could be found.
    """
    t = tag
    for _ in range(n):
        try:
            t = t.parent
        except AttributeError:
            return None
    return t

## <font color="red"> Problem 2 </font>

We want to grab the restaurant's name.

1. Use Inspect Element to determine the tags/classes for each of the elements listed above.  
2. Note that all the business information is contained in a`div` that contains a class similar to  `<div class=" ... businessName__09f24__3Wql2 ...">`.  You will need to use a regular expression to match the `businessName` in the class (see lecture 5.3).
3. Write expressions/functions to pull out the name of each restaurant.  
    * Note: The business name is in an unnamed tag, you will need to navigate to the using searches and/or relationship
 
**Confirm that there is an extra restaurant in the list (e.g. 11-12 instead of 10). This is due to an advertisement/sponsered links, which we want to ignore.**

Using alternate approach - find "a" tags (because the restaurant name is a link) that has a "name" attribute that isn't just an empty string

In [4]:
links = [t for t in soup.find_all("a") if (t.has_attr("name") and t["name"].strip())]
names = [t["name"] for t in links]
names

['Pappi’s Taqueria y Mas',
 'The Waterfront Restaurant & Tavern',
 'Restore Public House',
 'Lovechild Restaurant',
 'Buzzard Billy’s',
 'The Charmant',
 'The Freighthouse Restaurant',
 'Howie’s on La Crosse',
 'Piggy’s Restaurant',
 'Schuby’s Neighborhood Butcher',
 'River Rats Bar and Grill']

In [5]:
def get_name_links(soup: BeautifulSoup) -> list:
    return [t for t in soup.find_all("a") if (t.has_attr("name") and t["name"].strip())]

In [6]:
len(names)

11

In [7]:
def get_business_names(soup: BeautifulSoup) -> list:
    """Finds all business names on a Yelp page.
        
    Args:
        soup: A BeautifulSoup object of a Yelp search page
    
    Returns:
        A list of strings representing the names of all businesses
    """
    return [t["name"] for t in soup.find_all("a") if (t.has_attr("name") and t["name"].strip())]

## <font color="red"> Problem 3 </font>

Since we picked up extra information, we will need to be clever about identifying the information block for each restaurant.  Note that all of the actual search results (but not sponsered links) start with the ranking (e.g. `11.`).  Use the following steps to get a list that contains the information for each restaurant other than the adds.

1. Start by finding the ranking of the restaurant (1., 2., etc.). **Hint:** You will need to use regular expression to match the text of the tag (see lecture 5.3).
2. Now search for a parent of the above tags that surrounds all of the restaurant information.  You will want to use the `find_parent` method on each of the tags from **1.**.  **Hint:** Look through each of the `div` tags that contain the ranking, looking for a meaningful tag name to match with a regular expression.

The resulting list will be the starting point for gathering all of the information.

In [8]:
[t.parent.text for t in links]

['Pappi’s Taqueria y Mas',
 '1.\xa0The Waterfront Restaurant & Tavern',
 '2.\xa0Restore Public House',
 '3.\xa0Lovechild Restaurant',
 '4.\xa0Buzzard Billy’s',
 '5.\xa0The Charmant',
 '6.\xa0The Freighthouse Restaurant',
 '7.\xa0Howie’s on La Crosse',
 '8.\xa0Piggy’s Restaurant',
 '9.\xa0Schuby’s Neighborhood Butcher',
 '10.\xa0River Rats Bar and Grill']

In [9]:
starts_with_num = re.compile(r"\d+")
starts_with_num.match('1.\xa0The Waterfront Restaurant & Tavern') # check against one case

<re.Match object; span=(0, 1), match='1'>

In [10]:
[t["name"] for t in links if starts_with_num.match(t.parent.text)]

['The Waterfront Restaurant & Tavern',
 'Restore Public House',
 'Lovechild Restaurant',
 'Buzzard Billy’s',
 'The Charmant',
 'The Freighthouse Restaurant',
 'Howie’s on La Crosse',
 'Piggy’s Restaurant',
 'Schuby’s Neighborhood Butcher',
 'River Rats Bar and Grill']

In [11]:
# filter previous link list to remove ads
of_interest = [t for t in links if starts_with_num.match(t.parent.text)]
of_interest

[<a class="link__09f24__1kwXV link-color--inherit__09f24__3PYlA link-size--inherit__09f24__2Uj95" href="/biz/the-waterfront-restaurant-and-tavern-la-crosse?osq=Restaurants" name="The Waterfront Restaurant &amp; Tavern" rel="" target="">The Waterfront Restaurant &amp; Tavern</a>,
 <a class="link__09f24__1kwXV link-color--inherit__09f24__3PYlA link-size--inherit__09f24__2Uj95" href="/biz/restore-public-house-la-crosse?osq=Restaurants" name="Restore Public House" rel="" target="">Restore Public House</a>,
 <a class="link__09f24__1kwXV link-color--inherit__09f24__3PYlA link-size--inherit__09f24__2Uj95" href="/biz/lovechild-restaurant-la-crosse?osq=Restaurants" name="Lovechild Restaurant" rel="" target="">Lovechild Restaurant</a>,
 <a class="link__09f24__1kwXV link-color--inherit__09f24__3PYlA link-size--inherit__09f24__2Uj95" href="/biz/buzzard-billys-la-crosse-3?osq=Restaurants" name="Buzzard Billy’s" rel="" target="">Buzzard Billy’s</a>,
 <a class="link__09f24__1kwXV link-color--inherit_

In [12]:
def remove_ads(tag_list: list) -> list:
    starts_with_num = re.compile(r"\d+")
    return [t for t in tag_list if starts_with_num.match(t.parent.text)]

## <font color="red"> Problem 4 </font>

Write expressions/functions to gather each of the following pieces of information for each of the restaurants.

#### Restaurant Name

In [13]:
[t["name"] for t in of_interest]

['The Waterfront Restaurant & Tavern',
 'Restore Public House',
 'Lovechild Restaurant',
 'Buzzard Billy’s',
 'The Charmant',
 'The Freighthouse Restaurant',
 'Howie’s on La Crosse',
 'Piggy’s Restaurant',
 'Schuby’s Neighborhood Butcher',
 'River Rats Bar and Grill']

In [14]:
def get_names(tags: list) -> list:
    return [t['name'] for t in tags]

In [15]:
get_names(of_interest)

['The Waterfront Restaurant & Tavern',
 'Restore Public House',
 'Lovechild Restaurant',
 'Buzzard Billy’s',
 'The Charmant',
 'The Freighthouse Restaurant',
 'Howie’s on La Crosse',
 'Piggy’s Restaurant',
 'Schuby’s Neighborhood Butcher',
 'River Rats Bar and Grill']

#### Rating

In [16]:
[go_up(t,5).next_sibling.div.div.div.span.div['aria-label'].split()[0] for t in of_interest]

['4.5', '4.5', '4.5', '4', '4.5', '4', '4', '4', '4.5', '4.5']

In [17]:
def get_ratings(anchor_link_tags: list) -> list:
    """Finds business ratings using parent/child structure with the link tags of the business names passed in.
        Returns empty string if data cannot be found."""
    
    ratings = []
    for tag in anchor_link_tags:
        try:
            rating = go_up(tag,5).next_sibling.div.div.div.span.div['aria-label'].split()[0]
        except AttributeError:
            rating = ""
        ratings.append(rating)
    return ratings

In [18]:
get_ratings(of_interest)

['4.5', '4.5', '4.5', '4', '4.5', '4', '4', '4', '4.5', '4.5']

#### Address

In [19]:
[go_up(tag, 7).next_sibling.find("span").text for tag in of_interest] # same as before but with custom function to clean up

['328 Front St S',
 '1810 State St',
 '300 3rd St S',
 '222 Pearl St',
 '101 State St',
 '107 Vine St',
 '1128 La Crosse St',
 '501 Front St S',
 '321 State St',
 '1311 La Crescent Pl']

In [20]:
def get_addresses(anchor_link_tags: list) -> list:
    """Finds addresses using parent/child structure with the link tags of the business names passed in. 
    If an address cannot be found, returns an empty string."""
    addresses = []
    for tag in anchor_link_tags:
        try:
            addr = go_up(tag, 7).next_sibling.find("span").text
        except AttributeError:
            addr = ""
        addresses.append(addr)
    return addresses

In [21]:
get_addresses(of_interest)

['328 Front St S',
 '1810 State St',
 '300 3rd St S',
 '222 Pearl St',
 '101 State St',
 '107 Vine St',
 '1128 La Crosse St',
 '501 Front St S',
 '321 State St',
 '1311 La Crescent Pl']

#### Review Count

In [22]:
[int(go_up(tag, 5).next_sibling.find_all("span")[1].text) for tag in of_interest]

[226, 17, 109, 276, 149, 125, 57, 131, 17, 28]

In [23]:
def get_rating_counts(anchor_link_tags: list) -> list:
    """Finds numbers of ratings using parent/child structure with the link tags of the business names passed in.
        Will yield 0 if a tag's value is unfindable."""

    counts = []
    for tag in anchor_link_tags:
        try:
            count = int(go_up(tag, 5).next_sibling.find_all("span")[1].text)
        except:
            count = 0
        counts.append(count)   
        
    return counts

In [24]:
get_rating_counts(of_interest)

[226, 17, 109, 276, 149, 125, 57, 131, 17, 28]

#### Category

**Examples:** `['American (New)', 'Seafood', 'Steakhouses']` becomes `'American (New);Seafood;Steakhouses'`

In [25]:
categories = [go_up(tag, 5).next_sibling.next_sibling.find_all("a") for tag in of_interest]
categories[0]

[<a class="link__09f24__1kwXV link-color--inherit__09f24__3PYlA link-size--default__09f24__3xWLF" href="/search?cflt=newamerican&amp;find_desc=Restaurants&amp;find_loc=La+Crosse%2C+WI" name="" rel="" role="link" target="">American (New)</a>,
 <a class="link__09f24__1kwXV link-color--inherit__09f24__3PYlA link-size--default__09f24__3xWLF" href="/search?cflt=seafood&amp;find_desc=Restaurants&amp;find_loc=La+Crosse%2C+WI" name="" rel="" role="link" target="">Seafood</a>,
 <a class="link__09f24__1kwXV link-color--inherit__09f24__3PYlA link-size--default__09f24__3xWLF" href="/search?cflt=steak&amp;find_desc=Restaurants&amp;find_loc=La+Crosse%2C+WI" name="" rel="" role="link" target="">Steakhouses</a>]

In [26]:
cleaned_categories = [[a.text for a in go_up(tag, 5).next_sibling.next_sibling.find_all("a")] for tag in of_interest]
cleaned_categories[0]

['American (New)', 'Seafood', 'Steakhouses']

In [27]:
joined_categories = [";".join(categs) for categs in cleaned_categories]
joined_categories

['American (New);Seafood;Steakhouses',
 'American (Traditional)',
 'American (New)',
 'American (Traditional);Cajun/Creole',
 'French;Cocktail Bars',
 'Seafood;Steakhouses;Desserts',
 'American (New);Pubs',
 'Steakhouses;Seafood;Sandwiches',
 'Butcher;Delis;Caterers',
 'American (New);Burgers;Cocktail Bars']

In [28]:
def get_categories(anchor_link_tags: list) -> list:
    """Returns list of strings - business categories joined with semicolon.
        Uses parent/child structure with the link tags of the business names passed in
        Empty string for unfindable value"""
    cat_strs = []
    for tag in anchor_link_tags:
        try:
            categories = ";".join([a.text for a in go_up(tag, 5).next_sibling.next_sibling.find_all("a")])
        except:
            categories = ""
        cat_strs.append(categories)
    return cat_strs

In [29]:
get_categories(of_interest)

['American (New);Seafood;Steakhouses',
 'American (Traditional)',
 'American (New)',
 'American (Traditional);Cajun/Creole',
 'French;Cocktail Bars',
 'Seafood;Steakhouses;Desserts',
 'American (New);Pubs',
 'Steakhouses;Seafood;Sandwiches',
 'Butcher;Delis;Caterers',
 'American (New);Burgers;Cocktail Bars']

## <font color="red">  Problem 4 </font>

Package all of the expressions in a function that takes a url as input and returns the table of information.  Use a `def` statement and put the above helper functions in the body of the main function.  Test this function on the front page of the search.

In [30]:
def parse_page(url: str) -> list:
    sess = requests.Session()
    page = sess.get(url)
    soup = BeautifulSoup(page.content, "html.parser")
    
    
    business_links = remove_ads(get_name_links(soup))
    names = get_names(business_links)
    ratings = get_ratings(business_links)
    addresses = get_addresses(business_links)
    review_nums = get_rating_counts(business_links)
    categories = get_categories(business_links)
    
    return [buis_data for buis_data in zip(names,ratings,addresses,review_nums,categories)]    
    

In [31]:
parse_page('https://www.yelp.com/search?find_desc=Restaurants&find_loc=La+Crosse%2C+WI&start=10')

[('Digger’s Sting Restaurant',
  '4',
  '122 3rd St N',
  62,
  'American (New);Steakhouses'),
 ('The Crow', '3.5', '100 3rd St S', 154, 'American (Traditional);Gastropubs'),
 ('Uno Venti Pizzeria', '4', '120 King St', 17, 'Pizza;Italian;Beer Bar'),
 ('Burritos House', '4.5', '1205 La Crosse St', 30, 'Mexican'),
 ('Milwaukee Burger Company',
  '3.5',
  '3039 Medco Ct',
  23,
  'Burgers;Sports Bars;Beer Bar'),
 ('4 Sisters Wine Bar and Tapas',
  '3.5',
  '100 Harborview Plz',
  111,
  'Wine Bars;Tapas/Small Plates;Tapas Bars'),
 ('Iguana’s Mexican Street Café', '4', '1800 State St', 71, 'Mexican'),
 ('Five Star Eggrolls', '3.5', '1203 La Crosse St', 9, 'Thai;Laotian;Chinese'),
 ('Le Chateau', '4.5', '410 Cass St', 51, 'French;Wine Bars;Cocktail Bars'),
 ('Hmong’s Golden Egg Roll',
  '4',
  '901 State St',
  69,
  'Laotian;Vietnamese;Thai')]

## <font color="red">  Problem 5 </font>

Now perform a linear search to grab all of the information on restaurants in La Crosse.  You will need to

1. Inspect the url for the first, second, etc. pages to determine the pattern.
2. Create a list of urls for all search results.
3. Get the info from all pages and use your function from the last problem to parse the data into a list.
4. Write the results to a csv file. **Hint:** Use `'a'` (append) instead of `'w'` write on all after the first/

Pattern: added "&start={10 * n-1}" where n is pg number - also works for 1st page ("&start=0")

In [32]:
urls = [f"https://www.yelp.com/search?find_desc=Restaurants&find_loc=La%20Crosse%2C%20WI&start={10*pg}" for pg in range(24)]
#urls

In [33]:
all_rest_data = [parse_page(url) for url in urls]
# all_data = []
# for url in urls:
#     print(f"Parsing page at: {url}")
#     data = parse_page(url)
#     all_data.append(data)

In [34]:
import csv
with open("restaurants.csv", "w") as outfile:
    writer = csv.writer(outfile)
    for pg_of_data in all_rest_data:
        writer.writerows(pg_of_data)

## <font color="red">  Bonus Problem </font>

See if you can also grab the latitude and longitude of each result.

In [36]:
" ".join(reversed("2C 4H 9C 8H 5H 9S 5S AD 8C AH 8S 2H 7C 5D 4S JD TH 5C JH 6S 3H KS TC JC".split()))

'JC TC KS 3H 6S JH 5C TH JD 4S 5D 7C 2H 8S AH 8C AD 5S 9S 5H 8H 9C 4H 2C'