## Linear Web Crawling

* A **linear web crawl** involves pulling info from a list of similar pages
    * Multiple pages of web reviews
    * Multiple days/week/years of music on the current
    * Multiple pages of reviews for a movie

## Workflow for a Linear Crawl

* Make a function that processes a single page
* Make a list of urls
* Apply the function to the list of urls.

## Exploiting url patterns

* We can make a list of all pages by inspecting and exploiting url patterns.
* Examples:
    * The date and hour in
        * `http://www.thecurrent.org/playlist/2014-01-01/01`
    * The `&start=30` in 
        * `https://www.yelp.com/search?find_desc=Restaurants&find_loc=La+Crosse,+WI&start=30`

## Formating Python strings.

* Mark insertion points with `{0}`, `{1}`, etc.
* Use the `format` method to insert item.

In [1]:
base_str = "Hello {0}! How's your {1} going?"
base_str.format("Todd", "Monday")

"Hello Todd! How's your Monday going?"

In [5]:
greeting = lambda name,day: f"Hello {name}! How's your {day} going?"

greeting("Bryce","Friday")

"Hello Bryce! How's your Friday going?"

In [2]:
base_str.format("Chris", "Wednesday")

"Hello Chris! How's your Wednesday going?"

## Formating strings in a list comprehension

You can use a list comprehension to format a whole list of strings.

In [6]:
base_str2 = "Hello {0}!"
[base_str2.format(name) for name in ["Todd", "Chris", "Brant"]]

['Hello Todd!', 'Hello Chris!', 'Hello Brant!']

## Example - Yelp 

In a previous lab, we scraped Yelp for a lunch location in La Crosse, WI.  Here is the resulting function.

In [7]:
def get_page_info(soup):
    indexes = [tag for tag in soup.find_all('span', class_="indexed-biz-name")]
    info_blocks = [tag.find_parent('div', 'biz-listing-large') for tag in indexes]
    names = [tag.find('a', 'biz-name').span.get_text() for tag in info_blocks]
    ratings = [tag.find('div', 'i-stars').img['alt'].split(' ')[0] for tag in info_blocks]
    addresses = [tag.find('address').get_text().strip() for tag in info_blocks if tag.find('address') is not None]
    review_count = [tag.find('span', 'review-count').next.strip().split(" ")[0] for tag in info_blocks]
    get_and_combine_categories = lambda cat_list: ";".join([tag.next for tag in cat_list.find_all('a')])
    cat_lists = [tag.find('span', 'category-str-list') for tag in info_blocks]
    categories = [get_and_combine_categories(tag) for tag in cat_lists]
    return [row for row in zip(names, ratings, addresses, review_count, categories)]

## <font color='red'> Exercise 5 </font>

**Tasks**
1. Get the addresses for the first two pages of search output.  
2. Identify the part of the address related to the page.
    * Look for `&start=30`
3. Build a `base_string`
    * Replace `&start=30` with `&start={0}`
    * Verify that `base_string.format(30)` replaces the `{0}` with `30`
3. Build a list of urls for all of the search results.
    * Note that there are 8 pages listed for La Crosse.  We want 8 links.
    * Use `range` and a list comprehension to replace `{0}` with `0`, `30`, ...

In [6]:
import requests
from bs4 import BeautifulSoup

### Step 1 - Get a soup for each url

In [26]:
base_str = "https://www.yelp.com/search?find_desc=Restaurants&find_loc=La+Crosse,+WI&start={0}"
urls = [base_str.format(i) for i in range(0, 8*30, 30)]
s = requests.Session()

def get_soup(url):
    print("Loading {}".format(url))
    r = s.get(url)
    soup = BeautifulSoup(r.content, 'html.parser')
    return soup

soups = [get_soup(url) for url in urls]

Loading https://www.yelp.com/search?find_desc=Restaurants&find_loc=La+Crosse,+WI&start=0
Loading https://www.yelp.com/search?find_desc=Restaurants&find_loc=La+Crosse,+WI&start=30
Loading https://www.yelp.com/search?find_desc=Restaurants&find_loc=La+Crosse,+WI&start=60
Loading https://www.yelp.com/search?find_desc=Restaurants&find_loc=La+Crosse,+WI&start=90
Loading https://www.yelp.com/search?find_desc=Restaurants&find_loc=La+Crosse,+WI&start=120
Loading https://www.yelp.com/search?find_desc=Restaurants&find_loc=La+Crosse,+WI&start=150
Loading https://www.yelp.com/search?find_desc=Restaurants&find_loc=La+Crosse,+WI&start=180
Loading https://www.yelp.com/search?find_desc=Restaurants&find_loc=La+Crosse,+WI&start=210


### Step 2 - Get the info from each soup

In [27]:
info = [get_page_info(soup) for soup in soups]
info

[[('The Waterfront Restaurant & Tavern',
   '4.5',
   '328 Front St S',
   '182',
   'American (New);Seafood;Steakhouses'),
  ('Buzzard Billy’s',
   '4.0',
   '222 Pearl St',
   '203',
   'American (Traditional);Cajun/Creole'),
  ('Greengrass Cafe',
   '4.0',
   '1904 Campbell Rd',
   '85',
   'Breakfast & Brunch;Bars'),
  ('Lovechild Restaurant',
   '4.5',
   '300 3rd St S',
   '75',
   'Salad;Bars;Steakhouses'),
  ('The Charmant', '4.5', '101 State St', '125', 'French;American (New)'),
  ('The Root Note',
   '4.5',
   '115 4th St S',
   '65',
   'Music Venues;Coffee & Tea;Creperies'),
  ('Piggy’s Restaurant',
   '4.0',
   '501 Front St S',
   '109',
   'Steakhouses;Seafood;Sandwiches'),
  ('Howie’s on La Crosse',
   '4.0',
   '1128 La Crosse St',
   '50',
   'American (New);Pubs'),
  ('Kate’s On State', '4.0', '333 Main St', '79', 'Italian;Seafood'),
  ('Dublin Square Irish Pub & Eatery',
   '4.0',
   '103 3rd St N',
   '173',
   'American (Traditional);Irish Pub'),
  ('The Freightho

## Stinky Cabbage!

Let's hold our nose and use double `for` loops to print the contents of `info`

In [30]:
for inner_list in info:
    for row in inner_list:
        print(row)

('The Waterfront Restaurant & Tavern', '4.5', '328 Front St S', '182', 'American (New);Seafood;Steakhouses')
('Buzzard Billy’s', '4.0', '222 Pearl St', '203', 'American (Traditional);Cajun/Creole')
('Greengrass Cafe', '4.0', '1904 Campbell Rd', '85', 'Breakfast & Brunch;Bars')
('Lovechild Restaurant', '4.5', '300 3rd St S', '75', 'Salad;Bars;Steakhouses')
('The Charmant', '4.5', '101 State St', '125', 'French;American (New)')
('The Root Note', '4.5', '115 4th St S', '65', 'Music Venues;Coffee & Tea;Creperies')
('Piggy’s Restaurant', '4.0', '501 Front St S', '109', 'Steakhouses;Seafood;Sandwiches')
('Howie’s on La Crosse', '4.0', '1128 La Crosse St', '50', 'American (New);Pubs')
('Kate’s On State', '4.0', '333 Main St', '79', 'Italian;Seafood')
('Dublin Square Irish Pub & Eatery', '4.0', '103 3rd St N', '173', 'American (Traditional);Irish Pub')
('The Freighthouse Restaurant', '4.0', '107 Vine St', '104', 'Seafood;Steakhouses;Desserts')
('Digger’s Sting Restaurant', '4.0', '122 3rd St N

# <font color="red">TODO: Add a picture of converting a double `for` loop to a comprehension</font>

## Looking at the type of `info`

* `info` is a list of lists.
    * Each call to `get_page_info` gives a list inside our main list.

In [31]:
# This is a list of lists, one list per url
[type(item) for item in info]

[list, list, list, list, list, list, list, list]

### Step 3 - Flattening the results

* We will use 2 `for` clauses to flatten the output
    * first/outer `for` processes out list
    * second/inner `for` processes a row of data

In [32]:
flat_info = [row for info_list in info for row in info_list]
[row for row in flat_info][:5]

[('The Waterfront Restaurant & Tavern',
  '4.5',
  '328 Front St S',
  '182',
  'American (New);Seafood;Steakhouses'),
 ('Buzzard Billy’s',
  '4.0',
  '222 Pearl St',
  '203',
  'American (Traditional);Cajun/Creole'),
 ('Greengrass Cafe',
  '4.0',
  '1904 Campbell Rd',
  '85',
  'Breakfast & Brunch;Bars'),
 ('Lovechild Restaurant',
  '4.5',
  '300 3rd St S',
  '75',
  'Salad;Bars;Steakhouses'),
 ('The Charmant', '4.5', '101 State St', '125', 'French;American (New)')]

## Writing the output to a file.

* Turn each row of data into a comma separated string.
* Use a `with` statement to open a file.
* Use `print` and a `for` loop to write the output.

In [19]:
lines = [','.join(row) for row in flat_info]
lines[:10]

['The Waterfront Restaurant & Tavern,4.5,328 Front St S,182,American (New);Seafood;Steakhouses',
 'Buzzard Billy’s,4.0,222 Pearl St,203,American (Traditional);Cajun/Creole',
 'Greengrass Cafe,4.0,1904 Campbell Rd,85,Breakfast & Brunch;Bars',
 'Lovechild Restaurant,4.5,300 3rd St S,75,Salad;Bars;Steakhouses',
 'The Charmant,4.5,101 State St,125,French;American (New)',
 'The Root Note,4.5,115 4th St S,65,Music Venues;Coffee & Tea;Creperies',
 'Piggy’s Restaurant,4.0,501 Front St S,109,Steakhouses;Seafood;Sandwiches',
 'Howie’s on La Crosse,4.0,1128 La Crosse St,50,American (New);Pubs',
 'Kate’s On State,4.0,333 Main St,78,Italian;Seafood',
 'The Freighthouse Restaurant,4.0,107 Vine St,103,Seafood;Steakhouses;Desserts']

### Step 4 - Write to a file

In [96]:
with open('lunch.csv', 'w') as out_file:
    for line in lines:
        print(line, file = out_file)

In [97]:
# Mac command
%cat lunch.csv

The Waterfront Restaurant & Tavern,4.5,328 Front St S,182,American (New);Seafood;Steakhouses
Buzzard Billy’s,4.0,222 Pearl St,203,Breakfast & Brunch;Bars
The Charmant,4.5,101 State St,124,American (Traditional);Cajun/Creole
Lovechild Restaurant,4.5,300 3rd St S,75,French;American (New)
Piggy’s Restaurant,4.0,501 Front St S,108,Salad;Bars;Steakhouses
Greengrass Cafe,4.0,1904 Campbell Rd,85,Music Venues;Coffee & Tea;Creperies
Dublin Square Irish Pub & Eatery,4.0,103 3rd St N,173,Steakhouses;Seafood;Sandwiches
The Root Note,4.5,115 4th St S,65,American (New);Pubs
The Freighthouse Restaurant,4.0,107 Vine St,103,American (Traditional);Irish Pub
The Crow,3.5,100 3rd St S,104,Italian;Seafood
Howie’s on La Crosse,4.0,1128 La Crosse St,50,Seafood;Steakhouses;Desserts
Kate’s On State,4.0,333 Main St,78,American (Traditional);Breakfast & Brunch;Pubs
Le Chateau,4.5,410 Cass St,46,Diners;Cafes;American (Traditional)
Digger’s Sting Restaurant,4.0,122 3rd St N,50,American (Traditional);Breakfast & Br

In [97]:
# PC command
!type lunch.csv

The Waterfront Restaurant & Tavern,4.5,328 Front St S,182,American (New);Seafood;Steakhouses
Buzzard Billy’s,4.0,222 Pearl St,203,Breakfast & Brunch;Bars
The Charmant,4.5,101 State St,124,American (Traditional);Cajun/Creole
Lovechild Restaurant,4.5,300 3rd St S,75,French;American (New)
Piggy’s Restaurant,4.0,501 Front St S,108,Salad;Bars;Steakhouses
Greengrass Cafe,4.0,1904 Campbell Rd,85,Music Venues;Coffee & Tea;Creperies
Dublin Square Irish Pub & Eatery,4.0,103 3rd St N,173,Steakhouses;Seafood;Sandwiches
The Root Note,4.5,115 4th St S,65,American (New);Pubs
The Freighthouse Restaurant,4.0,107 Vine St,103,American (Traditional);Irish Pub
The Crow,3.5,100 3rd St S,104,Italian;Seafood
Howie’s on La Crosse,4.0,1128 La Crosse St,50,Seafood;Steakhouses;Desserts
Kate’s On State,4.0,333 Main St,78,American (Traditional);Breakfast & Brunch;Pubs
Le Chateau,4.5,410 Cass St,46,Diners;Cafes;American (Traditional)
Digger’s Sting Restaurant,4.0,122 3rd St N,50,American (Traditional);Breakfast & Br