## Crawler exercise (YELP)
#### by Anthony Lo and his team (https://github.com/team-ant/crawler_example)

You will be using Python + BeautifulSoup to create a crawler that perform web scrapping on https://www.yelp.com. Through this exercise, you will know
- Send a http request from Python and get server's repsonse
- Preprocess scrapped data using BeautifulSoup
- Automate the crawler by simple Python coding
- Randomize user agent and set timer to minmize the risk of being blocked
- Export the data to a file for further analysis



### Get all shop name in one page
Sample output:
1. Sugar
2. Camper’s
3. Mr & Mrs Fox
4. Shugetsu
5. 金峰靚靚粥麵
6. Public
7. Plat du Jour
8. Formosa Autumn
9. Nha Trang
10. Wah Cheong Congee & Noodle Shop

In [0]:
import requests
from bs4 import BeautifulSoup



In [0]:
## test connection
url = 'https://www.yelp.com/search?find_desc=&find_loc=quarry%20bay'
html_text = requests.get(url,headers={'User-Agent': ua.random}).text
bs = BeautifulSoup(html_text)
for i in bs.find_all('div',{'class':'lemon--div__373c0__1mboc businessNameWithNoVerifiedBadge__373c0__24q4s border-color--default__373c0__2oFDT'}):
  print(i.text)

1. Sugar
2. Camper’s
3. Mr & Mrs Fox
4. Shugetsu
5. 金峰靚靚粥麵
6. Public
7. Plat du Jour
8. Formosa Autumn
9. Nha Trang
10. Wah Cheong Congee & Noodle Shop


### Get total number of pages
Write a function that return the toal pages of search results. Set a maximum number as the limit in the function.

Example:
- if total pages = 6 and max = 3, return 3
- if total pages = 5 and max = 10, return 5
- if no result, return -1


In [0]:
## get total pages
def get_total_pages(url, max_page=3):
    try:
        html_text = requests.get(url,headers={'User-Agent': ua.random}).text
        bs = BeautifulSoup(html_text,'lxml')
        #page 1 of 34 ..... 1 2 3 4 5....Next > Bar
        temp = bs.find('div',{'class':'lemon--div__373c0__1mboc pagination__373c0__1NjN5 u-padding-t2 border--bottom__373c0__uPbXS border-color--default__373c0__2oFDT'})
        #Extract total page number
        temp_text = temp.find('span',{'class':'lemon--span__373c0__3997G text__373c0__2pB8f text-color--normal__373c0__K_MKN text-align--left__373c0__2pnx_'}).text
        total_pages = int(temp_text.split(' ')[-1])
        print("Total_pages: {}".format(total_pages))
        if total_pages > max_page:
            total_pages = max_page
        return total_pages
    except Exception:
        return -1

### Loop through pages and get a list of shop name and links
Given a search result page, crawl all shop names and links for maximum 3 pages

<br>
* Remember to setup randomized random agent and timer to prevent bloclking by server

In [0]:
# Install fake agent
!pip install fake-useragent

Collecting fake-useragent
  Downloading https://files.pythonhosted.org/packages/d1/79/af647635d6968e2deb57a208d309f6069d31cb138066d7e821e575112a80/fake-useragent-0.1.11.tar.gz
Building wheels for collected packages: fake-useragent
  Building wheel for fake-useragent (setup.py) ... [?25l[?25hdone
  Created wheel for fake-useragent: filename=fake_useragent-0.1.11-cp36-none-any.whl size=13485 sha256=399f3b2d5fcd06d7c0d31f6f049dc5c68ede0658683c1d8f2b14f3b122ab8870
  Stored in directory: /root/.cache/pip/wheels/5e/63/09/d1dc15179f175357d3f5c00cbffbac37f9e8690d80545143ff
Successfully built fake-useragent
Installing collected packages: fake-useragent
Successfully installed fake-useragent-0.1.11


In [0]:
from fake_useragent import UserAgent
import time
import random
ua = UserAgent()

In [0]:
## get shop name and link
def get_shop_list(url, max_page=3):
    total_pages = get_total_pages(url, max_page)
    output_list = []
    for i in range(total_pages):
      page_url = url+'&start={}'.format(i*10)
      page_request = requests.get(page_url,headers={'User-Agent': ua.random})
      page_bs = BeautifulSoup(page_request.text,'lxml')
      #restaurant name
      for i in page_bs.find_all('div',{'class':'lemon--div__373c0__1mboc businessNameWithNoVerifiedBadge__373c0__24q4s border-color--default__373c0__2oFDT'}):
        shop_name = '.'.join(i.text.split('.')[1:]).strip()
#         shop_name = i.text
        link = 'https://www.yelp.com'+i.find('a')['href']
        output_list.append([shop_name, link])
        print(shop_name, link)
        time.sleep(random.random())
    return output_list

### Get reviews
Get reviews given a shop url, crawl maximum 3 pages of reviews

In [0]:
def get_reviews(shop_name, url, max_page=5):
    total_pages = get_total_pages(url, max_page)
    output_list = []
    for i in range(total_pages):
        review_url = url+'?start={}'.format(i*20)
        review_request = requests.get(review_url,headers={'User-Agent': ua.random})
        review_bs = BeautifulSoup(review_request.text)
        for i in review_bs.find_all('div',{'class': 'lemon--div__373c0__1mboc arrange-unit__373c0__1piwO arrange-unit-fill__373c0__17z0h border-color--default__373c0__2oFDT'}):
            p_comment = i.find('p',{'class': 'lemon--p__373c0__3Qnnj text__373c0__2pB8f comment__373c0__3EKjH text-color--normal__373c0__K_MKN text-align--left__373c0__2pnx_'})
            if p_comment is not None:
                star = int(i.find('span',{'class':'lemon--span__373c0__3997G display--inline__373c0__1DbOG border-color--default__373c0__2oFDT'}).find('div')['aria-label'][0])
                output_list.append([shop_name,star,p_comment.text])
                #print([shop_name,star,p_comment.text])
        time.sleep(random.random()+1)
    return output_list

### Combine all together!
Given a search query, return shop list and their reviews. 
Finally, export each review to a 3-columns csv files that contains 
1. shop name
2. star of a review
3. review content

In [0]:
reviews = []
shop_list = get_shop_list('https://www.yelp.com/search?find_desc=&find_loc=quarry%20bay', 30)
for p_shop in shop_list:
  reviews += get_reviews(p_shop[0], p_shop[1])


Total_pages: 27
Sugar https://www.yelp.com/biz/sugar-%E9%A6%99%E6%B8%AF
Camper’s https://www.yelp.com/biz/%E5%9D%90%E5%BF%98-%E9%A6%99%E6%B8%AF-2
Mr & Mrs Fox https://www.yelp.com/biz/mr-%E5%92%8C-mrs-fox-%E9%A6%99%E6%B8%AF
Shugetsu https://www.yelp.com/biz/%E9%BA%B5%E9%AE%AE%E9%86%AC%E6%B2%B9%E6%88%BF-%E5%91%A8%E6%9C%88-%E9%A6%99%E6%B8%AF
金峰靚靚粥麵 https://www.yelp.com/biz/%E9%87%91%E5%B3%B0%E9%9D%9A%E9%9D%9A%E7%B2%A5%E9%BA%B5-%E9%A6%99%E6%B8%AF
Public https://www.yelp.com/biz/public-%E9%A6%99%E6%B8%AF
Plat du Jour https://www.yelp.com/biz/plat-du-jour-%E9%A6%99%E6%B8%AF
Formosa Autumn https://www.yelp.com/biz/%E4%B8%80%E8%91%89%E8%87%BA%E7%81%A3%E6%96%99%E7%90%86-%E9%A6%99%E6%B8%AF
Nha Trang https://www.yelp.com/biz/%E8%8A%BD%E8%8E%8A-%E9%A6%99%E6%B8%AF-2
Wah Cheong Congee & Noodle Shop https://www.yelp.com/biz/%E8%8F%AF%E6%98%8C%E7%B2%A5%E9%BA%B5-%E9%A6%99%E6%B8%AF
Simplylife https://www.yelp.com/biz/%E6%98%9F%E7%BE%8E%E4%B9%90-%E9%A6%99%E6%B8%AF-2
Tenren’s Tea https://www.yelp.com/biz

In [0]:
import pandas as pd
kt_reviews = pd.DataFrame(reviews)
kt_reviews.columns = ['rest', 'rating', 'reviews']
kt_reviews.to_csv('kt_reviews.csv', index=False, encoding='utf-8-sig')