## 웹 크롤링

### 웹 페이지에서 필요한 정보 파싱
>- The 50 Best Sandwiches in Chicago
>- 메인페이지의 TOP50 리스트정보 가져오기
>- 각각에 연결된 상세정보 가져오기

#### 데이터크롤링 미션
>
>- 과제1 : 메인페이지 정보 크롤링 - 랭킹, 카페명, 메뉴명, 상세페이지링커
>- 과제2 : 상세페이지 정보 크롤링 - 가격, 주소, 전화번호, 홈페이지정보
>- 과제3 : 과제1, 과제2 정보를 모두 포함하여 파일로 저장하기

In [1]:
from bs4 import BeautifulSoup 
from urllib.request import urlopen

import pandas as pd
import re

### 메인페이지 정보 크롤링

In [2]:
url  = 'http://www.chicagomag.com/Chicago-Magazine/November-2012/Best-Sandwiches-Chicago/'
html = urlopen(url)
soup = BeautifulSoup(html, "lxml")
soup.title

<title>
  The 50 Best Sandwiches in Chicago |
  Chicago magazine
      |  November 2012
    </title>

In [3]:
content_post = soup.find('div', 'content post')
content_post

<div class="content post">
<div class="fb-like fb-like-top" data-href="http://www.chicagomag.com/Chicago-Magazine/November-2012/Best-Sandwiches-Chicago/" data-layout="box_count" data-send="false" data-share="true" data-show-faces="false" data-width="50"></div>
<p><span class="dropcap">F</span>or generations, sandwiches were the ultimate guilty pleasure of subcultures that had no patience for guilt: hungry bachelors, school kids, working stiffs, old men in delis. To fridge-foraging rubes like Dagwood, quality wasn’t half as important as quantity. The sandwich was one of the only snacks you were allowed to pile as high as you wanted with anything you desired and cram into your face with both hands—a meal so inelegant and blithely proud of its inelegance that it came in six-foot segments for parties. And we ate it. Standing up.</p>
<section class="related-content pull-right">
<h3>Related Content</h3>
<ul>
<li><a href="http://www.chicagomag.com/Chicago-Magazine/November-2012/Best-Sandwiche

In [4]:
contents = content_post.find_all('div', 'sammy')
len(contents)

50

#### 랭킹

In [5]:
tmp  = contents[0].find('div', 'sammyRank')
rank = int(tmp.get_text())
rank

1

#### 카페명

In [6]:
tmp  = contents[0].find('a')
tmp1 = tmp.get_text()
# 'BLT\r\nOld Oak Tap\nRead more ' <<나온다.

tmp2 = tmp1.find('\n') #첫 번째 \n 위치 
tmp3 = tmp1.find('\n', tmp2+1) #두 번째 \n 위치 

cafe_name = tmp1[tmp2+1:tmp3]
cafe_name

'Old Oak Tap'

#### 상세페이지 링크

In [7]:
if 'http' not in tmp['href']:
    link = 'http://www.chicagomag.com' + tmp['href']
else:
    link = tmp['href']

link

'http://www.chicagomag.com/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Old-Oak-Tap-BLT/'

#### 메뉴명

In [8]:
tmp = contents[0].find('b')
menu_name = tmp.get_text()
menu_name

'BLT'

#### 합치기

In [9]:
def chicago_sandwiches_rank():
    Rank     = []
    CafeName = []
    MenuName = []
    Link     = []
    
    url  = 'http://www.chicagomag.com/Chicago-Magazine/November-2012/Best-Sandwiches-Chicago/'
    html = urlopen(url)
    soup = BeautifulSoup(html, "lxml")
    
    content_post = soup.find('div', 'content post')
    contents = content_post.find_all('div', 'sammy')
    
    for content in range(len(contents)):
        
        #랭킹
        tmp  = contents[content].find('div', 'sammyRank')
        rank = int(tmp.get_text())
        
        #카페명
        tmp  = contents[content].find('a')
        tmp1 = tmp.get_text()
        # 'BLT\r\nOld Oak Tap\nRead more ' <<나온다.
        tmp2 = tmp1.find('\n')         #첫 번째 \n 위치 
        tmp3 = tmp1.find('\n', tmp2+1) #두 번째 \n 위치 
        cafe_name = tmp1[tmp2+1:tmp3]
        
        #상세페이지 링크
        if 'http' not in tmp['href']:
            link = 'http://www.chicagomag.com' + tmp['href']
        else:
            link = tmp['href']
        
        #메뉴명
        tmp = contents[content].find('b')
        menu_name = tmp.get_text()
        
        Rank.append(rank)
        CafeName.append(cafe_name)
        MenuName.append(menu_name)
        Link.append(link)
    
    data   = {'순위':Rank, '상호명':CafeName, '메뉴':MenuName, '링크':Link }
    ret_df = pd.DataFrame(data)
    
    return ret_df

In [10]:
df = chicago_sandwiches_rank()
df.head(7)

Unnamed: 0,순위,상호명,메뉴,링크
0,1,Old Oak Tap,BLT,http://www.chicagomag.com/Chicago-Magazine/Nov...
1,2,Au Cheval,Fried Bologna,http://www.chicagomag.com/Chicago-Magazine/Nov...
2,3,Xoco,Woodland Mushroom,http://www.chicagomag.com/Chicago-Magazine/Nov...
3,4,Al’s Deli,Roast Beef,http://www.chicagomag.com/Chicago-Magazine/Nov...
4,5,Publican Quality Meats,PB&L,http://www.chicagomag.com/Chicago-Magazine/Nov...
5,6,Hendrickx Belgian Bread Crafter,Belgian Chicken Curry Salad,http://www.chicagomag.com/Chicago-Magazine/Nov...
6,7,Acadia,Lobster Roll,http://www.chicagomag.com/Chicago-Magazine/Nov...


### 상세페이지 정보 크롤링

In [11]:
url_page = df['링크'][5]
url_page

'http://www.chicagomag.com/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Hendrickx-Belgian-Bread-Crafter-Belgian-Chicken-Curry-Salad/'

In [12]:
html = urlopen(url_page)
soup = BeautifulSoup(html, "lxml")
soup.title

<title>
  6. Hendrickx Belgian Bread Crafter Belgian Chicken Curry Salad |
  Chicago magazine
      |  November 2012
    </title>

In [13]:
conts = soup.find('em')
conts

<em>$7.25. 100 E. Walton St., 312-649-6717</em>

In [14]:
conts_t = conts.get_text()
conts_t

'$7.25. 100 E. Walton St., 312-649-6717'

#### 가격

In [15]:
re_price = re.search('[$]\d{0,2}[.]?\d{1,2}.\s{1}\d{1,4}', conts_t)

if re_price is not None:
    price = re_price.group()

price

'$7.25. 100'

#### 주소

In [16]:
re_address = re.search('\s{1}\w{0,3}[ .]?\s{1}\w{3,15}\s{1}\w{2,3}', conts_t)

if re_address is not None:
    address = re_address.group().strip()

address

'E. Walton St'

#### 전화번호

In [17]:
re_phone = re.search('\d{3}[-]\d{3}[-]\d{4}', conts_t)

if re_phone is not None:
    phone = re_phone.group()

phone

'312-649-6717'

#### 홈페이지 정보

In [18]:
if conts.find('a') == None:
    info = '홈페이지 정보 없음'
else:
    info = conts.find('a').get_text()
    
info

'홈페이지 정보 없음'

#### 모두 합치기

In [19]:
from tqdm import tqdm_notebook

In [20]:
def chicago_sandwiches_info():
    Price   = []
    Address = []
    Phone   = []
    Info    = []
    
    idx = 0
    
    for url_page in tqdm_notebook(df['링크']):
        
        idx += 1
        print("{}번째 크롤링..{}".format(idx, url_page))
        
        html = urlopen(url_page)
        soup = BeautifulSoup(html, "lxml")
        
        conts = soup.find('em')
        conts_t = conts.get_text()
        
        #가격
        re_price = re.search('[$]\d{0,2}[.]?\d{1,3}.\s{1}\d{1,4}', conts_t)
        if re_price is not None:
            price = re_price.group()
            
        #주소
        re_address = re.search('\s{1}\w{0,3}[ .]?\s{1}\w{3,15}\s{1}\w{2,3}', conts_t)
        if re_address is not None:
            address = re_address.group().strip()
        
        #전화번호
        re_phone = re.search('\d{3}[-]\d{3}[-]\d{4}', conts_t)
        if re_phone is not None:
            phone = re_phone.group()
        
        #홈페이지 정보
        if conts.find('a') == None:
            info = '홈페이지 정보 없음'
        else:
            info = conts.find('a').get_text()
        
        Price.append(price)
        Address.append(address)
        Phone.append(phone)
        Info.append(info)
        
    df['가격']          = Price
    df['주소']          = Address
    df['전화번호']      = Phone
    df['홈페이지 정보'] = Info

    print('Crawling is Finished !!!')
        
    return df

In [21]:
all_df = chicago_sandwiches_info()
all_df.head()

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for url_page in tqdm_notebook(df['링크']):


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=50.0), HTML(value='')))

1번째 크롤링..http://www.chicagomag.com/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Old-Oak-Tap-BLT/
2번째 크롤링..http://www.chicagomag.com/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Au-Cheval-Fried-Bologna/
3번째 크롤링..http://www.chicagomag.com/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Xoco-Woodland-Mushroom/
4번째 크롤링..http://www.chicagomag.com/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Als-Deli-Roast-Beef/
5번째 크롤링..http://www.chicagomag.com/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Publican-Quality-Meats-PB-L/
6번째 크롤링..http://www.chicagomag.com/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Hendrickx-Belgian-Bread-Crafter-Belgian-Chicken-Curry-Salad/
7번째 크롤링..http://www.chicagomag.com/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Acadia-Lobster-Roll/
8번째 크롤링..http://www.chicagomag.com/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Birchwood-Kitchen-Smoked-Salmon-Salad/
9번째 크롤링..http://www

Unnamed: 0,순위,상호명,메뉴,링크,가격,주소,전화번호,홈페이지 정보
0,1,Old Oak Tap,BLT,http://www.chicagomag.com/Chicago-Magazine/Nov...,$10. 2109,W. Chicago Ave,773-772-0406,theoldoaktap.com
1,2,Au Cheval,Fried Bologna,http://www.chicagomag.com/Chicago-Magazine/Nov...,$9. 800,W. Randolph St,312-929-4580,aucheval.tumblr.com
2,3,Xoco,Woodland Mushroom,http://www.chicagomag.com/Chicago-Magazine/Nov...,$9.50. 445,N. Clark St,312-334-3688,rickbayless.com
3,4,Al’s Deli,Roast Beef,http://www.chicagomag.com/Chicago-Magazine/Nov...,$9.40. 914,914 Noyes St,847-475-9400,alsdeli.net
4,5,Publican Quality Meats,PB&L,http://www.chicagomag.com/Chicago-Magazine/Nov...,$10. 825,W. Fulton Mkt,312-445-8977,publicanqualitymeats.com


In [22]:
all_df.set_index('순위', inplace=True)
all_df

Unnamed: 0_level_0,상호명,메뉴,링크,가격,주소,전화번호,홈페이지 정보
순위,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Old Oak Tap,BLT,http://www.chicagomag.com/Chicago-Magazine/Nov...,$10. 2109,W. Chicago Ave,773-772-0406,theoldoaktap.com
2,Au Cheval,Fried Bologna,http://www.chicagomag.com/Chicago-Magazine/Nov...,$9. 800,W. Randolph St,312-929-4580,aucheval.tumblr.com
3,Xoco,Woodland Mushroom,http://www.chicagomag.com/Chicago-Magazine/Nov...,$9.50. 445,N. Clark St,312-334-3688,rickbayless.com
4,Al’s Deli,Roast Beef,http://www.chicagomag.com/Chicago-Magazine/Nov...,$9.40. 914,914 Noyes St,847-475-9400,alsdeli.net
5,Publican Quality Meats,PB&L,http://www.chicagomag.com/Chicago-Magazine/Nov...,$10. 825,W. Fulton Mkt,312-445-8977,publicanqualitymeats.com
6,Hendrickx Belgian Bread Crafter,Belgian Chicken Curry Salad,http://www.chicagomag.com/Chicago-Magazine/Nov...,$7.25. 100,E. Walton St,312-649-6717,홈페이지 정보 없음
7,Acadia,Lobster Roll,http://www.chicagomag.com/Chicago-Magazine/Nov...,$16. 1639,S. Wabash Ave,312-360-9500,acadiachicago.com
8,Birchwood Kitchen,Smoked Salmon Salad,http://www.chicagomag.com/Chicago-Magazine/Nov...,$10. 2211,W. North Ave,773-276-2100,birchwoodkitchen.com
9,Cemitas Puebla,Atomica Cemitas,http://www.chicagomag.com/Chicago-Magazine/Nov...,$9. 3619,W. North Ave,773-772-8435,cemitaspuebla.com
10,Nana,Grilled Laughing Bird Shrimp and Fried Po’ Boy,http://www.chicagomag.com/Chicago-Magazine/Nov...,$17. 3267,S. Halsted St,312-929-2486,nanaorganic.com


In [24]:
# file_name='../data/The_50_Best_Sandwiches_in_Chicago.csv'
# df.to_csv(file_name, sep=',', encoding='UTF-8')