# Web Scraping 

 - requests : HTTP를 위한 패키지 
 - lxml : HTML처리 
 - cssselect : HTML처리 

### HTTP Method 
 - GET : 서버의 자원(resouce)을 가지고 올때 
  - 목록보기
  - 글 보기
  - 다운로드
 - POST : 서버에 자원을 추가할 때 
  - 글 쓰기 
  - 업로드 
 - 구분이 잘 지켜지지 않음.
 - PUT, DELETE 등 Method도 있으나 웹브라우저는 지원 X

![status](img/14.PNG)

 - 405 : POST <-> GET 형식을 바꾸면 된다. 

In [1]:
import requests

In [2]:
url = 'https://techcrunch.com/2017/03/08/a-new-affordable-naming-startup-for-startups/'

In [3]:
res = requests.get(url)

In [4]:
res

<Response [200]>

In [5]:
res.text[:200]

'<!DOCTYPE html>\n<html xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://opengraphprotocol.org/schema/" xmlns:fb="http://www.facebook.com/2008/fbml" lang="en">\n<head>\n\t<title>A new, affordable nami'

## HTML에서 본문 추출

In [6]:
import lxml.html

In [7]:
root = lxml.html.fromstring(res.text)

In [8]:
entries = root.cssselect('.article-entry')

In [9]:
entries

[<Element div at 0x20541f2f6d8>]

In [10]:
len(entries)

1

In [11]:
article = entries[0]
content = article.text_content()

In [12]:
content[:50]

'\n\n\n\nA few years ago, I launched a daily email news'

## 목록에서 여러 개의 링크 가져오기

In [13]:
res = requests.get('https://techcrunch.com/startups/')
root = lxml.html.fromstring(res.text)

In [14]:
titles = root.cssselect('h2 a')

In [15]:
titles

[<Element a at 0x20541f5c3b8>,
 <Element a at 0x20541f5c408>,
 <Element a at 0x20541f5c458>,
 <Element a at 0x20541f5c4a8>,
 <Element a at 0x20541f5c4f8>,
 <Element a at 0x20541f5c548>,
 <Element a at 0x20541f5c598>,
 <Element a at 0x20541f5c5e8>,
 <Element a at 0x20541f5c638>,
 <Element a at 0x20541f5c688>,
 <Element a at 0x20541f5c6d8>,
 <Element a at 0x20541f5c728>,
 <Element a at 0x20541f5c778>,
 <Element a at 0x20541f5c7c8>,
 <Element a at 0x20541f5c818>,
 <Element a at 0x20541f5c868>,
 <Element a at 0x20541f5c8b8>,
 <Element a at 0x20541f5c908>,
 <Element a at 0x20541f5c958>,
 <Element a at 0x20541f5c9a8>]

In [16]:
i = 0
for title in titles:
    i += 1
    print("- " + str(i) + " " + title.text)

- 1 Equity crowdfunding platform Seedrs to launch secondary market
- 2 Garena rebrands to Sea and raises $550 million more to focus on Indonesian e-commerce
- 3 Cornershop, a grocery-delivery app in Chile and Mexico, raises $21M
- 4 “The Handmaid’s Tale” is critical to the success of Hulu’s Live TV service
- 5 Urban-X’s investors showcase features high-tech face masks, navigation for the blind and more
- 6 Germany’s Duolingo competitor Babbel sets its sights on the US
- 7 Inside Andy Rubin’s futuristic Playground hardware incubator
- 8 Fortem raises $5.5 million to hunt and take down unwanted drones
- 9 Baby tech draws seed funding and a few big rounds
- 10 Made In Space reveals the Archinaut, a robot-operated factory in the sky
- 11 Growlabs nabs $2.2M to automate outbound sales
- 12 Scan these new QR-style Spotify Codes to instantly play a song
- 13 Equity podcast: Sequoia’s Roelof Botha on why he’s still bullish on Tesla
- 14 Docker’s new CEO sees gargantuan opportunities ahead
- 15

In [17]:
titles[0].attrib['href']

'https://techcrunch.com/2017/05/07/equity-crowdfunding-platform-seedrs-to-launch-secondary-market/'

In [18]:
for title in titles:
    print(title.attrib['href'])

https://techcrunch.com/2017/05/07/equity-crowdfunding-platform-seedrs-to-launch-secondary-market/
https://techcrunch.com/2017/05/07/sea-change/
https://techcrunch.com/2017/05/07/cornershop-a-grocery-delivery-app-in-chile-and-mexico-raises-21m/
https://techcrunch.com/2017/05/07/the-handmaids-tale-is-critical-to-the-success-of-hulus-live-tv-service/
https://techcrunch.com/2017/05/06/urban-xs-investors-showcase-features-high-tech-face-masks-navigation-for-the-blind-and-more/
https://techcrunch.com/2017/05/06/germanys-duolingo-competitor-babbel-sets-its-sights-on-the-u-s/
https://techcrunch.com/2017/05/06/playground-hardware/
https://techcrunch.com/2017/05/06/fortem-raises-5-5-million-to-hunt-and-take-down-unwanted-drones/
https://techcrunch.com/2017/05/06/baby-tech-draws-seed-funding-and-a-few-big-rounds/
https://techcrunch.com/2017/05/05/made-in-space-reveals-the-archinaut-a-robot-operated-factory-in-the-sky/
https://techcrunch.com/2017/05/05/automated-sales/
https://techcrunch.com/2017/

In [19]:
articles = [] 
for title in titles:
    url = title.attrib['href']
    res_a = requests.get(url)
    articles.append(res_a.text) 

### Example (Self)

- 밑에 기사의 기자이름만 가지고오고 싶을때. 
- div class = "river" -> li -> div class ="block" -> div class = "byline" -> a

In [20]:
import tqdm

In [21]:
subTitles = root.cssselect('.river li .block .byline a')

In [22]:
for subTitle in tqdm.tqdm_notebook(subTitles):
    print(subTitle.text)
    print("https://techcrunch.com" + subTitle.attrib['href'])

Steve O'Hear
https://techcrunch.com/author/steve-ohear/
Catherine Shu
https://techcrunch.com/author/catherine-shu/
Matthew Lynley
https://techcrunch.com/author/matthew-lynley/
Jordan Crook
https://techcrunch.com/author/jordan-crook/
Stefan Etienne
https://techcrunch.com/author/stefan-etienne/
Frederic Lardinois
https://techcrunch.com/author/frederic-lardinois/
Josh Constine
https://techcrunch.com/author/josh-constine/
Lora Kolodny
https://techcrunch.com/author/lora-kolodny/
Joanna Glasner
https://techcrunch.com/author//
Lora Kolodny
https://techcrunch.com/author/lora-kolodny/
John Mannes
https://techcrunch.com/author/john-mannes/
Josh Constine
https://techcrunch.com/author/josh-constine/
Alex Wilhelm
https://techcrunch.com/author/alex-wilhelm/
Frederic Lardinois
https://techcrunch.com/author/frederic-lardinois/
John Biggs
https://techcrunch.com/author/john-biggs/
John Mannes
https://techcrunch.com/author/john-mannes/
Sarah Buhr
https://techcrunch.com/author/sarah-buhr/
Frederic Lardino

### TQDM Notebook 
 - 찾아서 해보기 for문에 돌리면 Progress bar가 나타난다. 

## BeautifulSoup 

In [23]:
import urllib.request 
from bs4 import BeautifulSoup

In [24]:
req = urllib.request.Request('https://techcrunch.com/startups/')
data = urllib.request.urlopen(req).read()

In [25]:
bs = BeautifulSoup(data, 'html.parser')

In [26]:
h2s = bs.find_all('h2')

In [27]:
h2s[:5]

[<h2>Hi!</h2>,
 <h2>What big tech companies pay interns</h2>,
 <h2 class="h-alt">Giant companies that won't buy your startup</h2>,
 <h2 class="h-alt">Yik Yak is shutting down</h2>,
 <h2 class="h-alt">Robinhood stock app raises $110M at $1.3B</h2>]

In [28]:
i = 0
for h2 in h2s:
    all_s = h2.find_all('a')
    if len(all_s) > 0:
        i += 1
        print(str(i) + " " +all_s[0].text)

1 Equity crowdfunding platform Seedrs to launch secondary market
2 Garena rebrands to Sea and raises $550 million more to focus on Indonesian e-commerce
3 Cornershop, a grocery-delivery app in Chile and Mexico, raises $21M
4 “The Handmaid’s Tale” is critical to the success of Hulu’s Live TV service
5 Urban-X’s investors showcase features high-tech face masks, navigation for the blind and more
6 Germany’s Duolingo competitor Babbel sets its sights on the US
7 Inside Andy Rubin’s futuristic Playground hardware incubator
8 Fortem raises $5.5 million to hunt and take down unwanted drones
9 Baby tech draws seed funding and a few big rounds
10 Made In Space reveals the Archinaut, a robot-operated factory in the sky
11 Growlabs nabs $2.2M to automate outbound sales
12 Scan these new QR-style Spotify Codes to instantly play a song
13 Equity podcast: Sequoia’s Roelof Botha on why he’s still bullish on Tesla
14 Docker’s new CEO sees gargantuan opportunities ahead
15 Worst quarter for paid TV sub

## Article Image 
 - 마지막에 크롤링 된 image 가지고오기. 

In [29]:
root = lxml.html.fromstring(res_a.text)

In [30]:
imgs = root.cssselect('.article-entry img')

In [31]:
img = imgs[0]

In [32]:
img.attrib['src']

'https://tctechcrunch2011.files.wordpress.com/2017/05/fuchsia-macaree-izzy-wheel.jpg?w=738'

In [33]:
picture = requests.get(img.attrib['src'])

In [34]:
with open('img/a.jpg','wb') as f:
    f.write(picture.content)