# Web Scraping 

 - requests : HTTP를 위한 패키지 
 - lxml : HTML처리 
 - cssselect : HTML처리 

### HTTP Method 
 - GET : 서버의 자원(resouce)을 가지고 올때 
  - 목록보기
  - 글 보기
  - 다운로드
 - POST : 서버에 자원을 추가할 때 
  - 글 쓰기 
  - 업로드 
 - 구분이 잘 지켜지지 않음.
 - PUT, DELETE 등 Method도 있으나 웹브라우저는 지원 X

![status](3.PNG)

 - 405 : POST <-> GET 형식을 바꾸면 된다. 

In [1]:
import requests

In [2]:
url = 'https://techcrunch.com/2017/03/08/a-new-affordable-naming-startup-for-startups/'

In [3]:
res = requests.get(url)

In [4]:
res

<Response [200]>

In [5]:
res.text[:200]

'<!DOCTYPE html>\n<html xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://opengraphprotocol.org/schema/" xmlns:fb="http://www.facebook.com/2008/fbml" lang="en">\n<head>\n\t<title>A new, affordable nami'

## HTML에서 본문 추출

In [6]:
import lxml.html

In [7]:
root = lxml.html.fromstring(res.text)

In [8]:
entries = root.cssselect('.article-entry')

In [9]:
entries

[<Element div at 0x1f8039b41d8>]

In [10]:
len(entries)

1

In [11]:
article = entries[0]
content = article.text_content()

In [12]:
content[:50]

'\n\n\n\nA few years ago, I launched a daily email news'

## 목록에서 여러 개의 링크 가져오기

In [13]:
res = requests.get('https://techcrunch.com/startups/')
root = lxml.html.fromstring(res.text)

In [14]:
titles = root.cssselect('h2 a')

In [15]:
titles

[<Element a at 0x1f8029bd688>,
 <Element a at 0x1f8029bd728>,
 <Element a at 0x1f8029bd778>,
 <Element a at 0x1f8029bd7c8>,
 <Element a at 0x1f8029bd818>,
 <Element a at 0x1f8029bd868>,
 <Element a at 0x1f8029bd8b8>,
 <Element a at 0x1f8039dca48>,
 <Element a at 0x1f8039dca98>,
 <Element a at 0x1f8039dcae8>,
 <Element a at 0x1f8039dcb38>,
 <Element a at 0x1f8039dcb88>,
 <Element a at 0x1f8039dcbd8>,
 <Element a at 0x1f8039dcc28>,
 <Element a at 0x1f8039dcc78>,
 <Element a at 0x1f8039dccc8>,
 <Element a at 0x1f8039dcd18>,
 <Element a at 0x1f8039dcd68>,
 <Element a at 0x1f8039dcdb8>,
 <Element a at 0x1f8039dce08>]

In [16]:
i = 0
for title in titles:
    i += 1
    print("- " + str(i) + " " + title.text)

- 1 Santa Fe enlists Rubicon Global to curb waste and ramp up recycling
- 2 The League adds read receipts, so paid members can confirm when someone is really ghosting them
- 3 Arthena uses data science to find the best investments in art
- 4 Kangpe is a mobile service connecting Africa to healthcare
- 5 Fritz Lanman takes CEO role at ClassPass as founder Payal Kadakia steps in as Chairman
- 6 Piwik Pro raises $2 million to truly liberate your analytics
- 7 Utilis takes top water innovation prize at Imagine H2O for tech that finds leaks underground
- 8 Mobile ad company Appodeal acquires game platform Corona Labs
- 9 Rock Pamper Scissors, the hairdresser booking app backed by Seedcamp and 500 Startups, shutters
- 10 WeWork’s Adam Neumann is coming to Disrupt
- 11 In the future there will be mindclones
- 12 Cubspot finds camps and classes for your wee ones
- 13 Carbon moves into high-volume manufacturing with SpeedCell system, and bigger 3D printers
- 14 Smart diabetes management service

In [17]:
titles[0].attrib['href']

'https://techcrunch.com/2017/03/17/santa-fe-enlists-rubicon-global-to-curb-waste-and-ramp-up-recycling/'

In [18]:
for title in titles:
    print(title.attrib['href'])

https://techcrunch.com/2017/03/17/santa-fe-enlists-rubicon-global-to-curb-waste-and-ramp-up-recycling/
https://techcrunch.com/2017/03/17/the-league-adds-read-receipts/
https://techcrunch.com/2017/03/17/arthena-y-combinator/
https://techcrunch.com/2017/03/17/kangpe-is-a-mobile-service-connecting-africa-to-healthcare/
https://techcrunch.com/2017/03/17/classpass-ceo-fritz-lanman-payal-kadakia/
https://techcrunch.com/2017/03/17/piwic-raises-2-million-to-truly-liberate-your-analytics/
https://techcrunch.com/2017/03/16/utilis-takes-top-water-innovation-prize-at-imagine-h2o/
https://techcrunch.com/2017/03/16/appodeal-acquires-corona-labs/
https://techcrunch.com/2017/03/16/rock-pamper-scissors-deadpool/
https://techcrunch.com/2017/03/16/weworks-adam-neumann-is-coming-to-disrupt/
https://techcrunch.com/2017/03/16/in-the-future-there-will-be-mindclones/
https://techcrunch.com/2017/03/16/cubspot-finds-camps-and-classes-for-your-wee-ones/
https://techcrunch.com/2017/03/16/carbon-moves-into-high-vo

In [51]:
articles = [] 
for title in titles:
    url = title.attrib['href']
    res_a = requests.get(url)
    articles.append(res_a.text) 

### Example (Self)

- 밑에 기사의 기자이름만 가지고오고 싶을때. 
- div class = "river" -> li -> div class ="block" -> div class = "byline" -> a

In [45]:
import tqdm

In [46]:
subTitles = root.cssselect('.river li .block .byline a')

In [48]:
for subTitle in tqdm.tqdm_notebook(subTitles):
    print(subTitle.text)
    print("https://techcrunch.com" + subTitle.attrib['href'])

Lora Kolodny
https://techcrunch.com/author/lora-kolodny/
Anthony Ha
https://techcrunch.com/author/anthony-ha/
Anthony Ha
https://techcrunch.com/author/anthony-ha/
Sarah Buhr
https://techcrunch.com/author/sarah-buhr/
Jordan Crook
https://techcrunch.com/author/jordan-crook/
John Biggs
https://techcrunch.com/author/john-biggs/
Lora Kolodny
https://techcrunch.com/author/lora-kolodny/
Anthony Ha
https://techcrunch.com/author/anthony-ha/
Steve O'Hear
https://techcrunch.com/author/steve-ohear/
Connie Loizos
https://techcrunch.com/author/connie-loizos/
John Biggs
https://techcrunch.com/author/john-biggs/
John Biggs
https://techcrunch.com/author/john-biggs/
Lora Kolodny
https://techcrunch.com/author/lora-kolodny/
Matthew Lynley
https://techcrunch.com/author/matthew-lynley/
Steve O'Hear
https://techcrunch.com/author/steve-ohear/
Steve O'Hear
https://techcrunch.com/author/steve-ohear/
Lora Kolodny
https://techcrunch.com/author/lora-kolodny/
Catherine Shu
https://techcrunch.com/author/catherine-sh

### TQDM Notebook 
 - 찾아서 해보기 for문에 돌리면 Progress bar가 나타난다. 

## BeautifulSoup 

In [21]:
import urllib.request 
from bs4 import BeautifulSoup

In [22]:
req = urllib.request.Request('https://techcrunch.com/startups/')
data = urllib.request.urlopen(req).read()

In [23]:
bs = BeautifulSoup(data, 'html.parser')

In [24]:
h2s = bs.find_all('h2')

In [60]:
h2s[:5]

[<h2>Hi!</h2>,
 <h2>Bolt Threads debuts its first product</h2>,
 <h2 class="h-alt">Why YC startups love this new naming startup</h2>,
 <h2 class="h-alt">Instacart raises $400M at $3.4B valuation</h2>,
 <h2 class="h-alt">Tinder Select is a secret version of the app</h2>]

In [49]:
i = 0
for h2 in h2s:
    all_s = h2.find_all('a')
    if len(all_s) > 0:
        i += 1
        print(str(i) + " " +all_s[0].text)

1 Santa Fe enlists Rubicon Global to curb waste and ramp up recycling
2 The League adds read receipts, so paid members can confirm when someone is really ghosting them
3 Arthena uses data science to find the best investments in art
4 Kangpe is a mobile service connecting Africa to healthcare
5 Fritz Lanman takes CEO role at ClassPass as founder Payal Kadakia steps in as Chairman
6 Piwik Pro raises $2 million to truly liberate your analytics
7 Utilis takes top water innovation prize at Imagine H2O for tech that finds leaks underground
8 Mobile ad company Appodeal acquires game platform Corona Labs
9 Rock Pamper Scissors, the hairdresser booking app backed by Seedcamp and 500 Startups, shutters
10 WeWork’s Adam Neumann is coming to Disrupt
11 In the future there will be mindclones
12 Cubspot finds camps and classes for your wee ones
13 Carbon moves into high-volume manufacturing with SpeedCell system, and bigger 3D printers
14 Smart diabetes management service Livongo Health raises $52.5

## Article Image 
 - 마지막에 크롤링 된 image 가지고오기. 

In [52]:
root = lxml.html.fromstring(res_a.text)

In [53]:
imgs = root.cssselect('.article-entry img')

In [54]:
img = imgs[0]

In [55]:
img.attrib['src']

'https://tctechcrunch2011.files.wordpress.com/2017/03/klp_9195.jpg?w=738'

In [56]:
picture = requests.get(img.attrib['src'])

In [57]:
with open('a.jpg','wb') as f:
    f.write(picture.content)