# Web Crawling 
 - 기본적으로 텍스트 데이터를 가지고 위해서는 보통 기업 내부의 VOC 데이터 또는 텍스트 형태의 데이터를 가진 테이블을 활용하나 일반적으로 구하기 쉽거나 그렇지 않아 뉴스, SNS, 블로그 등 다양한 텍스트를 구하기 쉬운 웹의 정보를 이용한다.
 - requests : HTTP를 위한 패키지
 - lxml : HTML 처리
 - cssselect: HTML 처리 
 - beautiful soup : HTML 처리 

## HTML Method
 - GET : 서버의 자원(resource)를 가지고 올때 사용
  - 목록보기 
  - 글보기
  - 다운로드
 - POST : 서버에 자원을 추가할때 즉, 데이터를 추가할때
  - 글쓰기
  - 업로드
 - 구분이 명확하지 않아 소스와 브라우저의 개발자 도구를 통해 확인하여야 한다.

## HTTP Status Code
 - 3자리수
 - 2XX : Success
  - 200 OK
 - 3XX : Redirection
  - 302 Found
 - 4XX : Client Error
  - 400 Bad Request
  - 403 Forbiddent
  - 404 Not Found
  - 405 Method Not Allowed
 - 5XX : Server Error
 - 405 : POST <-> GET 형식을 바꾸면 된다.

In [2]:
import requests 

In [3]:
url = 'https://techcrunch.com/2017/03/08/a-new-affordable-naming-startup-for-startups/'
res = requests.get(url)

In [4]:
res # Success

<Response [200]>

In [5]:
res.text[:200] # 텍스트 형태로 데이터를 가지고 온다.

'\n<!DOCTYPE html>\n<html xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://opengraphprotocol.org/schema/" xmlns:fb="http://www.facebook.com/2008/fbml" lang="en">\n<head>\n\t<title>A new, affordable nam'

## HTML에서 본문 추출
 - lxml
 - BeautifulSoup
 - 아래의 Article의 본문을 Crawling 하려고 한다.
![browser](./TEXTMINING/7.png)

In [6]:
import lxml.html
from bs4 import BeautifulSoup

### 1. lxml

In [10]:
root = lxml.html.fromstring(res.text)

 - html 요소에 접근 방법
  - class : . 
  - id : 

 - class 값으로 접근
  - . 로 접근

In [14]:
entries = root.cssselect('.article-entry')
entries

[<Element div at 0x107b1d458>]

In [15]:
len(entries) # 본문의 내용으로 된 tag는 1개 밖에 없다.

1

In [16]:
article = entries[0]
content = article.text_content() # 본문의 내용을 TEXT 형태로 추출.
content[:100]

'\n\n\n\nA few years ago, I launched a daily email newsletter, and I was ecstatic\xa0to be striking out on m'

 - id 값으로 접근 
  - "#" 로 접근

In [23]:
root.cssselect('#speakable-summary')[0].text_content()[:100]

'A few years ago, I launched a daily email newsletter, and I was ecstatic\xa0to be striking out on my ow'

## 2. BeautifulSoup 

In [24]:
bs = BeautifulSoup(res.text, 'html.parser')
type(bs)

bs4.BeautifulSoup

In [25]:
bs.findAll("div", class_='article-entry') 

[<div class="article-entry text">
 <!-- Begin: Wordpress Article Content -->
 <img class="" src="https://tctechcrunch2011.files.wordpress.com/2017/03/screen-shot-2017-03-08-at-1-26-28-pm.png?w=738"/>
 <p id="speakable-summary">A few years ago, I launched a daily email <a href="https://www.strictlyvc.com/" target="_blank">newsletter</a>, and I was ecstatic to be striking out on my own for the first time. Alas, just a few weeks after filing to secure a trademark, an officious-sounding note appeared in my inbox, and soon after, I found myself shelling out $10,000 in lawyer’s fees over a short-lived trademark dispute. It wasn’t nearly as painful as it might have been, but it was a rude realization that figuring out the right brand can be both time-consuming and have implications that founders might not foresee.</p>
 <p>Of course, my experience is hardly rare. Most founders are typically left to either conduct trademark searches on their own via the <a href="https://www.uspto.gov/" target="

In [26]:
h2s = bs.find_all('h2')
len(h2s)

5

In [27]:
h2s

[<h2>Hi!</h2>, <h2 class="section-title collapse-title aside-adjacent">
 	Newsletter Subscriptions
 </h2>, <h2 class="collapse-title section-title">
 		Latest <span>Crunch Report</span> </h2>, <h2>Featured Stories</h2>, <h2><span class="text-muted">Latest From</span> Startups</h2>]

## 목록에서 여러 개의 링크 가져오기

In [29]:
res = requests.get('https://techcrunch.com/startups/')
root = lxml.html.fromstring(res.text)
titles = root.cssselect('h2 a')
len(titles)

20

In [30]:
i = 0
for title in titles:
    i += 1
    print("- " + str(i) + " " + title.text)

- 1 Linxo raises $24 million for its app that brings bank accounts together
- 2 Crunch Report | So About That Equifax Hack
- 3 Copycats versus disruptors in Latin America
- 4 A Stanford professor’s advice on surviving the a**hole at your startup
- 5 It’s time to build our own Equifax with blackjack and crypto
- 6 Equity podcast: Roku is going public, 23andMe raises $200M and Juicero is dead
- 7 Red Cross to start testing drones in disaster relief efforts
- 8 Teralytics wants to tap telcos’ big data to help cities get smarter about Uber and Lyft
- 9 Final days to apply for Startup Battlefield Australia
- 10 Crunch Report | Marvel and Star Wars going exclusively to Disney streaming
- 11 Former GrubHub employee testified drivers often complained about ‘ghost orders’
- 12 Canvas’ robot cart could change how factories work
- 13 Streamroot raises $3.2 million for its peer-to-peer video delivery technology
- 14 HotelTonight to expand booking window to 100 days
- 15 StatMuse lets you ask a spo

### HTML 태그 내 요소 접근 
 - attrib['요소명']

In [31]:
titles[0].attrib['href']

'https://techcrunch.com/2017/09/09/linxo-raises-24-million-for-its-app-that-brings-bank-accounts-together/'

In [32]:
for title in titles:
    print(title.attrib['href'])

https://techcrunch.com/2017/09/09/linxo-raises-24-million-for-its-app-that-brings-bank-accounts-together/
https://techcrunch.com/2017/09/08/crunch-report-equifax-hack/
https://techcrunch.com/2017/09/08/copycats-vs-disruptors-in-latin-america/
https://techcrunch.com/2017/09/08/a-stanford-professors-advice-on-surviving-the-ahole-at-your-startup/
https://techcrunch.com/2017/09/08/its-time-to-build-our-own-equifax-with-blackjack-and-crypto/
https://techcrunch.com/2017/09/08/equity-podcast-roku-is-going-public-23andme-raises-200m-and-juicero-is-dead/
https://techcrunch.com/2017/09/08/red-cross-to-start-testing-drones-in-disaster-relief-efforts/
https://techcrunch.com/2017/09/07/teralytics-wants-to-tap-telcos-big-data-to-help-cities-get-smarter-about-uber-and-lyft/
https://techcrunch.com/2017/09/07/final-days-to-apply-for-startup-battlefield-australia/
https://techcrunch.com/2017/09/07/crunch-report-marvel-star-wars-disney/
https://techcrunch.com/2017/09/07/former-grubhub-employee-testimony-

In [34]:
articles = [] 
for title in titles:
    url = title.attrib['href']
    res_a = requests.get(url)
    articles.append(res_a.text)

In [35]:
len(articles)

20

In [38]:
articles[0][:100]

'<!DOCTYPE html>\n<html xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://opengraphprotocol.org/sc'

## Example
 - 기사 밑에 있는 기자 이름만 가지고 오고 싶다면?
 - div class = 'river' => li => div class = 'black' => div class = 'byline' => a 로 접근
 - tqdm library : Progress bar를 볼 수 있다.

In [39]:
import tqdm

In [41]:
subTitles = root.cssselect('.river li .block .byline a')

In [44]:
for subTitle in tqdm.tqdm_notebook(subTitles[:10]):
    print(subTitle.text)
    print("https://techcrunch.com" + subTitle.attrib['href'])

Romain Dillet
https://techcrunch.com/author/romain-dillet/
Anthony Ha
https://techcrunch.com/author/anthony-ha/
Francisco Coronel
https://techcrunch.com/contributor/francisco-coronel/
Connie Loizos
https://techcrunch.com/author/connie-loizos/
John Biggs
https://techcrunch.com/author/john-biggs/
Alex Wilhelm
https://techcrunch.com/author/alex-wilhelm/
Matt Burns
https://techcrunch.com/author/matt-burns/
Natasha Lomas
https://techcrunch.com/author/natasha-lomas/
Samantha Stein
https://techcrunch.com/author/samantha-stein/
Anthony Ha
https://techcrunch.com/author/anthony-ha/

