# 파싱(Parsing)

- 가공되지 않은 데이터(html, xml, json, xhtml...) 에서 원하는 정보를 추출하는 작업
- 파이썬 대표적인 파싱 모듈 => BeautifulSoup, Selenium ...


# BeautifulSoup 


- html, xml 파싱 기능의 모듈 
- pip list 명령으로 beautifulsoup4 설치 확인 
- pip install beautifulSoup4
- from bs4 import BeautifulSoup 임포트 
- https://www.crummy.com/software/BeautifulSoup/bs4/doc/


In [3]:
# pip list

In [2]:
# 임포트
import requests
from bs4 import BeautifulSoup

In [6]:
# dir(BeautifulSoup)

## BeautifulSoup 작업 과정

- requests 또는 urllib 이용해서 url => html 화 => 텍스트
- 텍스트 => BeautifulSoup 객체화
- BeautifulSoup 제공함수를 이용해서 자료 변환 
- csv 파일로 저장


In [7]:
# 샘플 URL
url = 'http://pythonscraping.com/pages/warandpeace.html'

In [10]:
# (1) url => request => response => html 텍스트화 
response = requests.get(url)
response #<Response [200]>
html_str = response.text
# html_str
with open('output/warandpeace.html', 'w') as f:
    f.write(html_str)

In [11]:
ls output

 C 드라이브의 볼륨에는 이름이 없습니다.
 볼륨 일련 번호: 582F-D8CD

 C:\workspace\scrap\output 디렉터리

2021-09-10  오전 11:24    <DIR>          .
2021-09-10  오전 11:24    <DIR>          ..
2021-09-10  오전 10:47            15,041 google1.html
2021-09-10  오전 10:47            15,057 google2.html
2021-09-10  오전 11:24            11,936 warandpeace.html
               3개 파일              42,034 바이트
               2개 디렉터리  378,635,317,248 바이트 남음


In [16]:
html_str

'<html>\n<head>\n<style>\n.green{\n\tcolor:#55ff55;\n}\n.red{\n\tcolor:#ff5555;\n}\n#text{\n\twidth:50%;\n}\n</style>\n</head>\n<body>\n<h1>War and Peace</h1>\n<h2>Chapter 1</h2>\n<div id="text">\n"<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the\nBuonapartes. But I warn you, if you don\'t tell me that this means war,\nif you still try to defend the infamies and horrors perpetrated by\nthat Antichrist- I really believe he is Antichrist- I will have\nnothing more to do with you and you are no longer my friend, no longer\nmy \'faithful slave,\' as you call yourself! But how do you do? I see\nI have frightened you- sit down and tell me all the news.</span>"\n<p/>\nIt was in July, 1805, and the speaker was the well-known <span class="green">Anna\nPavlovna Scherer</span>, maid of honor and favorite of the <span class="green">Empress Marya\nFedorovna</span>. With these words she greeted <span class="green">Prince Vasili Kuragin</span>, a man\nof high rank

In [14]:
# (2) BeautifulSoup 객체화 
# BeautifulSoup?
# BeautifulSoup(html변수, 해석기)
# 해석기는 html.parser/html5/xml/lxml


'\nBeautifulSoup(html, 해석기)\n해석기는 html.parser/html5/xml/lxml\n'

In [17]:
soup = BeautifulSoup(html_str,'html.parser')
soup
type(soup)

bs4.BeautifulSoup

In [19]:
# dir(soup)

In [None]:
# (3-1) BeautifulSoup 이용
# 도트(.) 를 이용한 태그탐색
# soup.부모요소 아래에 있는 자식요소 탐색
# soup.태그1.태그2.태그3...
# 텍스트만 출력 : text/string 속성 이용

In [22]:
soup.style

<style>
.green{
	color:#55ff55;
}
.red{
	color:#ff5555;
}
#text{
	width:50%;
}
</style>

In [28]:
print(soup.h1)
print(soup.h1.text)
print(soup.h1.string)

<h1>War and Peace</h1>
War and Peace
War and Peace


In [29]:
print(soup.div.span.string)

Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.


In [None]:
# (3-2) BeautifulSoup 이용
# 도트(.) 를 이용한 태그탐색
# soup.태그명.next_sibling : 다음 형제요소 탐색

In [None]:
# h1 태그의 다음 형제요소 => 개행문자
soup.h1.next_sibling

In [33]:
print(soup.h1.next_sibling.next_sibling)
print(soup.h1.next_sibling.next_sibling.text)

<h2>Chapter 1</h2>
Chapter 1


In [37]:
target = soup.div.span.next_sibling.next_sibling.next_sibling
print(target.next_sibling) # target으로 잡고 또 그 다음거

<span class="green">Anna
Pavlovna Scherer</span>


In [40]:
# list(next_siblings) => 리스트로 반환
print(target.next_siblings)

<generator object PageElement.next_siblings at 0x0000016F66A42DD0>


In [41]:
list(target.next_siblings)

[<span class="green">Anna
 Pavlovna Scherer</span>,
 ', maid of honor and favorite of the ',
 <span class="green">Empress Marya
 Fedorovna</span>,
 '. With these words she greeted ',
 <span class="green">Prince Vasili Kuragin</span>,
 ', a man\nof high rank and importance, who was the first to arrive at her\nreception. ',
 <span class="green">Anna Pavlovna</span>,
 ' had had a cough for some days. She was, as\nshe said, suffering from la grippe; grippe being then a new word in\n',
 <span class="green">St. Petersburg</span>,
 ', used only by the elite.\n',
 <p></p>,
 '\nAll her invitations without exception, written in French, and\ndelivered by a scarlet-liveried footman that morning, ran as follows:\n',
 <p></p>,
 '\n"',
 <span class="red">If you have nothing better to do, Count [or Prince], and if the
 prospect of spending an evening with a poor invalid is not too
 terrible, I shall be very charmed to see you tonight between 7 and 10-
 Annette Scherer.</span>,
 '"\n',
 <p></p>,
 '\n"'

In [None]:
# (3-3) BeautifulSoup 이용
# - 특정 아이디, 클래스, 특정값을 가지고 있는 태그 탐색
# find(id=아이디명) => 태그 1개
# find_all(class='클래스명') => 리스트
# find_all(attrs={속성값}) => 리스트

In [44]:
# 아이디로 탐색
# soup.find(id='text')
# soup.find(id='text').text
type(soup.find(id='text'))

bs4.element.Tag

In [45]:
# 클래스로 탐색
green_list = soup.find_all(class_='green')
len(green_list)

41

In [48]:
print(green_list[0])
green_list[-1].text

<span class="green">Anna
Pavlovna Scherer</span>


'Anna Pavlovna'

In [51]:
# 속성으로 찾기
# find_all(attrs={속성:값}) => 리스트
green_list2 = soup.find_all(attrs={'class':'green'})
len(green_list2)

41

In [52]:
green_list2 = soup.find_all(attrs={'class':'red'})
len(green_list2)

34

In [53]:
# (3-5) BeautifulSoup 이용 
# 선택자로 탐색 
# soup.select(선택자) => 리스트
# soup.select_one(선택자) => 1개

In [54]:
# soup.select('div')
# span 태그 전부 
span_list = soup.select('span')
len(span_list)

75

In [55]:
red_list = soup.select('.red')
red_list[0].text

"Well, Prince, so Genoa and Lucca are now just family estates of the\nBuonapartes. But I warn you, if you don't tell me that this means war,\nif you still try to defend the infamies and horrors perpetrated by\nthat Antichrist- I really believe he is Antichrist- I will have\nnothing more to do with you and you are no longer my friend, no longer\nmy 'faithful slave,' as you call yourself! But how do you do? I see\nI have frightened you- sit down and tell me all the news."

In [57]:
first_span = soup.select_one('.red')
first_span.text

"Well, Prince, so Genoa and Lucca are now just family estates of the\nBuonapartes. But I warn you, if you don't tell me that this means war,\nif you still try to defend the infamies and horrors perpetrated by\nthat Antichrist- I really believe he is Antichrist- I will have\nnothing more to do with you and you are no longer my friend, no longer\nmy 'faithful slave,' as you call yourself! But how do you do? I see\nI have frightened you- sit down and tell me all the news."

In [59]:
# 크롬의 개발자 도구에서 지원하는 선택자 복사 기능 
# 1) url 페이지에서 개발자 모드로 변경 : F12 / 마우스 우측버튼 [검사]
# 2) [Select an element~] 도구 클릭후 찾고하자는 웹페이지 일부분 클릭 
# 3) HTML element 창영역에서 11시방향의 옵션버튼 클릭 후 [copy]-[copy selector]
# 4) ctrl+v 해서 선택자 확인 
# 예)
#text > p

In [65]:
result = soup.select('#text>p')
result

[<p></p>,
 <p></p>,
 <p></p>,
 <p></p>,
 <p></p>,
 <p></p>,
 <p></p>,
 <p></p>,
 <p></p>,
 <p></p>,
 <p></p>,
 <p></p>,
 <p></p>,
 <p></p>,
 <p></p>,
 <p></p>,
 <p></p>,
 <p></p>,
 <p></p>,
 <p></p>,
 <p></p>,
 <p></p>,
 <p></p>,
 <p></p>,
 <p></p>,
 <p></p>,
 <p></p>,
 <p></p>,
 <p></p>,
 <p></p>,
 <p></p>,
 <p></p>,
 <p></p>,
 <p></p>]

In [58]:
soup

<html>
<head>
<style>
.green{
	color:#55ff55;
}
.red{
	color:#ff5555;
}
#text{
	width:50%;
}
</style>
</head>
<body>
<h1>War and Peace</h1>
<h2>Chapter 1</h2>
<div id="text">
"<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.</span>"
<p></p>
It was in July, 1805, and the speaker was the well-known <span class="green">Anna
Pavlovna Scherer</span>, maid of honor and favorite of the <span class="green">Empress Marya
Fedorovna</span>. With these words she greeted <span class="green">Prince Vasili Kuragin</span>, a man
of high rank and importance, who was the firs