<a href="https://colab.research.google.com/github/bok-h22/crawling/blob/master/02_HTML_%EC%8A%A4%ED%81%AC%EB%9E%98%EC%9D%B4%ED%95%91.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
html_str = """
<html>
  <head>
    <title>안녕하세요</title>
  </head>
  <body>
    <div id="container">
      <p class='p1'>hello</p>
      <p>Bye</p>
    </div>
  </body>
</html>"""

- html 형식의 문자열 내에 원하는 데이터가 존재
- HTML Element에 원하는 데이터가 있다.

In [None]:
from bs4 import BeautifulSoup

In [None]:
soup = BeautifulSoup(html_str, 'html.parser')
soup


<html>
<head>
<title>안녕하세요</title>
</head>
<body>
<div id="container">
<p class="p1">hello</p>
<p>Bye</p>
</div>
</body>
</html>

In [None]:
type(soup)

bs4.BeautifulSoup

# `find`
- `find("태그명", "속성 값을 딕셔너리로 표현")` : 한 개의 엘리먼트 찾기
- `find_all("태그명", "속성 값을 딕셔너리로 표현")` : 여러 개의 엘리먼트 찾기

In [None]:
container_div = soup.find("div", {"id": "container"})
container_div

<div id="container">
<p class="p1">hello</p>
<p>Bye</p>
</div>

In [None]:
type(container_div)

bs4.element.Tag

In [None]:
soup.find_all("p")

[<p class="p1">hello</p>, <p>Bye</p>]

In [None]:
soup.find("p")

<p class="p1">hello</p>

In [None]:
# 클래스가 p1인 p 엘리먼트를 모두 찾아보기
soup.find_all('p', {'class': 'p1'})

[<p class="p1">hello</p>]

In [None]:
# 텍스트 추출
#  - 반드시 element 객체인 상태에서만 가능
soup.find('p', {'class': 'p1'}).text

'hello'

In [None]:
# 두 개의 p 태그 내에 있는 텍스트 추출하기
p_tags = soup.find_all("p")
print(p_tags[0].text)
print(p_tags[1].text)

hello
Bye


In [None]:
for p in p_tags:
  print(p.text)

hello
Bye


In [None]:
# 엘리먼트에서 찾기
container_div = soup.find("div", {"id": "container"})
container_div

<div id="container">
<p class="p1">hello</p>
<p>Bye</p>
</div>

In [None]:
container_div.find_all("p")

[<p class="p1">hello</p>, <p>Bye</p>]

# 선택자(`Selector`)를 사용해서 찾기
- `select("선택자")` : 선택자에 의해 엘리먼트를 여러 개 선택
- `select_one("선택자")` : 선택자에 의해 엘리먼트를 한 개만 선택


In [None]:
soup.select_one("#container")

<div id="container">
<p class="p1">hello</p>
<p>Bye</p>
</div>

In [None]:
# select_one을 이용해 .p1 찾기
soup.select_one("#container > .p1")

<p class="p1">hello</p>

In [None]:
container_div.select_one(".p1")

<p class="p1">hello</p>

In [None]:
soup.select("#container > p")

[<p class="p1">hello</p>, <p>Bye</p>]

# 텍스트, 속성 추출

In [None]:
# Bye만 추출해 보세요
soup.select("#container > p")[-1].text

'Bye'

In [None]:
# .p1을 가져오고 싶으면?
soup.select_one("#container > .p1").get("class") # get(속성명)

['p1']

In [None]:
soup.select_one("#container > .p1")['class']

['p1']

# 네이버 환율 정보 스크래이핑

In [None]:
import requests
from bs4 import BeautifulSoup

In [None]:
NAVER_FINANCE_URL = "https://finance.naver.com/marketindex/"

In [None]:
response = requests.get(NAVER_FINANCE_URL)
html_code = response.content
soup = BeautifulSoup(html_code, 'html.parser')
soup

In [None]:
# 스크래이핑 결과물을 데이터 프레임으로 만들어 주기

In [None]:
exchange_list = soup.select_one("#exchangeList")
exchange_list

In [None]:
fin_list = exchange_list.find_all("li") # exchange_list.select("li")
len(fin_list)

4

In [None]:
sample_li = fin_list[0]
sample_li

<li class="on">
<a class="head usd" href="/marketindex/exchangeDetail.naver?marketindexCd=FX_USDKRW" onclick="clickcr(this, 'fr1.usdt', '', '', event);">
<h3 class="h_lst"><span class="blind">미국 USD</span></h3>
<div class="head_info point_dn">
<span class="value">1,298.40</span>
<span class="txt_krw"><span class="blind">원</span></span>
<span class="change"> 3.60</span>
<span class="blind">하락</span>
</div>
</a>
<a class="graph_img" href="/marketindex/exchangeDetail.naver?marketindexCd=FX_USDKRW" onclick="clickcr(this, 'fr1.usdc', '', '', event);">
<img alt="" height="153" src="https://ssl.pstatic.net/imgfinance/chart/marketindex/FX_USDKRW.png" width="295"/>
</a>
<div class="graph_info">
<span class="time">2023.03.28 13:30</span>
<span class="source">하나은행 기준</span>
<span class="count">고시회차<span class="num">461</span>회</span>
</div>
</li>

In [None]:
c_name = sample_li.find("h3", {"class":"h_lst"}).text
c_name

'미국 USD'

In [None]:
c_name = sample_li.select_one("h3.h_lst").text
c_name

'미국 USD'

In [None]:
change = sample_li.select_one("span.change").text
change

' 3.60'

In [None]:
updown = sample_li.select("span.blind")
updown[-1].text

'하락'

In [None]:
exchange_rate = sample_li.select_one("span.value").text
exchange_rate

'1,298.40'

In [None]:
c_name_list = []
exchange_rate_list = []
change_list = []
updown_list = []

for fin in fin_list:
  c_name = fin.select_one("h3.h_lst").text.strip()
  exchange_rate = float(fin.select_one("span.value").text.replace(",", ""))
  change = float(fin.select_one("span.change").text.strip())
  updown = fin.select("span.blind")[-1].text.strip()

  print(c_name, exchange_rate, change, updown)

  c_name_list.append(c_name)
  exchange_rate_list.append(exchange_rate)
  change_list.append(change)
  updown_list.append(updown)

미국 USD 1298.4 3.6 하락
일본 JPY(100엔) 993.61 2.33 상승
유럽연합 EUR 1403.64 0.34 상승
중국 CNY 188.64 0.35 하락


In [None]:
fin_datas = {
    "국가": c_name_list,
    "환율": exchange_rate_list,
    "변동": change_list,
    "등락": updown_list
}
fin_datas

{'국가': ['미국 USD', '일본 JPY(100엔)', '유럽연합 EUR', '중국 CNY'],
 '환율': [1298.4, 993.61, 1403.64, 188.64],
 '변동': [3.6, 2.33, 0.34, 0.35],
 '등락': ['하락', '상승', '상승', '하락']}

In [None]:
import pandas as pd

df_finance = pd.DataFrame(fin_datas)
df_finance

Unnamed: 0,국가,환율,변동,등락
0,미국 USD,1298.4,3.6,하락
1,일본 JPY(100엔),993.61,2.33,상승
2,유럽연합 EUR,1403.64,0.34,상승
3,중국 CNY,188.64,0.35,하락
