## 네이버 금융 개별종목 수집
* FinanceDataReader를 통해 수집했던 데이터를 네이버 증권 웹 페이지를 통해 직접 수집합니다.


### Keyword

* html 파일 읽어오기
    * pd.read_html(url, encoding="cp949")

* 결측 데이터 제거하기(axis 0:행, 1:열)
    * table[0].dropna()

* 데이터 프레임 합치기
    * pd.concat([df1, df2, df3])

* 중복데이터 제거
    * df.drop_duplicates()

* 과학적 기수법
    * 1.210000e+02 => 121

* 날짜 column의 첫 row값 확인
    * date = df.iloc[0]["날짜"]

* 파일로 저장하기 
    * df.to_csv(file_name, index=False)

* 파일 읽어오기
    * pd.read_csv(file_name)

## 수집할 페이지 보기

* 네이버 금융 국내증시 : https://finance.naver.com/sise/
* 2020년 주요 상장종목
    * 하이브 : https://finance.naver.com/item/main.nhn?code=352820
    * 카카오게임즈 : https://finance.naver.com/item/main.nhn?code=293490
    * SK바이오팜 : https://finance.naver.com/item/main.nhn?code=326030

## 라이브러리 로드

In [21]:
import pandas as pd

## 수집할 URL 정하기

In [3]:
# 종목번호와 상장사 이름을 item_code와 item_name으로 설정
item_code = "352820"
item_name = "빅히트"

# item_code = "326030"
# item_name = "SK바이오팜"

# 종목 URL 만들기
url="https://finance.naver.com/item/sise_day.nhn?code=352820&page=3"
print(url)

https://finance.naver.com/item/sise_day.nhn?code=352820&page=3


## requests를 통한 HTTP 요청
* [Requests: HTTP for Humans™ — Requests documentation](https://requests.readthedocs.io/en/master/)
* [Quickstart — Requests documentation # custom-headers](https://requests.readthedocs.io/en/latest/user/quickstart/#custom-headers)

In [4]:
import requests

In [6]:
response=requests.get(url)

In [7]:
response.text

'\n<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">\n<html>\n<head>\n<meta http-equiv="Content-Type" content="text/html; charset=euc-kr">\n<title>네이버 :: 세상의 모든 지식, 네이버</title>\n\n<style type="text/css">\n.error_content * {margin:0;padding:0;}\n.error_content img{border:none;}\n.error_content em {font-style:normal;}\n.error_content {width:410px; margin:80px auto 0; padding:57px 0 0 0; font-size:12px; font-family:"나눔고딕", "NanumGothic", "돋움", Dotum, AppleGothic, Sans-serif; text-align:left; line-height:14px; background:url(https://ssl.pstatic.net/static/common/error/090610/bg_thumb.gif) no-repeat center top; white-space:nowrap;}\n.error_content p{margin:0;}\n.error_content .error_desc {margin-bottom:21px; overflow:hidden; text-align:center;}\n.error_content .error_desc2 {margin-bottom:11px; padding-bottom:7px; color:#888; line-height:18px; border-bottom:1px solid #eee;}\n.error_content .error_desc3 {clear:both; color:#888;}\n.error_con

In [12]:
headers={"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36"}

In [15]:
response=requests.get(url,headers=headers)

In [16]:
response.text

'\n<html lang="ko">\n<head>\n<meta http-equiv="Content-Type" content="text/html; charset=euc-kr">\n<title>네이버 금융</title>\n\n<link rel="stylesheet" type="text/css" href="https://ssl.pstatic.net/imgstock/static.pc/20210624194556/css/newstock.css">\n<link rel="stylesheet" type="text/css" href="https://ssl.pstatic.net/imgstock/static.pc/20210624194556/css/common.css">\n<link rel="stylesheet" type="text/css" href="https://ssl.pstatic.net/imgstock/static.pc/20210624194556/css/layout.css">\n<link rel="stylesheet" type="text/css" href="https://ssl.pstatic.net/imgstock/static.pc/20210624194556/css/main.css">\n<link rel="stylesheet" type="text/css" href="https://ssl.pstatic.net/imgstock/static.pc/20210624194556/css/newstock2.css">\n<link rel="stylesheet" type="text/css" href="https://ssl.pstatic.net/imgstock/static.pc/20210624194556/css/newstock3.css">\n<link rel="stylesheet" type="text/css" href="https://ssl.pstatic.net/imgstock/static.pc/20210624194556/css/world.css">\n</head>\n<body>\n<script

## BeautifulSoup 을 통한 table 태그 찾기

In [17]:
from bs4 import BeautifulSoup as bs

In [18]:
html=bs(response.text,"lxml")

In [19]:
temp=html.select("table")

In [23]:
temp

[<table cellspacing="0" class="type2">
 <tr>
 <th>날짜</th>
 <th>종가</th>
 <th>전일비</th>
 <th>시가</th>
 <th>고가</th>
 <th>저가</th>
 <th>거래량</th>
 </tr>
 <tr>
 <td colspan="7" height="8"></td>
 </tr>
 <tr onmouseout="mouseOut(this)" onmouseover="mouseOver(this)">
 <td align="center"><span class="tah p10 gray03">2021.05.27</span></td>
 <td class="num"><span class="tah p11">253,000</span></td>
 <td class="num">
 <img alt="하락" height="6" src="https://ssl.pstatic.net/imgstock/images/images4/ico_down.gif" style="margin-right:4px;" width="7"/><span class="tah p11 nv01">
 				6,000
 				</span>
 </td>
 <td class="num"><span class="tah p11">259,500</span></td>
 <td class="num"><span class="tah p11">260,000</span></td>
 <td class="num"><span class="tah p11">252,000</span></td>
 <td class="num"><span class="tah p11">1,175,231</span></td>
 </tr>
 <tr onmouseout="mouseOut(this)" onmouseover="mouseOver(this)">
 <td align="center"><span class="tah p10 gray03">2021.05.26</span></td>
 <td class="num"><span cl

In [24]:
str(temp)

'[<table cellspacing="0" class="type2">\n<tr>\n<th>날짜</th>\n<th>종가</th>\n<th>전일비</th>\n<th>시가</th>\n<th>고가</th>\n<th>저가</th>\n<th>거래량</th>\n</tr>\n<tr>\n<td colspan="7" height="8"></td>\n</tr>\n<tr onmouseout="mouseOut(this)" onmouseover="mouseOver(this)">\n<td align="center"><span class="tah p10 gray03">2021.05.27</span></td>\n<td class="num"><span class="tah p11">253,000</span></td>\n<td class="num">\n<img alt="하락" height="6" src="https://ssl.pstatic.net/imgstock/images/images4/ico_down.gif" style="margin-right:4px;" width="7"/><span class="tah p11 nv01">\n\t\t\t\t6,000\n\t\t\t\t</span>\n</td>\n<td class="num"><span class="tah p11">259,500</span></td>\n<td class="num"><span class="tah p11">260,000</span></td>\n<td class="num"><span class="tah p11">252,000</span></td>\n<td class="num"><span class="tah p11">1,175,231</span></td>\n</tr>\n<tr onmouseout="mouseOut(this)" onmouseover="mouseOver(this)">\n<td align="center"><span class="tah p10 gray03">2021.05.26</span></td>\n<td class="num"

## pandas 코드 한 줄로 데이터 수집하기

In [26]:
# read_html을 이용하여 url의 page내의 값을 DataFrame으로 받아옵니다.
# cp949는 한글 인코딩을 위해 사용합니다. 기본 인코딩 설정은 utf-8 이며, 
# 네이버의 일별 시세는 cp949 인코딩으로 불러올 수 있습니다.
# 데이터를 로드 했을 때 한글 인코딩이 깨진다면 대부분 cp949 로 불러올 수 있습니다.
table=pd.read_html(str(temp))
table

[            날짜        종가      전일비        시가        고가        저가        거래량
 0          NaN       NaN      NaN       NaN       NaN       NaN        NaN
 1   2021.05.27  253000.0   6000.0  259500.0  260000.0  252000.0  1175231.0
 2   2021.05.26  259000.0   1500.0  261500.0  264000.0  256500.0   257975.0
 3   2021.05.25  260500.0   3000.0  265000.0  266000.0  255000.0   307669.0
 4   2021.05.24  263500.0   2000.0  263000.0  272000.0  261000.0   543930.0
 5   2021.05.21  261500.0   4500.0  268000.0  272000.0  257500.0   357864.0
 6          NaN       NaN      NaN       NaN       NaN       NaN        NaN
 7          NaN       NaN      NaN       NaN       NaN       NaN        NaN
 8          NaN       NaN      NaN       NaN       NaN       NaN        NaN
 9   2021.05.20  266000.0   5500.0  261000.0  266000.0  255000.0   342100.0
 10  2021.05.18  260500.0  12000.0  250000.0  260500.0  249000.0   395521.0
 11  2021.05.17  248500.0   6000.0  242500.0  248500.0  242500.0   143607.0
 12  2021.05

In [27]:
# table[0]와 table[1]을 확인하여 보면 table[0]에 필요한 데이터들이 있습니다.
table[0]

Unnamed: 0,날짜,종가,전일비,시가,고가,저가,거래량
0,,,,,,,
1,2021.05.27,253000.0,6000.0,259500.0,260000.0,252000.0,1175231.0
2,2021.05.26,259000.0,1500.0,261500.0,264000.0,256500.0,257975.0
3,2021.05.25,260500.0,3000.0,265000.0,266000.0,255000.0,307669.0
4,2021.05.24,263500.0,2000.0,263000.0,272000.0,261000.0,543930.0
5,2021.05.21,261500.0,4500.0,268000.0,272000.0,257500.0,357864.0
6,,,,,,,
7,,,,,,,
8,,,,,,,
9,2021.05.20,266000.0,5500.0,261000.0,266000.0,255000.0,342100.0


In [29]:
# dropna를 통해 결측치가 들어있는 row를 제거합니다.
temp=table[0].dropna()
temp

Unnamed: 0,날짜,종가,전일비,시가,고가,저가,거래량
1,2021.05.27,253000.0,6000.0,259500.0,260000.0,252000.0,1175231.0
2,2021.05.26,259000.0,1500.0,261500.0,264000.0,256500.0,257975.0
3,2021.05.25,260500.0,3000.0,265000.0,266000.0,255000.0,307669.0
4,2021.05.24,263500.0,2000.0,263000.0,272000.0,261000.0,543930.0
5,2021.05.21,261500.0,4500.0,268000.0,272000.0,257500.0,357864.0
9,2021.05.20,266000.0,5500.0,261000.0,266000.0,255000.0,342100.0
10,2021.05.18,260500.0,12000.0,250000.0,260500.0,249000.0,395521.0
11,2021.05.17,248500.0,6000.0,242500.0,248500.0,242500.0,143607.0
12,2021.05.14,242500.0,2000.0,245000.0,245000.0,241000.0,149565.0
13,2021.05.13,240500.0,5500.0,242000.0,246000.0,238000.0,248997.0


## 페이지별 데이터 수집 함수 만들기

In [54]:
def get_day_list(item_code,page_no):
    url=f"https://finance.naver.com/item/sise_day.nhn?code={item_code}&page={page_no}"
    headers={"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36"}
    response=requests.get(url,headers=headers)
    html=bs(response.text,"lxml")
    temp=html.select("table")
    table=pd.read_html(str(temp))
    table=table[0].dropna()
    return table

In [55]:
# 함수가 잘 만들어졌는지 확인
get_day_list("352820",2)

Unnamed: 0,날짜,종가,전일비,시가,고가,저가,거래량
1,2021.06.10,272500.0,10500.0,262000.0,279000.0,261500.0,946188.0
2,2021.06.09,262000.0,3000.0,265000.0,265000.0,261000.0,166528.0
3,2021.06.08,265000.0,500.0,265000.0,267000.0,263500.0,153360.0
4,2021.06.07,265500.0,500.0,267000.0,268000.0,261000.0,153686.0
5,2021.06.04,266000.0,3500.0,268500.0,269000.0,260000.0,284503.0
9,2021.06.03,269500.0,6000.0,264000.0,274000.0,262500.0,410640.0
10,2021.06.02,263500.0,2500.0,263000.0,266500.0,261500.0,204841.0
11,2021.06.01,261000.0,4000.0,266000.0,266500.0,258000.0,183347.0
12,2021.05.31,265000.0,4000.0,263500.0,267000.0,261500.0,225074.0
13,2021.05.28,261000.0,8000.0,255000.0,263000.0,254500.0,285689.0


## 반복문을 통한 전체 일자 데이터 수집하기
* (주의) 기간이 긴 데이터를 수집할때는 서버에 부담을 주지 않기 위해 time.sleep()값을 주세요.

In [56]:
import numpy as np

In [67]:
import time

page_no = 1

item_list = []
start=1
end=10

for i in range(start,end+1):
    a=get_day_list(item_code,i)
    item_list.append(a)

In [68]:
len(item_list)

10

## 수집한 데이터 하나의 데이터프레임으로 합치기

<img src="https://pandas.pydata.org/docs/_images/merging_concat_basic.png">

* [Merge, join, concatenate and compare documentation](https://pandas.pydata.org/docs/user_guide/merging.html#merge-join-concatenate-and-compare)

In [72]:
item_list[0]

Unnamed: 0,날짜,종가,전일비,시가,고가,저가,거래량
1,2021.06.24,322000.0,500.0,326000.0,329500.0,317500.0,373105.0
2,2021.06.23,321500.0,3000.0,334500.0,337000.0,318500.0,687701.0
3,2021.06.22,324500.0,16500.0,311000.0,324500.0,311000.0,607966.0
4,2021.06.21,308000.0,5000.0,313000.0,321000.0,299500.0,556480.0
5,2021.06.18,313000.0,16000.0,298500.0,326500.0,297500.0,1630734.0
9,2021.06.17,297000.0,5000.0,292000.0,301000.0,289000.0,567305.0
10,2021.06.16,292000.0,3000.0,290500.0,298000.0,287000.0,478261.0
11,2021.06.15,289000.0,7000.0,285000.0,293000.0,279500.0,461455.0
12,2021.06.14,282000.0,6500.0,277000.0,286000.0,276500.0,502504.0
13,2021.06.11,275500.0,3000.0,272500.0,277500.0,271500.0,324529.0


In [73]:
df=pd.concat(item_list)
df.shape

(100, 7)

<img src="https://pandas.pydata.org/docs/_images/02_io_readwrite.svg">

In [74]:
df.head()

Unnamed: 0,날짜,종가,전일비,시가,고가,저가,거래량
1,2021.06.24,322000.0,500.0,326000.0,329500.0,317500.0,373105.0
2,2021.06.23,321500.0,3000.0,334500.0,337000.0,318500.0,687701.0
3,2021.06.22,324500.0,16500.0,311000.0,324500.0,311000.0,607966.0
4,2021.06.21,308000.0,5000.0,313000.0,321000.0,299500.0,556480.0
5,2021.06.18,313000.0,16000.0,298500.0,326500.0,297500.0,1630734.0


In [75]:
df.tail()

Unnamed: 0,날짜,종가,전일비,시가,고가,저가,거래량
9,2021.02.04,241500.0,9000.0,234000.0,247000.0,228500.0,861501.0
10,2021.02.03,232500.0,0.0,231500.0,241000.0,225500.0,545274.0
11,2021.02.02,232500.0,15000.0,219000.0,242500.0,213500.0,1363038.0
12,2021.02.01,217500.0,13000.0,204500.0,217500.0,199000.0,608089.0
13,2021.01.29,204500.0,15500.0,222000.0,222000.0,200500.0,773127.0
