## [미니프로젝트] 악성사이트 탐지 머신러닝 모델 개발

### 여러분은 기업 보안팀에서 근무중인 엔지니어로써, 웹페이지에서 추출한 Feature(특징) 기반으로 악성사이트를 탐지하는 머신러닝 모델 개발 미션을 부여받았습니다.

### ▣ 우리가 풀어야 하는 문제는 무엇인가요?
 - 웹 페이지에서 Feature를 추출하세요.
 - 악성사이트 여부를 판별하는 성능 좋은 AI모델을 생성하세요.

<br>

---

## ▣ 데이터 소개
* 웹 크롤링 데이터셋 : Feature_Website.xlsx

## ▣ 웹 크롤링 데이터셋의 변수 소개
* html_code : 크롤링을 활용해 수집한 HTML Code 원본
* repu : 악성사이트 여부 (malicious : 악성사이트, benign : 정상사이트)
<br>

---

## <b>[1단계] 데이터 수집</b>

* 1단계에서는 크롤링으로 수집한 HTML Code를 활용해 Feature를 만드는 과정을 체험합니다.

# <b>Step 0. 본격적인 실습 전 packages 설치
* Beautifulsoup 라이브러리 설치
* openpyxl 라이브러리 설치

In [1]:
pip install bs4

Collecting bs4
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py): started
  Building wheel for bs4 (setup.py): finished with status 'done'
  Created wheel for bs4: filename=bs4-0.0.1-py3-none-any.whl size=1271 sha256=0914fe543aa23533bcd83f29f4e39de1b42ae3f700bfda84ab528d542166e654
  Stored in directory: c:\users\wslee\appdata\local\pip\cache\wheels\73\2b\cb\099980278a0c9a3e57ff1a89875ec07bfa0b6fcbebb9a8cad3
Successfully built bs4
Installing collected packages: bs4
Successfully installed bs4-0.0.1
Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install openpyxl

Note: you may need to restart the kernel to use updated packages.


* 데이터 프레임 관련 라이브러리 Import

In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

---
## <b>Q1. 데이터 불러오기
### 정상/악성 HTML Code가 저장된 엑셀파일 불러오기
- 파일명 : Feature Website.xlsx


### <span style="color:darkred">[문제1] Pandas 라이브러리를 활용해서 'Feature Website.xlsx'파일을 'df' 변수에 저장하고 그 info()및 head()를 통해 데이터를 확인하세요.<span>

In [126]:
# 아래에 실습코드를 작성하고 결과를 확인합니다.
df = pd.read_excel('Feature_Website.xlsx')

In [3]:
# 데이터 프레임의 info를 확인합니다.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40 entries, 0 to 39
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   html_code  40 non-null     object
 1   repu       40 non-null     object
dtypes: object(2)
memory usage: 768.0+ bytes


In [6]:
# 불러온 데이터를 확인합니다.
df.head()

Unnamed: 0,html_code,repu
0,<!DOCTYPE html>\n<!--[if lt IE 7]> <html lang=...,malicious
1,\n\t\n\n\n\t\n\n\t\n\n\n\t\n\n\n\t\n\n\t\n\t\t...,malicious
2,"<!DOCTYPE html>\n<html lang=""en"">\n <head>\...",malicious
3,"<!DOCTYPE html><html lang=""en""><head><style da...",malicious
4,<!DOCTYPE html>\n\n\n \n \n \n \n ...,malicious


---
# <b>Step 1. 데이터 수집

### 주어진 데이터로만 모델링 하는 경우는 드뭅니다.
### 주어진 데이터 외 추가로 데이터를 수집 또는 생성해야 하는 경우가 많습니다.
### 이번 과정에서는 웹 크롤러를 통해 수집된 정상/악성 사이트 HTML 데이터에서
### BeatifulSoup 라이브러리를 활용 필요한 Feature(특징)를 추출해 보도록 하겠습니다.
### 정상/악성 사이트 HTML Code는 사전에 수집하여 'Feature Website.xlsx' 파일에 저장해 두었습니다.


### <span style="color:blue">[예시] Beatuifulsoup 라이브러리를 활용 HTML code를 출력하고 \<title> 태그 길이를 계산합니다.<span>

In [23]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(df['html_code'][0], 'html.parser')

*<span style="color:blue"> html code 출력<span>

In [24]:
print(soup)

<!DOCTYPE html>

<!--[if lt IE 7]> <html lang="en-us" class="a-no-js a-lt-ie9 a-lt-ie8 a-lt-ie7"> <![endif]-->
<!--[if IE 7]>    <html lang="en-us" class="a-no-js a-lt-ie9 a-lt-ie8"> <![endif]-->
<!--[if IE 8]>    <html lang="en-us" class="a-no-js a-lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="a-no-js" lang="en-us"><!--<![endif]--><head>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta charset="utf-8"/>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<title dir="ltr">Amazon.com</title>
<meta content="width=device-width" name="viewport"/>
<link href="https://images-na.ssl-images-amazon.com/images/G/01/AUIClients/AmazonUI-3c913031596ca78a3768f4e934b1cc02ce238101.secure.min._V1_.css" rel="stylesheet"/>
<script>

if (true === true) {
    var ue_t0 = (+ new Date()),
        ue_csm = window,
        ue = { t0: ue_t0, d: function() { return (+new Date() - ue_t0); } },
        ue_furl = "fls-na.amazon.com",
        ue_mid = "ATVPDKIKX0DER"

* <span style="color:blue"> \<title> 태그 출력 및 길이 계산<span>

In [25]:
soup = BeautifulSoup(df['html_code'][0], 'html.parser')

# <title> 태그 출력
print("* title :",soup.head.title)

# <title> 태그 길이 출력
print("* title 길이 :", len(str(soup.head.title.getText())))

* title : <title dir="ltr">Amazon.com</title>
* title 길이 : 10


In [127]:
def title_length(soup):
    try:
        return len(str(soup.head.title.getText()))
    except:
        return 0.0

In [128]:
title_len = []

for index, row in df.iterrows():
    soup = BeautifulSoup(row.html_code, 'html.parser')

    title_len.append(title_length(soup))

In [129]:
df['title_len'] = title_len

In [34]:
# <title> 태그 출력
print("* title :",soup.head.script)

# <title> 태그 길이 출력
print("* title 길이 :", len(str(soup.head.script.getText())))

* title : <script>function createPerformanceMark(e){void 0!==window.performance&&void 0!==window.performance.mark&&performance.mark(e)}function createPerformanceMeasure(e,r,o){void 0!==window.performance&&void 0!==window.performance.measure&&performance.measure(e,r,o)}</script>
* title 길이 : 251


In [130]:
df.head()

Unnamed: 0,html_code,repu,title_len
0,<!DOCTYPE html>\n<!--[if lt IE 7]> <html lang=...,malicious,10.0
1,\n\t\n\n\n\t\n\n\t\n\n\n\t\n\n\n\t\n\n\t\n\t\t...,malicious,5.0
2,"<!DOCTYPE html>\n<html lang=""en"">\n <head>\...",malicious,71.0
3,"<!DOCTYPE html><html lang=""en""><head><style da...",malicious,0.0
4,<!DOCTYPE html>\n\n\n \n \n \n \n ...,malicious,27.0


---

## <b>Q2. html 에서 \<script>...\</script> 태그 길이 계산
- BeautifulSoup으로 html소스를 python 객체로 변환
- 함수로 구현하기
- float으로 return 받기

### <span style="color:darkred">[문제2] Beatuifulsoup 라이브러리를 활용 HTML code에서 \<script> 태그 길이를 계산하는 함수를 완성하고 결과를 확인하세요.<span>

In [246]:
# Feature(특징) 데이터를 추출는 함수를 작성합니다.
def head_script_length(soup):
    try:
        return len(str(soup.head.script))
    except:
        return 0.0

In [247]:
def body_script_length(soup):
    try:
        return len(str(soup.body.script))
    except:
        return 0.0

In [248]:
def body_script_length(soup):
    try:
        return len(str(soup.body.script.getText()))
    except:
        return 0.0

In [249]:
# 데이터 프레임의 html_code 컬럼에서 Feature(특징) 데이터를 추출합니다.
head_script_len = []
body_script_len = []

for index, row in df.iterrows():
    soup = BeautifulSoup(row.html_code, 'html.parser')

    head_script_len.append(head_script_length(soup))
    body_script_len.append(body_script_length(soup))

In [250]:
# 추출한 Feature(특징) 데이터를 확인합니다.
df['head_script_len'] = head_script_len
df['body_script_len'] = body_script_len

In [251]:
df.head()

Unnamed: 0,html_code,repu,title_len,head_script_len,body_script_len,blank_count_html,body_len,src_count_head_script,href_count_head_script,src_count_body_script,href_count_body_script,SrcOrHrefTag_count_head_script,SrcOrHrefTag_count_body_script
0,<!DOCTYPE html>\n<!--[if lt IE 7]> <html lang=...,malicious,10.0,405,185.0,65.0,402.0,0.0,0.0,1.0,0.0,0.0,1.0
1,\n\t\n\n\n\t\n\n\t\n\n\n\t\n\n\n\t\n\n\t\n\t\t...,malicious,5.0,579,0.0,87.0,1041.0,0.0,0.0,1.0,0.0,0.0,1.0
2,"<!DOCTYPE html>\n<html lang=""en"">\n <head>\...",malicious,71.0,817,0.0,190.0,433.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"<!DOCTYPE html><html lang=""en""><head><style da...",malicious,0.0,4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,<!DOCTYPE html>\n\n\n \n \n \n \n ...,malicious,27.0,425,0.0,1802.0,2969.0,0.0,0.0,0.0,0.0,0.0,0.0


In [65]:
soup = BeautifulSoup(df['html_code'][0], 'html.parser')

# <title> 태그 출력
print("* title :",soup.head.script)

# <title> 태그 길이 출력
print("* title 길이 :", len(str(soup.head.script.getText())))

* title : <script>

if (true === true) {
    var ue_t0 = (+ new Date()),
        ue_csm = window,
        ue = { t0: ue_t0, d: function() { return (+new Date() - ue_t0); } },
        ue_furl = "fls-na.amazon.com",
        ue_mid = "ATVPDKIKX0DER",
        ue_sid = (document.cookie.match(/session-id=([0-9-]+)/) || [])[1],
        ue_sn = "opfcaptcha.amazon.com",
        ue_id = 'BY6PY7346W2THV0MAFYZ';
}
</script>
* title 길이 : 388


---

## <b>Q3. html에서 공백 수 계산

- BeautifulSoup으로 html소스를 python 객체로 변환
- 함수로 구현하기
- float으로 return 받기

### <span style="color:darkred">[문제3] Beatuifulsoup 라이브러리를 활용 HTML Code에서 \<html> 태그 공백 수를 계산하는 함수를 완성하고 결과를 확인하세요.<span>

In [136]:
soup = BeautifulSoup(df['html_code'][0], 'html.parser')
# <title> 태그 출력

count = 0
#print(soup.body)
for i in range(len(soup.body.getText())):
    
    if soup.body.getText()[i] == " ":
        count+= 1
        
print(count)

# <title> 태그 길이 출력
#print("* title 길이 :", len(str(soup.body.getText())))

65


In [278]:
# Feature(특징) 데이터를 추출하는 함수를 작성합니다.
def count_blank_html(soup):
    try:
#         count = 0
#         html_tag = str(soup.body)

#         for i in range(len(html_tag)):
#             if html_tag[i] == " ":
#                 count += 1
                
#         return count
    
        return float(str(soup.body).count(" "))
    except:
        return 0.0  

In [279]:
# 데이터 프레임의 html_code 컬럼에서 Feature(특징) 데이터를 추출합니다.
blank_count_html = []

for index, row in df.iterrows():
    soup = BeautifulSoup(row.html_code, 'html.parser')
    blank_count_html.append(count_blank_html(soup))

In [280]:
# 추출한 Feature(특징) 데이터를 확인합니다.
df['blank_count_html'] = blank_count_html

In [281]:
df.head()

Unnamed: 0,html_code,repu,title_len,head_script_len,body_script_len,blank_count_html,body_len,src_count_head_script,href_count_head_script,src_count_body_script,href_count_body_script,SrcOrHrefTag_count_head_script,SrcOrHrefTag_count_body_script
0,<!DOCTYPE html>\n<!--[if lt IE 7]> <html lang=...,malicious,10.0,405,185.0,361.0,3836.0,0.0,0.0,1.0,0.0,0.0,1.0
1,\n\t\n\n\n\t\n\n\t\n\n\n\t\n\n\n\t\n\n\t\n\t\t...,malicious,5.0,579,0.0,774.0,15268.0,0.0,0.0,1.0,0.0,0.0,1.0
2,"<!DOCTYPE html>\n<html lang=""en"">\n <head>\...",malicious,71.0,817,0.0,1469.0,16542.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"<!DOCTYPE html><html lang=""en""><head><style da...",malicious,0.0,4,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0
4,<!DOCTYPE html>\n\n\n \n \n \n \n ...,malicious,27.0,425,0.0,2304.0,20314.0,0.0,0.0,0.0,0.0,0.0,0.0


---

## <b>Q4. html 에서 body 길이 계산

- BeautifulSoup으로 html소스를 python 객체로 변환
- 함수로 구현하기
- float으로 return 받기

### <span style="color:darkred">[문제4] Beatuifulsoup 라이브러리를 활용 HTML code에서 \<body> 태그 길이를 계산하는 함수를 완성하고 결과를 확인하세요.<span>

In [252]:
# Feature(특징) 데이터를 추출하는 함수를 작성합니다.
def body_length(soup):
    try:
        return float(len(str(soup.body)))
    except:
        return 0.0

In [253]:
# 데이터 프레임의 html_code 컬럼에서 Feature(특징) 데이터를 추출합니다.
body_len = []

for index, row in df.iterrows():
    soup = BeautifulSoup(row.html_code, 'html.parser')

    body_len.append(body_length(soup))

In [254]:
# 추출한 Feature(특징) 데이터를 확인합니다.
df['body_len'] = body_len

In [255]:
df.head()

Unnamed: 0,html_code,repu,title_len,head_script_len,body_script_len,blank_count_html,body_len,src_count_head_script,href_count_head_script,src_count_body_script,href_count_body_script,SrcOrHrefTag_count_head_script,SrcOrHrefTag_count_body_script
0,<!DOCTYPE html>\n<!--[if lt IE 7]> <html lang=...,malicious,10.0,405,185.0,65.0,3836.0,0.0,0.0,1.0,0.0,0.0,1.0
1,\n\t\n\n\n\t\n\n\t\n\n\n\t\n\n\n\t\n\n\t\n\t\t...,malicious,5.0,579,0.0,87.0,15268.0,0.0,0.0,1.0,0.0,0.0,1.0
2,"<!DOCTYPE html>\n<html lang=""en"">\n <head>\...",malicious,71.0,817,0.0,190.0,16542.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"<!DOCTYPE html><html lang=""en""><head><style da...",malicious,0.0,4,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0
4,<!DOCTYPE html>\n\n\n \n \n \n \n ...,malicious,27.0,425,0.0,1802.0,20314.0,0.0,0.0,0.0,0.0,0.0,0.0


---

## <b>Q5. script 에서 src, href 속성을 가진 태그수

- BeautifulSoup으로 html소스를 python 객체로 변환
- 함수로 구현하기
- float으로 return 받기

### <span style="color:darkred">[문제5] Beatuifulsoup 라이브러리를 활용 HTML code에서 \<script> 태그에서 src, href 속성을 가진 태그수를 계산하는 함수를 완성하고 결과를 확인하세요. <span>


In [245]:
soup = BeautifulSoup(df['html_code'][0], 'html.parser')
# <title> 태그 출력

count = 0
#print(soup.body)
for index, row in df.iterrows():
    soup = BeautifulSoup(row.html_code, 'html.parser')
    if index == 33:
        try:
            print(soup.head.script)
            tmp = list(str(soup.head.script).split('>'))

            #tmp = str(soup.body.script).count('src')
            print(tmp)
        except:
            print(0)
        
#print(count)

# <title> 태그 길이 출력
#print("* title 길이 :", len(str(soup.body.getText())))

<script>
window.vanilla = window.vanilla || {};window.VAN = window.VAN || {};
// For AB testing
// We should come up with a list of feature flags
var defaultFlags = {
parsely: {
enabled: true
},
onesignal: {
enabled: true
},
gtm: {
vanilla: true,
sitespecific: true,
},
selligent: {
enabled: true
},
sourcepoint: {
enabled: true
}
}
window.vanilla.featureFlags = Object.assign({}, defaultFlags, window.vanilla.featureFlags || {})
window.vanilla.addScript = function(src, id) {
var script = window.document.createElement('script');
script.src = src;
script.async = true;
script.id = id;
window.document.head.appendChild(script);
}; !function(){if('PerformanceLongTaskTiming' in window){var g=window.__tti={e:[]};
g.o=new PerformanceObserver(function(l){g.e=g.e.concat(l.getEntries())});
g.o.observe({entryTypes:['longtask']})}}();
var startFramesMeasurement=function(){window.vanilla.framesSampled=[];window.vanilla.frames=0,window.vanilla.loadtime=window.performance.now();var a=function(){window.van

In [233]:
# Feature(특징) 데이터를 추출하는 함수를 작성합니다.
def head_script_src_count(soup):
    try:
        return float(str(soup.head.script).count('src'))
    except:
        return 0.0

In [234]:
def head_script_href_count(soup):
    try:
        return float(str(soup.head.script).count('href'))
    except:
        return 0.0

In [235]:
def body_script_src_count(soup):
    try:
        return float(str(soup.body.script).count('src'))
    except:
        return 0.0

In [236]:
def body_script_href_count(soup):
    try:
        return float(str(soup.body.script).count('href'))
    except:
        return 0.0

In [292]:
def head_script_SrcOrHrefTag_count(soup):
    try:        
        count = float(len(soup.head.find_all()))
        count -= len(soup.head.find_all(src=False, href=False))
        
        return count
    
    except:
        return 0.0

In [293]:
def body_script_SrcOrHrefTag_count(soup):
    try:
        count = float(len(soup.body.find_all()))
        count -= len(soup.body.find_all(src=False, href=False))
                
        return count
    
    except:
        return 0.0

In [294]:
# 데이터 프레임의 html_code 컬럼에서 Feature(특징) 데이터를 추출합니다
src_count_head_script = []
href_count_head_script = []
src_count_body_script = []
href_count_body_script = []

for index, row in df.iterrows():
    soup = BeautifulSoup(row.html_code, 'html.parser')

    src_count_head_script.append(head_script_src_count(soup))
    href_count_head_script.append(head_script_href_count(soup))
    src_count_body_script.append(body_script_src_count(soup))
    href_count_body_script.append(body_script_href_count(soup))

In [295]:
# 추출한 Feature(특징) 데이터를 확인합니다.
df['src_count_head_script'] = src_count_head_script
df['href_count_head_script'] = href_count_head_script
df['src_count_body_script'] = src_count_body_script
df['href_count_body_script'] = href_count_body_script

In [296]:
df

Unnamed: 0,html_code,repu,title_len,head_script_len,body_script_len,blank_count_html,body_len,src_count_head_script,href_count_head_script,src_count_body_script,href_count_body_script,SrcOrHrefTag_count_head_script,SrcOrHrefTag_count_body_script
0,<!DOCTYPE html>\n<!--[if lt IE 7]> <html lang=...,malicious,10.0,405,185.0,361.0,3836.0,0.0,0.0,1.0,0.0,1,1.0
1,\n\t\n\n\n\t\n\n\t\n\n\n\t\n\n\n\t\n\n\t\n\t\t...,malicious,5.0,579,0.0,774.0,15268.0,0.0,0.0,1.0,0.0,140,1.0
2,"<!DOCTYPE html>\n<html lang=""en"">\n <head>\...",malicious,71.0,817,0.0,1469.0,16542.0,0.0,0.0,0.0,0.0,15,0.0
3,"<!DOCTYPE html><html lang=""en""><head><style da...",malicious,0.0,4,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0,0.0
4,<!DOCTYPE html>\n\n\n \n \n \n \n ...,malicious,27.0,425,0.0,2304.0,20314.0,0.0,0.0,0.0,0.0,77,0.0
5,_x000D_\n_x000D_\n_x000D_\n<!DOCTYPE html>_x00...,malicious,17.0,3180,0.0,0.0,4.0,0.0,0.0,0.0,0.0,15,0.0
6,"<!doctype html>\n\n<html data-ytrk-page=""HOME""...",malicious,36.0,5056,15554.0,1806.0,20018.0,0.0,0.0,0.0,0.0,16,0.0
7,"\n\t<!DOCTYPE html>\n\t<html class=""no-icon-fo...",malicious,45.0,172,0.0,0.0,4.0,0.0,0.0,0.0,0.0,54,0.0
8,"<!DOCTYPE html>\n<html class=""no-js"">\n<head>\...",malicious,77.0,4,1349.0,1337.0,29661.0,0.0,0.0,0.0,0.0,5,0.0
9,"<!DOCTYPE html>\n<html class=""b-header--bl...",malicious,14.0,119,0.0,0.0,4.0,0.0,0.0,0.0,0.0,3,0.0


In [297]:
SrcOrHrefTag_count_head_script = []
SrcOrHrefTag_count_body_script = []

for index, row in df.iterrows():
    soup = BeautifulSoup(row.html_code, 'html.parser')
    
    SrcOrHrefTag_count_head_script.append(head_script_SrcOrHrefTag_count(soup))
    SrcOrHrefTag_count_body_script.append(body_script_SrcOrHrefTag_count(soup))

In [298]:
df['SrcOrHrefTag_count_head_script'] = SrcOrHrefTag_count_head_script
df['SrcOrHrefTag_count_body_script'] = SrcOrHrefTag_count_body_script

In [299]:
df

Unnamed: 0,html_code,repu,title_len,head_script_len,body_script_len,blank_count_html,body_len,src_count_head_script,href_count_head_script,src_count_body_script,href_count_body_script,SrcOrHrefTag_count_head_script,SrcOrHrefTag_count_body_script
0,<!DOCTYPE html>\n<!--[if lt IE 7]> <html lang=...,malicious,10.0,405,185.0,361.0,3836.0,0.0,0.0,1.0,0.0,1.0,4.0
1,\n\t\n\n\n\t\n\n\t\n\n\n\t\n\n\n\t\n\n\t\n\t\t...,malicious,5.0,579,0.0,774.0,15268.0,0.0,0.0,1.0,0.0,140.0,33.0
2,"<!DOCTYPE html>\n<html lang=""en"">\n <head>\...",malicious,71.0,817,0.0,1469.0,16542.0,0.0,0.0,0.0,0.0,15.0,15.0
3,"<!DOCTYPE html><html lang=""en""><head><style da...",malicious,0.0,4,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0
4,<!DOCTYPE html>\n\n\n \n \n \n \n ...,malicious,27.0,425,0.0,2304.0,20314.0,0.0,0.0,0.0,0.0,77.0,35.0
5,_x000D_\n_x000D_\n_x000D_\n<!DOCTYPE html>_x00...,malicious,17.0,3180,0.0,0.0,4.0,0.0,0.0,0.0,0.0,15.0,0.0
6,"<!doctype html>\n\n<html data-ytrk-page=""HOME""...",malicious,36.0,5056,15554.0,1806.0,20018.0,0.0,0.0,0.0,0.0,16.0,43.0
7,"\n\t<!DOCTYPE html>\n\t<html class=""no-icon-fo...",malicious,45.0,172,0.0,0.0,4.0,0.0,0.0,0.0,0.0,54.0,0.0
8,"<!DOCTYPE html>\n<html class=""no-js"">\n<head>\...",malicious,77.0,4,1349.0,1337.0,29661.0,0.0,0.0,0.0,0.0,5.0,198.0
9,"<!DOCTYPE html>\n<html class=""b-header--bl...",malicious,14.0,119,0.0,0.0,4.0,0.0,0.0,0.0,0.0,3.0,0.0


## <b>Q6. 추가적으로 도출 가능한 Feature

- BeautifulSoup으로 html소스를 python 객체로 변환
- 함수로 구현하기
- 적절한 자료형으로 return 받기

### <span style="color:darkred">[문제6] Beatuifulsoup 라이브러리를 활용 HTML code에서 추가로 만들수 있는 Feature를 찾아보고 결과를 확인하세요. <span>


In [12]:
# Feature(특징) 데이터를 추출하는 함수를 작성합니다.
# Feature(특징) 데이터를 추출하는 함수를 작성합니다.
def script_src_count(soup):
    try:
        return float(len(soup.find_all(src=True)))
    except:
        return 0.0

In [None]:
# 데이터 프레임의 html_code 컬럼에서 Feature(특징) 데이터를 추출합니다


In [None]:
# 추출한 Feature(특징) 데이터를 확인합니다.

