## Web Crawling 
### 1. Requests: 웹페이지의 HTML을 requests로 가져온다.
### 2. BeautifulSoup: 데이터 처리를 웨 HTML문서를 Parsing.
> - 웹페이지의 데이터를 직접 활용하기 위해서는 HTML을 단순 텍스트로 읽는 것이 아니라, 태그 구조를 분석하여 원하는 정보만
### 3. 필요한 정보를 추출하는 메서드를 사용하여 원하는 요소를 선택

![HTML 구조](https://www.tcpschool.com/lectures/img_html_tag_structure.png)


In [71]:
'''
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
'''

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<h1> Beautifule Soup </h1>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

In [72]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <h1>
   Beautifule Soup
  </h1>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>



In [53]:
print(soup.title)
# <title>The Dormouse's story</title>

<title>The Dormouse's story</title>


In [54]:
soup.title.text

"The Dormouse's story"

In [60]:
soup.h1

' Beautifule Soup '

In [61]:
soup.h1.text

' Beautifule Soup '

In [6]:
print(soup.title.name)
# u'title'


title


In [7]:
print(soup.title.string)
# u'The Dormouse's story'


The Dormouse's story


In [8]:

print(soup.title.parent.name)
# u'head'


head


In [9]:
print(soup.p)
# <p class="title"><b>The Dormouse's story</b></p>

<p class="title"><b>The Dormouse's story</b></p>


In [65]:
soup.find_all('p')

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

In [74]:
length = len(soup.find_all('p'))

In [76]:
for i in range(length):
    print(soup.find_all('p')[i])
    
    print(soup.find_all('p')[i].text)
    print('*'*100)

<p class="title"><b>The Dormouse's story</b></p>
The Dormouse's story
****************************************************************************************************
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
****************************************************************************************************
<p class="story">...</p>
...
****************************************************************************************************


In [10]:
print(soup.p['class'])
# u'title'

['title']


In [12]:
print(soup.p['class'][0])

title


In [11]:
print(soup.a)
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>


In [13]:
print(soup.find_all('a'))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]


In [15]:
print(soup.find(id="link3"))
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>


In [16]:
for link in soup.find_all('a'):
    print(link.get('href'))
    
print(soup.get_text())

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie

The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...



## google.com/robots.txt
## naver.com/robots.txt

In [17]:
import requests
from bs4 import BeautifulSoup

url = "http://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

for book in soup.find_all("h3"):
    print(book.a.text)


A Light in the ...
Tipping the Velvet
Soumission
Sharp Objects
Sapiens: A Brief History ...
The Requiem Red
The Dirty Little Secrets ...
The Coming Woman: A ...
The Boys in the ...
The Black Maria
Starving Hearts (Triangular Trade ...
Shakespeare's Sonnets
Set Me Free
Scott Pilgrim's Precious Little ...
Rip it Up and ...
Our Band Could Be ...
Olio
Mesaerion: The Best Science ...
Libertarianism for Beginners
It's Only the Himalayas


In [18]:
url = "http://quotes.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

for quote in soup.find_all("span", class_="text"):
    print(quote.text)

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
“Try not to become a man of success. Rather become a man of value.”
“It is better to be hated for what you are than to be loved for what you are not.”
“I have not failed. I've just found 10,000 ways that won't work.”
“A woman is like a tea bag; you never know how strong it is until it's in hot water.”
“A day without sunshine is like, you know, night.”


In [26]:
quotes = [quote.text for quote in soup.find_all("span", class_="text")]
quotes

['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
 '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
 '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
 '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',
 "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",
 '“Try not to become a man of success. Rather become a man of value.”',
 '“It is better to be hated for what you are than to be loved for what you are not.”',
 "“I have not failed. I've just found 10,000 ways that won't work.”",
 "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”",
 '“A day without sunshine is like, you know, night.”']

In [37]:
texts = []
for quote in soup.find_all("span", class_="text"):
    # 따옴표를 제거한 텍스트
    cleaned_text = quote.text.strip('“”')
    print(cleaned_text)
    texts.append(cleaned_text)
print(texts)

The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.
It is our choices, Harry, that show what we truly are, far more than our abilities.
There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.
The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.
Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.
Try not to become a man of success. Rather become a man of value.
It is better to be hated for what you are than to be loved for what you are not.
I have not failed. I've just found 10,000 ways that won't work.
A woman is like a tea bag; you never know how strong it is until it's in hot water.
A day without sunshine is like, you know, night.
['The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.

In [40]:
# text mining을 위해서는 corpus를 이렇게 반드시 저장
texts = [ quote.text.strip('“”') for quote in soup.find_all("span", class_="text")]
texts

['The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.',
 'It is our choices, Harry, that show what we truly are, far more than our abilities.',
 'There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.',
 'The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.',
 "Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.",
 'Try not to become a man of success. Rather become a man of value.',
 'It is better to be hated for what you are than to be loved for what you are not.',
 "I have not failed. I've just found 10,000 ways that won't work.",
 "A woman is like a tea bag; you never know how strong it is until it's in hot water.",
 'A day without sunshine is like, you know, night.']

In [41]:
from collections import Counter
import string

# 텍스트 전처리 및 단어 빈도 계산
def word_frequency(texts):
    # 모든 텍스트를 소문자로 변환하고 구두점 제거
    cleaned_texts = ' '.join(texts).lower().translate(str.maketrans('', '', string.punctuation))
    # 단어 빈도 계산
    word_counts = Counter(cleaned_texts.split())
    return word_counts

# 단어 빈도 계산
word_counts = word_frequency(texts)
print(word_counts)


Counter({'is': 12, 'a': 9, 'it': 6, 'be': 6, 'to': 5, 'our': 4, 'are': 4, 'not': 4, 'you': 4, 'the': 3, 'as': 3, 'of': 3, 'what': 3, 'than': 3, 'we': 2, 'have': 2, 'thinking': 2, 'without': 2, 'that': 2, 'ways': 2, 'though': 2, 'miracle': 2, 'in': 2, 'its': 2, 'better': 2, 'absolutely': 2, 'become': 2, 'man': 2, 'for': 2, 'like': 2, 'know': 2, 'world': 1, 'created': 1, 'process': 1, 'cannot': 1, 'changed': 1, 'changing': 1, 'choices': 1, 'harry': 1, 'show': 1, 'truly': 1, 'far': 1, 'more': 1, 'abilities': 1, 'there': 1, 'only': 1, 'two': 1, 'live': 1, 'your': 1, 'life': 1, 'one': 1, 'nothing': 1, 'other': 1, 'everything': 1, 'person': 1, 'gentleman': 1, 'or': 1, 'lady': 1, 'who': 1, 'has': 1, 'pleasure': 1, 'good': 1, 'novel': 1, 'must': 1, 'intolerably': 1, 'stupid': 1, 'imperfection': 1, 'beauty': 1, 'madness': 1, 'genius': 1, 'and': 1, 'ridiculous': 1, 'boring': 1, 'try': 1, 'success': 1, 'rather': 1, 'value': 1, 'hated': 1, 'loved': 1, 'i': 1, 'failed': 1, 'ive': 1, 'just': 1, 'fou

# Feature Representation
# Representation Learning = Machine Learning

In [44]:
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
print(X.toarray())


['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]


In [46]:
import pandas as pd
pd.DataFrame(X.toarray(),
             columns = vectorizer.get_feature_names_out())

Unnamed: 0,and,document,first,is,one,second,the,third,this
0,0,1,1,1,0,0,1,0,1
1,0,2,0,1,0,1,1,0,1
2,1,0,0,1,1,0,1,1,1
3,0,1,1,1,0,0,1,0,1


In [47]:
vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(2, 2))
X2 = vectorizer2.fit_transform(corpus)
vectorizer2.get_feature_names_out()

pd.DataFrame(X2.toarray(),
             columns = vectorizer2.get_feature_names_out())

Unnamed: 0,and this,document is,first document,is the,is this,second document,the first,the second,the third,third one,this document,this is,this the
0,0,0,1,1,0,0,1,0,0,0,0,1,0
1,0,1,0,1,0,1,0,1,0,0,1,0,0
2,1,0,0,1,0,0,0,0,1,1,0,1,0
3,0,0,1,0,1,0,1,0,0,0,0,0,1


In [48]:
texts

['The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.',
 'It is our choices, Harry, that show what we truly are, far more than our abilities.',
 'There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.',
 'The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.',
 "Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.",
 'Try not to become a man of success. Rather become a man of value.',
 'It is better to be hated for what you are than to be loved for what you are not.',
 "I have not failed. I've just found 10,000 ways that won't work.",
 "A woman is like a tea bag; you never know how strong it is until it's in hot water.",
 'A day without sunshine is like, you know, night.']

In [49]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
print(vectorizer.get_feature_names_out())
print(X.toarray())


['000' '10' 'abilities' 'absolutely' 'and' 'are' 'as' 'bag' 'be' 'beauty'
 'become' 'better' 'boring' 'cannot' 'changed' 'changing' 'choices'
 'created' 'day' 'everything' 'failed' 'far' 'for' 'found' 'genius'
 'gentleman' 'good' 'harry' 'has' 'hated' 'have' 'hot' 'how'
 'imperfection' 'in' 'intolerably' 'is' 'it' 'just' 'know' 'lady' 'life'
 'like' 'live' 'loved' 'madness' 'man' 'miracle' 'more' 'must' 'never'
 'night' 'not' 'nothing' 'novel' 'of' 'one' 'only' 'or' 'other' 'our'
 'person' 'pleasure' 'process' 'rather' 'ridiculous' 'show' 'strong'
 'stupid' 'success' 'sunshine' 'tea' 'than' 'that' 'the' 'there'
 'thinking' 'though' 'to' 'truly' 'try' 'two' 'until' 'value' 've' 'water'
 'ways' 'we' 'what' 'who' 'without' 'woman' 'won' 'work' 'world' 'you'
 'your']
[[0 0 0 0 0 0 1 0 1 0 0 0 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
  1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 2 0 0 1 0 0 0 0 0 0 0 0
  0 0 1 0 2 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0]
 [0 0 1 0 0 1 0 0 0 0 0 0

In [50]:
pd.DataFrame(X.toarray(),
             columns = vectorizer.get_feature_names_out())

Unnamed: 0,000,10,abilities,absolutely,and,are,as,bag,be,beauty,...,we,what,who,without,woman,won,work,world,you,your
0,0,0,0,0,0,0,1,0,1,0,...,1,0,0,1,0,0,0,1,0,0
1,0,0,1,0,0,1,0,0,0,0,...,1,1,0,0,0,0,0,0,0,0
2,0,0,0,0,0,1,2,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,0,0,0,0,0,0,0,0,2,0,...,0,0,1,0,0,0,0,0,0,0
4,0,0,0,2,1,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,2,0,0,2,0,...,0,2,0,0,0,0,0,0,2,0
7,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,1,0,0,0
8,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,1,0,0,0,1,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0
