<a href="https://colab.research.google.com/github/chaiyaphum/SuperAI-Engineering-Python-Web-Scraping-Tutorial/blob/main/Python_Web_Scraping_C1_Introduction_to_BeautifulSoup.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Get the website - using HTTP library (Requests)**

In [None]:
import requests

html = requests.get("https://keithgalli.github.io/web-scraping/example.html")

print(html.content)

b'<html>\n<head>\n<title>HTML Example</title>\n</head>\n<body>\n\n<div align="middle">\n<h1>HTML Webpage</h1>\n<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>\n</div>\n\n<h2>A Header</h2>\n<p><i>Some italicized text</i></p>\n\n<h2>Another header</h2>\n<p id="paragraph-id"><b>Some bold text</b></p>\n\n</body>\n</html>\n'


In [None]:
# Response Object: https://www.w3schools.com/python/ref_requests_response.asp

print('Response url:', html.url)

# HTTP response status codes: https://developer.mozilla.org/th/docs/Web/HTTP/Status
print('Response status:', html.status_code)

print('Response status:', html.encoding)

Response url: https://keithgalli.github.io/web-scraping/example.html
Response status: 200
Response status: utf-8


In [None]:
print('Response headers:', html.headers)

Response headers: Date: Fri, 23 Oct 2020 05:19:57 GMT
Server: Apache
Last-Modified: Sat, 09 Jun 2018 19:15:58 GMT
ETag: "4121bc8-234-56e3a58b39172"
Accept-Ranges: bytes
Content-Length: 564
Cache-Control: max-age=1209600
Expires: Fri, 06 Nov 2020 05:19:57 GMT
Connection: close
Content-Type: text/html




### **Beginning to scrape with BeautifulSoup**

In [None]:
# Install beautifulsoup4
!pip install beautifulsoup4



In [None]:
from bs4 import BeautifulSoup

In [None]:
html = requests.get("https://keithgalli.github.io/web-scraping/example.html")
bs = BeautifulSoup(html.content, "html.parser")

In [None]:
print(bs.prettify())

<html>
 <head>
  <title>
   HTML Example
  </title>
 </head>
 <body>
  <div align="middle">
   <h1>
    HTML Webpage
   </h1>
   <p>
    Link to more interesting example:
    <a href="https://keithgalli.github.io/web-scraping/webpage.html">
     keithgalli.github.io/web-scraping/webpage.html
    </a>
   </p>
  </div>
  <h2>
   A Header
  </h2>
  <p>
   <i>
    Some italicized text
   </i>
  </p>
  <h2>
   Another header
  </h2>
  <p id="paragraph-id">
   <b>
    Some bold text
   </b>
  </p>
 </body>
</html>



### **find() and find_all() with BeautifulSoup**

In [None]:
result_find = bs.find('h2')
result_findall = bs.find_all('h2')

print('find: ', result_find)
print('find_all: ', result_findall)

find:  <h2>A Header</h2>
find_all:  [<h2>A Header</h2>, <h2>Another header</h2>]


In [None]:
result_header_tags = bs.find_all(['h1', 'h2','h3','h4','h5','h6'])
print('Header: ', result_header_tags)

result_p_tags = bs.find_all('p')
print('p: ', result_p_tags)

Header:  [<h1>HTML Webpage</h1>, <h2>A Header</h2>, <h2>Another header</h2>]
p:  [<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>, <p><i>Some italicized text</i></p>, <p id="paragraph-id"><b>Some bold text</b></p>]


In [None]:
for tag in result_p_tags:
    print(tag.get_text())

Link to more interesting example: keithgalli.github.io/web-scraping/webpage.html
Some italicized text
Some bold text


In [None]:
result_p_by_id = bs.find('p', attrs={'id': 'paragraph-id'})
print(result_p_by_id.get_text())

Some bold text


### **Keyword Arguments**

In [None]:
html = requests.get("http://www.pythonscraping.com/pages/warandpeace.html")
bs = BeautifulSoup(html.content, "html.parser")

In [None]:
result_by_class = bs.find_all('span', {'class':{'green', 'red'}})
print([text for text in result_by_class])

[<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.</span>, <span class="green">Anna
Pavlovna Scherer</span>, <span class="green">Empress Marya
Fedorovna</span>, <span class="green">Prince Vasili Kuragin</span>, <span class="green">Anna Pavlovna</span>, <span class="green">St. Petersburg</span>, <span class="red">If you have nothing better to do, Count [or Prince], and if the
prospect of spending an evening with a poor invalid is not too
terrible, I shall be very charmed to see you tonight between 7 and 10-
Annette Scherer.</span>, <span class="red">Heavens! w

In [None]:
result_by_class = bs.find_all('span', {'class':{'green'}})
print([text for text in result_by_class])
print('------------------')
print([text.get_text() for text in result_by_class])

[<span class="green">Anna
Pavlovna Scherer</span>, <span class="green">Empress Marya
Fedorovna</span>, <span class="green">Prince Vasili Kuragin</span>, <span class="green">Anna Pavlovna</span>, <span class="green">St. Petersburg</span>, <span class="green">the prince</span>, <span class="green">Anna Pavlovna</span>, <span class="green">Anna Pavlovna</span>, <span class="green">the prince</span>, <span class="green">the prince</span>, <span class="green">the prince</span>, <span class="green">Prince Vasili</span>, <span class="green">Anna Pavlovna</span>, <span class="green">Anna Pavlovna</span>, <span class="green">the prince</span>, <span class="green">Wintzingerode</span>, <span class="green">King of Prussia</span>, <span class="green">le Vicomte de Mortemart</span>, <span class="green">Montmorencys</span>, <span class="green">Rohans</span>, <span class="green">Abbe Morio</span>, <span class="green">the Emperor</span>, <span class="green">the prince</span>, <span class="green">Princ