# Beautiful Soup Tutorial - Web Scraping with Python

In [12]:
import pandas as pd
import os
from bs4 import BeautifulSoup

## Reading HTML Files

In [65]:
with open('index.html', 'r') as file: 
    doc = BeautifulSoup(file, 'html.parser')

print(doc)

<html>
<head>
<title>Your Title Here</title>
</head>
<body bgcolor="FFFFFF">
<center><img align="BOTTOM" src="clouds.jpg"/> </center>
<hr/>
<a href="http://somegreatsite.com">Link Name</a>
  
  is a link to another nifty site
  
  <h1>This is a Header</h1>
<h2>This is a Medium Header</h2>
  
  Send me mail at <a href="mailto:support@yourcompany.com">
  
  support@yourcompany.com</a>.
  
  <p> This is a new paragraph!
  
  <p> <b color="red">This is a new paragraph!</b>
<br/> <b><i>This is a new sentence without a paragraph break, in bold italics.</i></b>
<hr/>
</p></p></body>
</html>


In [19]:
print(doc.prettify())

<html>
 <head>
  <title>
   Your Title Here
  </title>
 </head>
 <body bgcolor="FFFFFF">
  <center>
   <img align="BOTTOM" src="clouds.jpg"/>
  </center>
  <hr/>
  <a href="http://somegreatsite.com">
   Link Name
  </a>
  is a link to another nifty site
  <h1>
   This is a Header
  </h1>
  <h2>
   This is a Medium Header
  </h2>
  Send me mail at
  <a href="mailto:support@yourcompany.com">
   support@yourcompany.com
  </a>
  .
  <p>
   This is a new paragraph!
  </p>
  <p>
   <b color="red">
    This is a new paragraph!
   </b>
   <br/>
   <b>
    <i>
     This is a new sentence without a paragraph break, in bold italics.
    </i>
   </b>
  </p>
  <hr/>
 </body>
</html>


## Find by the Tag Name

Using `.<tag_name>` you can access to the first tag that has this name in the document. This method is identical to `.find('tag_name')`

In [20]:
tag = doc.title
print(tag)

<title>Your Title Here</title>


To access the content or the string inside the tag, we can use `.string`

In [27]:
print(tag.string)

Your Title Here


The same way we can access things, we also can change it. 

In [31]:
tag.string = 'hello'
print(doc)

<html>
<head>
<title>hello</title>
</head>
<body bgcolor="FFFFFF">
<center><img align="BOTTOM" src="clouds.jpg"/> </center>
<hr/>
<a href="http://somegreatsite.com">Link Name</a>
  
  is a link to another nifty site
  
  <h1>This is a Header</h1>
<h2>This is a Medium Header</h2>
  
  Send me mail at <a href="mailto:support@yourcompany.com">
  
  support@yourcompany.com</a>.
  
  <p> This is a new paragraph!
  
  </p><p> <b color="red">This is a new paragraph!</b>
<br/> <b><i>This is a new sentence without a paragraph break, in bold italics.</i></b>
</p><hr/>
</body>
</html>


## Find All by Tag Name

`.find_all(<tag_name>)` will give you all the tags in the document.

In [50]:
tags = doc.find_all('p')
print(tags)

[<p> This is a new paragraph!
  
  </p>, <p> <b color="red">This is a new paragraph!</b>
<br/> <b><i>This is a new sentence without a paragraph break, in bold italics.</i></b>
</p>]


Notice that the `<p>` tag has things like tags again inside of them, right? We can access the nested tags using a combination of list indexing and `.find_all` function again.

In [60]:
print(tags[1])
print()
print(tags[1].find_all('b'))

<p> <b color="red">This is a new paragraph!</b>
<br/> <b><i>This is a new sentence without a paragraph break, in bold italics.</i></b>
</p>

[<b color="red">This is a new paragraph!</b>, <b><i>This is a new sentence without a paragraph break, in bold italics.</i></b>]


## Parsing Website HTML

In [64]:
import requests

url = "https://www.newegg.ca/black-logitech-g-pro-x/p/N82E16826197345?Item=N82E16826197345"
result = requests.get(url)
print(result.text)

<!DOCTYPE html><html lang="en-ca" class="show-tab-store"><head><title>Logitech G Pro X Gaming Headset - Newegg.ca</title><meta charSet="utf-8"/><meta http-equiv="content-type" content="text/html; charset=UTF-8"/><meta name="referrer" content="always"/><meta name="keywords" content="Newegg, Newegg.ca, Logitech G Pro X Gaming Headset"/><meta name="description" content="Buy Logitech G Pro X Gaming Headset with fast shipping and top-rated customer service.Once you know, you Newegg!"/><meta property="og:image" content="https://c1.neweggimages.com/ProductImage/26-197-345-V09.jpg"/><meta property="og:description" content="Buy Logitech G Pro X Gaming Headset with fast shipping and top-rated customer service. Once you know, you Newegg!"/><meta property="og:title" content="Logitech G Pro X Gaming Headset - Newegg.com"/><meta property="og:url" content="https://www.newegg.ca/black-logitech-g-pro-x/p/N82E16826197345"/><meta property="og:type" content="website"/><meta name="language" content="englis

Instead of using `result.text`, we can use Beautiful Soup to parse the html text!

In [70]:
url = "https://www.newegg.ca/black-logitech-g-pro-x/p/N82E16826197345?Item=N82E16826197345"
result = requests.get(url)
doc = BeautifulSoup(result.text, 'html.parser')
print(doc.prettify)

<bound method Tag.prettify of <!DOCTYPE html>
<html class="show-tab-store" lang="en-ca"><head><title>Logitech G Pro X Gaming Headset - Newegg.ca</title><meta charset="utf-8"/><meta content="text/html; charset=utf-8" http-equiv="content-type"/><meta content="always" name="referrer"/><meta content="Newegg, Newegg.ca, Logitech G Pro X Gaming Headset" name="keywords"/><meta content="Buy Logitech G Pro X Gaming Headset with fast shipping and top-rated customer service.Once you know, you Newegg!" name="description"/><meta content="https://c1.neweggimages.com/ProductImage/26-197-345-V09.jpg" property="og:image"/><meta content="Buy Logitech G Pro X Gaming Headset with fast shipping and top-rated customer service. Once you know, you Newegg!" property="og:description"/><meta content="Logitech G Pro X Gaming Headset - Newegg.com" property="og:title"/><meta content="https://www.newegg.ca/black-logitech-g-pro-x/p/N82E16826197345" property="og:url"/><meta content="website" property="og:type"/><meta 

## Locating Text

Say we want to know the price of the goods sold on the page. We can locate it by using `.find_all(text = '<special_keyword>)'`

In [73]:
prices = doc.find_all(text = '$')
print(prices)

['$', '$']


That's not really helpful since we want to get the entire things, that is the actual price. To get there, we need to know the concept of Beautiful Soup Tree Structure. Then we will use the dollar sign above to get what the actual price is.

## Beautiful Soup Tree Structure

In Beautiful Soup, everything is in tree-like structure, similar to any HTML document

In [85]:
parent = prices[0].parent
print(parent)

<li class="price-current"><span class="price-current-label"></span>$<strong>119</strong><sup>.99</sup></li>


Notice that the parent for `$` is the `<li>` tag. Here, we can go to the tag `<strong>` to get access to the price.

In [86]:
strong = parent.find('strong')
print(strong.string)

119


We can also use `.stripped_strings` to get all the contents inside the `<li>` tag.

In [91]:
value = list(parent.stripped_strings)
print(value)
value = ''.join(value)
print(value)

['$', '119', '.99']
$119.99
