# BeautifulSoup 

Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.

Table of Contents:
* Importing Libraries
* Load our first page
* find and find_all
* select (CSS Selector)
* Get different properties of HTML
* Code navigation
* Exercise
  * Load webpage
  * Grab all social links from the webpage
    * Using find/find_all
    * Using select
  * Scrape HTML into pandas dataframe
    * Using pandas
    * Using beautifulsoup
  * Grab all fun facts that use word 'is'
  * Download an image
  * Solve the mystery challenge

**Importing libraries**

In [1]:
import requests
from bs4 import BeautifulSoup as bs

**Load our first page**

In [2]:
#Load the webpage content 
r = requests.get("https://keithgalli.github.io/web-scraping/example.html")


#Convert to a beautiful 
soup = bs(r.content)

print(soup)


<html>
<head>
<title>HTML Example</title>
</head>
<body>
<div align="middle">
<h1>HTML Webpage</h1>
<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>
</div>
<h2>A Header</h2>
<p><i>Some italicized text</i></p>
<h2>Another header</h2>
<p id="paragraph-id"><b>Some bold text</b></p>
</body>
</html>



In [3]:
print(soup.prettify())

<html>
 <head>
  <title>
   HTML Example
  </title>
 </head>
 <body>
  <div align="middle">
   <h1>
    HTML Webpage
   </h1>
   <p>
    Link to more interesting example:
    <a href="https://keithgalli.github.io/web-scraping/webpage.html">
     keithgalli.github.io/web-scraping/webpage.html
    </a>
   </p>
  </div>
  <h2>
   A Header
  </h2>
  <p>
   <i>
    Some italicized text
   </i>
  </p>
  <h2>
   Another header
  </h2>
  <p id="paragraph-id">
   <b>
    Some bold text
   </b>
  </p>
 </body>
</html>



**find and find_all**

In [4]:
#we can scrape the header using .find()
first_header = soup.find("h2")
first_header

#find will just take the first h2


<h2>A Header</h2>

In [5]:
header = soup.find_all("h2")
header

#find_all will scrape all the h2 header and store as a list

[<h2>A Header</h2>, <h2>Another header</h2>]

In [6]:
# pass in a list of elements to look for
first_header = soup.find(["h1","h2"])
first_header 

#irrespectuve of sequence in the list, the find will give us h1 as output

<h1>HTML Webpage</h1>

In [7]:
header = soup.find_all(["h1","h2"])
header 

[<h1>HTML Webpage</h1>, <h2>A Header</h2>, <h2>Another header</h2>]

In [8]:
paragraph = soup.find_all("p")
paragraph

[<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>,
 <p><i>Some italicized text</i></p>,
 <p id="paragraph-id"><b>Some bold text</b></p>]

In [9]:
#passing attributes to the find/find_all function
paragraph = soup.find_all("p", attrs={"id":"paragraph-id"})
paragraph

[<p id="paragraph-id"><b>Some bold text</b></p>]

In [10]:
#nesting find/find_all calls
body = soup.find('body')
print(body)
print("-"*200)
div = body.find('div')
print(div)
print("-"*200)
header = div.find('h1')
print(header)

#this kind of process is helpful when the webpage has lot of data. It helps in narrowing down.

<body>
<div align="middle">
<h1>HTML Webpage</h1>
<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>
</div>
<h2>A Header</h2>
<p><i>Some italicized text</i></p>
<h2>Another header</h2>
<p id="paragraph-id"><b>Some bold text</b></p>
</body>
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
<div align="middle">
<h1>HTML Webpage</h1>
<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>
</div>
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
<h1>HTML Webpage</h1>


In [11]:
#if we want to search for some specific string
paragraph = soup.find_all("p", string = "Some")
paragraph

#the output is blank because we don't have only "some" word on the webpage. We also have other strings with it.
#But what if we want to find only specific string and not complete sentence
#we can use RegEx module for that

[]

In [12]:
import re

In [13]:
paragraph = soup.find_all("p", string =re.compile("Some"))
paragraph

#using only one word we can now find all the sentences containing that word in <p>.

[<p><i>Some italicized text</i></p>,
 <p id="paragraph-id"><b>Some bold text</b></p>]

In [14]:
#finding all the header containing "header" word
headers = soup.find_all("h2", string = re.compile("header"))
headers


[<h2>Another header</h2>]

In [15]:
#to get header no matter if initial h is in uppercase or lowercase we can do
headers = soup.find_all("h2", string = re.compile("(h|H)eader"))
headers

[<h2>A Header</h2>, <h2>Another header</h2>]

**select(CSS selector)**

In [16]:
print(soup.body.prettify())

<body>
 <div align="middle">
  <h1>
   HTML Webpage
  </h1>
  <p>
   Link to more interesting example:
   <a href="https://keithgalli.github.io/web-scraping/webpage.html">
    keithgalli.github.io/web-scraping/webpage.html
   </a>
  </p>
 </div>
 <h2>
  A Header
 </h2>
 <p>
  <i>
   Some italicized text
  </i>
 </p>
 <h2>
  Another header
 </h2>
 <p id="paragraph-id">
  <b>
   Some bold text
  </b>
 </p>
</body>



In [17]:
content = soup.select("p")
content

#it is similar to find_all used earlier.

[<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>,
 <p><i>Some italicized text</i></p>,
 <p id="paragraph-id"><b>Some bold text</b></p>]

In [18]:
content = soup.select("div p")
content

[<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>]

In [19]:
paragraph = soup.select("h2 ~ p")
paragraph

[<p><i>Some italicized text</i></p>,
 <p id="paragraph-id"><b>Some bold text</b></p>]

In [20]:
bold_text = soup.select("p#paragraph-id b")
bold_text

[<b>Some bold text</b>]

In [21]:
paragraphs = soup.select("body > p")
print(paragraphs)
print("-"*50)
for paragraph in paragraphs:
  print(paragraph.select("i"))


[<p><i>Some italicized text</i></p>, <p id="paragraph-id"><b>Some bold text</b></p>]
--------------------------------------------------
[<i>Some italicized text</i>]
[]


In [22]:
#grab by element with specific property
soup.select("[align=middle]")


[<div align="middle">
 <h1>HTML Webpage</h1>
 <p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>
 </div>]

**Get different properties of the HTML**

In [23]:
#header without tags
header = soup.find("h2")
header.string

'A Header'

In [24]:
# for multiple child element use get_text()
div = soup.find("div")

print("Before using get_text()")
print(div.prettify())

print("-"*50)
print("After using get_text()")
print(div.get_text())

Before using get_text()
<div align="middle">
 <h1>
  HTML Webpage
 </h1>
 <p>
  Link to more interesting example:
  <a href="https://keithgalli.github.io/web-scraping/webpage.html">
   keithgalli.github.io/web-scraping/webpage.html
  </a>
 </p>
</div>

--------------------------------------------------
After using get_text()

HTML Webpage
Link to more interesting example: keithgalli.github.io/web-scraping/webpage.html



In [25]:
#get a specific property from an element
link = soup.find("a")
link['href'] #to get just the link

'https://keithgalli.github.io/web-scraping/webpage.html'

In [26]:
paragraphs = soup.select("p#paragraph-id")
paragraphs[0]['id']


'paragraph-id'

**Code Navigation**

In [27]:
#path syntax
soup.body.div.h1.string

'HTML Webpage'

In [28]:
#parent - sibling - child
print(soup.prettify())

<html>
 <head>
  <title>
   HTML Example
  </title>
 </head>
 <body>
  <div align="middle">
   <h1>
    HTML Webpage
   </h1>
   <p>
    Link to more interesting example:
    <a href="https://keithgalli.github.io/web-scraping/webpage.html">
     keithgalli.github.io/web-scraping/webpage.html
    </a>
   </p>
  </div>
  <h2>
   A Header
  </h2>
  <p>
   <i>
    Some italicized text
   </i>
  </p>
  <h2>
   Another header
  </h2>
  <p id="paragraph-id">
   <b>
    Some bold text
   </b>
  </p>
 </body>
</html>



Here, as the "div" is nested within the "body", the "body" is parent of the "div" and the "div" is the child of the "body". 
The elements on the same level as "div" are the sibling for example "h2". 

In [29]:
soup.body.find("div")

<div align="middle">
<h1>HTML Webpage</h1>
<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>
</div>

In [30]:
soup.body.find("div").find_next_siblings()

[<h2>A Header</h2>,
 <p><i>Some italicized text</i></p>,
 <h2>Another header</h2>,
 <p id="paragraph-id"><b>Some bold text</b></p>]

# Exercise:

Use [this](https://keithgalli.github.io/web-scraping/webpage.html) website for exercise

**Load webpage**

In [31]:
r = requests.get("https://keithgalli.github.io/web-scraping/webpage.html")

webpage = bs(r.content)

print(webpage.prettify()) 

<html>
 <head>
  <title>
   Keith Galli's Page
  </title>
  <style>
   table {
    border-collapse: collapse;
  }
  th {
    padding:5px;
  }
  td {
    border: 1px solid #ddd;
    padding: 5px;
  }
  tr:nth-child(even) {
    background-color: #f2f2f2;
  }
  th {
    padding-top: 12px;
    padding-bottom: 12px;
    text-align: left;
    background-color: #add8e6;
    color: black;
  }
  .block {
  width: 100px;
  /*float: left;*/
    display: inline-block;
    zoom: 1;
  }
  .column {
  float: left;
  height: 200px;
  /*width: 33.33%;*/
  padding: 5px;
  }

  .row::after {
    content: "";
    clear: both;
    display: table;
  }
  </style>
 </head>
 <body>
  <h1>
   Welcome to my page!
  </h1>
  <img src="./images/selfie1.jpg" width="300px"/>
  <h2>
   About me
  </h2>
  <p>
   Hi, my name is Keith and I am a YouTuber who focuses on content related to programming, data science, and machine learning!
  </p>
  <p>
   Here is a link to my channel:
   <a href="https://www.youtube.com/kgmi

**Grab all of the social links from the webpage**






**Using find/find_all**

In [32]:
all_links_with_tags = webpage.find_all("a")
all_links_with_tags

[<a href="https://www.youtube.com/kgmit">youtube.com/kgmit</a>,
 <a href="#footer"><sup>1</sup></a>,
 <a href="https://www.instagram.com/keithgalli/">https://www.instagram.com/keithgalli/</a>,
 <a href="https://twitter.com/keithgalli">https://twitter.com/keithgalli</a>,
 <a href="https://www.linkedin.com/in/keithgalli/">https://www.linkedin.com/in/keithgalli/</a>,
 <a href="https://www.tiktok.com/@keithgalli">https://www.tiktok.com/@keithgalli</a>,
 <a href="https://www.eliteprospects.com/team/10263/mit-mass.-inst.-of-tech./2014-2015?tab=stats"> MIT (Mass. Inst. of Tech.) </a>,
 <a href="https://www.eliteprospects.com/league/acha-ii/stats/2014-2015"> ACHA II </a>,
 <a href="https://www.eliteprospects.com/league/acha-ii/stats/2014-2015"> </a>,
 <a href="https://www.eliteprospects.com/team/10263/mit-mass.-inst.-of-tech./2015-2016?tab=stats"> MIT (Mass. Inst. of Tech.) </a>,
 <a href="https://www.eliteprospects.com/league/acha-ii/stats/2015-2016"> ACHA II </a>,
 <a href="https://www.elite

In [33]:
only_links_no_tags = []

for link in all_links_with_tags:
  only_links_no_tags.append(link.get("href"))

only_links_no_tags

['https://www.youtube.com/kgmit',
 '#footer',
 'https://www.instagram.com/keithgalli/',
 'https://twitter.com/keithgalli',
 'https://www.linkedin.com/in/keithgalli/',
 'https://www.tiktok.com/@keithgalli',
 'https://www.eliteprospects.com/team/10263/mit-mass.-inst.-of-tech./2014-2015?tab=stats',
 'https://www.eliteprospects.com/league/acha-ii/stats/2014-2015',
 'https://www.eliteprospects.com/league/acha-ii/stats/2014-2015',
 'https://www.eliteprospects.com/team/10263/mit-mass.-inst.-of-tech./2015-2016?tab=stats',
 'https://www.eliteprospects.com/league/acha-ii/stats/2015-2016',
 'https://www.eliteprospects.com/league/acha-ii/stats/2015-2016',
 'https://www.eliteprospects.com/team/10263/mit-mass.-inst.-of-tech./2016-2017?tab=stats',
 'https://www.eliteprospects.com/league/acha-ii/stats/2016-2017',
 'https://www.eliteprospects.com/stats',
 'https://www.eliteprospects.com/stats',
 'https://www.eliteprospects.com/team/10263/mit-mass.-inst.-of-tech./2018-2019?tab=stats',
 'https://www.elit

In [34]:
social_media_links = only_links_no_tags[2:6]
social_media_links

['https://www.instagram.com/keithgalli/',
 'https://twitter.com/keithgalli',
 'https://www.linkedin.com/in/keithgalli/',
 'https://www.tiktok.com/@keithgalli']

**Using Select**

In [35]:
links_using_select = webpage.select('a[href^="https"]')
links_using_select

[<a href="https://www.youtube.com/kgmit">youtube.com/kgmit</a>,
 <a href="https://www.instagram.com/keithgalli/">https://www.instagram.com/keithgalli/</a>,
 <a href="https://twitter.com/keithgalli">https://twitter.com/keithgalli</a>,
 <a href="https://www.linkedin.com/in/keithgalli/">https://www.linkedin.com/in/keithgalli/</a>,
 <a href="https://www.tiktok.com/@keithgalli">https://www.tiktok.com/@keithgalli</a>,
 <a href="https://www.eliteprospects.com/team/10263/mit-mass.-inst.-of-tech./2014-2015?tab=stats"> MIT (Mass. Inst. of Tech.) </a>,
 <a href="https://www.eliteprospects.com/league/acha-ii/stats/2014-2015"> ACHA II </a>,
 <a href="https://www.eliteprospects.com/league/acha-ii/stats/2014-2015"> </a>,
 <a href="https://www.eliteprospects.com/team/10263/mit-mass.-inst.-of-tech./2015-2016?tab=stats"> MIT (Mass. Inst. of Tech.) </a>,
 <a href="https://www.eliteprospects.com/league/acha-ii/stats/2015-2016"> ACHA II </a>,
 <a href="https://www.eliteprospects.com/league/acha-ii/stats/20

In [36]:
slinks_without_tags = []

for link in links_using_select:
  slinks_without_tags.append(link.get('href'))

slinks_without_tags

['https://www.youtube.com/kgmit',
 'https://www.instagram.com/keithgalli/',
 'https://twitter.com/keithgalli',
 'https://www.linkedin.com/in/keithgalli/',
 'https://www.tiktok.com/@keithgalli',
 'https://www.eliteprospects.com/team/10263/mit-mass.-inst.-of-tech./2014-2015?tab=stats',
 'https://www.eliteprospects.com/league/acha-ii/stats/2014-2015',
 'https://www.eliteprospects.com/league/acha-ii/stats/2014-2015',
 'https://www.eliteprospects.com/team/10263/mit-mass.-inst.-of-tech./2015-2016?tab=stats',
 'https://www.eliteprospects.com/league/acha-ii/stats/2015-2016',
 'https://www.eliteprospects.com/league/acha-ii/stats/2015-2016',
 'https://www.eliteprospects.com/team/10263/mit-mass.-inst.-of-tech./2016-2017?tab=stats',
 'https://www.eliteprospects.com/league/acha-ii/stats/2016-2017',
 'https://www.eliteprospects.com/stats',
 'https://www.eliteprospects.com/stats',
 'https://www.eliteprospects.com/team/10263/mit-mass.-inst.-of-tech./2018-2019?tab=stats',
 'https://www.eliteprospects.c

In [37]:
social_media_links_using_select = slinks_without_tags[1:5]
social_media_links_using_select
# this method is not useful when the number of links is too large

['https://www.instagram.com/keithgalli/',
 'https://twitter.com/keithgalli',
 'https://www.linkedin.com/in/keithgalli/',
 'https://www.tiktok.com/@keithgalli']

**Scrape an HTML table into a Pandas Dataframe**

**Using Pandas**

In [38]:
import pandas as pd

In [39]:
table_pd = pd.read_html(r.text)
table_pd = table_pd[0]
table_pd.head()

#this is the easy method but the problem is that I used pandas and not beautifulsoup to scrape table 

Unnamed: 0,S,Team,League,GP,G,A,TP,PIM,+/-,Unnamed: 9,POST,GP.1,G.1,A.1,TP.1,PIM.1,+/-.1
0,2014-15,MIT (Mass. Inst. of Tech.),ACHA II,17.0,3.0,9.0,12.0,20.0,,|,,,,,,,
1,2015-16,MIT (Mass. Inst. of Tech.),ACHA II,9.0,1.0,1.0,2.0,2.0,,|,,,,,,,
2,2016-17,MIT (Mass. Inst. of Tech.),ACHA II,12.0,5.0,5.0,10.0,8.0,0.0,|,,,,,,,
3,2017-18,Did not play,,,,,,,,|,,,,,,,
4,2018-19,MIT (Mass. Inst. of Tech.),ACHA III,8.0,5.0,10.0,15.0,8.0,,|,,,,,,,


**Using Beautifulsoup**

In [40]:
table = webpage.select("table.hockey-stats")[0]
columns = table.find("thead").find_all("th")
column_names = [c.string for c in columns]

table_rows = table.find("tbody").find_all("tr")
l = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [str(tr.get_text()).strip() for tr in td]
    l.append(row)

df = pd.DataFrame(l, columns=column_names)
df.head()

Unnamed: 0,S,Team,League,GP,G,A,TP,PIM,+/-,Unnamed: 10,POST,GP.1,G.1,A.1,TP.1,PIM.1,+/-.1
0,2014-15,MIT (Mass. Inst. of Tech.),ACHA II,17.0,3.0,9.0,12.0,20.0,,|,,,,,,,
1,2015-16,MIT (Mass. Inst. of Tech.),ACHA II,9.0,1.0,1.0,2.0,2.0,,|,,,,,,,
2,2016-17,MIT (Mass. Inst. of Tech.),ACHA II,12.0,5.0,5.0,10.0,8.0,0.0,|,,,,,,,
3,2017-18,Did not play,,,,,,,,|,,,,,,,
4,2018-19,MIT (Mass. Inst. of Tech.),ACHA III,8.0,5.0,10.0,15.0,8.0,,|,,,,,,,


**Grab all fun facts that use word "is"**

In [41]:
is_list = webpage.find("ul").find_all("li")
is_list

[<li>Owned my dream car in high school <a href="#footer"><sup>1</sup></a></li>,
 <li>Middle name is Ronald</li>,
 <li>Never had been on a plane until college</li>,
 <li>Dunkin Donuts coffee is better than Starbucks</li>,
 <li>A favorite book series of mine is <i>Ender's Game</i></li>,
 <li>Current video game of choice is <i>Rocket League</i></li>,
 <li>The band that I've seen the most times live is the <i>Zac Brown Band</i></li>]

In [42]:
fact_list = []
for fact in is_list:
  fact = fact.get_text()
  fact_list.append(fact)

fact_list

['Owned my dream car in high school 1',
 'Middle name is Ronald',
 'Never had been on a plane until college',
 'Dunkin Donuts coffee is better than Starbucks',
 "A favorite book series of mine is Ender's Game",
 'Current video game of choice is Rocket League',
 "The band that I've seen the most times live is the Zac Brown Band"]

In [43]:
fact_list_with_is = [fact for fact in fact_list if "is" in fact]
fact_list_with_is

['Middle name is Ronald',
 'Dunkin Donuts coffee is better than Starbucks',
 "A favorite book series of mine is Ender's Game",
 'Current video game of choice is Rocket League',
 "The band that I've seen the most times live is the Zac Brown Band"]

**Download an Image**

In [44]:
# Load the webpage content
url = "https://keithgalli.github.io/web-scraping/"
r = requests.get(url+"webpage.html")

# Convert to a beautiful soup object
webpage = bs(r.content)

images = webpage.select("div.row div.column img")
image_url = images[0]['src']
full_url = url + image_url

img_data = requests.get(full_url).content
with open('lake_como.jpg', 'wb') as handler:
    handler.write(img_data)

**Solve the mystery challenge!**

In [45]:
file_data = webpage.select('a[href*="challenge"]')
file_data

[<a href="challenge/file_1.html">File 1</a>,
 <a href="challenge/file_2.html">File 2</a>,
 <a href="challenge/file_3.html">File 3</a>,
 <a href="challenge/file_4.html">File 4</a>,
 <a href="challenge/file_5.html">File 5</a>,
 <a href="challenge/file_6.html">File 6</a>,
 <a href="challenge/file_7.html">File 7</a>,
 <a href="challenge/file_8.html">File 8</a>,
 <a href="challenge/file_9.html">File 9</a>,
 <a href="challenge/file_10.html">File 10</a>]

In [46]:
mystery_links_without_tags = []

for link in file_data:
  mystery_links_without_tags.append(link.get('href'))

mystery_links_without_tags

['challenge/file_1.html',
 'challenge/file_2.html',
 'challenge/file_3.html',
 'challenge/file_4.html',
 'challenge/file_5.html',
 'challenge/file_6.html',
 'challenge/file_7.html',
 'challenge/file_8.html',
 'challenge/file_9.html',
 'challenge/file_10.html']

In [47]:
for link in mystery_links_without_tags:
  file_link = url + link
  # print(file_link)
  file = requests.get(file_link)
  file_content = bs(file.content)
  # print(file_content)
  message = file_content.find("p", attrs={"id":"secret-word"})
  message_word = message.text
  print(message_word)

Make
sure
to
smash
that
like
button
and
subscribe
!!!


**Sources**
* [Comprehensive Python Beautiful Soup Web Scraping Tutorial!](https://www.youtube.com/watch?v=GjKQ6V_ViQE&t=412s) by [Keith Galli](https://github.com/KeithGalli)
* [Beautiful Soup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
* [CSS Selector Reference](https://www.w3schools.com/cssref/css_selectors.asp)
* Websites used to scrape data:
  * [For practice](https://keithgalli.github.io/web-scraping/example.html)
  * [For exercise](https://keithgalli.github.io/web-scraping/webpage.html)





