# Webscaping with BeautifulSoup

This project is centered around extracting data from websites using BeautifulSoup in conjunction with Python. It will demonstrate the capabilites of the BeautifulSoup function as well as show how to data clean and go into analysis after extracting data: 

This project will be focused on scraping the data from a CodeAcademy mock-website, where I will be analyzing the tags:

Import the library for pulling the information from the website that we'd like to examine:

In [1]:
import requests

Finding the webpage that you would like to extract data from:

In [2]:
webpage_response = requests.get('https://content.codecademy.com/courses/beautifulsoup/shellter.html')

The HTML of the website we loaded for scraping:

In [3]:
print(webpage_response.text)

<!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8">
    <title>Turtle Shellter</title>
      <link href="https://fonts.googleapis.com/css?family=Poppins" rel="stylesheet">
      <link rel='stylesheet' type='text/css' href='style.css'>
  <script async src='/cdn-cgi/bm/cv/669835187/api.js'></script></head>

  <body>
      <div class="banner">
        <h1>The Shellter</h1>
        <span class="brag">The #1 Turtle Adoption website!</span>
      </div>

      <div class="about">
        <p class="text">Click to learn more about each turtle</p>
      </div>

    <div class="grid">
      <div class="box adopt">
          <a href="aesop.html" class="more-info"><img src="aesop.png" class="headshot"></a>
          <p>Aesop</p>
      </div>

      <div class="box adopt">
          <a href="caesar.html" class="more-info"><img src="caesar.png" class="headshot"></a>
          <p>Caesar</p>
      </div>

      <div class="box adopt">
          <a href="sulla.html" class="more-info"><img src="s

Getting the content out of the website:

In [4]:
webpage = webpage_response.content
print(webpage)

b'<!DOCTYPE html>\n<html>\n  <head>\n    <meta charset="utf-8">\n    <title>Turtle Shellter</title>\n      <link href="https://fonts.googleapis.com/css?family=Poppins" rel="stylesheet">\n      <link rel=\'stylesheet\' type=\'text/css\' href=\'style.css\'>\n  <script async src=\'/cdn-cgi/bm/cv/669835187/api.js\'></script></head>\n\n  <body>\n      <div class="banner">\n        <h1>The Shellter</h1>\n        <span class="brag">The #1 Turtle Adoption website!</span>\n      </div>\n\n      <div class="about">\n        <p class="text">Click to learn more about each turtle</p>\n      </div>\n\n    <div class="grid">\n      <div class="box adopt">\n          <a href="aesop.html" class="more-info"><img src="aesop.png" class="headshot"></a>\n          <p>Aesop</p>\n      </div>\n\n      <div class="box adopt">\n          <a href="caesar.html" class="more-info"><img src="caesar.png" class="headshot"></a>\n          <p>Caesar</p>\n      </div>\n\n      <div class="box adopt">\n          <a href="

Import the function for Beautiful Soup, so that the web scraping can take place:

In [5]:
from bs4 import BeautifulSoup

In [6]:
soup = BeautifulSoup(webpage, "html.parser")

In [7]:
#Example that has the output: <div id="example">An example div</div>
#soup = BeautifulSoup('<div id="example">An example div</div><p>An example p tag</p>')
#print(soup.div)

Getting the name of a tag using beautifulsoup:

In [8]:
print(soup.div.name)

div


Getting an attribute from a tag in BeautifulSoup:

In [9]:
print(soup.div.attrs)

{'class': ['banner']}


Getting the h1 tag from the website:

In [10]:
print(soup.h1)

<h1>The Shellter</h1>


Using 'find' method to get the 1st h1 tag in the website:

In [11]:
print(soup.find("h1"))

<h1>The Shellter</h1>


Using 'find_all' to find all the occurances of an h1 tag on the website:

In [12]:
print(soup.find_all("h1"))

[<h1>The Shellter</h1>]


In order to get more complex in our HTML search, we will impliment the use of the 'Regex' library:

In [13]:
import re

Finding all the tags that have 1-9 in the header:

In [14]:
soup.find_all(re.compile("h[1-9]"))

[<h1>The Shellter</h1>]

Finding tags that have 'h1', 'a', and 'p':

In [15]:
soup.find_all(['h1', 'a', 'p'])

[<h1>The Shellter</h1>,
 <p class="text">Click to learn more about each turtle</p>,
 <a class="more-info" href="aesop.html"><img class="headshot" src="aesop.png"/></a>,
 <p>Aesop</p>,
 <a class="more-info" href="caesar.html"><img class="headshot" src="caesar.png"/></a>,
 <p>Caesar</p>,
 <a class="more-info" href="sulla.html"><img class="headshot" src="sulla.png"/></a>,
 <p>Sulla</p>,
 <a class="more-info" href="spyro.html"><img class="headshot" src="spyro.png"/></a>,
 <p>Spyro</p>,
 <a class="more-info" href="zelda.html"><img class="headshot" src="zelda.png"/></a>,
 <p>Zelda</p>,
 <a class="more-info" href="bandicoot.html"><img class="headshot" src="bandicoot.png"/></a>,
 <p>Bandicoot</p>,
 <a class="more-info" href="hal.html"><img class="headshot" src="hal.png"/></a>,
 <p>Hal</p>,
 <a class="more-info" href="mock.html"><img class="headshot" src="mock.png"/></a>,
 <p>Mock</p>,
 <a class="more-info" href="sparrow.html"><img class="headshot" src="sparrow.png"/></a>,
 <p>Captain Sparrow</

Finding tags with a certain class, 'box adopt':

In [16]:
soup.find_all(attrs={'class': 'box adopt'})

[<div class="box adopt">
 <a class="more-info" href="aesop.html"><img class="headshot" src="aesop.png"/></a>
 <p>Aesop</p>
 </div>,
 <div class="box adopt">
 <a class="more-info" href="caesar.html"><img class="headshot" src="caesar.png"/></a>
 <p>Caesar</p>
 </div>,
 <div class="box adopt">
 <a class="more-info" href="sulla.html"><img class="headshot" src="sulla.png"/></a>
 <p>Sulla</p>
 </div>,
 <div class="box adopt">
 <a class="more-info" href="spyro.html"><img class="headshot" src="spyro.png"/></a>
 <p>Spyro</p>
 </div>,
 <div class="box adopt">
 <a class="more-info" href="zelda.html"><img class="headshot" src="zelda.png"/></a>
 <p>Zelda</p>
 </div>,
 <div class="box adopt">
 <a class="more-info" href="bandicoot.html"><img class="headshot" src="bandicoot.png"/></a>
 <p>Bandicoot</p>
 </div>,
 <div class="box adopt">
 <a class="more-info" href="hal.html"><img class="headshot" src="hal.png"/></a>
 <p>Hal</p>
 </div>,
 <div class="box adopt">
 <a class="more-info" href="mock.html"><im

Using a function to find tags with complicated requirements:

In [17]:
#def shortcut(tag):
    #return tag.attr('class') == "headshot"

#soup.find_all(shortcut)

Selecting tags in BeautifulSoup using CSS indicators:

In [18]:
soup.select('.more-info')

[<a class="more-info" href="aesop.html"><img class="headshot" src="aesop.png"/></a>,
 <a class="more-info" href="caesar.html"><img class="headshot" src="caesar.png"/></a>,
 <a class="more-info" href="sulla.html"><img class="headshot" src="sulla.png"/></a>,
 <a class="more-info" href="spyro.html"><img class="headshot" src="spyro.png"/></a>,
 <a class="more-info" href="zelda.html"><img class="headshot" src="zelda.png"/></a>,
 <a class="more-info" href="bandicoot.html"><img class="headshot" src="bandicoot.png"/></a>,
 <a class="more-info" href="hal.html"><img class="headshot" src="hal.png"/></a>,
 <a class="more-info" href="mock.html"><img class="headshot" src="mock.png"/></a>,
 <a class="more-info" href="sparrow.html"><img class="headshot" src="sparrow.png"/></a>]

Retriving all the text in the HTML cite we're analysing, seperated by a comma( , ):

In [19]:
soup.get_text(' , ')

"\n , \n , \n , \n , Turtle Shellter , \n , \n , \n , \n , \n , \n , The Shellter , \n , The #1 Turtle Adoption website! , \n , \n , \n , Click to learn more about each turtle , \n , \n , \n , \n , \n , Aesop , \n , \n , \n , \n , Caesar , \n , \n , \n , \n , Sulla , \n , \n , \n , \n , Spyro , \n , \n , \n , \n , Zelda , \n , \n , \n , \n , Bandicoot , \n , \n , \n , \n , Hal , \n , \n , \n , \n , Mock , \n , \n , \n , \n , Captain Sparrow , \n , \n , \n , (function(){window['__CF$cv$params']={r:'71f151194e569e68',m:'G0b63oj3Z.0RzMFdEpBeEVzgxMsg.CiKizV4LfEftVY-1655862455-0-AdGim3Opz2xErmCuUNUlBjlKl4mltdLv7jmrMt46jH/smYjYBrVDOvr9uCOnkZQuIbob+oeMGEAWJUprOsNd4ICXVbLLroPxafpZAYqa8+0bjNSGGJTqv+3j2BhdZeTyE6ZykM5W0Dm9GkGhe+LE72qlg3yiOO5gEKLPlpoPyQoW',s:[0x8b5ca4c425,0xacb994d00f],}})(); , \n , \n"

#### We will now be converting our scrapped data into a DataSet that can be manipulated for analysis:

The following is example code that I derived from the CodeAcedemy module on Webscraping. Continue onto the next set of code to see my personal application of this technique:

In [20]:
# Example Code from CodeAcademy on how to use BeautufulSoup with Python:

example = """import requests
from bs4 import BeautifulSoup
import pandas as pd

prefix = "https://content.codecademy.com/courses/beautifulsoup/"
webpage_response = requests.get('https://content.codecademy.com/courses/beautifulsoup/shellter.html')

webpage = webpage_response.content
soup = BeautifulSoup(webpage, "html.parser")

turtle_links = soup.find_all("a")
links = []
#go through all of the a tags and get the links associated with them"
for a in turtle_links:
  links.append(prefix+a["href"])
    
#Define turtle_data:
turtle_data = {}

#follow each link:
for link in links:
  webpage = requests.get(link)
  turtle = BeautifulSoup(webpage.content, "html.parser")
  turtle_name = turtle.select(".name")[0].get_text()
  
  stats = turtle.find("ul")
  stats_text = stats.get_text("|")
  turtle_data[turtle_name] = stats_text.split("|")


turtle_df = pd.DataFrame(turtle_data)

print(turtle_df) """

We will now be setting variables for creating the tables for our dataframe. Then we will piece it together and print out our final product:

In [58]:
table = soup.find('body')
table_rows = table.find_all('div', attrs={'class': 'box adopt'})

In [72]:
storage = []
for i in table_rows:
    p = i.find('p')
    row = p.text
    storage.append(row)

In [73]:
print(storage)

['Aesop', 'Caesar', 'Sulla', 'Spyro', 'Zelda', 'Bandicoot', 'Hal', 'Mock', 'Captain Sparrow']


Above is the list of turtles that we will now convert to a modest dataframe with a column above the list of names:

In [84]:
dataframe = {}

In [86]:
import pandas as pd

In [87]:
df = pd.DataFrame(dataframe)

In [88]:
df.head()

In [89]:
df = df.assign(names= storage)

#### We now have our very modest DataFrame of Turtle names that we were able to scrape from the web!

In [91]:
df

Unnamed: 0,names
0,Aesop
1,Caesar
2,Sulla
3,Spyro
4,Zelda
5,Bandicoot
6,Hal
7,Mock
8,Captain Sparrow
