<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Beautiful-Soup---Web-Scraping" data-toc-modified-id="Beautiful-Soup---Web-Scraping-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Beautiful Soup - Web Scraping</a></span><ul class="toc-item"><li><span><a href="#What-is-Beautiful-Soup?" data-toc-modified-id="What-is-Beautiful-Soup?-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span><strong><font color="red">What is Beautiful Soup?</font></strong></a></span></li><li><span><a href="#So-What-Do-I-Do-With-Beautiful-Soup?" data-toc-modified-id="So-What-Do-I-Do-With-Beautiful-Soup?-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span><strong><font color="orange">So What Do I Do With Beautiful Soup?</font></strong></a></span><ul class="toc-item"><li><span><a href="#HTML-Components" data-toc-modified-id="HTML-Components-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span><strong><font color="purple">HTML Components</font></strong></a></span></li><li><span><a href="#HTML-Terms" data-toc-modified-id="HTML-Terms-1.2.2"><span class="toc-item-num">1.2.2&nbsp;&nbsp;</span><strong><font color="purple">HTML Terms</font></strong></a></span></li><li><span><a href="#HTML-Properties" data-toc-modified-id="HTML-Properties-1.2.3"><span class="toc-item-num">1.2.3&nbsp;&nbsp;</span><strong><font color="purple">HTML Properties</font></strong></a></span></li></ul></li><li><span><a href="#Now-What?" data-toc-modified-id="Now-What?-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span><strong><font color="green">Now What?</font></strong></a></span><ul class="toc-item"><li><span><a href="#Grab-Title-from-Page" data-toc-modified-id="Grab-Title-from-Page-1.3.1"><span class="toc-item-num">1.3.1&nbsp;&nbsp;</span>Grab Title from Page</a></span></li><li><span><a href="#Grab-Text-from-Page" data-toc-modified-id="Grab-Text-from-Page-1.3.2"><span class="toc-item-num">1.3.2&nbsp;&nbsp;</span>Grab Text from Page</a></span></li></ul></li></ul></li></ul></div>

In [11]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import requests
import re
from bs4 import BeautifulSoup
import os

### Beautiful Soup - Web Scraping

#### **<font color=red>What is Beautiful Soup?</font>**



#### **<font color=orange>So What Do I Do With Beautiful Soup?</font>**

Here, we are looking to retrieve content from a web page, but the web page is written in HTML (HyperText Markup Language), so we need a basic understanding of the different HTML elements used to create web pages.

##### **<font color=purple>HTML Elements</font>**

`<html>` tag identifies all contents as html

`<head>` tag contains data about the title of the web page

`<body>` tag contains the main content of the web page

`<p>` or paragraph tag creates new pargraphs of text

`<a>` tag (tell the browser to render a link). `<a>` tags use the `href` property to tell the link where to go.

`<div>` tag indentifies a division of the page

`<b>` tag bolds the text inside

`<i>` tag italicizes the text inside

`<table>` tag creates a table

`<form>` tag creates an input form
   
><font color=purple>The tags are nested inside of the main html tag like you see below.</font>

```html
<html>
    <head>
    </head>
    <body>
    <p>
        <a href = "link", id='name_of_link'>
    </p>
    <p class=''>
    </p>
    </body>
</html>
```

##### **<font color=purple>HTML Terms</font>**

**child** -> a tag inside of another tag. The `<p>` tags above are children of the `<body>` tag.

**parent** -> the tag another tag is inside of. The `<body>` tag is the parent of the `<p>` tag.

**sibling** -> a tag that is nested inside the same parent tag as another tag. The two `<p>` tags are siblings tags inside of the `<body>` tag.

##### **<font color=purple>HTML Properties</font>**

These are optional, but they make the HTML elements easier to work with because they give the elements names. You will have to examine a web page to find out if it uses these properties.

`class` is a property of an HTML element. One element can have multiple classes and elements can share the same classes, so classes cannot be used as unique identifiers.

`id` is a property of an HTML element. Each element can only have one id, so they can be used as unique identifiers.

#### **<font color=green>Now What?</font>**

We will need to use the `requests` library to retrieve the HTML from a web page we want to scrape. You can review how to use the `requests` library in my notebook [here](https://faithkane3.github.io/time_series_review/time_series_review).

Next, we will inspect the structure of the web page by right-clicking on the page we want to scrape and clicking 'inspect'. This will allow us to move our cursor over the part of the page we want to scrape and see the responsible HTML code for that section high-lighted on the right. We can use the tag and its properties with `BeautifulSoup` to `soup.find(name=tag)`

We will need to use `BeautifulSoup` to parse the HTML response to our request. 



In [5]:
url = 'https://codeup.com/codeups-data-science-career-accelerator-is-here/'
headers = {'User-Agent': 'Codeup Bayes Data Science'} 
    
response = requests.get(url, headers=headers)
response.ok

True

In [7]:
# Here's our long string

print(type(response.text))

<class 'str'>


In [9]:
# We want to use response.content to make our Soup

print(type(response.content))

<class 'bytes'>


In [16]:
# Use BeautifulSoup

soup = BeautifulSoup(response.content, 'html.parser')

# Now we have our BeautifulSoup object, we can use its built-in methods and properties

print(type(soup))

<class 'bs4.BeautifulSoup'>


##### Grab Title from Page

In [41]:
soup.find('h1', class_='jupiterx-post-title' ).get_text()

'Codeup’s Data Science Career Accelerator is Here!'

##### Grab Text from Page

In [47]:
text = soup.find('div', itemprop='text').get_text()
text

'The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in\xa0Glassdoor’s #1 Best Job in America.Data Science is a method of providing actionable intelligence from data.\xa0The data revolution has hit San Antonio,\xa0resulting in an explosion in Data Scientist positions\xa0across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen\xa0UTSA invest $70 M for a Cybersecurity Center and School of Data Science.\xa0We built a program to specifically meet the growing demands of this industry.Our program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group, and Rackspace, along with input from dozens of practitioners and hiring partners. Students wi

In [43]:
article = soup.find('div', class_='jupiterx-post-content').get_text()
article

'The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in\xa0Glassdoor’s #1 Best Job in America.Data Science is a method of providing actionable intelligence from data.\xa0The data revolution has hit San Antonio,\xa0resulting in an explosion in Data Scientist positions\xa0across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen\xa0UTSA invest $70 M for a Cybersecurity Center and School of Data Science.\xa0We built a program to specifically meet the growing demands of this industry.Our program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group, and Rackspace, along with input from dozens of practitioners and hiring partners. Students wi