In [1]:
import pandas as pd
import numpy as np

from requests import get
import re
from bs4 import BeautifulSoup
import os

### Beautiful Soup - Web Scraping

#### <font color=red>What is Beautiful Soup?</font>

Beautiful Soup is a Python library for scraping dta from an HTML document. You have to be careful when scraping data from sites that the site allows the practice. You can check by typing `/robots.txt` after the base url of a site. Using the `headers` parameter in your request is also a part of web scraping best practices.

Check out the docs for BeautifulSoup [here](https://beautiful-soup-4.readthedocs.io/en/latest/). I also found it very useful to code along with some articles and tutorials I found online to get a feel for scraping.

- Simple but useful web scraping with Beautiful Soup [article](https://www.pluralsight.com/guides/web-scraping-with-beautiful-soup).

- Dataquest [tutorial](https://www.dataquest.io/blog/web-scraping-beautifulsoup/) from the curriculum and intro Dataquest [tutorial](https://www.dataquest.io/blog/web-scraping-tutorial-python/).

___

#### <font color=orange>So How Do We Use Beautiful Soup?</font>

Here, we are looking to retrieve content from a web page, but the web page is written in HTML (HyperText Markup Language), so we will use the `requests` library to get a response with the HTML from our desired page and `BeautifulSoup` to parse the HTML response. As you begin scraping, it would be helpful to have a basic understanding of the different HTML elements and attributes used to create web pages.

##### HTML Tree Diagram

![HTML Tree Diagram Image](https://drek4537l1klr.cloudfront.net/mcfedries/Figures/19_01.png)

___

##### HTML Elements

An HTML element consists of a start tag and an end tag along with the content between the tags. For example:

```html
<div>content...content</div>
```

HTML elements can be nested or contain other elements. For example:

```html
<!DOCTYPE html>
<html>
<body>

<h1>My First Heading</h1>
<p>My first paragraph.</p>

</body>
</html>
```

___

`<html>` tags define the html element --- the whole document.

`<body>` tags define the body element --- the document body.

`<h1>` to `<h6>` tags define a heading element --- a heading. 
- (`<h1>` - `<h6>`, largest to smallest heading size)

`<p>` tags define a paragraph element --- a new pargraph of text.

`<a>` tags define an anchor element, which tells the browser to render a hyperlink to a web page, file, email address, etc. Anchor elements use the `href` attribute to tell the link where to go.

```html
<a href='url_of_link'>Text of the link</a>
```

`<div>` tags define a division element, like a container; it is used to group and style block-level content using the `class` or `id` attributes (defined below).

`<span>` element is also like a container, like the `<div>` element above, but for styling inline elements instead of block-level.

`<img>` element defines an image and uses the `src` attribute to hold the image address. The `<img>` tag is self-closing, which mean it doesn't need a closing tag.

___

##### HTML or Tag Attributes

These are optional and appear inside of the opening tag, usually as name/value pairs `name='value'`. They make the HTML elements easier to work with because they give the elements names. For example, let's add a class attribute to our `<div>` element from above.

```html
<div class='descriptive_class_name'>content...content</div>
```

`class` is an attribute of an HTML element that defines equal styles for tags with the same class. One element can have multiple classes and different elements can share the same classes, so classes cannot be used as unique identifiers.

`id` is an attribute of an HTML element. Each element can only have one id, so they can be used as unique identifiers.

`itemprop` is an attribute that consists of a name-value pair and is used to add properties to an element.

`href` is an attribute of an `<a>` element that contains the link address.

```html
<a href=“destination.com”></a>

```

`src` is an attribute of an `<img>` element that contains the address for an image. I can size my image using the `width=` and `height=` attributes, as well, if I like.
```html
<img src="img_name.jpg" width="500" height="600">
```

___

#### <font color=green>Now What?</font>

![Web Scraping Workflow Image](https://i.pinimg.com/564x/13/ff/c9/13ffc9bddace4005d58222755647879c.jpg)

**Inspect** the structure of the web page by right-clicking on the part or parts of the page we want to scrape and clicking `inspect`. We can also inspect the source code of a web page by prefixing the url in the address bar of our browser with 'view-source:' like in the example below. This method returns the HTML as it is returned in your request, without any extra information.
```python
view-source:https://ryanorsinger.com/
```

**Obtain** the HTML from our target web page using the `requests` library . You can review how to use the `requests` library in my notebook [here](https://darden_reviews.github.io/api_review).
```python
url = 'https://ryanorsinger.com/'
headers = {'User-Agent': 'Codeup Data Science'} 
    
response = get(url, headers=headers)
```

**Parse** the HTML we receive from our request using Python's Beautiful Soup library. We do this by passing our string of HTML or an HTML file and a parser to BeautifulSoup. This is how we **Create** the Soup object that we will work with in extracting data using element names, attributes, and selectors.
```python
soup = BeautifulSoup(response.text, 'html.parser')
```

**Extract** the data we want from our Soup object.

**Save** our data to a file for future use or prepare it for further use in the next step of our project.

___

##### BeautifulSoup Methods

We can use HTML tags, CSS class (`class_=''`), Regex patterns, CSS selectors, and more with `BeautifulSoup` search methods to retrieve the information we want. For example:
```python

# Create our soup object using BeautifulSoup and our response string using get() method from requests library.

from requests import get
from bs4 import BeautifulSoup

response = get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

# Extract the first instance of the specific tag_name.
# param name -> A filter on tag name. Default: name=None
# param attrs -> A dictionary of filters on attribute values. Default: attrs={}

soup.find(name, attrs)

# Extract all of the instances of the specific tag_name.

soup.find_all(name, attrs)

# Return a dictionary of all attributes of this tag.

tag.attrs

# Return all the test in this tag

tag.text

# Return a list of all children elements of this tag.

tag.contents
```

You can find more about filtering your HTML requests with `BeautifulSoup` search methods [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#kinds-of-filters).

In [2]:
url = 'https://codeup.com/codeups-data-science-career-accelerator-is-here/'
headers = {'User-Agent': 'Codeup Data Science'} 
    
response = get(url, headers=headers)
response.ok

True

In [3]:
# Here's our long string; we'll use this to make our soup object

print(type(response.text))

<class 'str'>


In [5]:
# Use BeautifulSoup using our response string

soup = BeautifulSoup(response.text, 'html.parser')

# Now we have our BeautifulSoup object, we can use its built-in methods and properties

print(type(soup))

<class 'bs4.BeautifulSoup'>
