## Acquire Data through Web Scraping
When the data you need is not accessible through CSVs, APIs, SQL, or other types, there is an option. This option is known as web scraping.

**Web Scraping Ethics**
Make sure the website's terms of use allow for web scraping. You can generally find a terms of service page, or take a look at `example.com/robots.txt` to find the policy for computers looking at the web site.

At a high level, we'll go about web scraping through this process:

1. Manually explore the site in a web browser, and identify the relevant HTML elements.
2. Use the requests module to obtain the HTML from the page.
3. Use BeautifulSoup to parse the HTML and obtain the text/data that we want.
4. (Maybe) Script the process of requesting another page and parsing the data from it as well.
5. Take this data further down the data science pipeline.

### Steps
1. Import the get() function from the requests module, BeautifulSoup from bs4, and pandas.
2. Assign the address of the web page to a variable named url.
3. Request the server the content of the web page by using get(), and store the server’s response in the variable response.
4. Print the response text to ensure you have an html page.
5. Take a look at the actual web page contents and inspect the source to understand the structure a bit.
6. Use BeautifulSoup to parse the HTML into a variable ('soup').
7. Identify the key tags you need to extract the data you are looking for.
8. Create a dataframe of the data desired.
9. Run some summary stats and inspect the data to ensure you have what you wanted.
10. Edit the data structure as needed, especially so that one column has all the text you want included in this analysis.
11. Create a corpus of the column with the text you want to analyze.
12. Store that corpus for use in a future notebook.

#### Step 1. Import the get( ) function from the requests module, BeautifulSoup from bs4, and pandas.

In [1]:
# Import libraries

from requests import get
from bs4 import BeautifulSoup
import os
import numpy as np
import pandas as pd

For this lesson, we'll take a look at an article from Codeup's blog.

#### Step 2. Assign the address of the web page to a variable named url.
#### Step 3. Request the server content of the web page by using get( ), and store the server's response in the variable response.

In [2]:
# Create a variable to hold the web address

url = 'https://codeup.com/codeups-data-science-career-accelerator-is-here/'

# Some websites don't accept the pyhon-requests default user-agent
headers = {'User-Agent': 'Codeup Data Science'} 

# Get the response
response = get(url, headers=headers)

#### Step 4. Print the response text to ensure you have an html page.

In [3]:
print(response.text[:400])

<!DOCTYPE html><html lang="en-US"><head >	<meta charset="UTF-8" />
	<meta name="viewport" content="width=device-width, initial-scale=1" />
	
	<!-- This site is optimized with the Yoast SEO plugin v15.2 - https://yoast.com/wordpress/plugins/seo/ -->
	<title>Codeup’s Data Science Career Accelerator is Here! - Codeup</title>
	<meta name="description" content="The rumors are true! The time has arrived


Now we will take a look at the actual web page contents and inspect the source to understand the structure a bit.

As we see from the first line of the response, the server sent us an **HTML document**. This document describes the overall structure of that web page, along with its specific content (which is what makes that particular page unique).

For the most part, all of the pages from a single website will have the same (or very similar) overall structure. **To write our script, we will need to understand the HTML structure of one page**, and we will use the browser’s Developer Tools to do that.
- `command + option + u` will let you view the source of a page in chrome.
- `command + option + i` will open up the chrome dev tools page inspector.
- Right clicking on specific text in the page and selecting 'inspect' will take you right to the html of that text
    
In general, we'll be looking for **HTML tags**, and using a couple properties of those tags to identify the content that we want. Two element properties are important to us:
- `class`: This is a list of the class(es) that are applied to an element, these can be used to target certain elements, but are not guaranteed to be unique.
- `id`: This is a unique identifier for an element on a page.

We'll use the beautiful soup library to work with HTML data in python.

#### 5. Take a look at the actual web page contents and inspect the source to understand the structure a bit.
#### 6. Use BeautifulSoup to parse the HTML into a variable ('soup').

In [6]:
# Make a soup variable holding the response content

soup = BeautifulSoup(response.content, 'html.parser')

### Beautiful Soup Methods and Properties

- `soup.title.string`: gets the page's title (the same text in the browser tab for a page, this is the `title` element.
    
- `soup.prettify()`: is useful to print in case you want to see the HTML
    
- `soup.find_all("a")`: find all the anchor tags, or whatever argument is specified.
    
- `soup.find("h1")`: finds the first matching element
    
- `soup.get_text()`: gets the text from within a matching piece of soup/HTML
    
- The `soup.select()` method takes in a CSS selector as a string and returns all matching elements.

#### 7. Idenfity the key tags you need to extract the data you are looking for. 

In [15]:
article = soup.find('div', class_='jupiterx-post-content')
article.text

'The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in\xa0Glassdoor’s #1 Best Job in America.\nData Science is a method of providing actionable intelligence from data.\xa0The data revolution has hit San Antonio,\xa0resulting in an explosion in Data Scientist positions\xa0across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen\xa0UTSA invest $70 M for a Cybersecurity Center and School of Data Science.\xa0We built a program to specifically meet the growing demands of this industry.\nOur program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group, and Rackspace, along with input from dozens of practitioners and hiring partners. Student

Now that we have some text to process, we can **store it for future use**:

In [16]:
with open('article.txt', 'w') as f:
    f.write(article.text)

We can now package all of our code up in a nice function that we can use later:

In [9]:
def get_article_text():
    # if we already have the data, read it locally
    if path.exists('article.txt'):
        with open('article.txt') as f:
            return f.read()

    # otherwise go fetch the data
    url = 'https://codeup.com/codeups-data-science-career-accelerator-is-here/'
    headers = {'User-Agent': 'Codeup Data Science'}
    response = get(url, headers=headers)
    soup = BeautifulSoup(response.text)
    article = soup.find('div', class_='jupiterx-post-content')

    # save it for next time
    with open('article.txt', 'w') as f:
        f.write(article.text)

    return article.text

## HTML and CSS Crash Course

**HTML is the language for content and structure on the web.** This means that HTML **specifies what content is what: tex, images, links, tables, containers, etc...**

**CSS is the language for styling and presentation.** This means CSS specifies color, background, texture, position, etc...

### HTML Basics

HTML consists of elements denoted by **tags**. These tags are contained in angle brackets like `<main>`. Notice how there are opening and closing tags that contain other elements.

HTML tags nest inside of other HTML tags, just like directories and files are nested in other directories.

<html>
    <head>
        <title>This is the title of the page</title>
    </head>
    <body>
        <heading>
            <h1>Welcome to the blog!</h1>
            <p>Blog is short for "back-log"</p>
        </heading>
        <main>
            <h2>Read your way to insight!</h2>
            <section id="posts">
                <article class="blog_post">
                    <h3>Hello World</h3>
                    <p>This is the first post!</p>
                </article>
                <article class="blog_post">
                    <h3>HTML Is Awesome</h3>
                    <p>It's the language and structure for the web!</p>
                </article>
                <article class="blog_post">
                    <h3>CSS Is Totally Rad</h3>
                    <p>CSS Selectors are super powerful</p>
                </article>
            </section>
        </main>
        <footer>
            <p>All rights reserved.</p>
        </footer>
    </body>
</html>

### CSS Selectors

- The name of the element itself is a selector. For example `soup.select("p")` will select every paragraph tag and `soup.select("footer")` selects the footer element (and everything inside it)

- The id selector is denoted with a `#`. For example `soup.select("#posts")` will return the html element noted with the `id=posts` attribute

- The class selector is denoted with a `.` symbol before the class name. For example, `soup.select(".blog_post")` returns all of the elements that have that class name.