### ** Web Scraping Ethics **
#### Make sure the website's terms of use allow for web scraping. 

#### Web Scraping Process:
1. Manually explore site -> identify relevant HTML elements.
2. Use 'requests' module to obtain the HTML from the page.
3. Use BeautifulSoup to parse the HTML and obtain the text/data.
4. Script the process of requesting another page and parsing the data from it as well.
5. Continue along Data Science pipeline.

#### Steps:
1. Import the get() function from the requests module, BeautifulSoup from bs4, and pandas.
2. Assign the address of the web page to a variable named url.
3. Request the server the content of the web page by using get(), and store the server's response in the variable response.
4. Print the response text to ensure you have an html page.
5. Take a look at the actual web page contents and inspect the source to understand the structure a bit.
6. Use BeautifulSoup to parse the HTML into a varaible ('soup').
7. Identify the key tags you need to extract the data you are looking for.
8. Create a dataframe of the data desired.
9. Run some summary stats and inspect the data to ensure you have what you wanted.
10. Edit the data structure as needed, especially so that one column has all the text you want included in this analysis.
11. Create a corpus of the column with the text you want to analyze.
12. Store that corpus for use in a future notebook.

In [10]:
from requests import get
from bs4 import BeautifulSoup
import os

In [2]:
url = 'https://codeup.com/data-science/math-in-data-science/'
headers = {'User-Agent': 'Codeup Data Science'} 
response = get(url, headers=headers)

In [3]:
print(response.text[:400])

<!DOCTYPE html>
<html lang="en-US">
<head>
	<meta charset="UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge">
	<link rel="pingback" href="https://codeup.com/xmlrpc.php" />

	<script type="text/javascript">
		document.documentElement.className = 'js';
	</script>
	
	<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin /><script id="diviarea-loader">window.DiviPopupData=wi


In [4]:
soup = BeautifulSoup(response.content, 'html.parser')

### Beautiful Soup Methods and Properties
* soup.title.string gets the page's title (the same tesxt in the browser tab for a page, this is the < title > element.
* soup.prettify() is useful to print in case you want to see the HTML.
* soup.find_all("a") find all the anchor tags, or whatever argument is specified.
* soup.find("h1") finds the first matching element
* soup.get_text() gets the text from within a matching piece of soup/HTML
* soup.select() takes in a CSS selector as a string and returns all matching elements. <-- **useful**

In [5]:
article = soup.find('div', id='main-content')
article.text

'\n\n\n\n\n\nWhat are the Math and Stats Principles You Need for Data Science?\nOct 21, 2020 | Data Science\n\n\nComing into our Data Science program, you will need to know some math and stats. However, many of our applicants actually learn in the application process – you don’t need to be an expert before applying! Data science is a very accessible field to anyone dedicated to learning new skills, and we can work with any applicant to help them learn what they need to know. But what “skills” do we mean, exactly? Just what exactly are the data science math and stats principles you need to know?\nWhat are the main math principles you need to know to get into Codeup’s Data Science program?\n\n\nAlgebra\nDo you know PEMDAS and can you solve for x? You will need to be or become comfortable with the following:\xa0\n\nVariables (x, y, n, etc.)\nFormulas, functions, and variable manipulations (e.g. x^2 = x + 6, solve for x).\nOrder of evaluation: PEMDAS: parentheses, exponents, then multiplic

In [6]:
with open('article.txt', 'w') as f:
    f.write(article.text)

In [11]:
def get_article_text():
    # Read data locally if it exists.
    if os.path.exists('article.txt'):
        with open('article.txt') as f:
            return f.read()
    # Fetch data
    url = 'https://codeup.com/data-science/math-in-data-science/'
    headers = {'User-Agent': 'Codeup Data Science'}
    response = get(url, headers=headers)
    soup = BeautifulSoup(response.text)
    article = soup.find('div', id='main-content')
    
    # Save it for when needed
    with open('article.txt', 'w') as f:
        f.write(article.text)
    
    return article.text

In [12]:
get_article_text()

'\n\n\n\n\n\nWhat are the Math and Stats Principles You Need for Data Science?\nOct 21, 2020 | Data Science\n\n\nComing into our Data Science program, you will need to know some math and stats. However, many of our applicants actually learn in the application process – you don’t need to be an expert before applying! Data science is a very accessible field to anyone dedicated to learning new skills, and we can work with any applicant to help them learn what they need to know. But what “skills” do we mean, exactly? Just what exactly are the data science math and stats principles you need to know?\nWhat are the main math principles you need to know to get into Codeup’s Data Science program?\n\n\nAlgebra\nDo you know PEMDAS and can you solve for x? You will need to be or become comfortable with the following:\xa0\n\nVariables (x, y, n, etc.)\nFormulas, functions, and variable manipulations (e.g. x^2 = x + 6, solve for x).\nOrder of evaluation: PEMDAS: parentheses, exponents, then multiplic