# Web Scraping with Python I

## Warm-up
Some clarification on assignment
[CodePen html practice](https://codepen.io/careybaldwin02/pen/wOxvga)

## Resources
[requests library documentation](http://docs.python-requests.org/en/master/)  
[BeautifulSoup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)  
[CodePen](https://codepen.io/careybaldwin02/pen/wOxvga)  
[epicurious](http://www.epicurious.com/)  
[Market Watch](https://www.marketwatch.com/investing/stock/)  

## Objectives

- Become familiar with HTML and CSS through hands on example
- Review and practice HTTP requests
- Understand the purpose of web-scraping and review some legal, ethical considerations
- Learn to import and use the BeautifulSoup library to 
- build objects from HTML pulled from webpages using bs
- use the bs built in functions for getting data by tag and class



### Web Scraping
When possible, it is much more efficient to extract data from an API (JSON data).  However, this may not always be an option.  Web scraping is gathering data from a webpage, often using a program.  We will be using the BeautifulSoup library along with the skills we have developed to write functions that help us target and format the data we might want from a webpage.  

#### Caveat
It is important to note that webpages change all the time and a program that we write today, may note work in the same way tomorrow.  

#### Legal and ethical issues
- often against the 'Terms of Use' of a web site
- factual. non-proprietary data is usually ok
- proprietary data scraping depends on your objective
- potential damages to the site
- public vs. private information
- be open about the information extraction
- is there a public interest?

#### Tools for web scraping
- requests: handles the HTTP requests and responses
- Beautiful Soup: utilizes the 'tag structure' of an html page to quickly parse the contents of a page and retrieve data
- Selenium: pretends to be a browser.  Useful for pages with scripts (extra functionality, interactive components)

Python cannot understand JavaScript.  Selenium can pretend to be a browser so the server doesn’t know that the request is coming from a program.

#### We will use Beautiful Soup
- HTML (and XML) parser - a parser builds a data structure (tree) from a language
- Uses 'tags'
- Creates a parse tree
- Can handle incomplete tagging
- Tags organized in hierarchical dictionaries



#### Import the modules 

In [None]:
import requests
from bs4 import BeautifulSoup

#### The http request response cycle
We can send a request to a url while specifying keywords.

In [None]:
url = "http://www.epicurious.com/search/Apple Pie"
response = requests.get(url)
if response.status_code == 200:
    print("Success")
else:
    print("Failure")

- results_page is the object from which we need to extract the data.    - 'lxml' is the library we use for parsing the information.  
- We don't need the %20 because the requests library adds that in for us.  
- The prettify() function just makes the output look nice.

Below we see the same program but with a dynamic input. 

In [None]:
keywords = input("Please enter the things you want to see in a recipe ")
url = "http://www.epicurious.com/search/" + keywords
response = requests.get(url)
if response.status_code == 200:
    print("Success")
else:
    print("Failure")
    
print(url)

#### Set up the BeautifulSoup object

The content is all of the HTML for the page.  Let's look for our keywords in the HTML. 

In [None]:
results_page=BeautifulSoup(response.content,'lxml')
print(results_page.prettify)

# Ctrl+f to search this page for our keywords

#### Exercise: Requests
Write a program, like the one above, to the market watch website:  https://www.marketwatch.com/investing/stock/.  Make your program dynamic so that it can accept an input that is a stock symbol as a keyword, e.g. aapl, and add it to the url so that the url becomes:  https://www.marketwatch.com/investing/stock/aapl, for example. 

In [None]:
# code here


Parse the response using the BeautifulSoup module and print the results. 

In [None]:
# code here


#### BS4 Functions
These are some useful functions we can use to navigate the tree and extract information. 

|Function|Usage|
|-----|-----|
|```<tag>.find(<tag_name>, attribute=value)```|finds the first matching child tag|
|```<tag>.find_all(<tag_name>, attribute=value)```|finds all matching child tags|
|```<tag>.get_text()```|returns the marked up text|
|```<tag>.parent```|returns the (immediate) parent|
|```<tag>.parents```|returns all parents|
|```<tag>.children```|returns the direct children|
|```<tag>.descendents```|returns all children|
|```<tag>.get(attribute)```|returns the value of the specified attribute|

note: an attribute can be something like the value of an href inside a tag:
```
<a href=www.google.com>Google</a>
```

We won't use all of these functions, but some will be very useful.  The find_all() function, for example finds all instances of a specified tag.  It returns what's called a result_set, which we can see is a list.  

In [None]:
all_a_tags = results_page.find_all('a')
print(type(all_a_tags))
print(all_a_tags)

find() will find the first instance of a specified tag

In [None]:
div_tag = results_page.find('div')
print(div_tag)
print(type(div_tag))

The type of this bs4 element tag, it is a string.

bs4 functions can be applied recursively

In [None]:
div_tag.find('a')

#### we can't do the following:
```
all_a_tags.find('div')
```
because the all_a_tags object is a result_set, not a string

In [None]:
# all_a_tags.find('div')

#### Exercise:  Scraping HTML Elements


In [None]:
stocks_page=

# print(stocks_page.prettify)

# Find all a tags in the stocks_page results
stock_a= 

# Find the first div tag in the stocks_page results
stock_div= 

# Find the a tags within div tags



#### Qualifying Data Extraction by Class
I can add a level of specificity to the selection of a tag by adding the value of a particular class.  Both find as well as find_all can be qualified by css selectors.
- using selector=value
- using a dictionary

If we examine a webpage, we can find the tag and class name that uniquely identifies the data we want.  

We might want to specify the article tag by class = "recipe-content-card"

The underscore is there because "class" is a keyword in python.  BeautifulSoup understands the class_ syntax  

We can take a look at what comes back and check the length to see how many items on the page fit the criteria we specified in our request.

To summarize, what we are doing here is figuring out what uniquely identifies the data that we are looking for. This is often found in the css selector values. 

In [None]:
#When using this method and looking for 'class' use 'class_' (because class is a reserved word in python)
#Note that we get a list back because find_all returns a list
print(results_page.find_all('article',class_="recipe-content-card"))
# I can check the length of this to see how many of these I get back
len(results_page.find_all('article',class_="recipe-content-card"))

As an alternative to the method above where we call the find_all() function with the two arguments, we can also send the arguments as dictionaries.  This is useful when we want to look for multiple selectors.  Then we can include the selectors as values of the class key in the dictionary.

In [None]:
#Since we're using a string as the key, the fact that class is a reserved word is not a problem
#We get an element back because find returns an element
results_page.find('article',{'class':'recipe-content-card'})

#### Getting text

Given a particular tag, we can get the content, for example, I find the first article with the class "recipe-content-card".  .get_text() gives us the text in that tag.

In [None]:
results_page.find('article',{'class':'recipe-content-card'}).get_text()

We can get the actual value of an attribute.  A common task is extracting all links from a page.  get() returns the value of a tag attribute returns a string

In [None]:
# find the 'article' tag with the class 'recipe-content-card' and name it recipe_tag
recipe_tag = results_page.find('article',{'class':'recipe-content-card'})
# create an object called recipe_link that retrieves the a-tag from within the recipe_tag object
recipe_link = recipe_tag.find('a')
print("a tag:",recipe_link)
# from within the recipe_link object, retrieve the value of the href attribute
link_url = recipe_link.get('href')
print("link url:",link_url)
print(type(link_url))

note that the above url would need an http in order to become an active link. 

#### Exercise:  Extracting Content
Find the part of the Market Watch webpage that contains the current stock value in large font.  Reference the website:  https://www.marketwatch.com/investing/stock/aapl  Then return the associated text.

In [None]:
# Locate the appropriate tag name and class name


In [None]:
# get the text associated with the current stock value


#### Exercise:  Extracting Attribute Values
At the top left side of the page:  https://www.marketwatch.com/investing/stock/aapl, there is a table containing realtime stock values.  Inspect the part of the page containing the table of stock values.  See if you can find an attribute called "channel".  Let's dig into this a bit more and see what it means.  Let's see if we can extract the "channel" values in the way we extracted the "href" attribute values for recipes.  

In [None]:
# reference the stocks_page object we defined in the exercise above
