# Web Scraping

---

There is a LOT of useful information onthe internet, and as data scientists you'll often need access to that information. 

Unfortunatley, rarely is that information contained neatly in CSVs or even in tabular form. Rather, you have to really work to get what you need. 

Lucky for us, there are some useful tools for "scraping" the web – in particular, one called BeautifulSoup (https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

In [None]:
import time
import re
import csv
import requests
from bs4 import BeautifulSoup
!pip install lxml

## `BeautifulSoup`


What you're seeing above is the HTML for the NYT homepage. Let's continue with a few basics:

## `soup.title` 

Finds the title of a page

## `soup.title.string`

Gets a string version of that same title 

## `soup.title.parent.name`

## `soup.p`

Get the first paragraph tag in the HTML

## `soup.find_all`

It's important to know that BeautifulSoup transforms HTMl into a tree of Python objects. The most important objects to know are: 

1. Tag
2. NavigableString
3. BeautifulSoup

## `Tag`

Corresponds to an XML or HTML tag in the original document. For instance:

## `String` 

Corresponds to a bit of text within a tag. You use the NavigableString class to access that text.

## `BeautifulSoup object`

Represents the document as a whole.

---

# Exercise 1:

Choose any article from NYT.com and find its title.string value 

In [None]:
# your code here

# Exercise 2

Find any and all hyperlinks contained in this article

In [None]:
# your code here

---

## Navigating the Tree

The easiest way to navigate the parse tree is to call out the tag you want. 

In [None]:
# .HEAD

In [None]:
# .TITLE

You can, of course, delve deeper into the parse tree.

In [None]:
# .BODY.P

In [None]:
# .P

In [None]:
# FIND ALL

As alluded to earlier, it's helpful to be able to navigate the tree step-by-step. A tag's children are available in a list called .contents

In [None]:
# CONTENTS 

You can also iterate over a tag's children with the .children generator

In [None]:
# ITERATE

---

# Exercise 3:

Find the first paragrap tag in the article.

In [None]:
# your code here

# Exercise 4: 

Find all of the paragraph tags contained in the article.

In [None]:
# your code here

---

## Filters

In [None]:
# FIND ALL 

In [None]:
# RE

In [None]:
# RE

In [None]:
# FIND ALL MULTIPLE

Filtering by CSS Class

In [None]:
# CSS 

In [None]:
# FIND ALL WITH RE

---

# RSS Feeds

An RSS ('Real Simple Syndication') feed is nothing more than a text file that is updated with information (usually pared down) from a website. For more, check out [this article by Digital Trends](https://www.digitaltrends.com/computing/what-is-an-rss-feed/)

In order to flex our RSS Feed skills we are going to be mimicking this brilliant and simple bot, @TwoHeadlines: <br> https://twitter.com/twoheadlines?lang=en

<br>

The concept is simple. It takes two different headlines from two different outlets via their RSS feeds (which we'll go over in a moment) and combines them to produce often comical and almost always nonsensical news headlines.

<br>

The first thing we must do to create our own TwoHeadlines bot is import a few libraries. Remember, libraries in Python are collections of functions and methods that allow you to perform various actions without writing your own code.

<br>

For instance, in our Two Headlines bot we are going to use: 

#### Feedparser: a library that will allow us to read various RSS feeds (again, we'll get to RSS in a moment)<br>
https://pythonhosted.org/feedparser/introduction.html

#### Random: a library that will allow us to generate random numbers <br> 
https://docs.python.org/2/library/random.html

#### Time: a library that will allow us to work around traditionally tricky time functions <br>
https://docs.python.org/2/library/time.html

<br>

Thus, your first lines of code will look as follows: <br>

In [None]:
import feedparser
import random
import time

<br> Great! Now, we want to begin by defining our function. <br>

Remember, funcitons come in handy when you want to repeat the same task many times using the same _type_ of input. <br>

In [None]:
# FUNCTION

In [None]:
# FUNCTION EXECUTE

In this case, we will call our function 'TwoHeadlines' 


In [None]:
# PASS

To best understand what you can get from an RSS feed, take a look at the following examples: 

http://www.wsj.com/public/page/rss_news_and_feeds.html <br>
https://archive.nytimes.com/www.nytimes.com/services/xml/rss/index.html <br>
http://rss.cnn.com/rss/cnn_topstories.rss <br>

To see how you can actually pull these RSS feeds using Python, we're going to rely on Python. As an example, let's pull two feeds.

Note that we first set a variable equal to the desired url for the desired RSS feed. Then, we use feedparser to store that information into a new variable.

In [None]:
nyt_rss_url = 'https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml' # find the desired rss feed
espn_rss_url = 'https://www.espn.com/espn/rss/news' # find a second desired rss feed

nyt_feed = feedparser.parse(nyt_rss_url) # use feedparser to, well, parse the feed
espn_feed = feedparser.parse(espn_rss_url) # use feedparser to, well, parse the feed

Next, we need to get a bit creative, because we don't want that entire RSS feed; We just want the headline for the latest article! But if you type the following:

In [None]:
# FEED

In [None]:
# TITLES

But how did we know to use "['entries'][i]['title']"?

To understand, we need to briefly delve into the world of dictionaries 

In [None]:
# DICT

In [None]:
# DICT

As you may be able to see, our RSS is actually formated quite cleverily. It is a dicitionary (a set of key-value pairs) that includes lists. For example, look at the very top of the feed. It starts 

#### {'feed': {'title': 'WSJ.com: World News',

The best way to read this is - the first entry in the dictionary is 'Feed' and the first value for that entry (also known as a 'key' is 'Title'. 

Now, 'Title' happens to be another dictionary (you can tell because it begins with a '{'). If we keep searching, we'll see that the headline comes after 'entries' and is paired with the 'title'. 

I know this is all exceptionally confusing, but just bear with me. The more you practice parsing information from RSS feeds (or HTML in general) the easier it will become, I promise!

So, if we want that headline, and that headline only, we are going to: 

1. Navigate to the entire RSS feed
2. Navigate to the 'entries' section
3. Navigate to the first 'entries' section (each story is going to have its own, and we want the first headline)
4. Navigate to the 'title' section 

<br>

Now, back to replicating 'TwoHeadlinesBot'

In [None]:
# LIST

In [None]:
# INDEX

In [None]:
# VARIABLE

So, how do we want to mash our headlines together?

In [None]:
# SPLIT

In [None]:
# LOOP SPLIT

Let's keep going. Remember, we want to take half of one headline and half of a different headline and mash them together. So, how do we get just the first or second half of a list of words?  <br>

## Walkthrough of our Code

1) First, the `[:`    

In other words, we want to print everything leading up to (but not including!) the third instance in our list.

 2)  Next, the `int` allows us to ensure we're working with integers so we can do the necessary division at the end of the line of code.  


3)  `len` is a function that gives you the number of items in a list. For instance: 

4) Finally, we are taking the total number of words in the headline and dividing by two

In total, we are saying: "Take the headline, find out how many words are in the headline and divide by two. Then, take the first half of that headline and store it as the new healdine." 

_Note that while for the first healdine we take the first half (by putting the ':' at the beginning of the code) we are taking the second half of the second headline (by putting the ':' at the end of the code)._

## All together, now

Finally, we want to join the two halves of our healdine and store it as the variable 'new_headline' 