# Web Scraping

---

There is a LOT of useful information onthe internet, and as data scientists you'll often need access to that information. 

Unfortunatley, rarely is that information contained neatly in CSVs or even in tabular form. Rather, you have to really work to get what you need. 

Lucky for us, there are some useful tools for "scraping" the web – in particular, one called BeautifulSoup (https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

In [None]:
import time
import re
import csv
import requests
from bs4 import BeautifulSoup
!pip install lxml

## `BeautifulSoup`


In [None]:
url = "https://www.nytimes.com/" # let's scrape the NYT homepage

r = requests.get(url) # the requests library is the easiest way to call to a URL; here we are using a GET command

soup = BeautifulSoup(r.text,'html') # we are going to take the result of that GET command and pass it through bs4

print(soup.prettify()) # 'prettify' does exactly what you'd think – it prettifies the output of the print statement

What you're seeing above is the HTML for the NYT homepage. Let's continue with a few basics:

## `soup.title` 

Finds the title of a page

In [None]:
soup.title 

## `soup.title.string`

Gets a string version of that same title 

In [None]:
soup.title.string 

## `soup.title.parent.name`

In [None]:
soup.title.parent.name # find the parent of the title 
                       # this is exceptionally helpful when you're trying to parse an HTML tree

## `soup.p`

Get the first paragraph tag in the HTML

In [None]:
soup.p 

In [None]:
soup.p['class'] # get the class of that <p> tag

## `soup.find_all`

In [None]:
soup.find_all('a') # find all 'a' tags on the page

In [None]:
for link in soup.find_all('a'): # find all 'a' on the page
    print(link.get('href')) # get the associated href (hyperlink) for each instance 

It's important to know that BeautifulSoup transforms HTMl into a tree of Python objects. The most important objects to know are: 

1. Tag
2. NavigableString
3. BeautifulSoup

## `Tag`

Corresponds to an XML or HTML tag in the original document. For instance:

In [None]:
tag = soup.p 
tag.name

In [None]:
tag.attrs # you can easily access an attributes tags

In [None]:
tag['class'] # or, you can search for a corresponding value as you would in a dictionary 

## `String` 

Corresponds to a bit of text within a tag. You use the NavigableString class to access that text.

In [None]:
tag.string

## `BeautifulSoup object`

Represents the document as a whole.

In [None]:
soup.name

---

# Exercise 1:

Choose any article from NYT.com and find its title.string value 

In [None]:
# your code here

# Solution

In [None]:
url = "https://www.nytimes.com/interactive/2019/10/31/us/california-fire-evacuees.html?action=click&module=Top%20Stories&pgtype=Homepage" # let's scrape the NYT homepage

r = requests.get(url) # the requests library is the easiest way to call to a URL; here we are using a GET command

soup = BeautifulSoup(r.text,'html') # we are going to take the result of that GET command and pass it through bs4

soup.title.string 

# Exercise 2

Find any and all hyperlinks contained in this article

In [None]:
# your code here

# Solution

In [None]:
for link in soup.find_all('a'):
    print(link.get('href'))

---

## Navigating the Tree

The easiest way to navigate the parse tree is to call out the tag you want. 

In [None]:
soup.head # let's just call out for the 'head' tag

In [None]:
soup.title # or the 'title' tag

You can, of course, delve deeper into the parse tree.

In [None]:
soup.body.p # get the first <p> tag beneath the <body> tag

In [None]:
# note that using a tag name as an attribute gets you only the first tag by that name

soup.p

In [None]:
# to find all the tags, use something like find_all()

soup.find_all('p')

As alluded to earlier, it's helpful to be able to navigate the tree step-by-step. A tag's children are available in a list called .contents

In [None]:
head_tag = soup.head
head_tag.contents

You can also iterate over a tag's children with the .children generator

In [None]:
for child in head_tag.children:
    print(child)

---

# Exercise 3:

Find the first paragrap tag in the article.

In [None]:
# your code here

# Solution

In [None]:
soup.body.p

# Exercise 4: 

Find all of the paragraph tags contained in the article.

In [None]:
# your code here

# Solution

In [None]:
soup.find_all('p')

---

## Filters

In [None]:
soup.find_all('a') # simply pass in the string for the tag you're searching for

In [None]:
import re # you can pass in regular expressions, too

for tag in soup.find_all(re.compile("p")): # find all tags whose names start with 'p'
    print(tag.name)

In [None]:
for tag in soup.find_all(re.compile("t")): # find all the tags whose names contain the letter 't'
    print(tag.name)

In [None]:
soup.find_all(["a","b"]) # if you pass a list, bs4 will match against any item in that list 

Filtering by CSS Class

In [None]:
soup.find_all(class_="story-meta")

# note the class_, since class is a reserved word in Python

In [None]:
soup.find_all(class_=re.compile("ad"))

---

# RSS Feeds

An RSS ('Real Simple Syndication') feed is nothing more than a text file that is updated with information (usually pared down) from a website. For more, check out [this article by Digital Trends](https://www.digitaltrends.com/computing/what-is-an-rss-feed/)

In order to flex our RSS Feed skills we are going to be mimicking this brilliant and simple bot, @TwoHeadlines: <br> https://twitter.com/twoheadlines?lang=en

<br>

The concept is simple. It takes two different headlines from two different outlets via their RSS feeds (which we'll go over in a moment) and combines them to produce often comical and almost always nonsensical news headlines.

<br>

The first thing we must do to create our own TwoHeadlines bot is import a few libraries. Remember, libraries in Python are collections of functions and methods that allow you to perform various actions without writing your own code.

<br>

For instance, in our Two Headlines bot we are going to use: 

#### Feedparser: a library that will allow us to read various RSS feeds (again, we'll get to RSS in a moment)<br>
https://pythonhosted.org/feedparser/introduction.html

#### Random: a library that will allow us to generate random numbers <br> 
https://docs.python.org/2/library/random.html

#### Time: a library that will allow us to work around traditionally tricky time functions <br>
https://docs.python.org/2/library/time.html

<br>

Thus, your first lines of code will look as follows: <br>

In [None]:
import feedparser
import random
import time

<br> Great! Now, we want to begin by defining our function. <br>

Remember, funcitons come in handy when you want to repeat the same task many times using the same _type_ of input. <br>

In [None]:
# for example

def printSentence(sentence):
    print(sentence + " Plus a new sentence.")
    return;

In [None]:
printSentence("This is the sentence I want to print.")

In this case, we will call our function 'TwoHeadlines' 


In [None]:
def TwoHeadlines(): # we are leaving the input blank for now, and you'll see why in a moment
    pass            # this 'pass' is here just to avoid an error as we work on our function. To see what happens without it, 
                    # try removing the 'pass' line and see the error you receive.

To best understand what you can get from an RSS feed, take a look at the following examples: 

http://www.wsj.com/public/page/rss_news_and_feeds.html <br>
https://archive.nytimes.com/www.nytimes.com/services/xml/rss/index.html <br>
http://rss.cnn.com/rss/cnn_topstories.rss <br>

To see how you can actually pull these RSS feeds using Python, we're going to rely on Python. As an example, let's pull two feeds.

Note that we first set a variable equal to the desired url for the desired RSS feed. Then, we use feedparser to store that information into a new variable.

In [None]:
nyt_rss_url = 'https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml' # find the desired rss feed
espn_rss_url = 'https://www.espn.com/espn/rss/news' # find a second desired rss feed

nyt_feed = feedparser.parse(nyt_rss_url) # use feedparser to, well, parse the feed
espn_feed = feedparser.parse(espn_rss_url) # use feedparser to, well, parse the feed

Next, we need to get a bit creative, because we don't want that entire RSS feed; We just want the headline for the latest article! But if you type the following:

In [None]:
print(nyt_feed) # print the full RSS feed

In [None]:
for i in range(0,10): # for the first ten entries in the RSS feed (the ten most recent stories)
    print(nyt_feed['entries'][i]['title']) # print the title of said article

But how did we know to use "['entries'][i]['title']"?

To understand, we need to briefly delve into the world of dictionaries 

In [None]:
dictionary = {'favorite_food':'pasta'} # create a new dictionary 

# consider 'favorite_food' to be the word, and 'pasta' to be the definition, it it helps you

In [None]:
print(dictionary['favorite_food'])

# we then call 'favorite_food' and get the "definiton" 
# in reality, this is known as a Key:Value pair, with "Key" being the word, and "Value" being the definition

As you may be able to see, our RSS is actually formated quite cleverily. It is a dicitionary (a set of key-value pairs) that includes lists. For example, look at the very top of the feed. It starts 

#### {'feed': {'title': 'WSJ.com: World News',

The best way to read this is - the first entry in the dictionary is 'Feed' and the first value for that entry (also known as a 'key' is 'Title'. 

Now, 'Title' happens to be another dictionary (you can tell because it begins with a '{'). If we keep searching, we'll see that the headline comes after 'entries' and is paired with the 'title'. 

I know this is all exceptionally confusing, but just bear with me. The more you practice parsing information from RSS feeds (or HTML in general) the easier it will become, I promise!

So, if we want that headline, and that headline only, we are going to: 

1. Navigate to the entire RSS feed
2. Navigate to the 'entries' section
3. Navigate to the first 'entries' section (each story is going to have its own, and we want the first headline)
4. Navigate to the 'title' section 

<br>

Now, back to replicating 'TwoHeadlinesBot'

In [None]:
my_list = [] # create a new, empty list called 'my_list'

for i in range(0,10): 
    my_list.append(nyt_feed['entries'][i]['title']) # append the first ten titles to this list

In [None]:
my_list[3] # select the third index of that list

In [None]:
Article4 = my_list[3]

In [None]:
Article4[:25] # get the first 25 characters of the title of the 3rd index (fourth article) in our list

In [None]:
len(Article4) # how many characters long is our title? 

In [None]:
len(Article4)/2 # figure out the half-way point of the title 

In [None]:
Article4[0:30] # get the first half of our article title

In [None]:
Article5 = my_list[4] # let's see what the next title is in our list
print(Article5)

So, how do we want to mash our headlines together?

In [None]:
nyt_first_story = nyt_feed['entries'][0]['title'] #Recall that '0' is actually the first instance
print(nyt_first_story)

In [None]:
words = nyt_first_story.split(' ') # remember, I can split that single sentence into a list of individual words 
print(words) 

In [None]:
for i in range(0,10): 

    nyt_first_story = nyt_feed['entries'][i]['title'] # pull the title of the ith story in the first RSS feed
    espn_first_story = espn_feed['entries'][i]['title'] # pull the title of the ith story in the second RSS feed

    nyt_words = nyt_first_story.split(' ') # split the title by spaces (aka, make every word in the title it's own)
    espn_words = espn_first_story.split(' ') # split the title by spaces (aka, make every word in the title it's own)
    
print(nyt_words) 
print(" --- ") # print a line for formatting purposes
print(espn_words)

Let's keep going. Remember, we want to take half of one headline and half of a different headline and mash them together. So, how do we get just the first or second half of a list of words?  <br>

In [None]:
for i in range(0,10): 

    nyt_first_story = nyt_feed['entries'][i]['title'] # pull the title of the ith story in the first RSS feed
    espn_first_story = espn_feed['entries'][i]['title'] # pull the title of the ith story in the second RSS feed

    nyt_words = nyt_first_story.split(' ') # split the title by spaces (aka, make every word in the title it's own)
    espn_words = espn_first_story.split(' ') # split the title by spaces (aka, make every word in the title it's own)

    nyt_words = nyt_words[:int(len(nyt_words)/2)] 
    espn_words = espn_words[int(len(espn_words)/2):]
    
print(nyt_words)
print(" --- ")
print(espn_words)

## Walkthrough of our Code

1) First, the `[:`    

In [None]:
# the ':' at the front of a list means 'everything leading up to this point. For instance: 

list = ['a','b','c','d','e']
list = list[:3]
print(list)

In other words, we want to print everything leading up to (but not including!) the third instance in our list.

 2)  Next, the `int` allows us to ensure we're working with integers so we can do the necessary division at the end of the line of code.  


In [None]:
len(nyt_words)/2 # the result is a float, which we don't want

In [None]:
int(len(nyt_words)/2) # tis gives us an integer

3)  `len` is a function that gives you the number of items in a list. For instance: 

In [None]:
list = ['a','b','c','d','e']
len(list)

4) Finally, we are taking the total number of words in the headline and dividing by two

In total, we are saying: "Take the headline, find out how many words are in the headline and divide by two. Then, take the first half of that headline and store it as the new healdine." 

_Note that while for the first healdine we take the first half (by putting the ':' at the beginning of the code) we are taking the second half of the second headline (by putting the ':' at the end of the code)._

## All together, now

Finally, we want to join the two halves of our healdine and store it as the variable 'new_headline' 

In [None]:
for i in range(0,10): 

    nyt_first_story = nyt_feed['entries'][i]['title'] # pull the title of the ith story in the first RSS feed
    espn_first_story = espn_feed['entries'][i]['title'] # pull the title of the ith story in the second RSS feed

    nyt_words = nyt_first_story.split(' ') # split the title by spaces (aka, make every word in the title it's own)
    espn_words = espn_first_story.split(' ') # split the title by spaces (aka, make every word in the title it's own)

    nyt_words = nyt_words[:int(len(nyt_words)/2)] 
    espn_words = espn_words[int(len(espn_words)/2):]
    
    new_headline = nyt_words + espn_words # Take the first half of the title from the first RSS feed and add the second half of the second RSS feed
    new_headline = ' '.join(new_headline) # Join the two strings created above with spaces

    print(new_headline) # Print your newly created headline