# Web Scraping

---

There is a LOT of useful information onthe internet, and as data scientists you'll often need access to that information. 

Unfortunatley, rarely is that information contained neatly in CSVs or even in tabular form. Rather, you have to really work to get what you need. 

Lucky for us, there are some useful tools for "scraping" the web – in particular, one called BeautifulSoup (https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

In [23]:
import time
import re
import csv
import requests
from bs4 import BeautifulSoup
!pip install lxml #https://lxml.de/



## `BeautifulSoup`


In [24]:
url = "https://www.nytimes.com/news-event/coronavirus" 

r = requests.get(url) # the requests library is the easiest way to call to a URL; here we are using a GET command

soup = BeautifulSoup(r.text,'html') # we are going to take the result of that GET command and pass it through bs4

print(soup.prettify()) # 'prettify' does exactly what you'd think – it prettifies the output of the print statement

<!DOCTYPE html>
<html lang="en" xmlns:og="http://opengraphprotocol.org/schema/">
 <head>
  <title data-rh="true">
   The Coronavirus Outbreak - The New York Times
  </title>
  <meta content="en-US" data-rh="true" itemprop="inLanguage"/>
  <meta content="collection" data-rh="true" id="applicationName" name="applicationName"/>
  <meta content="coronavirus" data-rh="true" name="nyt-collection:identifier"/>
  <meta content="coronavirus" data-rh="true" name="CN"/>
  <meta content="news_eventcollection" data-rh="true" name="nyt-collection:type"/>
  <meta content="column" data-rh="true" name="CT"/>
  <meta content="Health" data-rh="true" name="nyt-collection:display-name"/>
  <meta content="" data-rh="true" name="nyt-collection:tagline"/>
  <meta content="" data-rh="true" name="nyt-collection:promotional-image"/>
  <meta content="collection" data-rh="true" name="PT"/>
  <meta content="100000006957737" data-rh="true" name="asset_id"/>
  <meta content="coronavirus" data-rh="true" name="slug"/>


What you're seeing above is the HTML for the NYT homepage. Let's continue with a few basics:

## `soup.title` 

Finds the title of a page

In [25]:
soup.title 

<title data-rh="true">The Coronavirus Outbreak - The New York Times</title>

## `soup.title.string`

Gets a string version of that same title 

In [26]:
soup.title.string 

'The Coronavirus Outbreak - The New York Times'

## `soup.p`

Get the first paragraph tag in the HTML

In [27]:
soup.p 

<p class="updated">Updated weekday evenings</p>

In [28]:
soup.p['class'] # get the class of that <p> tag

['updated']

## `soup.find_all`

In [29]:
soup.find_all('p')

[<p class="updated">Updated weekday evenings</p>,
 <p class="g-hp-cta">See all new cases around the world</p>,
 <p class="g-hp-cta">See the U.S. hot spots</p>,
 <p><a href="https://www.nytimes.com/2020/05/15/us/coronavirus-what-to-do-outside.html">Outdoor gatherings</a> lower risk because wind disperses viral droplets, and sunlight can kill some of the virus. Open spaces prevent the virus from building up in concentrated amounts and being inhaled, which can happen when infected people exhale in a confined space for long stretches of time, said Dr. Julian W. Tang, a virologist at the University of Leicester.</p>,
 <p>In the beginning, the coronavirus <a href="https://www.nytimes.com/article/coronavirus-facts-history.html#link-6817bab5">seemed like it was primarily a respiratory illness</a> — many patients had fever and chills, were weak and tired, and coughed a lot, though some people don’t show many symptoms at all. Those who seemed sickest had pneumonia or acute respiratory distress s

In [30]:
url = "https://www.nytimes.com/2020/10/01/business/economy/layoffs-unemployment-claims.html" 

r = requests.get(url) # the requests library is the easiest way to call to a URL; here we are using a GET command

soup = BeautifulSoup(r.text,'html') # we are going to take the result of that GET command and pass it through bs4

In [31]:
soup.title

<title data-rh="true">New Layoffs Add to Worries Over U.S. Economic Slowdown - The New York Times</title>

In [32]:
soup.title.string

'New Layoffs Add to Worries Over U.S. Economic Slowdown - The New York Times'

In [33]:
content = soup.find_all('p')

content

[<p>Advertisement</p>,
 <p>Supported by</p>,
 <p class="css-1smgwul e1wiw3jv0" id="article-summary">A standoff over further federal aid and concern over the pandemic’s duration are pushing companies to eliminate jobs.</p>,
 <p class="css-1nuro5j e1jsehar1" itemprop="author" itemscope="" itemtype="http://schema.org/Person">By<!-- --> <a class="css-brehiz e1jsehar0" href="https://www.nytimes.com/by/nelson-d-schwartz"><span class="css-1baulvz" itemprop="name">Nelson D. Schwartz</span></a> and <a class="css-brehiz e1jsehar0" href="http://nytimes.com/by/gillian-friedman"><span class="css-1baulvz last-byline" itemprop="name">Gillian Friedman</span></a></p>,
 <p class="css-158dogj evys1bk0">The American economy is being buffeted by a fresh round of corporate layoffs, signaling new anxiety about the course of the coronavirus pandemic and uncertainty about further legislative relief.</p>,
 <p class="css-158dogj evys1bk0">Companies including <a class="css-1g7m0tk" href="https://www.nytimes.com/2

In [93]:
content_list = []

for p in content:
  content_list.append(p.string)
  
article = content_list[2:40]

In [96]:
article

['A standoff over further federal aid and concern over the pandemic’s duration are pushing companies to eliminate jobs.',
 None,
 'The American economy is being buffeted by a fresh round of corporate layoffs, signaling new anxiety about the course of the coronavirus pandemic and uncertainty about further legislative relief.',
 None,
 None,
 'Democrats are pushing a $2.2 trillion proposal, while the White House has floated a $1.6 trillion plan.',
 'After business shutdowns in the early spring threw 22 million people out of work, the economy rebounded in May and June with the help of stimulus money and rock-bottom interest rates. But the loss of momentum since then, coupled with fears of a second wave of coronavirus cases this fall, has left many experts uneasy about the months ahead.',
 '“The layoffs are an additional headwind in an already weak labor market,” said Rubeela Farooqi, chief U.S. economist for High Frequency Economics. “As long as the virus isn’t contained, this is going to

In [99]:
for i in article: 
  print(i)

A standoff over further federal aid and concern over the pandemic’s duration are pushing companies to eliminate jobs.
None
The American economy is being buffeted by a fresh round of corporate layoffs, signaling new anxiety about the course of the coronavirus pandemic and uncertainty about further legislative relief.
None
None
Democrats are pushing a $2.2 trillion proposal, while the White House has floated a $1.6 trillion plan.
After business shutdowns in the early spring threw 22 million people out of work, the economy rebounded in May and June with the help of stimulus money and rock-bottom interest rates. But the loss of momentum since then, coupled with fears of a second wave of coronavirus cases this fall, has left many experts uneasy about the months ahead.
“The layoffs are an additional headwind in an already weak labor market,” said Rubeela Farooqi, chief U.S. economist for High Frequency Economics. “As long as the virus isn’t contained, this is going to be an ongoing phenomeno

---

# RSS Feeds

An RSS ('Real Simple Syndication') feed is nothing more than a text file that is updated with information (usually pared down) from a website. For more, check out [this article by Digital Trends](https://www.digitaltrends.com/computing/what-is-an-rss-feed/)

In order to flex our RSS Feed skills we are going to be mimicking this brilliant and simple bot, @TwoHeadlines: <br> https://twitter.com/twoheadlines?lang=en

<br>

The concept is simple. It takes two different headlines from two different outlets via their RSS feeds (which we'll go over in a moment) and combines them to produce often comical and almost always nonsensical news headlines.

<br>

The first thing we must do to create our own TwoHeadlines bot is import a few libraries. Remember, libraries in Python are collections of functions and methods that allow you to perform various actions without writing your own code.

<br>

For instance, in our Two Headlines bot we are going to use: 

#### Feedparser: a library that will allow us to read various RSS feeds (again, we'll get to RSS in a moment)<br>
https://pythonhosted.org/feedparser/introduction.html

#### Random: a library that will allow us to generate random numbers <br> 
https://docs.python.org/2/library/random.html

#### Time: a library that will allow us to work around traditionally tricky time functions <br>
https://docs.python.org/2/library/time.html

<br>

Thus, your first lines of code will look as follows: <br>

In [None]:
!pip install feedparser

import feedparser
import random
import time

<br> Great! Now, we want to begin by defining our function. <br>

Remember, funcitons come in handy when you want to repeat the same task many times using the same _type_ of input. <br>

In [None]:
# for example

def printSentence(sentence):
    print(sentence + " Plus a new sentence.")
    return;

In [None]:
printSentence("This is the sentence I want to print.")

In this case, we will call our function 'TwoHeadlines' 


In [None]:
def TwoHeadlines(): # we are leaving the input blank for now, and you'll see why in a moment
    pass            # this 'pass' is here just to avoid an error as we work on our function. To see what happens without it, 
                    # try removing the 'pass' line and see the error you receive.

To best understand what you can get from an RSS feed, take a look at the following examples: 

http://www.wsj.com/public/page/rss_news_and_feeds.html <br>
https://archive.nytimes.com/www.nytimes.com/services/xml/rss/index.html <br>
http://rss.cnn.com/rss/cnn_topstories.rss <br>

To see how you can actually pull these RSS feeds using Python, we're going to rely on Python. As an example, let's pull two feeds.

Note that we first set a variable equal to the desired url for the desired RSS feed. Then, we use feedparser to store that information into a new variable.

In [None]:
nyt_rss_url = 'https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml' # find the desired rss feed
espn_rss_url = 'https://www.espn.com/espn/rss/news' # find a second desired rss feed

nyt_feed = feedparser.parse(nyt_rss_url) # use feedparser to, well, parse the feed
espn_feed = feedparser.parse(espn_rss_url) # use feedparser to, well, parse the feed

Next, we need to get a bit creative, because we don't want that entire RSS feed; We just want the headline for the latest article! But if you type the following:

In [None]:
print(nyt_feed) # print the full RSS feed

In [None]:
for i in range(0,10): # for the first ten entries in the RSS feed (the ten most recent stories)
    print(nyt_feed['entries'][i]['title']) # print the title of said article

But how did we know to use "['entries'][i]['title']"?

To understand, we need to briefly delve into the world of dictionaries 

In [None]:
dictionary = {'favorite_food':'pasta'} # create a new dictionary 

# consider 'favorite_food' to be the word, and 'pasta' to be the definition, it it helps you

In [None]:
print(dictionary['favorite_food'])

# we then call 'favorite_food' and get the "definiton" 
# in reality, this is known as a Key:Value pair, with "Key" being the word, and "Value" being the definition

As you may be able to see, our RSS is actually formated quite cleverily. It is a dicitionary (a set of key-value pairs) that includes lists. For example, look at the very top of the feed. It starts 

#### {'feed': {'title': 'WSJ.com: World News',

The best way to read this is - the first entry in the dictionary is 'Feed' and the first value for that entry (also known as a 'key' is 'Title'. 

Now, 'Title' happens to be another dictionary (you can tell because it begins with a '{'). If we keep searching, we'll see that the headline comes after 'entries' and is paired with the 'title'. 

I know this is all exceptionally confusing, but just bear with me. The more you practice parsing information from RSS feeds (or HTML in general) the easier it will become, I promise!

So, if we want that headline, and that headline only, we are going to: 

1. Navigate to the entire RSS feed
2. Navigate to the 'entries' section
3. Navigate to the first 'entries' section (each story is going to have its own, and we want the first headline)
4. Navigate to the 'title' section 

<br>

Now, back to replicating 'TwoHeadlinesBot'

In [None]:
my_list = [] # create a new, empty list called 'my_list'

for i in range(0,10): 
    my_list.append(nyt_feed['entries'][i]['title']) # append the first ten titles to this list

In [None]:
my_list[3] # select the third index of that list

In [None]:
Article4 = my_list[3]

In [None]:
Article4[:25] # get the first 25 characters of the title of the 3rd index (fourth article) in our list

In [None]:
len(Article4) # how many characters long is our title? 

In [None]:
len(Article4)/2 # figure out the half-way point of the title 

In [None]:
Article4[0:30] # get the first half of our article title

In [None]:
Article5 = my_list[4] # let's see what the next title is in our list
print(Article5)

So, how do we want to mash our headlines together?

In [None]:
nyt_first_story = nyt_feed['entries'][0]['title'] #Recall that '0' is actually the first instance
print(nyt_first_story)

In [None]:
words = nyt_first_story.split(' ') # remember, I can split that single sentence into a list of individual words 
print(words) 

In [None]:
for i in range(0,10): 

    nyt_first_story = nyt_feed['entries'][i]['title'] # pull the title of the ith story in the first RSS feed
    espn_first_story = espn_feed['entries'][i]['title'] # pull the title of the ith story in the second RSS feed

    nyt_words = nyt_first_story.split(' ') # split the title by spaces (aka, make every word in the title it's own)
    espn_words = espn_first_story.split(' ') # split the title by spaces (aka, make every word in the title it's own)
    
print(nyt_words) 
print(" --- ") # print a line for formatting purposes
print(espn_words)

Let's keep going. Remember, we want to take half of one headline and half of a different headline and mash them together. So, how do we get just the first or second half of a list of words?  <br>

In [None]:
for i in range(0,10): 

    nyt_first_story = nyt_feed['entries'][i]['title'] # pull the title of the ith story in the first RSS feed
    espn_first_story = espn_feed['entries'][i]['title'] # pull the title of the ith story in the second RSS feed

    nyt_words = nyt_first_story.split(' ') # split the title by spaces (aka, make every word in the title it's own)
    espn_words = espn_first_story.split(' ') # split the title by spaces (aka, make every word in the title it's own)

    nyt_words = nyt_words[:int(len(nyt_words)/2)] 
    espn_words = espn_words[int(len(espn_words)/2):]
    
print(nyt_words)
print(" --- ")
print(espn_words)

## Walkthrough of our Code

1) First, the `[:`    

In [None]:
# the ':' at the front of a list means 'everything leading up to this point. For instance: 

list = ['a','b','c','d','e']
list = list[:3]
print(list)

In other words, we want to print everything leading up to (but not including!) the third instance in our list.

 2)  Next, the `int` allows us to ensure we're working with integers so we can do the necessary division at the end of the line of code.  


In [None]:
len(nyt_words)/2 # the result is a float, which we don't want

In [None]:
int(len(nyt_words)/2) # tis gives us an integer

3)  `len` is a function that gives you the number of items in a list. For instance: 

In [None]:
list = ['a','b','c','d','e']
len(list)

4) Finally, we are taking the total number of words in the headline and dividing by two

In total, we are saying: "Take the headline, find out how many words are in the headline and divide by two. Then, take the first half of that headline and store it as the new healdine." 

_Note that while for the first healdine we take the first half (by putting the ':' at the beginning of the code) we are taking the second half of the second headline (by putting the ':' at the end of the code)._

## All together, now

Finally, we want to join the two halves of our healdine and store it as the variable 'new_headline' 

In [None]:
for i in range(0,10): 

    nyt_first_story = nyt_feed['entries'][i]['title'] # pull the title of the ith story in the first RSS feed
    espn_first_story = espn_feed['entries'][i]['title'] # pull the title of the ith story in the second RSS feed

    nyt_words = nyt_first_story.split(' ') # split the title by spaces (aka, make every word in the title it's own)
    espn_words = espn_first_story.split(' ') # split the title by spaces (aka, make every word in the title it's own)

    nyt_words = nyt_words[:int(len(nyt_words)/2)] 
    espn_words = espn_words[int(len(espn_words)/2):]
    
    new_headline = nyt_words + espn_words # Take the first half of the title from the first RSS feed and add the second half of the second RSS feed
    new_headline = ' '.join(new_headline) # Join the two strings created above with spaces

    print(new_headline) # Print your newly created headline