# Scraping whitehouse links

This notebook details how we can scrape whitehouse statement text, types, and links, with details.

As always, we will begin by importing our libraries

In [1]:
# 1. importing useful libraries

import requests # to get the website
import time     # to force our code to wait a little before re-trying to grab a webpage
import re       # to grab the exact element we need
from bs4 import BeautifulSoup # to grab the html elements we need

Now let's create 
- data: an empty list, where we will be appending our content
- headers: a dictioanry that tells our requests as what kind of browser we are requesting the webpage

In [2]:
# create an empty list
data  = [] 
# access the webpage as Chrome
my_headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}

## 1. Get the page

First, we will try to get the page five times:
- if we are unsuccessfuly, requests.get will thrown an error, and we we print that we failed to get the page
- if we are successful, we will break the loop

At then end, we will tell the user whether we got the page or not
- try to get a page that does not exist, e.g. https://www.whitsadsadasdasdehouse.gov/briefings-statements/page/1/' and see what happens!

In [3]:
# Give the url of the page
page = 'https://www.whitehouse.gov/briefings-statements/page/1/' 
# Initialize src to be False
src  = False

# Now get the page

# try to scrape 5 times
for i in range(1,6): 
    try:
        # get url content
        response = requests.get(page, headers = my_headers)
        # get the html content
        src = response.content
        # if we successuflly got the file, break the loop
        break 
    # if requests.get() threw an exception, i.e., the attempt to get the response failed
    except:
        print ('failed attempt #',i)
        # wait 2 secs before trying again
        time.sleep(2)

# if we could not get the page 
if not src:
   # couldnt get the page, print that we could not and continue to the next attempt
   print('Could not get page: ', page)
else:
   # got the page, let the user know
   print('Successfully got page: ', page)

Successfully got page:  https://www.whitehouse.gov/briefings-statements/page/1/


## 2. Find where the information is contained by using "inspect"

We now will need to get, for each statement, the information on the statement type, link text, and link address.

For example, see below

<font color='red'>These pictures were taken on 10/10/2019 9:00AM.

As the whitehouse continuously releases statements, your results will vary, and will simply pick up the last statement instead</font>

<br>

<img src="http://drive.google.com/uc?export=view&id=1DvcN_BhQCS_zD_E2mXZdncsdbhWi9jti"  width="600">

<br>

The fastest way to scrape is through the "inspect" function on your browser
- Right click anywhere on the page, then click inspect (step 0)
- Click on select an element to inspect it (step 1)
- Hover your mouse over the page until you find your element, or click to focus on that one (step 2)
- See the HTML code where your information is contained

These steps are shown below to find the html tag in which the statement type for the first statement ("STATEMENTS & RELEASES") is contained can be found below

<br>

<img src="http://drive.google.com/uc?export=view&id=11YLb6ojElT-6U4WcDprqTpwUuRqDTB1v" width="600">

<br>

Similarly, the steps to find where the html element where the link text, and the link url are contained, are shown below

<br>
<img src="http://drive.google.com/uc?export=view&id=1NHGdgfdMXNA0PwdCZA2SizxKfHMXYoLm" width="600">
<br>

## 3. Use beautiful soup to parse the information

We will next use the powerful functionality of beautiful soup to parse the information that we need.

First, we will turn our html to a beautiful object item. 

In [4]:
# 1. decode('ascii', 'ignore') will turn weird characters to more easy-to-work-with characters in our html file
# 2. 'lxml' tells the beautiful soup library what kind of data the src variable contains (an html page)

soup = BeautifulSoup(src.decode('ascii', 'ignore'), 'lxml')

Next, we need to find what is the tag that "contains" the information we want. 

Inspecting the HTML we find that the tag "article" with an attribute of type class and value "briefing-statement brifing-statement--results" contains both tags we want to parse: 
1. the h2 tag with with an attribute of type class and type  "briefing-statement__type" 
2. the a tag with with an attribute of type class and type "briefing-statement__title"

By saying "contain" we mean that this information is found after the article tag opens, and before it closes.
This is easy to see in the "inspect view" because everything that an element contains, has more indent than that element!!

See for example below:

<br>
<img src="http://drive.google.com/uc?export=view&id=1PK1_UjH3a-VoPw6JHBODhHYm-83gjEZl" width="600">
<br>

To find all article tags, we simply ask Beautiful Soup to look for this tag in the "soup" variable---that contains the HTML

In [5]:
articles = soup.findAll('article')

Now verify that the first element of the articles list contains an article.

In [6]:
articles[0]

<article class="briefing-statement briefing-statement--results">
<div class="briefing-statement__content">
<p class="briefing-statement__type">
				Remarks			</p>
<h2 class="briefing-statement__title"><a href="https://www.whitehouse.gov/briefings-statements/remarks-president-trump-international-association-chiefs-police-annual-conference-exposition-chicago-il/">Remarks by President Trump at the International Association of Chiefs of Police Annual Conference and Exposition | Chicago, IL</a></h2>
<div class="meta meta--left">
<div class="meta__issue">
<p class="issue-flag issue-flag--left">
<a href="https://www.whitehouse.gov/issues/law-justice/">
			Law &amp; Justice		</a>
</p>
</div>
<p class="meta__date">
<time>Oct 28, 2019</time>
</p>
</div>
</div>
</article>

You can take a look at the attributes of that html tag, and veryify that they are the tags that we want

In [7]:
articles[0].attrs

{'class': ['briefing-statement', 'briefing-statement--results']}

As articles may be used for other kind of information, we would like to keep only the articles with the right attributes. The way to do this is by telling Beautiful Soup to only keep the articles who are briefing statements, which is done as below:

In [8]:
# keep only <article> tags whose class containts the "briefing-statement" substring
articles = soup.findAll('article', {'class':re.compile('briefing-statement')})
len(articles)

10

Alright, now that we have all articles, let's grab the correct information for each article.
We need to grab two kinds of information:
    
- the text that is contained within the **p** tag with attribute "class" of value "briefing-statement__type"
- (i) the value of the "href" attribute and (ii) the text of the **a** tag _that is contained_ within the **h2** tag with attribute "class" of value "briefing-statement__title"

Let's first do this with the first article, and then do it for every article using a for loop

In [9]:
article = articles[0]
article

<article class="briefing-statement briefing-statement--results">
<div class="briefing-statement__content">
<p class="briefing-statement__type">
				Remarks			</p>
<h2 class="briefing-statement__title"><a href="https://www.whitehouse.gov/briefings-statements/remarks-president-trump-international-association-chiefs-police-annual-conference-exposition-chicago-il/">Remarks by President Trump at the International Association of Chiefs of Police Annual Conference and Exposition | Chicago, IL</a></h2>
<div class="meta meta--left">
<div class="meta__issue">
<p class="issue-flag issue-flag--left">
<a href="https://www.whitehouse.gov/issues/law-justice/">
			Law &amp; Justice		</a>
</p>
</div>
<p class="meta__date">
<time>Oct 28, 2019</time>
</p>
</div>
</div>
</article>

In [10]:
# first grab the <p> tag and everythign it contains from the first article
# to do so, simply use beautiful soups "find" functionality 
# (because there is only one, it will return the first occurence)

p = article.find('p')
p



<p class="briefing-statement__type">
				Remarks			</p>

In [11]:
# grab the text from the h2 article simply using the .text function of beautiful soup
article_type = p.text
article_type

'\n\t\t\t\tRemarks\t\t\t'

In [12]:
# strip the text from unecessary whitespace, by using .strip() method of Python strings
article_type = article_type.strip()

In [13]:
article_type

'Remarks'

Finding the **a** tag url and text is slightly more tricky because we note that there are multiple **a** tags within the **article** tag that we grabbed. 

To find the right one, we simply need to note that this article is contained within the **h2** tag, and that there is only one **h2** tag in the **article** tag.

<br>
<img src="http://drive.google.com/uc?export=view&id=1PK1_UjH3a-VoPw6JHBODhHYm-83gjEZl" width="600">
<br>

As such, we need to perform two searches

In [15]:
# find h2 within article
h2 = article.find('h2')

# find a tag within h2
a  = h2.find('a')

a

<a href="https://www.whitehouse.gov/briefings-statements/text-letter-counsel-vice-president-chairman-cummings-chairman-engel-chairman-schiff/">Text of a Letter from the Counsel to the Vice President to Chairman Cummings, Chairman Engel and Chairman Schiff</a>

In [16]:
# get the url
article_url  = a.attrs['href']
article_text = a.text.strip()

In [17]:
print('type  = '+ article_type)
print('url   = '+ article_url)
print('text  = '+ article_text)


type  = Statements & Releases
url   = https://www.whitehouse.gov/briefings-statements/text-letter-counsel-vice-president-chairman-cummings-chairman-engel-chairman-schiff/
text  = Text of a Letter from the Counsel to the Vice President to Chairman Cummings, Chairman Engel and Chairman Schiff


### Et voila!!!
<br><br><br>


# 4. Scraping all articles in webpage
Now to scrape all articles from the webpage, simply use a for loop!

In [18]:
data = []

articles = soup.findAll('article', {'class':re.compile('briefing-statement')})

for article in articles:
    
    # find p, grab text, strip() it
    p = article.find('p')
    article_type = p.text.strip()
    
    # find h2, find a, grab a's url and text
    h2 = article.find('h2')
    a  = h2.find('a')
    article_url  = a.attrs['href']
    article_text = a.text.strip()
    
    # add all the info to the data link
    data.append([article_type, article_url, article_text])

for article in data:
    print(article)

['Statements & Releases', 'https://www.whitehouse.gov/briefings-statements/text-letter-counsel-vice-president-chairman-cummings-chairman-engel-chairman-schiff/', 'Text of a Letter from the Counsel to the Vice President to Chairman Cummings, Chairman Engel and Chairman Schiff']
['Remarks', 'https://www.whitehouse.gov/briefings-statements/remarks-president-trump-welcoming-2019-stanley-cup-champions-st-louis-blues/', 'Remarks by President Trump Welcoming the 2019 Stanley Cup Champions, St. Louis Blues']
['Statements & Releases', 'https://www.whitehouse.gov/briefings-statements/text-message-congress-continuation-national-emergency-respect-significant-narcotics-traffickers-centered-colombia/', 'Text of a Message to the Congresson the Continuation of the National Emergency with Respect to Significant Narcotics Traffickers Centered in Colombia']
['Statements & Releases', 'https://www.whitehouse.gov/briefings-statements/text-notice-continuation-national-emergency-respect-significant-narcotics-

# 5. Scraping all articles in webpage with failsafes

I show you below how to deal unfound elements


In [19]:
data = []

articles = soup.findAll('article', {'class':re.compile('briefing-statement')})

for article in articles:
    
    # initialize to not found
    article_type = 'NA'
    article_url  = 'NA'
    article_text = 'NA'
    
    # find p, grab text, strip() it
    p = article.find('p')
    
    # if you found it
    if p:
        article_type = p.text.strip()
    
    # find h2, find a, grab a's url and text
    h2 = article.find('h2')
    a  = h2.find('a')
    
    # if you found it
    if a:
        article_url  = a.attrs['href']
        article_text = a.text.strip()
    
    # add all the info to the data link
    # if some element hasn't been found, the 'NA' string will be added
    data.append([article_type, article_url, article_text])

for article in data:
    print(article)

['Statements & Releases', 'https://www.whitehouse.gov/briefings-statements/text-letter-counsel-vice-president-chairman-cummings-chairman-engel-chairman-schiff/', 'Text of a Letter from the Counsel to the Vice President to Chairman Cummings, Chairman Engel and Chairman Schiff']
['Remarks', 'https://www.whitehouse.gov/briefings-statements/remarks-president-trump-welcoming-2019-stanley-cup-champions-st-louis-blues/', 'Remarks by President Trump Welcoming the 2019 Stanley Cup Champions, St. Louis Blues']
['Statements & Releases', 'https://www.whitehouse.gov/briefings-statements/text-message-congress-continuation-national-emergency-respect-significant-narcotics-traffickers-centered-colombia/', 'Text of a Message to the Congresson the Continuation of the National Emergency with Respect to Significant Narcotics Traffickers Centered in Colombia']
['Statements & Releases', 'https://www.whitehouse.gov/briefings-statements/text-notice-continuation-national-emergency-respect-significant-narcotics-

# 6. Scraping many pages!

To scrape many pages, we simply add a for loop to everything...




In [20]:
data = []
numPages = 5

for k in range(1,numPages+1):
    
    # Give the url of the page
    page = 'https://www.whitehouse.gov/briefings-statements/page/'+str(k)+'/' 
    # Initialize src to be False
    src  = False

    # Now get the page

    # try to scrape 5 times
    for i in range(1,6): 
        try:
            # get url content
            response = requests.get(page, headers = my_headers)
            # get the html content
            src = response.content
            # if we successuflly got the file, break the loop
            break 
        # if requests.get() threw an exception, i.e., the attempt to get the response failed
        except:
            print ('failed attempt #',i)
            # wait 2 secs before trying again
            time.sleep(2)

    # if we could not get the page 
    if not src:
       # couldnt get the page, print that we could not and continue to the next attempt
       print('Could not get page: ', page)
       # move on to the next page
       continue 
    else:
       # got the page, let the user know
       print('Successfully got page: ', page)
    

    articles = soup.findAll('article', {'class':re.compile('briefing-statement')})

    for article in articles:

        # initialize to not found
        article_type = 'NA'
        article_url  = 'NA'
        article_text = 'NA'

        # find p, grab text, strip() it
        p = article.find('p')

        # if you found it
        if p:
            article_type = p.text.strip()

        # find h2, find a, grab a's url and text
        h2 = article.find('h2')
        a  = h2.find('a')

        # if you found it
        if a:
            article_url  = a.attrs['href']
            article_text = a.text.strip()

        # add all the info to the data link
        # if some element hasn't been found, the 'NA' string will be added
        data.append([article_type, article_url, article_text])

Successfully got page:  https://www.whitehouse.gov/briefings-statements/page/1/
Successfully got page:  https://www.whitehouse.gov/briefings-statements/page/2/
Successfully got page:  https://www.whitehouse.gov/briefings-statements/page/3/
Successfully got page:  https://www.whitehouse.gov/briefings-statements/page/4/
Successfully got page:  https://www.whitehouse.gov/briefings-statements/page/5/


In [21]:
for article in data:
        print(article)

['Statements & Releases', 'https://www.whitehouse.gov/briefings-statements/text-letter-counsel-vice-president-chairman-cummings-chairman-engel-chairman-schiff/', 'Text of a Letter from the Counsel to the Vice President to Chairman Cummings, Chairman Engel and Chairman Schiff']
['Remarks', 'https://www.whitehouse.gov/briefings-statements/remarks-president-trump-welcoming-2019-stanley-cup-champions-st-louis-blues/', 'Remarks by President Trump Welcoming the 2019 Stanley Cup Champions, St. Louis Blues']
['Statements & Releases', 'https://www.whitehouse.gov/briefings-statements/text-message-congress-continuation-national-emergency-respect-significant-narcotics-traffickers-centered-colombia/', 'Text of a Message to the Congresson the Continuation of the National Emergency with Respect to Significant Narcotics Traffickers Centered in Colombia']
['Statements & Releases', 'https://www.whitehouse.gov/briefings-statements/text-notice-continuation-national-emergency-respect-significant-narcotics-

# 7. Saving data for later use

To save data, we will create a txt file and save them as follows:

**every row will have the article_type followed by the article_url followed by the article_text**

Doing so is easy




In [22]:
with open('whitehouse_statements.txt', mode='w', encoding='utf-8') as f:
    for statement in data:
        f.write(statement[0] + '\t' + statement[1] + '\t' + statement[2] + '\n')

FileNotFoundError: [Errno 2] No such file or directory: 'files/whitehouse_statements.txt'

# 8. Opening data to use it


To open this data from the file, we simply need to

In [None]:
with open('whitehouse_statements.txt', mode='r', encoding = 'utf-8') as f:
    data = f.read()
    
# break into lines, then break each line into article_type, aritcle_url, article_text
# throw away last element because it is simply an empty line
data = data.split('\n')[0:-1]

for i in range(0,len(data)):
    data[i] = data[i].split('\t')



# <font color='red'> If you understood this notebook, HW2 will be a walk in the park </font>