# Reading the website in Python

In [1]:
import requests

In [2]:
r=requests.get('https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html')

In [3]:
# Print first 500 character 

In [4]:
print(r.text[0:500])  

<!DOCTYPE html>
<!--[if (gt IE 9)|!(IE)]> <!--><html lang="en" class="no-js page-interactive section-opinion page-theme-standard tone-opinion page-interactive-default limit-small layout-xlarge app-interactive" itemid="https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html" itemtype="http://schema.org/NewsArticle" itemscope xmlns:og="http://opengraphprotocol.org/schema/"><!--<![endif]-->
<!--[if IE 9]> <html lang="en" class="no-js ie9 lt-ie10 page-interactive section-opinion page


# Parsing the HTML using Beautiful Soup

We're going to parse the HTML using the Beautiful Soup 4 library, which is a popular Python library for web scraping

In [5]:
from bs4 import BeautifulSoup  
soup = BeautifulSoup(r.text, 'html.parser') 

The code above parses the HTML `(stored in r.text)` into a special object called soup that the Beautiful Soup library understands. In other words, Beautiful Soup is reading the HTML and making sense of its structure.

(**Note** that `html.parser` is the parser included with the Python standard library, though other parsers can be used by `Beautiful Soup`. See differences between parsers to learn more.)

# Collecting all of the records

The Python code above is the standard code I use with every web scraping project. Now, we're going to start taking advantage of the patterns we noticed in the article formatting to build our dataset!

You might have noticed that each record has the following format:

`<span class="short-desc"><strong> DATE </strong> LIE <span class="short-truth"><a href="URL"> EXPLANATION </a></span></span>`

There's an outer `<span>` tag, and then nested within it is a `<strong>` tag plus another `<span>` tag, which itself contains an `<a>` tag. All of these tags affect the formatting of the text. And because the New York Times wants each record to appear in a consistent way in your web browser, we know that each record will be tagged in a consistent way in the HTML. This is the pattern that allows us to build our dataset!

**Let's ask Beautiful Soup to find all of the records:**

In [6]:
results = soup.find_all('span', attrs={'class':'short-desc'}) 

In [7]:
len(results)

180

We can also slice the object like a list, in order **to examine the first three results:**

In [8]:
results[0:3] 

[<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>,
 <span class="short-desc"><strong>Jan. 21 </strong>“A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.” <span class="short-truth"><a href="http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/" target="_blank">(Trump was on the cover 11 times and Nixon appeared 55 times.)</a></span></span>,
 <span class="short-desc"><strong>Jan. 23 </strong>“Between 3 million and 5 million illegal votes caused me to lose the popular vote.” <span class="short-truth"><a href="https://www.nytimes.com/2017/01/23/us/politics/donald-trump-congress-democrats.html" target="_

In [9]:
results[-1] # The last result

<span class="short-desc"><strong>Nov. 11 </strong>“I'd rather have him  – you know, work with him on the Ukraine than standing and arguing about whether or not  – because that whole thing was set up by the Democrats.” <span class="short-truth"><a href="https://www.nytimes.com/interactive/2017/12/10/us/politics/trump-and-russia.html" target="_blank">(There is no evidence that Democrats "set up" Russian interference in the election.)</a></span></span>

# Extracting the date

Web scraping is often an iterative process, in which you experiment with your code until it works exactly as you desire. To simplify the experimentation, we'll start by only working with the first record in the results object, and then later on we'll modify our code to use a loop:

In [10]:
first_result = results[0]  
first_result 

<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>

In order to locate the date, we can use its `find()` method to find a single tag that matches a specific pattern, in contrast to the `find_all()` method we used above to find all tags that match a pattern:

In [11]:
first_result.find('strong') 

<strong>Jan. 21 </strong>

In [12]:
first_result.find('strong').text

'Jan. 21\xa0'

In [13]:
first_result.find('strong').text[0:-1] # Delete \xa0

'Jan. 21'

In [14]:
first_result.find('strong').text[0:-1] + ', 2017' # Date Format

'Jan. 21, 2017'

# Extracting the lie
Let's take another look at `first_result`:

In [15]:
first_result

<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>

Our goal is to extract the two sentences about Iraq. Unfortunately, there isn't a pair of opening and closing tags that starts immediately before the lie and ends immediately after the lie. Therefore, we're going to have to use a different technique:

In [16]:
first_result.contents # return list of python 

[<strong>Jan. 21 </strong>,
 "“I wasn't a fan of Iraq. I didn't want to go into Iraq.” ",
 <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span>]

In [17]:
first_result.contents[1]  

"“I wasn't a fan of Iraq. I didn't want to go into Iraq.” "

In [18]:
first_result.contents[1][1:-2] # Delete ""

"I wasn't a fan of Iraq. I didn't want to go into Iraq."

# Extracting the explanation
Based upon what you've seen already, you might have figured out that we have at least two options for how we extract the third component of the record, which is the writer's explanation of why the President's statement was a lie.

The `first option` is to slice the contents attribute, like we did when extracting the lie:

In [19]:
first_result.contents[2]

<span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span>

The ***second option*** is to search for the surrounding tag, like we did when extracting the date:

In [20]:
first_result.find('a') 

<a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a>

Either way, we can access the `text attribute` and then slice off the opening and closing parentheses:

In [21]:
first_result.find('a').text[1:-1] # Only text

'He was for an invasion before he was against it.'

# Extracting the URL

Finally, we want to extract the URL of the article that substantiates the writer's claim that the President was lying.

Let's examine the `<a>` tag within `first_result`:

In [22]:
first_result.find('a') 

<a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a>

`Beautiful Soup` treats tag attributes and their values like key-value pairs in a dictionary: you put the attribute name in brackets (like a dictionary key), and you get back the attribute's value:

In [23]:
first_result.find('a') ['href']

'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'

# Building the dataset

Now that we've figured out how to extract the four components of first_result, we can create a loop to repeat this process on all 180 results. We'll store the output in a list of tuples called records:


In [24]:
records = []  
for result in results:  
    date = result.find('strong').text[0:-1] + ', 2017'
    lie = result.contents[1][1:-2]
    explanation = result.find('a').text[1:-1]
    url = result.find('a')['href']
    records.append((date, lie, explanation, url))

In [25]:
len(records) 

180

In [26]:
records[0:3] # Let's do a quick spot check of the first three records:

[('Jan. 21, 2017',
  "I wasn't a fan of Iraq. I didn't want to go into Iraq.",
  'He was for an invasion before he was against it.',
  'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'),
 ('Jan. 21, 2017',
  'A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.',
  'Trump was on the cover 11 times and Nixon appeared 55 times.',
  'http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/'),
 ('Jan. 23, 2017',
  'Between 3 million and 5 million illegal votes caused me to lose the popular vote.',
  "There's no evidence of illegal voting.",
  'https://www.nytimes.com/2017/01/23/us/politics/donald-trump-congress-democrats.html')]

# Applying a tabular data structure (DataFrame)

In [27]:
import pandas as pd 

In [28]:
df = pd.DataFrame(records, columns=['date', 'lie', 'explanation', 'url']) 

In [29]:
df.head()  

Unnamed: 0,date,lie,explanation,url
0,"Jan. 21, 2017",I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it.,https://www.buzzfeed.com/andrewkaczynski/in-20...
1,"Jan. 21, 2017",A reporter for Time magazine — and I have been...,Trump was on the cover 11 times and Nixon appe...,http://nation.time.com/2013/11/06/10-things-yo...
2,"Jan. 23, 2017",Between 3 million and 5 million illegal votes ...,There's no evidence of illegal voting.,https://www.nytimes.com/2017/01/23/us/politics...
3,"Jan. 25, 2017","Now, the audience was the biggest ever. But th...",Official aerial photos show Obama's 2009 inaug...,https://www.nytimes.com/2017/01/21/us/politics...
4,"Jan. 25, 2017",Take a look at the Pew reports (which show vot...,The report never mentioned voter fraud.,https://www.nytimes.com/2017/01/24/us/politics...


In [30]:
df.tail() 

Unnamed: 0,date,lie,explanation,url
175,"Oct. 25, 2017",We have trade deficits with almost everybody.,We have trade surpluses with more than 100 cou...,https://www.bea.gov/newsreleases/international...
176,"Oct. 27, 2017","Wacky & totally unhinged Tom Steyer, who has b...",Steyer has financially supported many winning ...,https://www.opensecrets.org/donor-lookup/resul...
177,"Nov. 1, 2017","Again, we're the highest-taxed nation, just ab...",We're not.,http://www.politifact.com/truth-o-meter/statem...
178,"Nov. 7, 2017",When you look at the city with the strongest g...,"Several other cities, including New York and L...",http://www.politifact.com/truth-o-meter/statem...
179,"Nov. 11, 2017","I'd rather have him – you know, work with him...","There is no evidence that Democrats ""set up"" R...",https://www.nytimes.com/interactive/2017/12/10...


In [31]:
df['date'] = pd.to_datetime(df['date']) 

In [32]:
df.head() 

Unnamed: 0,date,lie,explanation,url
0,2017-01-21,I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it.,https://www.buzzfeed.com/andrewkaczynski/in-20...
1,2017-01-21,A reporter for Time magazine — and I have been...,Trump was on the cover 11 times and Nixon appe...,http://nation.time.com/2013/11/06/10-things-yo...
2,2017-01-23,Between 3 million and 5 million illegal votes ...,There's no evidence of illegal voting.,https://www.nytimes.com/2017/01/23/us/politics...
3,2017-01-25,"Now, the audience was the biggest ever. But th...",Official aerial photos show Obama's 2009 inaug...,https://www.nytimes.com/2017/01/21/us/politics...
4,2017-01-25,Take a look at the Pew reports (which show vot...,The report never mentioned voter fraud.,https://www.nytimes.com/2017/01/24/us/politics...


In [33]:
df.to_csv('trump_lies.csv')

# FINAL