# Web scraping a New York Times article
### Hello, and welcome to my data analysis project.

Web scraping is the process of extracting information from a webpage by taking advantage of patterns in a web page's code.

This project will:
- Read the web page into Python by parsing the HTML using Beautiful Soup. 
- Collecting all of the records to build a dataset 
- Apply a tabular data structure to the dataset for exporting to a CSV file

In [24]:
# Reading web page html into Python using the requests library
import requests
r=requests.get('https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html?mcubz=0')

In [14]:
# Print the first 500 characters of html
print(r.text[0:500])

<!DOCTYPE html>
<!--[if (gt IE 9)|!(IE)]> <!--><html lang="en" class="no-js page-interactive section-opinion page-theme-standard tone-opinion page-interactive-default limit-small layout-xlarge app-interactive" itemid="https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html" itemtype="http://schema.org/NewsArticle" itemscope xmlns:og="http://opengraphprotocol.org/schema/"><!--<![endif]-->
<!--[if IE 9]> <html lang="en" class="no-js ie9 lt-ie10 page-interactive section-opinion page


In [21]:
# Using the beautiful soup library to parse the html
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')

# Each of the records in the dataset has the following format


![title](article_1.png)

In [35]:
# Let's ask Beautiful Soup to find all of the records
# This code searches the soup object for all span tags with the attribute class='short-desc'
results = soup.find_all('span', attrs={'class': 'short-desc'})

In [18]:
# results acts like a Python list, so we can check it's length
len(results)

116

In [37]:
# Check that the last result in the results object matches the last record in the article
results[-1]

<span class="short-desc"><strong>July 19 </strong>“But the F.B.I. person really reports directly to the president of the United States, which is interesting.” <span class="short-truth"><a href="https://www.usatoday.com/story/news/politics/onpolitics/2017/07/20/fbi-director-reports-justice-department-not-president/495094001/" target="_blank">(He reports directly to the attorney general.)</a></span></span>

In [36]:
# Also slice the results object like a list in order to examine the first 3 results
results[0:3]

[<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>,
 <span class="short-desc"><strong>Jan. 21 </strong>“A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.” <span class="short-truth"><a href="http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/" target="_blank">(Trump was on the cover 11 times and Nixon appeared 55 times.)</a></span></span>,
 <span class="short-desc"><strong>Jan. 23 </strong>“Between 3 million and 5 million illegal votes caused me to lose the popular vote.” <span class="short-truth"><a href="https://www.nytimes.com/2017/01/23/us/politics/donald-trump-congress-democrats.html" target="_

extract the President's lies from the New York Times article and store them in a structured dataset.

In [38]:
# Separate each of the 116 records into it's four components in order to give the dataset its structure

## Extracting the date

In [39]:
# The results object is still in the form of raw html
first_result = results[0]
# Return the first record in the results object
first_result

<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>

In [41]:
# Search for the <strong> tag using the find() method to find the date
first_result.find('strong')

<strong>Jan. 21 </strong>

In [43]:
# Access first_result's .text attribute to return a string
first_result.find('strong').text

'Jan. 21\xa0'

In [44]:
# Remove "\xa0" escape sequence that represents the &nbsp; character in the HTML source. And slice it off from the end of the string
first_result.find('strong').text[0:-1]

'Jan. 21'

In [45]:
# Add the year to the end of the date string
first_result.find('strong').text[0:-1] +', 2017'

'Jan. 21, 2017'

## Extracting the quote

In [50]:
# Examine the first_result object again (raw html code)
first_result

<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>

In [51]:
# There isn't a set of opening and closing tags before the quote. So a different technique will be used to extract the quote
# first_result has a contents attribute, which returns a Python list containing its "children" which are the tags and strings that are nested within a tag.
# Slice this list to extract the second element

In [47]:
# Second element of the object
first_result.contents[1]

"“I wasn't a fan of Iraq. I didn't want to go into Iraq.” "

In [62]:
# Slice off the curly quotation marks and space at the end
first_result.contents[1][1:-2]

"I wasn't a fan of Iraq. I didn't want to go into Iraq."

<span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span>

## Extracting the explanation

In [64]:
# Using the .contents attribute
first_result.contents[2]

<span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span>

In [65]:
# Searches for surrounding tag
first_result.find('a')

<a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a>

In [66]:
first_result.find('a').text[1:-1]

'He was for an invasion before he was against it.'

## Extracting the URL 

In [67]:
first_result.find('a')

<a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a>

In [68]:
first_result.find('a')['href']

'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'

# Building the dataset

In [69]:
# Create a loop to extract all 116 results
# The output will be stored in a list of tuples called records
records = []
for result in results:
    date = result.find('strong').text[0:-1] + ', 2017'
    quote = result.contents[1][1:-2]
    explanation = result.find('a').text[1:-1]
    url = result.find('a')['href']
    records.append((date, quote, explanation, url))

In [71]:
len(records)

116

In [70]:
# Check the first 3 records
records[0:3]

[('Jan. 21, 2017',
  "I wasn't a fan of Iraq. I didn't want to go into Iraq.",
  'He was for an invasion before he was against it.',
  'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'),
 ('Jan. 21, 2017',
  'A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.',
  'Trump was on the cover 11 times and Nixon appeared 55 times.',
  'http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/'),
 ('Jan. 23, 2017',
  'Between 3 million and 5 million illegal votes caused me to lose the popular vote.',
  "There's no evidence of illegal voting.",
  'https://www.nytimes.com/2017/01/23/us/politics/donald-trump-congress-democrats.html')]

In [72]:
# Convert list of tuples into a DataFrame

In [73]:
import pandas as pd
df = pd.DataFrame(records, columns=['date', 'lie', 'explanation', 'url'])

In [74]:
# View first 5 rows of the dataframe
df.head()

Unnamed: 0,date,lie,explanation,url
0,"Jan. 21, 2017",I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it.,https://www.buzzfeed.com/andrewkaczynski/in-20...
1,"Jan. 21, 2017",A reporter for Time magazine — and I have been...,Trump was on the cover 11 times and Nixon appe...,http://nation.time.com/2013/11/06/10-things-yo...
2,"Jan. 23, 2017",Between 3 million and 5 million illegal votes ...,There's no evidence of illegal voting.,https://www.nytimes.com/2017/01/23/us/politics...
3,"Jan. 25, 2017","Now, the audience was the biggest ever. But th...",Official aerial photos show Obama's 2009 inaug...,https://www.nytimes.com/2017/01/21/us/politics...
4,"Jan. 25, 2017",Take a look at the Pew reports (which show vot...,The report never mentioned voter fraud.,https://www.nytimes.com/2017/01/24/us/politics...


In [75]:
# Convert the date column to datetime format
df['date'] = pd.to_datetime(df['date'])

In [76]:
df.head()

Unnamed: 0,date,lie,explanation,url
0,2017-01-21,I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it.,https://www.buzzfeed.com/andrewkaczynski/in-20...
1,2017-01-21,A reporter for Time magazine — and I have been...,Trump was on the cover 11 times and Nixon appe...,http://nation.time.com/2013/11/06/10-things-yo...
2,2017-01-23,Between 3 million and 5 million illegal votes ...,There's no evidence of illegal voting.,https://www.nytimes.com/2017/01/23/us/politics...
3,2017-01-25,"Now, the audience was the biggest ever. But th...",Official aerial photos show Obama's 2009 inaug...,https://www.nytimes.com/2017/01/21/us/politics...
4,2017-01-25,Take a look at the Pew reports (which show vot...,The report never mentioned voter fraud.,https://www.nytimes.com/2017/01/24/us/politics...


# Export the dataset to a CSV file

In [78]:
## The index parameter is set to False 
## so that the index (integers 0 to 115) are not included in the CSV file. 
df.to_csv('trump_quotes.csv', index=False, encoding='utf-8')

## Thank you! 