# Web Page Structure

First, we will make a GET request to https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html
and store in a "response" object.<br>
Use **"response.content"** to get the content of the response, and assign it to content.

**Note**:  how the content is the same as the HTML above.

In [1]:
import requests
response = requests.get("https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html")
content = response.content

In [2]:
content[0:500]

b'<!DOCTYPE html>\n<!--[if (gt IE 9)|!(IE)]> <!--><html lang="en" class="no-js page-interactive section-opinion page-theme-standard tone-opinion page-interactive-default limit-small layout-xlarge app-interactive" itemid="https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html" itemtype="http://schema.org/NewsArticle" itemscope xmlns:og="http://opengraphprotocol.org/schema/"><!--<![endif]-->\n<!--[if IE 9]> <html lang="en" class="no-js ie9 lt-ie10 page-interactive section-opinion page'

# Retrieving Elements from a Page

* We'll use the BeautifulSoup library to parse the Web page with Python. This library allows us to extract tags from an HTML document.<br>
* We can think of HTML documents as "trees," and the nested tags as "branches" (similar to a family tree).<br>
Note: html.parser is the parser included with the Python standard library

In [3]:
from bs4 import BeautifulSoup
#Initialize parser, and pass in the content we grabbed earlier
parser= BeautifulSoup(content,'html.parser')

* The code above parses the HTML (stored in content) into a special object called parser that the Beautiful Soup library understands. 
* Beautiful Soup is reading the HTML and making sense of its structure.


In [4]:
results = parser.find_all("span", class_="short-desc")

"find_all" method will find all occurrences of a tag in the current element, and return a Python list

In [5]:
len(results)

116

In [6]:
results[0:3]

[<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>,
 <span class="short-desc"><strong>Jan. 21 </strong>“A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.” <span class="short-truth"><a href="http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/" target="_blank">(Trump was on the cover 11 times and Nixon appeared 55 times.)</a></span></span>,
 <span class="short-desc"><strong>Jan. 23 </strong>“Between 3 million and 5 million illegal votes caused me to lose the popular vote.” <span class="short-truth"><a href="https://www.nytimes.com/2017/01/23/us/politics/donald-trump-congress-democrats.html" target="_

In [7]:
first_result =results[1]
first_result

<span class="short-desc"><strong>Jan. 21 </strong>“A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.” <span class="short-truth"><a href="http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/" target="_blank">(Trump was on the cover 11 times and Nixon appeared 55 times.)</a></span></span>

**Pattern of record**<br>
Each record has four components **(date, lie, explanation, and URL)** 

# Extract Date
Pass in the "strong" tag to find_all method

In [8]:
first_result.find_all('strong')[0].text



'Jan. 21\xa0'

In [9]:
first_result.find_all('strong')[0].text[0:-1]

'Jan. 21'

In [10]:
data=first_result.contents

In [11]:
data

[<strong>Jan. 21 </strong>,
 '“A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.” ',
 <span class="short-truth"><a href="http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/" target="_blank">(Trump was on the cover 11 times and Nixon appeared 55 times.)</a></span>]

# Extract Trump's Statements

In [12]:
data[1][1:-2]

'A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.'

# Extract Explaination

In [13]:
data[2]

<span class="short-truth"><a href="http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/" target="_blank">(Trump was on the cover 11 times and Nixon appeared 55 times.)</a></span>

In [14]:
first_result.find('a').text[1:-2]

'Trump was on the cover 11 times and Nixon appeared 55 times'

# Extract URL string
Note: href is values are located inside the tag itself

In [15]:
first_result.find('a')['href']

'http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/'

# Dataset

In [17]:
final_data=[]

for row in results:
    date = row.find_all('strong')[0].text[0:-1]
    trump_stmts= row.contents[1][1:-2]
    explaination= row.find_all('a')[0].text[1:-1]
    url = row.find('a')['href']
    
    final_data.append((date,trump_stmts,explaination,url))

In [18]:
final_data[0:3]

[('Jan. 21',
  "I wasn't a fan of Iraq. I didn't want to go into Iraq.",
  'He was for an invasion before he was against it.',
  'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'),
 ('Jan. 21',
  'A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.',
  'Trump was on the cover 11 times and Nixon appeared 55 times.',
  'http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/'),
 ('Jan. 23',
  'Between 3 million and 5 million illegal votes caused me to lose the popular vote.',
  "There's no evidence of illegal voting.",
  'https://www.nytimes.com/2017/01/23/us/politics/donald-trump-congress-democrats.html')]

# Construct a dataframe using Pandas

In [19]:
import pandas as pd
df = pd.DataFrame(final_data, columns=['date', 'trump_stmts', 'explanation', 'url'])

In [20]:
df.head()

Unnamed: 0,date,trump_stmts,explanation,url
0,Jan. 21,I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it.,https://www.buzzfeed.com/andrewkaczynski/in-20...
1,Jan. 21,A reporter for Time magazine — and I have been...,Trump was on the cover 11 times and Nixon appe...,http://nation.time.com/2013/11/06/10-things-yo...
2,Jan. 23,Between 3 million and 5 million illegal votes ...,There's no evidence of illegal voting.,https://www.nytimes.com/2017/01/23/us/politics...
3,Jan. 25,"Now, the audience was the biggest ever. But th...",Official aerial photos show Obama's 2009 inaug...,https://www.nytimes.com/2017/01/21/us/politics...
4,Jan. 25,Take a look at the Pew reports (which show vot...,The report never mentioned voter fraud.,https://www.nytimes.com/2017/01/24/us/politics...


In [21]:
df.tail()

Unnamed: 0,date,trump_stmts,explanation,url
111,July 6,"As a result of this insistence, billions of do...",NATO countries agreed to meet defense spending...,http://www.nbcnews.com/politics/donald-trump/f...
112,July 17,We’ve signed more bills — and I’m talking abou...,"Clinton, Carter, Truman, and F.D.R. had signed...",https://www.nytimes.com/2017/07/17/us/politics...
113,July 19,"Um, the Russian investigation — it’s not an in...",It is.,http://time.com/4823514/donald-trump-investiga...
114,July 19,"I heard that Harry Truman was first, and then ...","Presidents Clinton, Carter, Truman, and F.D.R....",https://www.nytimes.com/2017/07/17/us/politics...
115,July 19,But the F.B.I. person really reports directly ...,He reports directly to the attorney general.,https://www.usatoday.com/story/news/politics/o...


# Exporting to CSV file

In [22]:
df.to_csv('trump_lies.csv', index=False, encoding='utf-8')