## Reading the webpage into python using request

In [67]:
import pandas as pd

In [4]:
import requests
r = requests.get("https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html")

In [9]:
# Read first few characters
r.text[:500]
# match it with view page source html code[chrome users]

'<!DOCTYPE html>\n<!--[if (gt IE 9)|!(IE)]> <!--><html lang="en" class="no-js page-interactive section-opinion page-theme-standard tone-opinion page-interactive-default limit-small layout-xlarge app-interactive" itemid="https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html" itemtype="http://schema.org/NewsArticle" itemscope xmlns:og="http://opengraphprotocol.org/schema/"><!--<![endif]-->\n<!--[if IE 9]> <html lang="en" class="no-js ie9 lt-ie10 page-interactive section-opinion page'

### Parse html using beautifulsoup

In [11]:
# ! pip install beautifulsoup4



In [12]:
from bs4 import BeautifulSoup

In [13]:
soup = BeautifulSoup(r.text, 'html.parser')

In [14]:
type(soup)

bs4.BeautifulSoup

### Collecting all of the records

In [25]:
results = soup.find_all("span",attrs={"class":"short-desc"})

In [26]:
len(results)

180

In [27]:
results[:2]

[<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>,
 <span class="short-desc"><strong>Jan. 21 </strong>“A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.” <span class="short-truth"><a href="http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/" target="_blank">(Trump was on the cover 11 times and Nixon appeared 55 times.)</a></span></span>]

In [29]:
results[-1]

<span class="short-desc"><strong>Nov. 11 </strong>“I'd rather have him  – you know, work with him on the Ukraine than standing and arguing about whether or not  – because that whole thing was set up by the Democrats.” <span class="short-truth"><a href="https://www.nytimes.com/interactive/2017/12/10/us/politics/trump-and-russia.html" target="_blank">(There is no evidence that Democrats "set up" Russian interference in the election.)</a></span></span>

### Extracting the date

In [30]:
first_results = results[0]

In [31]:
first_results

<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>

In [32]:
first_results.find("strong")

<strong>Jan. 21 </strong>

In [57]:
first_results.find("strong").text[:-1] + ', 2017'

'Jan. 21, 2017'

And that's how we sucessfully extracted the date for first result

### Extract the lie
Now we will extract content of the file but there is no key word before and behind "lie". So we will use 'content' function of 
soup

In [36]:
first_results.contents

[<strong>Jan. 21 </strong>,
 "“I wasn't a fan of Iraq. I didn't want to go into Iraq.” ",
 <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span>]

In [37]:
first_results.contents[1]

"“I wasn't a fan of Iraq. I didn't want to go into Iraq.” "

Now we will remove "" and space from the back.

In [40]:
first_results.contents[1][1:-2]

"I wasn't a fan of Iraq. I didn't want to go into Iraq."

### Extract the link

In [58]:
first_results.contents[-1].find('a')

<a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a>

Now 'href' works here as key value pair in dictionary.

In [59]:
first_results.contents[-1].find('a')['href']

'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'

### Extracting the explanation

In [61]:
first_results.contents[-1].text

'(He was for an invasion before he was against it.)'

In [62]:
first_results.contents[-1].text[1:-1]

'He was for an invasion before he was against it.'

### Building the dataset

In [64]:
record = []
for result in results:
    date = result.find("strong").text[:-1] + ', 2017'
    lie = result.contents[1][1:-2]
    explanation = result.contents[-1].text[1:-1]
    link = result.contents[-1].find('a')['href']
    record.append((date, lie, explanation,link))
    
    

In [65]:
len(record)

180

In [66]:
record[:3]

[('Jan. 21, 2017',
  "I wasn't a fan of Iraq. I didn't want to go into Iraq.",
  'He was for an invasion before he was against it.',
  'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'),
 ('Jan. 21, 2017',
  'A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.',
  'Trump was on the cover 11 times and Nixon appeared 55 times.',
  'http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/'),
 ('Jan. 23, 2017',
  'Between 3 million and 5 million illegal votes caused me to lose the popular vote.',
  "There's no evidence of illegal voting.",
  'https://www.nytimes.com/2017/01/23/us/politics/donald-trump-congress-democrats.html')]

### Buiding the tabuelar dataset

In [68]:
webscrapped_data = pd.DataFrame(record,columns=['Date',"Lie", "Explanation", "Hyperlink"])

In [69]:
webscrapped_data.head()

Unnamed: 0,Date,Lie,Explanation,Hyperlink
0,"Jan. 21, 2017",I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it.,https://www.buzzfeed.com/andrewkaczynski/in-20...
1,"Jan. 21, 2017",A reporter for Time magazine — and I have been...,Trump was on the cover 11 times and Nixon appe...,http://nation.time.com/2013/11/06/10-things-yo...
2,"Jan. 23, 2017",Between 3 million and 5 million illegal votes ...,There's no evidence of illegal voting.,https://www.nytimes.com/2017/01/23/us/politics...
3,"Jan. 25, 2017","Now, the audience was the biggest ever. But th...",Official aerial photos show Obama's 2009 inaug...,https://www.nytimes.com/2017/01/21/us/politics...
4,"Jan. 25, 2017",Take a look at the Pew reports (which show vot...,The report never mentioned voter fraud.,https://www.nytimes.com/2017/01/24/us/politics...


In [70]:
webscrapped_data["Date"] = pd.to_datetime(webscrapped_data["Date"])

In [71]:
webscrapped_data.head()

Unnamed: 0,Date,Lie,Explanation,Hyperlink
0,2017-01-21,I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it.,https://www.buzzfeed.com/andrewkaczynski/in-20...
1,2017-01-21,A reporter for Time magazine — and I have been...,Trump was on the cover 11 times and Nixon appe...,http://nation.time.com/2013/11/06/10-things-yo...
2,2017-01-23,Between 3 million and 5 million illegal votes ...,There's no evidence of illegal voting.,https://www.nytimes.com/2017/01/23/us/politics...
3,2017-01-25,"Now, the audience was the biggest ever. But th...",Official aerial photos show Obama's 2009 inaug...,https://www.nytimes.com/2017/01/21/us/politics...
4,2017-01-25,Take a look at the Pew reports (which show vot...,The report never mentioned voter fraud.,https://www.nytimes.com/2017/01/24/us/politics...


In [72]:
webscrapped_data.set_index("Date", drop=True, inplace=True)

In [73]:
webscrapped_data.head()

Unnamed: 0_level_0,Lie,Explanation,Hyperlink
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-21,I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it.,https://www.buzzfeed.com/andrewkaczynski/in-20...
2017-01-21,A reporter for Time magazine — and I have been...,Trump was on the cover 11 times and Nixon appe...,http://nation.time.com/2013/11/06/10-things-yo...
2017-01-23,Between 3 million and 5 million illegal votes ...,There's no evidence of illegal voting.,https://www.nytimes.com/2017/01/23/us/politics...
2017-01-25,"Now, the audience was the biggest ever. But th...",Official aerial photos show Obama's 2009 inaug...,https://www.nytimes.com/2017/01/21/us/politics...
2017-01-25,Take a look at the Pew reports (which show vot...,The report never mentioned voter fraud.,https://www.nytimes.com/2017/01/24/us/politics...


### Store the data in csv file

In [74]:
webscrapped_data.to_csv('trumplies.csv', index=True,encoding='utf8')