# Web Scraping using BeautifulSoup 

## Web Scraping (also termed Screen Scraping, Web Data Extraction, Web Harvesting etc.) is a technique employed to extract large amounts of data from websites whereby the data is extracted and saved to a local file in your computer or to a database in table (spreadsheet) format.

In [1]:
# Loading libraries
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup

# Here is the link to Trump's lies artcile from NY-Times. https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html

# Here is the structure of the first lie:

In [2]:
link = "https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html"
response = requests.get(link)

# Collecting all the records

In [3]:
soup = BeautifulSoup(response.text, "html.parser")

In [4]:
results = soup.find_all(name = "span", attrs = {"class" : "short-desc"})

In [5]:
len(results)

180

In [6]:
results[0]

<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>

# Parse the first lie into 4 structured columns:
* Date
* Lie
* Description
* Link

# The first result to see the pattern of the html

In [7]:
r = results[0]

In [8]:
r

<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>

# Parsing date by looking at the strong tag

In [9]:
date = r.find("strong").text[:-1] + ", 2017"

In [10]:
date

'Jan. 21, 2017'

# Parsing lie by parsing the content out

In [11]:
lie = r.contents[1][1:-2]

In [12]:
lie

"I wasn't a fan of Iraq. I didn't want to go into Iraq."

# Parsing explanation by looking at the a tag

In [13]:
explanation = r.find("a").text[1:-1]

In [14]:
explanation

'He was for an invasion before he was against it.'

# Parsing link out by looking at the href key of the a tag

In [15]:
link = r.find("a")["href"]

In [16]:
link

'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'

# Now, iterating over the whole records and put it into a dataframe

In [17]:
rows = []
for r in results:
    date = r.find("strong").text[:-1] + ", 2017"
    lie = r.contents[1][1:-2]
    explanation = r.find("a").text[1:-1]
    link = r.find("a")["href"]
    rows.append( (date, lie, explanation, link) )

In [18]:
df = pd.DataFrame(data = rows, columns=["Date", "Lie", "Explanation", "Link"])
df["Date"] = pd.to_datetime(df["Date"])

In [19]:
df.head(20)

Unnamed: 0,Date,Lie,Explanation,Link
0,2017-01-21,I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it.,https://www.buzzfeed.com/andrewkaczynski/in-20...
1,2017-01-21,A reporter for Time magazine — and I have been...,Trump was on the cover 11 times and Nixon appe...,http://nation.time.com/2013/11/06/10-things-yo...
2,2017-01-23,Between 3 million and 5 million illegal votes ...,There's no evidence of illegal voting.,https://www.nytimes.com/2017/01/23/us/politics...
3,2017-01-25,"Now, the audience was the biggest ever. But th...",Official aerial photos show Obama's 2009 inaug...,https://www.nytimes.com/2017/01/21/us/politics...
4,2017-01-25,Take a look at the Pew reports (which show vot...,The report never mentioned voter fraud.,https://www.nytimes.com/2017/01/24/us/politics...
5,2017-01-25,You had millions of people that now aren't ins...,"The real number is less than 1 million, accord...",https://www.nytimes.com/2017/03/13/us/politics...
6,2017-01-25,"So, look, when President Obama was there two w...",There were no gun homicide victims in Chicago ...,https://www.dnainfo.com/chicago/2017-chicago-m...
7,2017-01-26,We've taken in tens of thousands of people. We...,Vetting lasts up to two years.,https://www.nytimes.com/interactive/2017/01/29...
8,2017-01-26,I cut off hundreds of millions of dollars off ...,Most of the cuts were already planned.,https://www.washingtonpost.com/news/fact-check...
9,2017-01-28,The coverage about me in the @nytimes and the ...,It never apologized.,https://www.nytimes.com/2016/11/13/us/election...
