<a href="https://colab.research.google.com/github/Zaedul-Islam/Web_Scraping/blob/master/Web%20Scraping%20in%20Python/Web%20Scraper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Web Scraping in Python**

---


## **Step 1: Reading the web page into Python**

To install ***requests*** library in the command line execute this command: ***pip install requests***

In [0]:
import requests
webPage = requests.get('https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html')

In [28]:
# The "webPage" object a "text" attribute which contains the same HTML code that we can see by viewing the source from a web browser
# Print the first 500 characters of the HTML source code
print(webPage.text[0:500])

<!DOCTYPE html>
<!--[if (gt IE 9)|!(IE)]> <!--><html lang="en" class="no-js page-interactive section-opinion page-theme-standard tone-opinion page-interactive-default limit-small layout-xlarge app-interactive" itemid="https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html" itemtype="http://schema.org/NewsArticle" itemscope xmlns:og="http://opengraphprotocol.org/schema/"><!--<![endif]-->
<!--[if IE 9]> <html lang="en" class="no-js ie9 lt-ie10 page-interactive section-opinion page


## **Step 2: Parsing the HTML using Beautiful Soup**

To install ***bs4*** library from the command line execute this command: ***pip install beautifulsoup4***

In [0]:
from bs4 import BeautifulSoup
# BeautifulSoup is reading the HTML in "webPage" object and making sense of it's structure
beautifulSoup = BeautifulSoup(webPage.text, 'html.parser')

**allRecords** is an array contaning records in this format:
`<span class="short-desc"><strong> DATE </strong> LIE <span class="short-truth"><a href="URL"> EXPLANATION </a></span></span>`

In [0]:
# "recordList" acts like a Python list
recordList = beautifulSoup.find_all('span', attrs={'class':'short-desc'})

In [31]:
len(recordList)

180

In [32]:
# Print the first 3 records
recordList[0:3]  

[<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>,
 <span class="short-desc"><strong>Jan. 21 </strong>“A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.” <span class="short-truth"><a href="http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/" target="_blank">(Trump was on the cover 11 times and Nixon appeared 55 times.)</a></span></span>,
 <span class="short-desc"><strong>Jan. 23 </strong>“Between 3 million and 5 million illegal votes caused me to lose the popular vote.” <span class="short-truth"><a href="https://www.nytimes.com/2017/01/23/us/politics/donald-trump-congress-democrats.html" target="_

In [33]:
# Get the last record of the article
recordList[-1]

<span class="short-desc"><strong>Nov. 11 </strong>“I'd rather have him  – you know, work with him on the Ukraine than standing and arguing about whether or not  – because that whole thing was set up by the Democrats.” <span class="short-truth"><a href="https://www.nytimes.com/interactive/2017/12/10/us/politics/trump-and-russia.html" target="_blank">(There is no evidence that Democrats "set up" Russian interference in the election.)</a></span></span>

## **Step 3: Build the Dataset**

### **1. Extracting the date**

In [34]:
# Get the first record
firstRecord = recordList [0]
print(firstRecord)

<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>


In [35]:
# Returns a BeautifulSoup object which is a HTML tag
firstRecord.find('strong')

<strong>Jan. 21 </strong>

In [36]:
# Returns a regular Python string by accessing the text attribute inside the tag
# \xa0 is called escape sequence and a single charater that represents the nbsp character in the HTML source
firstRecord.find('strong').text

'Jan. 21\xa0'

In [37]:
# Slice of \xa0 from the end of the string
firstRecord.find('strong').text[0:-1]

'Jan. 21'

In [38]:
# Add year to the date
firstRecord.find('strong').text[0:-1] + ', 2017'

'Jan. 21, 2017'

### **2. Extracting the lie**

In [39]:
# Print the first record
firstRecord

<span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>

In [40]:
# The firstRecord tag has a contents attribute which returns a Python list containing it's children (nested tags)
firstRecord.contents

[<strong>Jan. 21 </strong>,
 "“I wasn't a fan of Iraq. I didn't want to go into Iraq.” ",
 <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span>]

In [41]:
# Get the LIE from the list
firstRecord.contents[1]

"“I wasn't a fan of Iraq. I didn't want to go into Iraq.” "

In [42]:
# Slice off the curly quotation marks and a space
firstRecord.contents[1][1:-2]

"I wasn't a fan of Iraq. I didn't want to go into Iraq."

### **3. Extracting the explanation**

In [43]:
firstRecord.contents[2]

<span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span>

In [44]:
firstRecord.find('a')

<a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a>

In [45]:
# Returns the writer's explanation of the lie
firstRecord.find('a').text[1:-1]

'He was for an invasion before he was against it.'

### **4. Extracting the URL**

In [46]:
firstRecord.find('a')

<a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a>

In [47]:
# BeautifulSoup treats tag attributes and values as key-value pairs in a dictionary
firstRecord.find('a')['href']

'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'

In [0]:
# "records" is a list of tuples
records = []
for record in recordList: 
  date = record.find('strong').text[0:-1] + ', 2017'
  lie = record.contents[1][1:-2]
  explanation = record.find('a').text[1:-1]
  url = record.find('a')['href']
  records.append((date, lie, explanation, url))

In [49]:
len(records)

180

In [50]:
# Print the first 3 records
records[0:3]

[('Jan. 21, 2017',
  "I wasn't a fan of Iraq. I didn't want to go into Iraq.",
  'He was for an invasion before he was against it.',
  'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'),
 ('Jan. 21, 2017',
  'A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.',
  'Trump was on the cover 11 times and Nixon appeared 55 times.',
  'http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/'),
 ('Jan. 23, 2017',
  'Between 3 million and 5 million illegal votes caused me to lose the popular vote.',
  "There's no evidence of illegal voting.",
  'https://www.nytimes.com/2017/01/23/us/politics/donald-trump-congress-democrats.html')]

## **Step 4: Applying a tabular data structure**

In [0]:
import pandas as pd
dataFrame = pd.DataFrame(records, columns=['Date', 'Lie', 'Explanation', 'URL'])

In [53]:
# Print the first few rows of the DataFrame
dataFrame.head()

Unnamed: 0,Date,Lie,Explanation,URL
0,"Jan. 21, 2017",I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it.,https://www.buzzfeed.com/andrewkaczynski/in-20...
1,"Jan. 21, 2017",A reporter for Time magazine — and I have been...,Trump was on the cover 11 times and Nixon appe...,http://nation.time.com/2013/11/06/10-things-yo...
2,"Jan. 23, 2017",Between 3 million and 5 million illegal votes ...,There's no evidence of illegal voting.,https://www.nytimes.com/2017/01/23/us/politics...
3,"Jan. 25, 2017","Now, the audience was the biggest ever. But th...",Official aerial photos show Obama's 2009 inaug...,https://www.nytimes.com/2017/01/21/us/politics...
4,"Jan. 25, 2017",Take a look at the Pew reports (which show vot...,The report never mentioned voter fraud.,https://www.nytimes.com/2017/01/24/us/politics...


In [54]:
# Print the last few rows of the DataFrame
dataFrame.tail()

Unnamed: 0,Date,Lie,Explanation,URL
175,"Oct. 25, 2017",We have trade deficits with almost everybody.,We have trade surpluses with more than 100 cou...,https://www.bea.gov/newsreleases/international...
176,"Oct. 27, 2017","Wacky & totally unhinged Tom Steyer, who has b...",Steyer has financially supported many winning ...,https://www.opensecrets.org/donor-lookup/resul...
177,"Nov. 1, 2017","Again, we're the highest-taxed nation, just ab...",We're not.,http://www.politifact.com/truth-o-meter/statem...
178,"Nov. 7, 2017",When you look at the city with the strongest g...,"Several other cities, including New York and L...",http://www.politifact.com/truth-o-meter/statem...
179,"Nov. 11, 2017","I'd rather have him – you know, work with him...","There is no evidence that Democrats ""set up"" R...",https://www.nytimes.com/interactive/2017/12/10...


In [0]:
 # Convert the "Date" column to panadas special date time format
 dataFrame['Date'] = pd.to_datetime(dataFrame['Date'])

In [56]:
# Print the first few rows of the DataFrame
dataFrame.head()

Unnamed: 0,Date,Lie,Explanation,URL
0,2017-01-21,I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it.,https://www.buzzfeed.com/andrewkaczynski/in-20...
1,2017-01-21,A reporter for Time magazine — and I have been...,Trump was on the cover 11 times and Nixon appe...,http://nation.time.com/2013/11/06/10-things-yo...
2,2017-01-23,Between 3 million and 5 million illegal votes ...,There's no evidence of illegal voting.,https://www.nytimes.com/2017/01/23/us/politics...
3,2017-01-25,"Now, the audience was the biggest ever. But th...",Official aerial photos show Obama's 2009 inaug...,https://www.nytimes.com/2017/01/21/us/politics...
4,2017-01-25,Take a look at the Pew reports (which show vot...,The report never mentioned voter fraud.,https://www.nytimes.com/2017/01/24/us/politics...


In [57]:
# Print the last few rows of the DataFrame
dataFrame.tail()

Unnamed: 0,Date,Lie,Explanation,URL
175,2017-10-25,We have trade deficits with almost everybody.,We have trade surpluses with more than 100 cou...,https://www.bea.gov/newsreleases/international...
176,2017-10-27,"Wacky & totally unhinged Tom Steyer, who has b...",Steyer has financially supported many winning ...,https://www.opensecrets.org/donor-lookup/resul...
177,2017-11-01,"Again, we're the highest-taxed nation, just ab...",We're not.,http://www.politifact.com/truth-o-meter/statem...
178,2017-11-07,When you look at the city with the strongest g...,"Several other cities, including New York and L...",http://www.politifact.com/truth-o-meter/statem...
179,2017-11-11,"I'd rather have him – you know, work with him...","There is no evidence that Democrats ""set up"" R...",https://www.nytimes.com/interactive/2017/12/10...


### **Exporting the dataset to a CSV file**

In [0]:
# Setting index parameter as False tells pandas that we don't need the integer indexes in the csv file
dataFrame.to_csv('president_lies.csv', index = False, encoding = 'utf-8')

In [67]:
# Reading the CSV file
dataFrame = pd.read_csv('president_lies.csv', parse_dates= ['Date'], encoding= 'utf-8')
dataFrame

Unnamed: 0,Date,Lie,Explanation,URL
0,2017-01-21,I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it.,https://www.buzzfeed.com/andrewkaczynski/in-20...
1,2017-01-21,A reporter for Time magazine — and I have been...,Trump was on the cover 11 times and Nixon appe...,http://nation.time.com/2013/11/06/10-things-yo...
2,2017-01-23,Between 3 million and 5 million illegal votes ...,There's no evidence of illegal voting.,https://www.nytimes.com/2017/01/23/us/politics...
3,2017-01-25,"Now, the audience was the biggest ever. But th...",Official aerial photos show Obama's 2009 inaug...,https://www.nytimes.com/2017/01/21/us/politics...
4,2017-01-25,Take a look at the Pew reports (which show vot...,The report never mentioned voter fraud.,https://www.nytimes.com/2017/01/24/us/politics...
...,...,...,...,...
175,2017-10-25,We have trade deficits with almost everybody.,We have trade surpluses with more than 100 cou...,https://www.bea.gov/newsreleases/international...
176,2017-10-27,"Wacky & totally unhinged Tom Steyer, who has b...",Steyer has financially supported many winning ...,https://www.opensecrets.org/donor-lookup/resul...
177,2017-11-01,"Again, we're the highest-taxed nation, just ab...",We're not.,http://www.politifact.com/truth-o-meter/statem...
178,2017-11-07,When you look at the city with the strongest g...,"Several other cities, including New York and L...",http://www.politifact.com/truth-o-meter/statem...


# **Summary: Putting together all the code in one block**

In [0]:
# Step 1: Reading the web page into Python
import requests
webPage = requests.get('https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html')

# Step 2: Parsing the HTML using Beautiful Soup
from bs4 import BeautifulSoup
# BeautifulSoup is reading the HTML in "webPage" object and making sense of it's structure
beautifulSoup = BeautifulSoup(webPage.text, 'html.parser')

# Step 3: Build the Dataset
records = []
for record in recordList: 
  # Extract the Date
  date = record.find('strong').text[0:-1] + ', 2017'
  # Extract the Lie
  lie = record.contents[1][1:-2]
  # Extract the explanation
  explanation = record.find('a').text[1:-1]
  # Extract the URL
  url = record.find('a')['href']
  records.append((date, lie, explanation, url))

# Step 4: Applying a tabular data structure
import pandas as pd
dataFrame = pd.DataFrame(records, columns=['Date', 'Lie', 'Explanation', 'URL'])  
# Convert the "Date" column to panadas special date time format
dataFrame['Date'] = pd.to_datetime(dataFrame['Date'])
 # Setting index parameter as False tells pandas that we don't need the integer indexes in the csv file
dataFrame.to_csv('president_lies.csv', index = False, encoding = 'utf-8')