# Scraping Example #1

#### We will be scraping an article from the New York Times entitled "Trump Says U.S. Will Hit Mexico With Tariffs on All Goods"

The article is found at: https://www.nytimes.com/2019/05/30/us/politics/trump-mexico-tariffs.html

To do this, we will need to *import* modules from **BeautifulSoup**, **requests**, and **pandas**.

In [1]:
# import statements:
from bs4 import BeautifulSoup
from bs4 import NavigableString
import requests
import pandas as pd

### Using requests module, we'll "get" the HTML document

In [2]:
# Get the entire HTML document using requests module, get() function
r = requests.get("https://www.nytimes.com/2019/05/30/us/politics/trump-mexico-tariffs.html")

In [3]:
# Let's see what we've got here...
# Square brackets [] allow us to put indicies so that we only read the first/last N characters
print(r.text[:500])
print(".........................")
print(r.text[-200:])

<!DOCTYPE html>
<html lang="en" class="story" xmlns:og="http://opengraphprotocol.org/schema/">
  <head>
    <title data-rh="true">Trump Says U.S. Will Hit Mexico With 5% Tariffs on All Goods - The New York Times</title>
    <meta data-rh="true" itemprop="inLanguage" content="en-US"/><meta data-rh="true" property="article:published" content="2019-05-30T23:48:36.000Z"/><meta data-rh="true" property="article:modified" content="2019-06-04T21:15:04.171Z"/><meta data-rh="true" http-equiv="Content-Lang
.........................
g) {
  window.Raven.setTagsContext(tag);
});
    };
    includeRaven.appendChild(script);
  });
}
</script>
</div>

    
    <!-- RELEASE 6d6c09d13c578e1f1ad5935614ffd4777860fad6 -->
  </body>
</html>


### The text is just HTML with all it's tags: BeautifulSoup will help us parse the data

In [4]:
# We'll pass the text onto the BeautifulSoup parser
soup = BeautifulSoup(r.text, 'html.parser')
soup

<!DOCTYPE html>

<html class="story" lang="en" xmlns:og="http://opengraphprotocol.org/schema/">
<head>
<title data-rh="true">Trump Says U.S. Will Hit Mexico With 5% Tariffs on All Goods - The New York Times</title>
<meta content="en-US" data-rh="true" itemprop="inLanguage"/><meta content="2019-05-30T23:48:36.000Z" data-rh="true" property="article:published"/><meta content="2019-06-04T21:15:04.171Z" data-rh="true" property="article:modified"/><meta content="en" data-rh="true" http-equiv="Content-Language"/><meta content="noarchive" data-rh="true" name="robots"/><meta content="100000006533915" data-rh="true" name="articleid"/><meta content="nyt://article/92f06fa1-480c-5a98-b35b-6f8661feb65b" data-rh="true" name="nyt_uri"/><meta content="pubp://event/df5cf82c50f847bda1dffae4e8e6c02e" data-rh="true" name="pubp_event_id"/><meta content="The president said he would use tariffs to punish Mexico until it stops the flow of migrants into the United States." data-rh="true" name="description"/><me

### After looking through the HTML file, we look for the tags that precede the actual text...

In [7]:
# We've figured out where (what HTML tags precede) the actual textual content is in the HTML file.
results = soup.find_all('p', attrs={'class':'css-18icg9x evys1bk0'})

In [6]:
# How many lines did we pick up?
NumberOfLines = len(results)
print(NumberOfLines)

0


In [None]:
# Show me the first N lines...
print(results[0:5])

In [None]:
# Show me the first line (index = 0) and just the textual content
print(results[0].contents[0])

In [None]:
# Can you find the word 'Trump' in that line? 
# What about the word 'UCSB'?
print(results[0].contents[0].find('Trump'))
print(results[0].contents[0].find('UCSB'))

In [None]:
# Show me all the textual content with an index
for i in range(NumberOfLines):
    print(i, results[i].contents[0])

In [None]:
# How many times was 'migra' mentioned? Includes 'immigrant', 'migrants', 'immigration', etc...
print(results[0].contents[0].count('migra'))

# How many times was 'trade' mentioned?
print(results[0].contents[0].count('trade'))

In [None]:
# How can I find other embedded things in there, like URLs?
print(results[0].find('a'))
print(results[12].find('a')['href'])

In [None]:
# Let's look for embedded URLs in ALL the text
for i in range(NumberOfLines):
    if results[i].find('a') != None:
        print(results[i].find('a')['href'])

In [None]:
# How long are their sentences (character count per line)?
for i in range(NumberOfLines):
    print(i, len(results[i].contents[0]))

### Ok, we've taken a look at the "lay of the land", now it's time to collect what we want

In [None]:
# Place all the records in a Python list called "records"
records = []
for i in range(NumberOfLines):
    content = results[i].contents[0]
    print(i)
    print(type(content))
    if type(content) == NavigableString:
        migrantMentions = content.count('migra')
        tradeMentions = content.count('trade')
        charCount = len(content)
        urls = "none"
        if results[i].find('a') != None:
            urls = results[i].find('a')['href']
        records.append((content, migrantMentions, tradeMentions, urls, charCount))

In [None]:
# Use the pandas module and DataFrame() function to create a table of our "records"
df = pd.DataFrame(records, columns=['content', '*migra* mentioned', '*trade* mentioned', 'urls', 'char count'])

In [None]:
df

In [None]:
# Export our table into a Comma-Separated Value (csv) file.
df.to_csv('trump_mexico_border.csv', index=False, encoding='utf-8')

## Caution...

* Web scraping works best with **static, well-structured web pages**. Dynamic or interactive content on a web page is often not accessible through the HTML source, which makes scraping it much harder!
* Web scraping is a "fragile" approach for building a dataset. The HTML on a page you are scraping can **change at any time**, which may cause your scraper to stop working.
* If you can **download the data** you need from a website (wget on Linux or simply save the page via your web browser), or if the website provides an **API with data access**, those approaches are preferable to scraping since they are easier to implement and less likely to break.
* If you are scraping a lot of pages from the same website (in rapid succession), it's best to **insert delays in your code** so that you don't overwhelm the website with requests. If the website decides you are causing a problem, **they can block your IP address (which may affect everyone in your building!)**
* Before scraping a website, you should review its **robots.txt** file (also known as the Robots exclusion standard) to check whether you are "allowed" to scrape their website.

*Thanks to Kevin Markham's similar exercise at https://github.com/justmarkham/trump-lies/blob/master/trump_lies.ipynb*