### Intro to Python

Python is a popular multi-purpose coding language. We usually teach it over the course of weeks but you'll see some of the essentials in the next hour or so (assuming all has gone well in the first half of all this). You're looking at one interface to Python, a Jupyter Notebook. It has two kinds of "cells" -- a text or Markdown cell (like this one) and then a Code cell for Python instructions (like the next one). You execute either by holding the shift key and hitting "enter".

In [1]:
# do some addition
2*(2+5)

14

In [2]:
# store the result and use it in another computation
x = 2+3
5*x

25

Python is an "object oriented" language meaning that it deals in so-called software "objects". It knows about a lot of different types of data, besides numbers. There are strings and True/False booleans, and, well, you name it. You can even make new objects to suit your needs if your data have special structure you need to account for.

There are two special objects we'll see today that hold information. One is a list and one is a dictionary. Lists store information in numerical order, and dictionaries store data under names. 

In [3]:
x = 5
x<2

False

In [4]:
x = "abc"
y = "def"
x+y*5

'abcdefdefdefdefdef'

In [5]:
# a list
x = [4,7,"hi",3.14]
y = ["bye",1000]
x+y

[4, 7, 'hi', 3.14, 'bye', 1000]

In [6]:
# extracting the third element (python starts counting at zero)
z = x+y
z

[4, 7, 'hi', 3.14, 'bye', 1000]

In [7]:
for e in z:
    print(e*5)

20
35
hihihihihi
15.700000000000001
byebyebyebyebye
5000


In [8]:
help(list)

Help on class list in module builtins:

class list(object)
 |  list(iterable=(), /)
 |  
 |  Built-in mutable sequence.
 |  
 |  If no argument is given, the constructor creates a new empty list.
 |  The argument must be an iterable if specified.
 |  
 |  Methods defined here:
 |  
 |  __add__(self, value, /)
 |      Return self+value.
 |  
 |  __contains__(self, key, /)
 |      Return key in self.
 |  
 |  __delitem__(self, key, /)
 |      Delete self[key].
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 |  __getitem__(...)
 |      x.__getitem__(y) <==> x[y]
 |  
 |  __gt__(self, value, /)
 |      Return self>value.
 |  
 |  __iadd__(self, value, /)
 |      Implement self+=value.
 |  
 |  __imul__(self, value, /)
 |      Implement self*=value.
 |  
 |  __init__(self, /, *args, **kwargs)
 |      Initialize self.  See help(type(self))

In [9]:
# a dictionary
x = {"name":"Web Scraping","time":"9am-11am","pages":1000}
type(x)

dict

In [10]:
# pull out the data associated with the "time" key
x["pages"]

1000

## Intro to Web Scraping

We've been collecting data using simple interfaces -- extensions to Chrome. As we mentioned, while open data portals, APIs and other publication mechanisms provide easy ways to get to information we need for our analysys and reporting, there are plenty other valuable data sources for us to take advantage of: web pages (HTML), PDF files, email dumps, etc. Automating the extraction of useful information from web pages is known as **"web scraping."** A terrible name aside, web scraping is very powerful and it's something you'll want to master. Today, we'll close our session talking about some of the basics of web scraping in Python.

![Web Scraping](https://blog.hartleybrody.com/wp-content/uploads/2012/12/scraper-tool.jpg)


There are many ways to scrape information from the web, but we're going to use Python, [requests](http://docs.python-requests.org/en/master/) and [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/).

### Trump's Lies

Since we haven't talked about Trump nearly enough in this class(!), let's take a look at a New York Times piece on [all of the lies Trump told in 2017](https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html). Fun! This will lead to a good scraping exercise. Before we get to the code, take a quick look at the NYTimes piece.

The first part of web scraping is making HTTP requests to pull the pages we need. We will use the [requests](http://docs.python-requests.org/en/master/) library?

In [11]:
# make the request to get the Trump Lies HTML
from requests import get

url = 'https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html'

lightbulb = {
    'From': 'markh@columbia.edu',
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36"
}

response = get(url, headers=lightbulb)

In [12]:
type(response)

requests.models.Response

Above, we include a "header" field (represented as a dictionary). The header passes information to the web server that might change the way it returns content. In later exercises, we might need to specify the header "User-Agent" which tells the server what kind of  browser the requeste is being made from -- some servers don't like handing pages out to bots. 

For now, we are using the "From" header to announce ourselves. I like to tell a source that I'm taking data. If you want to know more about headers, have a look [here](https://en.wikipedia.org/wiki/List_of_HTTP_header_fields).

In [13]:
# let's see what we have. remember that response.text will give us a string value of the page HTML

print(response.text)

<!DOCTYPE html>
<!--[if (gt IE 9)|!(IE)]> <!--><html lang="en" class="no-js page-interactive section-opinion page-theme-standard tone-opinion page-interactive-default limit-small layout-xlarge app-interactive" itemid="https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html" itemtype="http://schema.org/NewsArticle" itemscope xmlns:og="http://opengraphprotocol.org/schema/"><!--<![endif]-->
<!--[if IE 9]> <html lang="en" class="no-js ie9 lt-ie10 page-interactive section-opinion page-theme-standard tone-opinion page-interactive-default limit-small layout-xlarge app-interactive" xmlns:og="http://opengraphprotocol.org/schema/"> <![endif]-->
<!--[if IE 8]> <html lang="en" class="no-js ie8 lt-ie10 lt-ie9 page-interactive section-opinion page-theme-standard tone-opinion page-interactive-default limit-small layout-xlarge app-interactive" xmlns:og="http://opengraphprotocol.org/schema/"> <![endif]-->
<!--[if (lt IE 8)]> <html lang="en" class="no-js lt-ie10 lt-ie9 lt-ie8 page-intera

This is kind of a mess. The whole web page has been read in as a string. Thankfully, one of the great things about Python is a package called [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/), designed by [Leonard Richardson](http://www.crummy.com/self/). It is truly a thing of beauty. BeautifulSoup is a parser for HTML (and XML) that creates an object that lets you interact with the components of a web page. You can search for tags, extract attributes from the tags and pull the content contained in a tag. [The documentation is pretty simple too.](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) The latest version of BeautifulSoup is 4.6.0 and the package is called bs4.

We will come back to parsing Trump's lies with BeautifulSoup but let's start with a simple example first. Here is some very simple HTML that I'd like to run through BeautifulSoup:

```html
<html>

    <head>
        <title>My Technology News Site</title>
    </head>

    <body>
        <div>
            <p class="title"><strong>Steve Jobs introduces the public beta of Mac OS X</strong></p>
            <div class="description">Sept 13, 2000 - Steve Jobs <a href="https://www.apple.com/pr/library/2000/09/13Apple-Releases-Mac-OS-X-Public-Beta.html" target="_blank">introduces</a> the public beta of Mac OS X for US$29.95.</div>
            <div class="author">Author: Michael Young</div>
        </div>
    </body>

</html>

```

In [14]:
from bs4 import BeautifulSoup

our_html = '''
<html>

    <head>
        <title>My Technology News Site
    </head>

    <body>
        <div>
            <p class="title"><strong>Steve Jobs introduces the public beta of Mac OS X</strong></p>
            <div class="description">Sept 13, 2000 - Steve Jobs <a href="https://www.apple.com/pr/library/2000/09/13Apple-Releases-Mac-OS-X-Public-Beta.html" target="_blank">introduces</a> the public beta of Mac OS X for US$29.95.</div>
            <div class="author">Author: Michael Young</div>
            <p> Another paragraph </p>
        </div>
    </body>

</html>
'''

# BeautifulSoup takes two arguments: a string (hopefully with HTML in it) and the parser we'd like to use
soup = BeautifulSoup(our_html, 'html.parser')

# print out a pretty version of the BeautifulSoup object
print(soup.prettify())

<html>
 <head>
  <title>
   My Technology News Site
  </title>
 </head>
 <body>
  <div>
   <p class="title">
    <strong>
     Steve Jobs introduces the public beta of Mac OS X
    </strong>
   </p>
   <div class="description">
    Sept 13, 2000 - Steve Jobs
    <a href="https://www.apple.com/pr/library/2000/09/13Apple-Releases-Mac-OS-X-Public-Beta.html" target="_blank">
     introduces
    </a>
    the public beta of Mac OS X for US$29.95.
   </div>
   <div class="author">
    Author: Michael Young
   </div>
   <p>
    Another paragraph
   </p>
  </div>
 </body>
</html>



Let's do a super-quick review of [HTML](https://en.wikipedia.org/wiki/HTML):

Hypertext Markup Language is the language used for creating web pages. HTML uses `tags` which help label as well as structure the data in the document. Web browsers use the tags to help render the web page but does not display the tags. 

`Tags` normall come in pairs and have an opening tag `<p>` and a closing tag `</p>`:
```html
<p>this is a paragraph tag</p>
```

Other tags like the image tag `<img>` don't have a closing tag:
```html
<img src="http://somesite.com/images/logo.jpg" />
```

The other important thing about HTML tags is that they can contain one or more `attributes`. Like in the `<img>` tag above, the `src` attribute is used specify the URL of the image. A tag with multiple attributes could look like this:
```html
<p attribute_1="value1" attribute_2="value2">
Our content goes here
</p>
```

HTML documents typically have nested tags (think of a tree!) that looks like this:
```html
<html>
  <head>
    <title>My First Website!</title>
  </head>

    <body>
        <p>My mom would be proud of this.</p>
    </body>  
</html>
```



Back to BeautifulSoup...

When we run our HTML document through BeautifulSoup, we get a python object that allows us to traverse, query and manipulate the HTML document.

Here are a few ways to inspect our simple HTML document that we loaded above:

In [15]:
# <title> tag
print(soup.title)

# name of the <title> tag
print(soup.title.name)

# string value in the <title> tag
print(soup.title.text)

<title>My Technology News Site
    </title>
title
My Technology News Site
    


In [16]:
# how about if we want to find the first <div> tag?

print(soup.div.prettify())

<div>
 <p class="title">
  <strong>
   Steve Jobs introduces the public beta of Mac OS X
  </strong>
 </p>
 <div class="description">
  Sept 13, 2000 - Steve Jobs
  <a href="https://www.apple.com/pr/library/2000/09/13Apple-Releases-Mac-OS-X-Public-Beta.html" target="_blank">
   introduces
  </a>
  the public beta of Mac OS X for US$29.95.
 </div>
 <div class="author">
  Author: Michael Young
 </div>
 <p>
  Another paragraph
 </p>
</div>



In [17]:
# the string value for the first <p> tag within the first <div>

soup.div.p.text

'Steve Jobs introduces the public beta of Mac OS X'

In [18]:
# the value of the "class" attribute of the first <div> under the first <div> (!?!)

soup.div.div.text

'Sept 13, 2000 - Steve Jobs introduces the public beta of Mac OS X for US$29.95.'

In [19]:
# here is how we'd find the the <a> tag

soup.div.div.a["href"]

'https://www.apple.com/pr/library/2000/09/13Apple-Releases-Mac-OS-X-Public-Beta.html'

In [20]:
#### For You To Try

# how would you find the url in the description?



We can use `find` and `find_all` to search through the HTML to find certains tags and tag/attribute combinations. Let's take a look:

In [21]:
# find all <p> tags
for q in soup.find_all('p'):
    print(q.text)
    print("+++")

Steve Jobs introduces the public beta of Mac OS X
+++
 Another paragraph 
+++


In [22]:
# find all <div class="author>...</div> tags
for author_div in soup.find_all('div', attrs={'class': 'author'}):
    print(author_div.text)

Author: Michael Young


### "The president is still lying..."

Coming back to the Trump's lies page, how can we use BeautifulSoup to parse our the lies and create a CSV data that we can use for our own analysis?


In [24]:
from requests import get
from bs4 import BeautifulSoup

i = 1
url = 'https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html?page='+str(i)

lightbulb = {
    'From': '<put your email here>'
}

# http request
response = get(url, headers=lightbulb)

# run the HTML through BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

# print out a pretty version of the BeautifulSoup object
print(soup.prettify())

<!DOCTYPE html>
<!--[if (gt IE 9)|!(IE)]> <!-->
<html class="no-js page-interactive section-opinion page-theme-standard tone-opinion page-interactive-default limit-small layout-xlarge app-interactive" itemid="https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html" itemscope="" itemtype="http://schema.org/NewsArticle" lang="en" xmlns:og="http://opengraphprotocol.org/schema/">
 <!--<![endif]-->
 <!--[if IE 9]> <html lang="en" class="no-js ie9 lt-ie10 page-interactive section-opinion page-theme-standard tone-opinion page-interactive-default limit-small layout-xlarge app-interactive" xmlns:og="http://opengraphprotocol.org/schema/"> <![endif]-->
 <!--[if IE 8]> <html lang="en" class="no-js ie8 lt-ie10 lt-ie9 page-interactive section-opinion page-theme-standard tone-opinion page-interactive-default limit-small layout-xlarge app-interactive" xmlns:og="http://opengraphprotocol.org/schema/"> <![endif]-->
 <!--[if (lt IE 8)]> <html lang="en" class="no-js lt-ie10 lt-ie9 lt-ie8 pa

Still a mess! Let's open up the [link](https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html) in Chrome and view the HTML there. Remember the `View Page Source` option that allows us to peek under the covers of any web page? Another great resource is [Chrome Developer Tools](https://developer.chrome.com/devtools) which gives you an even greater look under the hood! 

Do we see any patterns in the NYTimes HTML?

```html
<span class="short-desc"><strong>Jan. 21&nbsp;</strong>“I wasn't a fan of Iraq. I didn't want to go into Iraq.” <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>&nbsp;&nbsp;<span class="short-desc"><strong>Jan. 21&nbsp;</strong>“A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.” <span class="short-truth"><a href="http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/" target="_blank">(Trump was on the cover 11 times and Nixon appeared 55 times.)</a></span></span>&nbsp;&nbsp;
```

In [25]:
# we can find each lie between the <span class="short-desc"> and </span> tags

for lie in soup.find_all('span', attrs={'class': 'short-desc'}):
    print(lie.prettify())

<span class="short-desc">
 <strong>
  Jan. 21
 </strong>
 “I wasn't a fan of Iraq. I didn't want to go into Iraq.”
 <span class="short-truth">
  <a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">
   (He was for an invasion before he was against it.)
  </a>
 </span>
</span>

<span class="short-desc">
 <strong>
  Jan. 21
 </strong>
 “A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.”
 <span class="short-truth">
  <a href="http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/" target="_blank">
   (Trump was on the cover 11 times and Nixon appeared 55 times.)
  </a>
 </span>
</span>

<span class="short-desc">
 <strong>
  Jan. 23
 </strong>
 “Between 3 million and 5 million illegal votes caused me to lose the popular vote.”
 <span class="short-truth">
  <a href="https://www.nytimes.com/2017/01/23/us/pol

In [26]:
# how would we find each date?

for lie in soup.find_all('span', attrs={'class': 'short-desc'}):
    date = lie.find('strong')
    print(date.string)

Jan. 21 
Jan. 21 
Jan. 23 
Jan. 25 
Jan. 25 
Jan. 25 
Jan. 25 
Jan. 26 
Jan. 26 
Jan. 28 
Jan. 29 
Jan. 30 
Feb. 3 
Feb. 4 
Feb. 5 
Feb. 6 
Feb. 6 
Feb. 6 
Feb. 6 
Feb. 7 
Feb. 7 
Feb. 9 
Feb. 9 
Feb. 10 
Feb. 12 
Feb. 16 
Feb. 16 
Feb. 16 
Feb. 16 
Feb. 16 
Feb. 16 
Feb. 18 
Feb. 18 
Feb. 24 
Feb. 24 
Feb. 24 
Feb. 27 
Feb. 27 
Feb. 28 
Feb. 28 
Feb. 28 
March 3 
March 4 
March 4 
March 7 
March 13 
March 13 
March 15 
March 17 
March 20 
March 21 
March 22 
March 22 
March 22 
March 29 
March 31 
April 2 
April 2 
April 5 
April 6 
April 11 
April 12 
April 12 
April 12 
April 12 
April 16 
April 18 
April 21 
April 21 
April 27 
April 28 
April 28 
April 28 
April 29 
April 29 
April 29 
April 29 
April 29 
April 29 
May 1 
May 1 
May 1 
May 2 
May 4 
May 4 
May 4 
May 8 
May 8 
May 8 
May 12 
May 12 
May 13 
May 26 
June 1 
June 1 
June 4 
June 5 
June 20 
June 21 
June 21 
June 21 
June 21 
June 21 
June 21 
June 21 
June 21 
June 22 
June 23 
June 27 
June 28 
June 29 
July 6 
Ju

**There's another way!**

BeautifulSoup tags have a `contents` attribute returns a `list` of all of the tags children. The children in this case are all of the tags and strings nested under a tag.

In [27]:
# print out the tag "contents"

for lie in soup.find_all('span', attrs={'class': 'short-desc'}):
    print(lie.contents)
    print('---\n')

[<strong>Jan. 21 </strong>, "“I wasn't a fan of Iraq. I didn't want to go into Iraq.” ", <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span>]
---

[<strong>Jan. 21 </strong>, '“A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.” ', <span class="short-truth"><a href="http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/" target="_blank">(Trump was on the cover 11 times and Nixon appeared 55 times.)</a></span>]
---

[<strong>Jan. 23 </strong>, '“Between 3 million and 5 million illegal votes caused me to lose the popular vote.” ', <span class="short-truth"><a href="https://www.nytimes.com/2017/01/23/us/politics/donald-trump-congress-democrats.html" target="_blank">(There's no evidence of illegal voting.)</a></span>]
-

In [28]:
# another way to get the date

for lie in soup.find_all('span', attrs={'class': 'short-desc'}):
    date = lie.contents[1].string
    print(date)

“I wasn't a fan of Iraq. I didn't want to go into Iraq.” 
“A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.” 
“Between 3 million and 5 million illegal votes caused me to lose the popular vote.” 
“Now, the audience was the biggest ever. But this crowd was massive. Look how far back it goes. This crowd was massive.” 
“Take a look at the Pew reports (which show voter fraud.)” 
“You had millions of people that now aren't insured anymore.” 
“So, look, when President Obama was there two weeks ago making a speech, very nice speech. Two people were shot and killed during his speech. You can't have that.” 
“We've taken in tens of thousands of people. We know nothing about them. They can say they vet them. They didn't vet them. They have no papers. How can you vet somebody when you don't know anything about them and you have no papers? How do you vet them? You can't.” 
“I cut off hundreds of million

In [29]:
type(lie)

bs4.element.Tag

In [30]:
# how about getting the actual lie?

for lie in soup.find_all('span', attrs={'class': 'short-desc'}):
    lie_text = lie.contents[1].string
    print(lie_text)

“I wasn't a fan of Iraq. I didn't want to go into Iraq.” 
“A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.” 
“Between 3 million and 5 million illegal votes caused me to lose the popular vote.” 
“Now, the audience was the biggest ever. But this crowd was massive. Look how far back it goes. This crowd was massive.” 
“Take a look at the Pew reports (which show voter fraud.)” 
“You had millions of people that now aren't insured anymore.” 
“So, look, when President Obama was there two weeks ago making a speech, very nice speech. Two people were shot and killed during his speech. You can't have that.” 
“We've taken in tens of thousands of people. We know nothing about them. They can say they vet them. They didn't vet them. They have no papers. How can you vet somebody when you don't know anything about them and you have no papers? How do you vet them? You can't.” 
“I cut off hundreds of million

### Putting It All Together

Now that we have the scraping part knocked out (congrats!), what if we wanted to save the data to a local `csv` file, or to pandas, to do further analysis? How might we do that?


In [31]:
from csv import writer

output = writer(open("lie.csv","w"))
output.writerow(["date","description"])

for lie in soup.find_all('span', attrs={'class': 'short-desc'}):
    lie_text = lie.contents[1].string
    lie_date = lie.contents[0].string
    output.writerow([lie_date,lie_text])

&#x1f3c6; **Challenge round!** &#x1f3c6;

Pick one or two of these tasks and use your skills with web scraping to answer the question. In each case, there is a URL and a data question attached to it. These come mainly from an excellent list compiled by Dan Nguyen at Stanford.

>Site: [https://analytics.usa.gov/](https://analytics.usa.gov/)<br>
Task: Number of people visiting US Government web sites now<br><br>
Site: [http://www.state.gov/r/pa/ode/socialmedia/](http://www.state.gov/r/pa/ode/socialmedia/)<br>
Task: The number of Pinterest accounts maintained by U.S. State Department embassies and missions<br><br>
Site: [https://petitions.whitehouse.gov/](https://petitions.whitehouse.gov/)<br>
Task: Number of petitions that have reached their goal<br><br>
Site: [https://www.faa.gov/air_traffic/flight_info/aeronav/aero_data/](https://www.faa.gov/air_traffic/flight_info/aeronav/aero_data/)<br>
Task: Number of airports with existing construction related activity<br><br>
Site: [https://www.osha.gov/pls/imis/establishment.html](https://www.osha.gov/pls/imis/establishment.html)<br>
Number of OSHA enforcement inspections involving Wal-Mart in California since 2014<br><br>
Site: [https://www.tdcj.state.tx.us/death_row/dr_scheduled_executions.html](https://www.tdcj.state.tx.us/death_row/dr_scheduled_executions.html)<br>
Task: Number of days until Texas's next scheduled execution <br><br>