# Chapter 9. Getting data

## 9.1 stdin and stdout

If you run your Python scripts at the command line, you can pipe data through them using sys.stdin and sys.stdout.

In [3]:
import sys, re
sys.argv
# sys.argv is the list of command-line arguments
# sys.argv[0] is the name of the program itself
# sys.argv[1] will be the regex specified at the command line

['/Users/boyuan/anaconda3/envs/sds/lib/python3.8/site-packages/ipykernel_launcher.py',
 '-f',
 '/Users/boyuan/Library/Jupyter/runtime/kernel-695b37f1-9614-4d45-8a11-86f9590b3fed.json']

In [5]:
# egrep.py
regex = sys.argv[1]
for line in sys.stdin: # for every line passed into the script
    if re.search(regex, line): 
        sys.stdout.write(line) # if it matches the regex, write it to stdout

In [6]:
# line_count.py 
import sys
count = 0
for line in sys.stdin:
    count += 1
print(count)

0


In [None]:
!cat SomeFile.txt | python egrep.py "[0-9]" | python line_count.py

The | is the pipe character, which means “use the output of the left command as the input of the right command.” You can build pretty elaborate data-processing pipelines this way

In [None]:
# most_common_words.py

import sys 

from collections import Counter

# pass in number of words as first arguement 

try: 
    num_words = int(sys.argv[1])
except:
    print('usage: most_common_words.py num_words')
    sys.exit(1) # nonzero exit code indicates error

counter = Counter(word.lower()
                  for line in sys.stdin
                  for word in line.strip().split()
                  if word)
for word, count in counter.most_common(num_words):
    sys.stdout.write(str(count))
    sys.stdout.write("\t")
    sys.stdout.write(word)
    sys.stdout.write("\n")

In [None]:
!cat the_bible.txt | python most_common_words.py 10

## 9.2 Reading files

### The basics of text files

In [None]:
# 'r' means read-only, it's assumed if you leave it out
file_for_reading = open('reading_file.txt', 'r')
file_for_reading2 = open('reading_file.txt')

In [None]:
# 'w' is write -- will destroy the file if it already exists!
file_for_writing = open('writing_file.txt', 'w')

In [None]:
# 'a' is append -- for adding to the end of the file
file_for_appending = open('appending_file.txt', 'a')

In [None]:
# don't forget to close your files when you're done
file_for_writing.close()

Because it is easy to forget to close your files, you should always use them in a with block, at the end of which they will be closed automatically:

In [None]:
with open(filename) as f:
    data = function_that_gets_data_from(f)
    
# at this point f has already been closed, so don't try to use it
process(data)

In [None]:
starts_with_hash = 0

with open('input.txt') as f:
    for line in f: # look at each line in the file
        if re.match("^#",line): # use a regex to see if it starts with '#'
            starts_with_hash += 1 # if it does, add 1 to the count

In [None]:
def get_domain(email_address: str) -> str:
    """Split on '@' and return the last piece"""
    return email_address.lower().split("@")[-1]

# a couple of tests
assert get_domain('joelgrus@gmail.com') == 'gmail.com'
assert get_domain('joel@m.datasciencester.com') == 'm.datasciencester.com'

from collections import Counter

with open('email_addresses.txt', 'r') as f:
    domain_counts = Counter(get_domain(line.strip())
                            for line in f
                            if "@" in line)

### Delimited files

Never parse a comma-separated file yourself. You will screw up the edge cases

In [None]:
import csv

'''
6/20/2014 AAPL 90.91
6/20/2014 MSFT 41.68
6/20/2014 FB 64.5
6/19/2014 AAPL 91.86
6/19/2014 MSFT 41.51
6/19/2014 FB 64.34
'''

with open('tab_delimited_stock_prices.txt') as f:
    tab_reader = csv.reader(f, delimiter = '\t')
    for row in tab_reader:
        date = row[0]
        symbol = row[1]
        closing_price = float(row[2])
        process(date, symbol, closing_price)

In [None]:
with open('colon_delimited_stock_prices.txt') as f:
    colon_reader = csv.DictReader(f, delimiter=':')
    for dict_row in colon_reader:
        date = dict_row["date"]
        symbol = dict_row["symbol"]
        closing_price = float(dict_row["closing_price"])
        process(date, symbol, closing_price)

In [None]:
todays_prices = {'AAPL': 90.91, 'MSFT': 41.68, 'FB': 64.5 }

with open('comma_delimited_stock_prices.txt', 'w') as f:
    csv_writer = csv.writer(f, delimiter=',')
    for stock, price in todays_prices.items():
        csv_writer.writerow([stock, price])

## 9.3 Scraping the web

### HTML and the parsing thereof

Pages on the web are written in HTML, in which text is (ideally) marked up into elements and their attributes:

<html>
    <head>
        <title>A web page</title>
    </head>
    <body>
        <p id="author">Joel Grus</p>
        <p id="subject">Data Science</p>
    </body>
</html>

In a perfect world, where all web pages were marked up semantically for our benefit, we would be able to extract data using rules like “find the < p > element whose id is subject and return the text it contains.” In the actual world, HTML is not generally well formed, let alone annotated. This means we’ll need help making sense of it.

To get data out of HTML, we will use the Beautiful Soup library, which builds a tree out of the various elements on a web page and provides a simple interface for accessing them. As I write this, the latest version is Beautiful Soup 4.6.0, which is what we’ll be using. We’ll also be using the Requests library, which is a much nicer way of making HTTP requests than anything that’s built into Python.

In [12]:
from bs4 import BeautifulSoup
import requests

url = ("https://raw.githubusercontent.com/"
       "joelgrus/data/master/getting-data.html")
html = requests.get(url).text
soup = BeautifulSoup(html, 'html5lib')

In [13]:
soup

<!DOCTYPE html>
<html lang="en-US"><head>
    <title>Getting Data</title>
    <meta charset="utf-8"/>
</head>
<body>
    <h1>Getting Data</h1>
    <div class="explanation">
        This is an explanation.
    </div>
    <div class="comment">
        This is a comment.
    </div>
    <div class="content">
        <p id="p1">This is the first paragraph.</p>
        <p class="important">This is the second paragraph.</p>
    </div>
    <div class="signature">
        <span id="name">Joel</span>
        <span id="twitter">@joelgrus</span>
        <span id="email">joelgrus-at-gmail</span>
    </div>


</body></html>

In [14]:
first_paragraph = soup.find('p')
first_paragraph

<p id="p1">This is the first paragraph.</p>

In [15]:
first_paragraph_text = soup.p.text
first_paragraph_text

'This is the first paragraph.'

In [16]:
first_paragraph_words = soup.p.text.split()
first_paragraph_words

['This', 'is', 'the', 'first', 'paragraph.']

In [17]:
first_paragraph_id = soup.p['id']
first_paragraph_id

'p1'

In [18]:
first_paragraph_id2 = soup.p.get('id')
first_paragraph_id2

'p1'

In [19]:
all_paragraphs = soup.find_all('p')
all_paragraphs

[<p id="p1">This is the first paragraph.</p>,
 <p class="important">This is the second paragraph.</p>]

In [20]:
paragraphs_with_ids = [p for p in soup('p') if p.get('id')]
paragraphs_with_ids

[<p id="p1">This is the first paragraph.</p>]

In [21]:
important_paragraphs = soup('p', {'class':'important'})
important_paragraphs

[<p class="important">This is the second paragraph.</p>]

In [22]:
important_paragraphs2 = soup('p', 'important')
important_paragraphs2

[<p class="important">This is the second paragraph.</p>]

In [23]:
important_paragraphs3 = [p for p in soup('p')
                         if 'important' in p.get('class', [])]
important_paragraphs3

[<p class="important">This is the second paragraph.</p>]

In [24]:
spans_inside_divs = [span
                     for div in soup('div')
                     for span in div('span')]
spans_inside_divs

[<span id="name">Joel</span>,
 <span id="twitter">@joelgrus</span>,
 <span id="email">joelgrus-at-gmail</span>]

### Example: Keeping tabs on congress

In [27]:
from bs4 import BeautifulSoup
import requests

url = "https://www.house.gov/representatives"
text = requests.get(url).text
soup = BeautifulSoup(text, 'html5lib')

In [28]:
all_urls = [a['href']
            for a in soup('a')
            if a.has_attr('href')]
print(len(all_urls))

966


In [29]:
import re

# Must start with http:// or https://
# Must end with .house.gov or .house.gov/
regex = r"^https?://.*\.house\.gov/?$"

# Let's write some tests!
assert re.match(regex, "http://joel.house.gov")
assert re.match(regex, "https://joel.house.gov")
assert re.match(regex, "http://joel.house.gov/")
assert re.match(regex, "https://joel.house.gov/")
assert not re.match(regex, "joel.house.gov")
assert not re.match(regex, "http://joel.house.com")
assert not re.match(regex, "https://joel.house.gov/biography")

# And now apply
good_urls = [url for url in all_urls if re.match(regex, url)]

print(len(good_urls))

874


In [30]:
# Use set to get rid of the duplicates 

good_urls = list(set(good_urls))
print(len(good_urls))

437


In [31]:
html = requests.get('https://jayapal.house.gov').text
soup = BeautifulSoup(html, 'html5lib')

# Use a set because the links might appear multiple times.
links = {a['href'] for a in soup('a') if 'press releases' in a.text.lower()}

print(links)

{'https://jayapal.house.gov/category/press-releases/', 'https://jayapal.house.gov/media/press-releases/'}


In [None]:
from typing import Dict, Set

press_releases: Dict[str, Set[str]] = {}
    
for house_url in good_urls:
    html = requests.get(house_url).text
    soup = BeautifulSoup(html, 'html5lib')
    pr_links = {a['href'] for a in soup('a') if 'press releases' in a.text.lower()}
    
    print(f"{house_url}: {pr_links}")
    press_releases[house_url] = pr_links

Most sites will have a robots.txt file that indicates how frequently you may scrape the site (and which paths you’re not supposed to scrape)

In [33]:
def paragraph_mentions(text: str, keyword: str) -> bool:
    """
    Returns True if a <p> inside the text mentions {keyword}
    """
    soup = BeautifulSoup(text, 'html5lib')
    paragraphs = [p.get_text() for p in soup('p')]
    return any(keyword.lower() in paragraph.lower()
               for paragraph in paragraphs)

In [34]:
text = """<body><h1>Facebook</h1><p>Twitter</p>"""
assert paragraph_mentions(text, "twitter") # is inside a <p>
assert not paragraph_mentions(text, "facebook") # not inside a <p>

In [35]:
for house_url, pr_links in press_releases.items():
    for pr_link in pr_links:
        url = f"{house_url}/{pr_link}"
        text = requests.get(url).text
        
        if paragraph_mentions(text, 'data'):
            print(f"{house_url}")
            break # done with this house_url

https://quigley.house.gov/
https://long.house.gov/
https://robinkelly.house.gov/
https://foxx.house.gov/
https://horsford.house.gov
https://ruiz.house.gov
https://cicilline.house.gov/
https://riggleman.house.gov
https://luria.house.gov
https://chu.house.gov/
https://castor.house.gov/
https://crist.house.gov
https://pascrell.house.gov/
https://frankel.house.gov
https://bass.house.gov/
https://rubengallego.house.gov/
https://buddycarter.house.gov/
https://horn.house.gov/
https://gosar.house.gov/


## 9.4 Using APIs

Many websites and web services provide application programming interfaces (APIs), which allow you to explicitly request data in a structured format. This saves you the trouble of having to scrape them

### JSON and XML

Because HTTP is a protocol for transferring text, the data you request through a web API needs to be serialized into a string format. Often this serialization uses JavaScript Object Notation (JSON). JavaScript objects look quite similar to Python dicts, which makes their string representations easy to interpret

In [None]:
{ "title" : "Data Science Book",
  "author" : "Joel Grus",
  "publicationYear" : 2019,
  "topics" : [ "data", "science", "data science"] }

We can parse JSON using Python’s json module. In particular, we will use its loads function, which deserializes a string representing a JSON object into a Python object:

In [36]:
import json 

serialized = '''
{ "title" : "Data Science Book",
  "author" : "Joel Grus",
  "publicationYear" : 2019,
  "topics" : [ "data", "science", "data science"] }
'''

# parse the JSON to create a python dict 

deserialized = json.loads(serialized)
assert deserialized['publicationYear'] == 2019
assert "data science" in deserialized['topics']

### Using an unauthenticated API

In [40]:
import requests, json

github_user = "joelgrus"
endpoint = f"https://api.github.com/users/{github_user}/repos"

repos = json.loads(requests.get(endpoint).text)
repos[0]

{'id': 112873601,
 'node_id': 'MDEwOlJlcG9zaXRvcnkxMTI4NzM2MDE=',
 'name': 'advent2017',
 'full_name': 'joelgrus/advent2017',
 'private': False,
 'owner': {'login': 'joelgrus',
  'id': 1308313,
  'node_id': 'MDQ6VXNlcjEzMDgzMTM=',
  'avatar_url': 'https://avatars1.githubusercontent.com/u/1308313?v=4',
  'gravatar_id': '',
  'url': 'https://api.github.com/users/joelgrus',
  'html_url': 'https://github.com/joelgrus',
  'followers_url': 'https://api.github.com/users/joelgrus/followers',
  'following_url': 'https://api.github.com/users/joelgrus/following{/other_user}',
  'gists_url': 'https://api.github.com/users/joelgrus/gists{/gist_id}',
  'starred_url': 'https://api.github.com/users/joelgrus/starred{/owner}{/repo}',
  'subscriptions_url': 'https://api.github.com/users/joelgrus/subscriptions',
  'organizations_url': 'https://api.github.com/users/joelgrus/orgs',
  'repos_url': 'https://api.github.com/users/joelgrus/repos',
  'events_url': 'https://api.github.com/users/joelgrus/events{/pri

In [43]:
from collections import Counter
from dateutil.parser import parse

dates = [parse(repo['created_at']) for repo in repos]
month_counts = Counter(date.month for date in dates)
weekday_counts = Counter(date.weekday() for date in dates)

In [45]:
print(month_counts, weekday_counts)

Counter({11: 5, 12: 4, 9: 4, 7: 4, 2: 3, 1: 3, 5: 3, 8: 2, 6: 1, 4: 1}) Counter({4: 7, 2: 7, 1: 5, 5: 4, 6: 4, 3: 2, 0: 1})


In [47]:
last_5_repositories = sorted(repos,
                             key = lambda r: r['pushed_at'],
                             reverse = True)[:5]
last_5_languages = [repo['language']
                    for repo in last_5_repositories]

last_5_languages

['Python', 'JavaScript', 'Python', 'Python', 'Python']

One of the benefits of using Python is that someone has already built a library for pretty much any API you’re interested in accessing. When they’re done well, these libraries can save you a lot of the trouble of figuring out the hairier details of API access.

### Finding APIs

If you need data from a specific site, look for a “developers” or “API” section of the site for details, and try searching the web for “python < sitename > api” to find a library

## 9.5 Example: Using the Twitter API

**Caution:** Don’t share the keys, don’t publish them in your book, and don’t check them into your public GitHub repository. One simple solution is to store them in a credentials.json file that doesn’t get checked in, and to have your code use json.loads to retrieve them. Another solution is to store them in environment variables and use os.environ to retrieve them.

There is a simple way, OAuth 2, that suffices when you just want to do simple searches. And there is a complex way, OAuth 1, that’s required when you want to perform actions (e.g., tweeting) or (in particular for us) connect to the Twitter stream.

In [51]:
from twython import Twython
from twython import TwythonStreamer

## 9.6 For further exploration