# Chapter 9. Getting Data

In [1]:
from __future__ import division
from collections import Counter
import math, random, csv, json
from bs4 import BeautifulSoup
import requests
import re

In order to be a data scientist you need data (yes, really).  
In this chapter, we take a look at different ways of getting data into Python and into the right formats.

## stdin and stdout

If you run your Python scripts at the command line, you can *pipe* data through them using <code>sys.stdin</code> and <code>sys.stdout</code>.  
For example, here is a script that reads in lines of text and spits back out the ones that match a regular expression:

Here's a program that counts the lines it receives and then writes out the count:

You could then use these to count how many lines of a file contain numbers.  
In Windows, you would use:  

<code>type SomeFile.txt | python egrep.py "[0-9]" | python line_count.py</code>  

whereas in a Unix system you would use:  

<code>cat SomeFile.txt | python egrep.py "[0-9]" | python line_count.py</code>  

The | is the pipe character, which means "use the output of the left command as the input of the right command".  
You can build some pretty elaborate data-processing pipelines this way.

Similarly, here is a script that counts the words in its input and writes out the most common ones:

In [2]:
cat lorem_ipsum.txt | python most_common_words.py 10

7	eget
7	in
6	non
5	sed
5	ut
4	vel
4	et
4	nulla
4	neque
4	ullamcorper


**Note**  
There are plenty of Unix command-line tools that can also be used, such as <code>grep</code> and <code>egrep</code>, that are probably preferable to building your own from scratch.

## Reading Files

### The Basics of Text Files

The first step in working with a text file is to obtain a *file object* using `open`:

In [3]:
# 'r' means read-only
file_for_reading = open('reading_file.txt', 'r')

# 'w' means write
#!# This will destroy the file if it already exists!
file_for_writing = open('writing_file.txt', 'w')

# 'a' means append
# for adding to the end of a file
file_for_appending = open('appending_file.txt', 'a')

#!# Don't forget to close your files when you're done!
file_for_reading.close()
file_for_writing.close()
file_for_appending.close()

Since it is easy to forget to close your files, you should always use them in a `with` block, at the end of which they will be closed automatically:

If you need to read a whole text file, you can just iterate over the lines of the file using `for`:

In [4]:
starts_with_hash = 0

with open('lorem_ipsum.txt', 'r') as f:
    for line in f:                 # look at each line in the file
        if re.match("^#",line):    # use a regex to see if it starts with '#'
            starts_with_hash += 1  # if it does, add 1 to the count

starts_with_hash

1

Every line you get this way ends in a newline character (\n), you you may want to `strip()` it before doing anything with it.

For example, imagine you have a file full of email addresses, one per line, and that you need to generate a histogram of the domains.  
A good first approximation is to just take the parts of the email addresses that come after the @.

In [5]:
def get_domain(email_address):
    """ split on '@' and return the last piece """
    return email_address.lower().split("@")[-1]

with open('email_addresses.txt', 'r') as f:
    domain_counts = Counter(get_domain(line.strip()) for line in f if "@" in line)

### Delimited Files

Frequently you will work with files that have lots of data on each line.  
These files are often either *comma-separated* or *tab-separated*.  
Don't try to parse them yourself -- use Python's `csv` module or the `pandas` library.  
If the file has no headers, you can use `csv.reader` to iterate over the rows.

For example, if we had a tab-delimited file of stock prices, we could process them with:

In [6]:
import csv

with open('tab_delimited_stock_prices.txt', 'r') as f:
    reader = csv.reader(f, delimiter='\t')
    for row in reader:
        date = row[0]
        symbol = row[1]
        closing_price = float(row[2])
        print(date, symbol, closing_price)

('6/20/2014', 'AAPL', 90.91)
('6/20/2014', 'MSFT', 41.68)
('6/20/2014', 'FB', 64.5)
('6/19/2014', 'AAPL', 91.86)
('6/19/2014', 'MSFT', 41.51)
('6/19/2014', 'FB', 64.34)


If your file has headers, you can either skip the header row (with an initial call to `reader.next()`), or get each row as a `dict`, with the headers as keys, by using `csv.DictReader`:

In [7]:
with open('colon_delimited_stock_prices.txt', 'r') as f:
    reader = csv.DictReader(f, delimiter=':')
    for row in reader:
        date = row["date"]
        symbol = row["symbol"]
        closing_price = float(row["closing_price"])
        print(date, symbol, closing_price)

('6/20/2014', 'AAPL', 90.91)
('6/20/2014', 'MSFT', 41.68)
('6/20/2014', 'FB', 64.5)


Even if your file doesn't have headers you can still use `DictReader` by passing it the keys as a `fieldnames` parameter.

You can similarly write out delimited data using `csv.writer`:

In [8]:
today_prices = { 'AAPL' : 95.95, 'MSFT' : 43.34, 'FB' : 66.66 }

with open('comma_delimited_stock_prices.txt', 'w') as f:
    writer = csv.writer(f, delimiter=',')
    for stock, price in today_prices.items():
        writer.writerow([stock, price])

`csv.writer()` will do the right thing if your fields themselves have commas in them.  
Your own hand-rolled writer probably won't.  
For example, if you attempt: 

In [9]:
results = [["test1", "success", "Monday"],
           ["test2", "success, kind of", "Tuesday"],
           ["test3", "failure, kind of", "Wednesday"],
           ["test4", "failure, utter", "Thursday"]]

#!# don't do this!
with open('bad_csv.csv', 'w') as f:
    for row in results:
        f.write(",".join(map(str, row)))  # might have too many commas in it
        f.write("\n")                     # row might have newlines as well

You will end up with a `csv` file that looks like:

In [10]:
with open('bad_csv.csv', 'r') as f:
    reader = csv.reader(f, delimiter=',')
    for row in reader:
        print(row)

['test1', 'success', 'Monday']
['test2', 'success', ' kind of', 'Tuesday']
['test3', 'failure', ' kind of', 'Wednesday']
['test4', 'failure', ' utter', 'Thursday']


that will be very difficult to make sense of.  
Open the `bad_csv.csv` file in a text editor and see for yourself.

## Scraping the Web

### HTML and the Parsing Thereof

To get data out of HTML, we will use the [BeautifulSoup library](https://www.crummy.com/software/BeautifulSoup/), which builds a tree out of the various elements on a web page and provides a simple interface for accessing them.  
We will also be using the [requests library](http://docs.python-requests.org/en/latest/), which is a much nicer way of making HTTP requests than anything that's built into Python.  
Python's built-in HTML parser is not very lenient, meaning that it doesn't deal well with HTML that's not perfectly formed, so we will use [html5lib](https://github.com/html5lib/).

To use Beautiful Soup, we need to pass some HTML into the `BeautifulSoup()` function.  
In our examples, this will be the result of a call to `requests.get`:

In [11]:
from bs4 import BeautifulSoup
import requests
html = requests.get("http://www.example.com").text
soup = BeautifulSoup(html, 'html5lib')

after which we can get pretty far using a few simple methods.

We will typically work with `Tag` objects, which correspond to the tags representing the structure of an HTML page.  
For example, to find the first `<p>` tag (and its contents), you can use:

In [12]:
first_paragraph = soup.find('p')  # or just soup.p -- see below
first_paragraph

<p>This domain is established to be used for illustrative examples in documents. You may use this\n    domain in examples without prior coordination or asking for permission.</p>

In [13]:
also_first_paragraph = soup.p
also_first_paragraph

<p>This domain is established to be used for illustrative examples in documents. You may use this\n    domain in examples without prior coordination or asking for permission.</p>

You can get the text contents of a `Tag` using its `text` property:

In [14]:
first_paragraph_text = soup.p.text
first_paragraph_text

u'This domain is established to be used for illustrative examples in documents. You may use this\n    domain in examples without prior coordination or asking for permission.'

In [15]:
first_paragraph_words = soup.p.text.split()
first_paragraph_words

[u'This',
 u'domain',
 u'is',
 u'established',
 u'to',
 u'be',
 u'used',
 u'for',
 u'illustrative',
 u'examples',
 u'in',
 u'documents.',
 u'You',
 u'may',
 u'use',
 u'this',
 u'domain',
 u'in',
 u'examples',
 u'without',
 u'prior',
 u'coordination',
 u'or',
 u'asking',
 u'for',
 u'permission.']

You can extract a tag's attributes by treating it like a `dict`:

In [16]:
first_paragraph_id2 = soup.p.get('id')  # returns None if no 'id'
print(first_paragraph_id2)
first_paragraph_id2

None


You can get multiple tags at once:

In [17]:
all_paragraphs = soup.find_all('p')  # or just soup('p')
all_paragraphs

[<p>This domain is established to be used for illustrative examples in documents. You may use this\n    domain in examples without prior coordination or asking for permission.</p>,
 <p><a href="http://www.iana.org/domains/example">More information...</a></p>]

In [18]:
also_all_paragraphs = soup('p')
also_all_paragraphs

[<p>This domain is established to be used for illustrative examples in documents. You may use this\n    domain in examples without prior coordination or asking for permission.</p>,
 <p><a href="http://www.iana.org/domains/example">More information...</a></p>]

In [19]:
paragraphs_with_ids = [p for p in soup('p') if p.get('id')]
paragraphs_with_ids

[]

Frequently you will want to find tags with a specific `class`:

In [20]:
important_paragraphs = soup('p', {'class' : 'important'})
important_paragraphs

[]

In [21]:
important_paragraphs2 = soup('p', 'important')
important_paragraphs2

[]

In [22]:
important_paragraphs3 = [p for p in soup('p') if 'important' in p.get('class', [])]
important_paragraphs3

[]

You can combine methods to implement more elaborate logic.  
For example, if you want to find every `<span>` element that is contained in a `<div>` element, you could do this:

In [23]:
# Warning -- this will return the same span multiple times if it lies within multiple divs
# If that is the case, be more clever
span_inside_divs = [span 
                    for div in soup('div')    # for each <div> on the page, 
                    for span in div('span')]  # find each <span> inside of it.
span_inside_divs

[]

Good. Enough with the basics, let's look at an example.

### Example: O'Reilly Books About Data