Chapter 6 - Data Loading, Storage, and File Formats

The tools in this book are of little use if you can’t easily import and export data in
Python. I’m going to be focused on input and output with pandas objects, though there
are of course numerous tools in other libraries to aid in this process. NumPy, for example,
features low-level but extremely fast binary data loading and storage, including
support for memory-mapped array. See Chapter 12 for more on those.
Input and output typically falls into a few main categories: reading text files and other
more efficient on-disk formats, loading data from databases, and interacting with network
sources like web APIs.

## Reading and Writing Data in Text Format

Python has become a beloved language for text and file munging due to its simple syntax
for interacting with files, intuitive data structures, and convenient features like tuple
packing and unpacking.

pandas features a number of functions for reading tabular data as a DataFrame object.
Table 6-1 has a summary of all of them, though read_csv and read_table are likely the
ones you’ll use the most.

In [1]:
#Table 6-1. Parsing functions in pandas

#Function           Description

#read_csv           Load delimited data from a file, URL, or file-like object. Use comma as default delimiter
#read_table         Load delimited data from a file, URL, or file-like object. Use tab ('\t') as default delimiter
#read_fwf           Read data in fixed-width column format (that is, no delimiters)
#read_clipboard     Version of read_table that reads data from the clipboard. Useful for converting tables from web pages

I’ll give an overview of the mechanics of these functions, which are meant to convert
text data into a DataFrame. The options for these functions fall into a few categories:

• Indexing: can treat one or more columns as the returned DataFrame, and whether
to get column names from the file, the user, or not at all.

• Type inference and data conversion: this includes the user-defined value conversions
and custom list of missing value markers.

• Datetime parsing: includes combining capability, including combining date and
time information spread over multiple columns into a single column in the result.

• Iterating: support for iterating over chunks of very large files.

• Unclean data issues: skipping rows or a footer, comments, or other minor things
like numeric data with thousands separated by commas.

In [2]:
# Type inference is one of the more important features of these functions; that means you
# don’t have to specify which columns are numeric, integer, boolean, or string. Handling
# dates and other custom types requires a bit more effort, though. Let’s start with a small
# comma-separated (CSV) text file:

In [3]:
import pandas as pd
from pandas import Series, DataFrame
import numpy as np

In [4]:
# Since this is comma-delimited, we can use read_csv to read it into a DataFrame:

df = pd.read_csv('Chapter 6 practice csv.csv')
df

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [5]:
# Here I used the Unix cat shell command to print the raw contents of
#the file to the screen. If you’re on Windows, you can use type instead
#of cat to achieve the same effect. Page 156

!type withoutheader.csv

withoutheader.csv: not found


In [6]:
#We could also have used read_table and specifying the delimiter:
pd.read_table('Chapter 6 practice csv.csv', sep = ',')
# here we see no difference since this file is already a csv file, meaning comma seperated already.

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [7]:
#To read this in, you have a couple of options. You can allow pandas to assign default
#column names, or you can specify names yourself:
pd.read_csv('withoutheader.csv', header = None)

FileNotFoundError: File b'withoutheader.csv' does not exist

In [8]:
pd.read_csv('withoutheader.csv', names = ['a','b','c','d','message'])

FileNotFoundError: File b'withoutheader.csv' does not exist

In [9]:
# Suppose you wanted the message column to be the index of the returned DataFrame.
# You can either indicate you want the column at index 4 or named 'message' using the
# index_col argument:

names = ['a','b','c','d','message']

In [10]:
pd.read_csv('withoutheader.csv', names = names, index_col = 'message')

FileNotFoundError: File b'withoutheader.csv' does not exist

In [None]:
# In the event that you want to form a hierarchical index from multiple columns, just
# pass a list of column numbers or names:


In [None]:
# How to create a csv file
# This part is added by myself to get familiar with creating a file.
import csv
with open('test.csv', 'w', newline = '') as csvfile:
    a = csv.writer(csvfile, delimiter=',')
    data = [['Stock','Sales'],
           ['100','24'],
           ['120','33'],
            ['23','5']]
    

In [None]:
# Use DataFrame to display
dt = DataFrame(data)
dt

In [None]:
# Try my own real-life example
pd.read_csv('SPY.csv')

In [None]:
# Let's use this example to try applications from the textbook - starting from Page 157

# In the event that you want to form a hierarchical index from multiple columns, just
# pass a list of column numbers or names:

!type SPY.csv

In [None]:
# For a hierarchical index
parsed = pd.read_csv('SPY.csv', index_col = ['Date', 'Open',])
parsed

In [None]:
# In some cases, a table might not have a fixed delimiter, using whitespace or some other
# pattern to separate fields. In these cases, you can pass a regular expression as a delimiter
# for read_table. Consider a text file that looks like this:
list(open('SPY.csv'))

In [None]:
# While you could do some munging by hand, in this case fields are separated by a variable
# amount of whitespace. This can be expressed by the regular expression \s+, so we have
# then:

result = pd.read_table('SPY.csv', sep = '\s+')
result

Because there was one fewer column name than the number of data rows, read_table infers that the first column should be the DataFrame’s index in this special case.

The parser functions have many additional arguments to help you handle the wide variety of exception file formats that occur (see Table 6-2). For example, you can skip the first, third, and fourth rows of a file with skiprows:

In [None]:
pd.read_csv('SPY.csv',skiprows = [1,2,3])

In [None]:
# Handling missing values is an important and frequently nuanced part of the file parsing
# process. Missing data is usually either not present (empty string) or marked by some
# sentinel value. By default, pandas uses a set of commonly occurring sentinels, such as NA, -1.#IND, and NULL:

In [None]:
!type US3M.csv


In [None]:
result = pd.read_csv('US3M.csv')
pd.isnull(result)

In [None]:
# The na_values option can take either a list or set of strings to consider missing values:
result = pd.read_csv('US3M.csv', na_values = ['NULL'])
result

In [None]:
# Different NA sentinels can be specified for each column in a dict:
sentinels = {'message': ['foo', 'NA'], 'something':['two']}
sentinels

In [None]:
dt1 = pd.read_csv('US3M.csv', na_values = sentinels)
dt1

In [None]:
#Doing a little cleasing in my way
clean_values = dt1.replace({'.': 0.0001})

Argument Description

path String indicating filesystem location, URL, or file-like object

sep or delimiter Character sequence or regular expression to use to split fields in each row

header Row number to use as column names. Defaults to 0 (first row), but should be None if there is no header
row


index_col Column numbers or names to use as the row index in the result. Can be a single name/number or a list
of them for a hierarchical index

names List of column names for result, combine with header=None

skiprows Number of rows at beginning of file to ignore or list of row numbers (starting from 0) to skip

na_values Sequence of values to replace with NA

comment Character or characters to split comments off the end of lines

parse_dates Attempt to parse data to datetime; False by default. If True, will attempt to parse all columns. Otherwise
can specify a list of column numbers or name to parse. If element of list is tuple or list, will combine
multiple columns together and parse to date (for example if date/time split across two columns)

keep_date_col If joining columns to parse date, drop the joined columns. Default True

converters Dict containing column number of name mapping to functions. For example {'foo': f} would apply
the function f to all values in the 'foo' column

dayfirst When parsing potentially ambiguous dates, treat as international format (e.g. 7/6/2012 -> June 7,
2012). Default False

date_parser Function to use to parse dates

nrows Number of rows to read from beginning of file

iterator Return a TextParser object for reading file piecemeal

chunksize For iteration, size of file chunks

skip_footer Number of lines to ignore at end of file

verbose Print various parser output information, like the number of missing values placed in non-numeric
columns

encoding Text encoding for unicode. For example 'utf-8' for UTF-8 encoded text

squeeze If the parsed data only contains one column return a Series

thousands Separator for thousands, e.g. ',' or '.'

### Reading Text Files in Pieces

In [None]:
# When processing very large files or figuring out the right set of arguments to correctly
# process a large file, you may only want to read in a small piece of a file or iterate through
# smaller chunks of the file.
clean_values

In [None]:
# If you want to only read out a small number of rows (avoiding reading the entire file),
# specify that with nrows:
pd.read_csv('US3M.csv', nrows = 5)

In [None]:
chunker = pd.read_csv('US3M.csv', chunksize = 1000)

In [None]:
#The TextParser object returned by read_csv allows you to iterate over the parts of the
#file according to the chunksize. For example, we can iterate over ex6.csv, aggregating
#the value counts in the 'key' column like so:
tot = Series([])
for piece in chunker:
    tot = tot.add(piece['USD3MTD156N'].value_counts(), fill_value = 0)
    
tot  = tot.sort_values(ascending = False)

tot[:10]
#not working, will get back to this with real projects so will understand what it really means.

### Writing Data Out to Text Format

In [None]:
# Data can also be exported to delimited format. Let’s consider one of the CSV files read above:
data = pd.read_csv('US3M.csv')

In [None]:
data

In [None]:
# Using DataFrame’s to_csv method, we can write the data out to a comma-separated file:
data.to_csv('outdata.csv')

In [None]:
!type outdata.csv

In [None]:
# Other delimiters cab be used, of course (writing to ... so it just prints the text results.):
data.to_csv('outdata.csv', sep = '|')
!type outdata.csv

In [None]:
# Missing values apperar as emplty strings in the output. You might want to denote them by some other sentinels values 

In [None]:
data.to_csv('outdata.csv', na_rep= "NULL")

In [None]:
!type outdata.csv

In [None]:
# With no other options specificed, both the row and the column labels are written. 
# Both of these can be disables.
data.to_csv('outdata.csv', index = False, header = False)
!type outdata.csv

In [None]:
# You can also write only a subset of the columns, and in an order of your choosing
data.to_csv('outdata.csv', index = False, columns = ['DATE','USD3MTD156N'])

!type outdata.csv

In [None]:
# Series also has a to_csv method:
from datetime import datetime
dates = pd.date_range('1/1/2000', period = 7)

In [None]:
ts = Series(np.arange(7), index = dates)
ts.to_csv

In [None]:
# With a bit of wrangling (no header, first column as index), you can read a CSV version
# of a Series with read_csv, but there is also a from_csv convenience method that makes
# it a bit simpler:
Series.from_csv('outdata.csv', parse_dates = True)

### Manually Working with Delimited Formats

In [None]:
#Most forms of tabular data can be loaded from disk using functions like pan
#das.read_table. In some cases, however, some manual processing may be necessary.
#It’s not uncommon to receive a file with one or more malformed lines that trip up
#read_table. To illustrate the basic tools, consider a small CSV file:

!type ex7.csv

In [None]:
#For any file with a single-character delimiter, you can use Python’s built-in csv module.
#To use it, pass any open file or file-like object to csv.reader:
import csv
f = open('ex7.csv')
reader = csv.reader(f)

In [None]:
# Can only use 'line' here.
for line in reader:
    print (line)

In [None]:
# From there, it’s up to you to do the wrangling necessary to put the data in the form
# that you need it. For example:
lines = list(csv.reader(open('ex7.csv')))
lines

In [None]:
header, values = lines[0], lines[1:]


In [None]:
header

In [None]:
values

In [None]:
# QUESTION: Can generate the result since v and h are not defined? 
data_dict = {h: v for h, bv in zip(header, zip(*values))}

In [None]:
# CSV files come in many different flavors. Defining a new format with a different delimiter,
# string quoting convention, or line terminator is done by defining a simple subclass
# of csv.Dialect:
class my_dialect(csv.Dialect):
    lineterminator = '\n'
    delimiter = ';'
    quotechar = '"'
reader = csv.reader(f, dialect = my_dialect)

In [None]:
# Individual CSV dialect parameters can also be given as keywords to csv.reader without
# having to define a subclass:
reader = csv.reader(f, delimiter = '|')
#My own practice:
pd.read_csv('ex7.csv')

The possible options (attributes of csv.Dialect) and what they do can be found in
Table 6-3.

In [None]:
#Table 6-3. CSV dialect options

#Argument               Description
#delimiter              One-character string to separate fields. Defaults to ','.
#lineterminator         Line terminator for writing, defaults to '\r\n'. Reader ignores this and recognizes
#                       cross-platform line terminators.
#quotechar              Quote character for fields with special characters (like a delimiter). Default is '"'.
#quoting                Quoting convention. Options include csv.QUOTE_ALL (quote all fields),
#                       csv.QUOTE_MINIMAL (only fields with special characters like the delimiter),
#                       csv.QUOTE_NONNUMERIC, and csv.QUOTE_NON (no quoting). See Python’s
#                       documentation for full details. Defaults to QUOTE_MINIMAL.
# skipinitialspace      Ignore whitespace after each delimiter. Default False.
# doublequote           How to handle quoting character inside a field. If True, it is doubled. See online
#                       documentation for full detail and behavior.
# escapechar            String to escape the delimiter if quoting is set to csv.QUOTE_NONE. Disabled by
#                       default

For files with more complicated or fixed multicharacter delimiters, you
will not be able to use the csv module. In those cases, you’ll have to do
the line splitting and other cleanup using string’s split method or the
regular expression method re.split.

In [None]:
# To write delimited files manually, you can use csv.writer. It accepts an open, writable
# file object and the same dialect and format options as csv.reader:
with open('mydata.csv', 'w') as f:
    writer = csv.writer(f, dialect=my_dialect)
    writer.writerow(('one', 'two', 'three'))
    writer.writerow(('1', '2', '3'))
    writer.writerow(('4', '5', '6'))
    writer.writerow(('7', '8', '9'))

### JSON Data

In [None]:
#JSON (short for JavaScript Object Notation) has become one of the standard formats
#for sending data by HTTP request between web browsers and other applications. It is
#a much more flexible data format than a tabular text form like CSV. Here is an example:

In [None]:
obj = """
{"name":"Wes",
"places_lived":["United States", "Spain", "Germany"],
"pet": null,
"siblings":[{"name":"Scott", "age": 25, "pet":"Zuko"},
            {"name":"Katie", "age": 33, "pet":"Cisco"}]}
            """

obj

JSON is very nearly valid Python code with the exception of its null value null and
some other nuances (such as disallowing trailing commas at the end of lists). The basic
types are objects (dicts), arrays (lists), strings, numbers, booleans, and nulls. All of the
keys in an object must be strings. There are several Python libraries for reading and
writing JSON data. I’ll use json here as it is built into the Python standard library. To
convert a JSON string to Python form, use json.loads:

In [None]:
#convert a JSON string to Python form
import json
result = json.loads(obj)
result

json.dumps on the other hand converts a Python object back to JSON:

In [None]:
asjson = json.dumps(result)

How you convert a JSON object or list of objects to a DataFrame or some other data
structure for analysis will be up to you. Conveniently, you can pass a list of JSON objects
to the DataFrame constructor and select a subset of the data fields:

In [None]:
siblings = DataFrame(result['siblings'], columns = ['name', 'age', 'pet'])
siblings

For an extended example of reading and manipulating JSON data (including nested
records), see the USDA Food Database example in the next chapter.

(An effort is underway to add fast native JSON export (to_json) and
decoding (from_json) to pandas. This was not ready at the time of writing.)

### XML and HTML: Web Scraping

Python has many libraries for reading and writing data in the ubiquitous HTML and
XML formats. lxml (http://lxml.de) is one that has consistently strong performance in
parsing very large files. lxml has multiple programmer interfaces; first I’ll show using
lxml.html for HTML, then parse some XML using lxml.objectify.

Many websites make data available in HTML tables for viewing in a browser, but not
downloadable as an easily machine-readable format like JSON, HTML, or XML. I noticed
that this was the case with Yahoo! Finance’s stock options data. If you aren’t
familiar with this data; options are derivative contracts giving you the right to buy
(call option) or sell (put option) a company’s stock at some particular price (the
strike) between now and some fixed point in the future (the expiry). People trade both
call and put options across many strikes and expiries; this data can all be found together
in tables on Yahoo! Finance.

To get started, find the URL you want to extract data from, open it with urllib2 and
parse the stream with lxml like so:

In [None]:
from urllib.request import urlopen

In [None]:
parsed = parse(urlopen('http://finance.yahoo.com/q/op?s=AAPL+Options'))
doc = parsed.getroot()

Using this object, you can extract all HTML tags of a particular type, such as table tags
containing the data of interest. As a simple motivating example, suppose you wanted
to get a list of every URL linked to in the document; links are a tags in HTML. Using
the document root’s findall method along with an XPath (a means of expressing
“queries” on the document):

In [None]:
links = doc.findall('.//a')

In [None]:
links[15:20]

But these are objects representing HTML elements; to get the URL and link text you
have to use each element’s get method (for the URL) and text_content method (for
the display text):

In [None]:
lnk = links[25]
lnk

In [None]:
lnk.get('href')

In [None]:
lnk.text_content()

Thus, getting a list of all URLs in the document is a matter of writing this list comprehension:

In [None]:
urls = [lnk.get('href') for lnk in doc.findall('.//a')]
urls[-10:]

Now, finding the right tables in the document can be a matter of TRIAL AND ERROR; some
websites make it easier by giving a table of interest an id attribute. I determined that
these were the two tables containing the call data and put data, respectively:

In [None]:
tables = doc.findall('.//table')
calls = tables[0]
puts = tables[0]

calls

In [None]:
puts

Each table has a header row followed by each of the data rows:

In [None]:
rows = calls.findall('.//tr')

For the header as well as the data rows, we want to extract the text from each cell; in
the case of the header these are th cells and td cells for the data:

In [None]:
def _unpack(row, kind = 'td'):
    elts = row.findall('.//%s' % kind)
    return [val.text_content() for val in elts]

Thus, we obtain:

In [None]:
_unpack(rows[0], kind = 'th')

In [None]:
_unpack(rows[0], kind='td')

Now, it’s a matter of combining all of these steps together to convert this data into a
DataFrame. Since the numerical data is still in string format, we want to convert some,
but perhaps not all of the columns to floating point format. You could do this by hand,
but, luckily, pandas has a class TextParser that is used internally in the read_csv and
other parsing functions to do the appropriate automatic type conversion:

In [None]:
import numpy as np
import pandas as pd
from pandas.io.parsers import TextParser

In [None]:
def parse_options_data(table):
    rows = table.findall('.//tr')
    header = _unpack(rows[0], kind = 'th')
    data = [_unpack(r) for r in rows[1:]]
    return TextParser(data, names = header).get_chunk()

Finally, we invoke this parsing function on the lxml table objects and get DataFrame
results:

In [None]:
#QUESTION: Gotta figure this out..
call_data = parse_options_data(calls)
put_data = parse_options_data(puts)

#### Bonus: Fetching the Yahoo Finance Page 
http://pythoncentral.io/python-beautiful-soup-example-yahoo-finance-scraper/

In [None]:
optionUrl = 'http://finance.yahoo.com/q/op?s=AAPL+Options'
optionsPage = urlopen(optionUrl)

This code retrieves the Yahoo Finance HTML and returns a file-like object.

If you go to the page we opened with Python and use your browser's "get source" command you'll see that it's a large, complicated HTML file. It will be Python's job to simplify and extract the useful data using the BeautifulSoup module. BeautifulSoup is an external module so you'll have to install it. If you haven't installed BeautifulSoup already, you can get it here
http://pythoncentral.io/python-beautiful-soup-example-yahoo-finance-scraper/

In [None]:
# Beautiful Soup Example: Loading a Page
# The following code will load the page into BeautifulSoup:
from bs4 import BeautifulSoup
soup = BeautifulSoup(optionsPage)

In [None]:
soup.findAll(text = 'AAPL161230C00117000')[0].parent

This result isn’t very useful yet. It’s just a unicode string (that's what the 'u' means) of what we searched for. However BeautifulSoup returns things in a tree format so we can find the context in which this text occurs by asking for it's parent node like so:

In [None]:
soup

Bingo. It's still a little messy, but you can see all of the data that we need is there. If you ignore all the stuff in brackets, you can see that this is just the data from one row.

In [None]:
optionsTable = [
    [x.text for x in y.parent.contents]
    for y in soup.findAll('td', attrs={'class': 'yfnc_h', 'nowrap': ''})
]

In [None]:
optionsTable

This code is a little dense, so let's take it apart piece by piece. The code is a list comprehension within a list comprehension. Let's look at the inner one first:

In [None]:
for y in soup.findAll('td', attrs={'class': 'yfnc_h', 'nowrap': ''})

This uses BeautifulSoup's findAll function to get all of the HTML elements with a td tag, a class of yfnc_h and a nowrap of nowrap. We chose this because it's a unique element in every table entry.

If we had just gotten td's with the class yfnc_h we would have gotten seven elements per table entry. Another thing to note is that we have to wrap the attributes in a dictionary because class is one of Python's reserved words. From the table above it would return this:

<td nowrap="nowrap"><a href="/q/op?s=AAPL&amp;amp;k=110.000000"><strong>110.00</strong></a></td>

We need to get one level higher and then get the text from all of the child nodes of this node's parent. That's what this code does:

In [None]:
[x.text for x in y.parent.contents]

This works, but you should be careful if this is code you plan to frequently reuse. If Yahoo changed the way they format their HTML, this could stop working. If you plan to use code like this in an automated way it would be best to wrap it in a try/catch block and validate the output.

This is only a simple Beautiful Soup example, and gives you an idea of what you can do with HTML and XML parsing in Python. You can find the Beautiful Soup documentation here. You'll find a lot more tools for searching and validating HTML documents.

#### Parsing XML with lxml.objectify

XML (extensible markup language) is another common structured data format supporting
hierarchical, nested data with metadata. The files that generate the book you
are reading actually form a series of large XML documents.
Above, I showed the lxml library and its lxml.html interface. Here I show an alternate
interface that’s convenient for XML data, lxml.objectify.
The New York Metropolitan Transportation Authority (MTA) publishes a number of
data series about its bus and train services (http://www.mta.info/developers/download
.html). Here we’ll look at the performance data which is contained in a set of XML files.
Each train or bus service has a different file (like Performance_MNR.xml for the Metro-
North Railroad) containing monthly data as a series of XML records that look like this:

<INDICATOR>
    <INDICATOR_SEQ>373889</INDICATOR_SEQ>
    <PARENT_SEQ></PARENT_SEQ>
    <AGENCY_NAME>Metro-North Railroad</AGENCY_NAME>
    <INDICATOR_NAME>Escalator Availability</INDICATOR_NAME>
    <DESCRIPTION>Percent of the time that escalators are operational
    systemwide. The availability rate is based on physical observations performed
    the morning of regular business days only. This is a new indicator the agency
    began reporting in 2009.</DESCRIPTION>
    <PERIOD_YEAR>2011</PERIOD_YEAR>
    <PERIOD_MONTH>12</PERIOD_MONTH>
    <CATEGORY>Service Indicators</CATEGORY>
    <FREQUENCY>M</FREQUENCY>
    <DESIRED_CHANGE>U</DESIRED_CHANGE>
    <INDICATOR_UNIT>%</INDICATOR_UNIT>
    <DECIMAL_PLACES>1</DECIMAL_PLACES>
    <YTD_TARGET>97.00</YTD_TARGET>
    <YTD_ACTUAL></YTD_ACTUAL>
    <MONTHLY_TARGET>97.00</MONTHLY_TARGET>
    <MONTHLY_ACTUAL></MONTHLY_ACTUAL>
</INDICATOR>

Using lxml.objectify, we parse the file and get a reference to the root node of the XML
file with getroot:

In [None]:
from lxml import objectify

In [None]:
path = 'Performance_MNR.xml'
parsed = objectify.parse(open(path))
root = parsed.getroot()

root.INDICATOR return a generator yielding each <INDICATOR> XML element. For each
record, we can populate a dict of tag names (like YTD_ACTUAL) to data values (excluding
a few tags):

## Binary Data Formats

One of the easiest ways to store data efficiently in binary format is using Python’s builtin
pickle serialization. Conveniently, pandas objects all have a save method which
writes the data to disk as a pickle:

In [None]:
frame = pd.read_csv('US3M.csv')
frame

In [None]:
frame.save('US3M.csv')

## Interacting with HTML and APIs P173

Still needs to dig deeper into

## Interacting with Databases