### Imports

In [None]:
import pandas as pd
import numpy as np

# Chapter 6. Data Loading, Storage, and File Formats

Accessing data is a necessary first step for using most of the tools in this book. I'm going to be focused on data input and output using pandas.

Input and output typicaly falls into a few main categories:

* Reading text files
* Loading from databases
* Interacting with network sources like APIs

## Reading and writing data in text format

There are a number of functions in pandas for reading different text files. Some include:
* read_csv
* read_fwf
* read_excel

Some of these functions have become wery complex over time due to the nature of messy data in the real world.



In [None]:
df = pd.read_csv('examples/ex1.csv')
df

Since this is comma-delimited, we can use read_csv to read it into a dataframe.

A file wil not always have a header row. ex2.csv is one such file. You can make pandas assign default names:

In [None]:
pd.read_csv('examples/ex2.csv', header = None)

Or you can specify yourself:

In [None]:
pd.read_csv('examples/ex2.csv', names = ['a', 'b', 'c', 'd', 'message'])

Suppose you wanted message column to be the index:

In [None]:
names = ['a', 'b', 'c', 'd', 'message']
pd.read_csv('examples/ex2.csv', names = names, index_col = 'message')

In the event that you want to form a hierarchical index from multiple columnsm, pass a list of column numbers or names:

In [None]:
parsed = pd.read_csv('examples/csv_mindex.csv', index_col = ['key1', 'key2'])
parsed

Sometimes the delimiter is not the character you expect. In some cases this must be manually defined, like in this example where a space is the delimiter

In [None]:
result = pd.read_csv('examples/ex3.txt', sep = '\s+')
result

The first column is interpreted as index since no column name was provided.

skipping rows is done like so:

In [None]:
pd.read_csv('examples/ex4.csv', skiprows = [0, 2, 3])

Handling missing values is an important and frequently nuanced part of the file parsing process. Missing data is usually either not present (empty string) or marked by some *sentinel* value. By default, pandas uses a set of commonly occuring sentinels, sich as **NA** and **NULL**

In [None]:
result = pd.read_csv('examples/ex5.csv')
result

In [None]:
pd.isnull(result)

the *na_value* option can take either a list or set of strings to consider missing values:

In [None]:
result = pd.read_csv('examples/ex5.csv', na_values = ['NULL'])
result

In [None]:
sentinels = {'message' : ['foo', 'NA'], 'something' : ['two']}
pd.read_csv('examples/ex5.csv', na_values = sentinels)

### Reading text files in pieces

When processing very large files, we make the pandas display setting more compact:

In [None]:
pd.options.display.max_rows = 10
result = pd.read_csv('examples/ex6.csv')
result

If you only want to read a small number of rows, specify that with *nrows*:

In [None]:
pd.read_csv('examples/ex6.csv', nrows = 5)

To read a file in pieces, specify a chunksize as a number of rows:

In [None]:
chunker = pd.read_csv('examples/ex6.csv', chunksize = 1000)
chunker

The *TextFileReader* object returned by *read_csv* allows you to iterate over the parts of the file according to the *chunksize*. For example, we can iterate over *ex6.csv*, aggregating the value conts in the 'key' column like so:

In [None]:
chunker = pd.read_csv('examples/ex6.csv', chunksize = 1000)

tot = pd.Series([])
for piece in chunker:
    tot = tot.add(piece['key'].value_counts(), fill_value = 0)
tot = tot.sort_values(ascending = False)
tot[:10]

### Writing data to text format

Data can also be exported to a delimited format. Let's consider one of the CSV files read before:

In [None]:
data = pd.read_csv('examples/ex5.csv')
data

Using DataFrame's *to_csv* method, we can write the data out to a comma-seperated file:

In [None]:
data.to_csv('examples/out.csv')

Other delimiters can be used, of course (writing to sys.stdout so it prints the text to the console):

In [None]:
import sys
data.to_csv(sys.stdout, sep = '|')

Missing values appear as empty strings in the output. You might want to denote them by some other sentinel value:

In [None]:
data.to_csv(sys.stdout, na_rep = 'NULL')

### Working with Delimited Formats

It's possible to load most forms of tabular data from disk using functions like *pandas.read_csv*. In Somce cases, however, some manual processing may be necessary. It's not uncommon to recieve a file with one or more malformed lines that trip up *read_csv*.

In [None]:
import csv
f = open('examples/ex7.csv')

reader = csv.reader(f)

for line in reader:
    print(line)

From there, it's up to you to do the wrangling necessary to put the data in the form that you need it. Let's take this step by step.

First, we read the file into a list of lines:

In [None]:
with open('examples/ex7.csv') as f:
    lines = list(csv.reader(f))

Then, we splot the lines into the header line and the data lines.

In [None]:
header, values = lines[0], lines[1:]

Then we can create a dictionary of data columns using a dictionary comprehension and the expression zip(*values), which transposes rows to columns:

In [None]:
data_dict = {h: v for h, v in zip(header, zip(*values))}
data_dict

CSV files come in many different flavors. To define a new format with a different delimiter, string quoting convention, or line terminator, we define a simple subclass of csv.Dialect:

In [None]:
class my_dialect(csv.Dialect):
    lineterminator = '\n'
    delimiter = ';'
    quotechar = '"'
    quoting = csv.QUOTE_MINIMAL

### JSON Data

JSON (javascript object notation) has become one of the standard formats for sending data by HTTP request between web browsers and other applications. It is a much more free-form data format than a tabular text form like CSV. Here is an example:

In [None]:
obj = """
{"name": "Wes",
 "places_lived" : ["United States", "Spain", "Germany"],
 "pet": null,
 "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},
              {"name": "Katie", "age": 38, "pets": ["Sixies", "Stache", "Cisco"]}]
}
"""

JSON is very nearly valid Python code with th exception of it's null value null and some other nuances. The basic types are object (dicts), arrays (lists), strings, numbers, booleans and nulls.

There are several libraries for JSON data.

In [None]:
import json
result = json.loads(obj)
result

json.dumps on the other hand, converts a Python object back to JSON

In [None]:
asjson = json.dumps(result)

How you convert a JSON object or list of objects to a DataFrame or some other data structure for analysis will be up to you. Conveniently, you can pass a list of dicts (which previously was a JSON object) to the DataFrame constructor and select a subset of the data fields:

In [None]:
siblings = pd.DataFrame(result['siblings'], columns=['name', 'age'])
siblings

The *pandas.read_json* can automatically convert JSON datasets in specific arrangements into a Series or DataFrame. For example:

In [None]:
data = pd.read_json('examples/example.json')
data

### XML and HTML: Web Scraping

Python has many libraries for reading and writing data in the ubiquitos HTML and XML formats. Examples include lxml, beautiful soup and html5lib. While lxml is comparatively much faster in general, the other libraries can better handle malformed HTML and XML files.

pandas has a built-in function, *read_html*, which uses libraries like *lxml* and beautiful soup to automatically parse tables out of HTML files as DataFrame objects. To show how this works, I downloaded an HTML file (used in the pandas documentation) from the United States FDIC government agency showing bank failures. First, you mus install some additional libraries used by read_html.

In [None]:
tables = pd.read_html('examples/fdic_failed_bank_list.html')
len(tables)

In [None]:
failures = tables[0]
failures.head()

because failures had many columns, pandas insert a line break character \.

As we will learn in later chapters, from here we could proceed to do some data cleaning and analysis, like computing the number of bank failures by year:

In [None]:
close_timestamps = pd.to_datetime(failures['Closing Date'])
close_timestamps.dt.year.value_counts()

### Parsing XML with lxml.objectify

In [None]:
from lxml import objectify
path = 'datasets/mta_perf/Performance_MNR.xml'
parsed = objectify.parse(open(path))
root = parsed.getroot()

In [None]:
data = []

skip_fields = ['PARENT_SEQ', 'INDICATOR_SEQ',
               'DESIRED_CHANGE', 'DECIMAL_PLACES']

for elt in root.INDICATOR:
    el_data = {}
    for child in elt.getchildren():
        if child.tag in skip_fields:
            continue
        el_data[child.tag] = child.pyval
    data.append(el_data)

perf = pd.DataFrame(data)
perf.head()

## Binary Data Formats

One of the easiest ways to store data (also known as serialization) effifiently in binary format is using Pythons bult-in *picke* serialization. Pandas object all have a *to_pickle* method that writes the data to disk in pickle-format:

In [None]:
frame = pd.read_csv('examples/ex1.csv')
frame

In [None]:
frame.to_pickle('examples/frame_pickle')

Reading pickled objects is one by using the builtin pickle or pandas *read_pickle* method.

In [None]:
pd.read_pickle('examples/frame_pickle')

### Caution

Pickle is only recommended as a short-term storage ormat. The problem is that it is hard to guarantee that the format will be stable over time; an object pickled today may not unpickle tomorrow. 


Pandas also has built-in support for two more binary data formats: HDF5 and MessagePack.

### Using HDF5 format

HDF5 is a well-regarded file format intended for storing large quantities of scientific array data. It is available as a C library, and it has ointerfaces available in many other languages, including Java, Julia, MATLAB and Python. The "HDF" in HDF5 stands for *hierarchical data format*. Each HDF5 file can store multiple datasets and supporting metadata. Compared with simpler formats, HDF5 supports on-the-fly compression with a variety of compression modes, enabling data with repeated patters to be stored more efficiently. HDF5 can be a good choice for working with very large datasets that don't fit into memory, as yoy can efficiently read and write small sections of much larger arrays.

While it is possible to directly access HDF5 files using either the PyTables or h5py libraries, python provides a high-level interface that simplifies storing Series and DataFrame objects.

In [None]:
frame = pd.DataFrame({'a' : np.random.randn(100)})
store = pd.HDFStore('mydata.h5')
store['obj1'] = frame
store['obj1_col'] = frame['a']
store

Objects contained in the HDF5 file can then be retrieved with the same dict-like API:

In [None]:
store['obj1']

HDFStore supports two storage schemas, 'fixed' and 'table'. The latter is generally slower byt ut supports query operations using a special syntax:

In [None]:
store.put('obj2', frame, format = 'table')
store.select('obj2', where = ['index >= 10 and index <= 15'])

In [None]:
store.close()

### Reading mirosoft excel files

Pandas also supports reading tabular data stored in Excel 2003 (and higher) files using either the ExcelFile class or *pandas.read_excel* method. Internally these tools use the add-on packages *xlrd* and *openpyxl* ro read XLS and XLSX files, respectiely. These must be installed seperately from pandas using pip or conda.

To use *ExcelFile*, create an instance by passing a path to an *xls* or *xlsx* file:

In [None]:
xlsx = pd.ExcelFile('examples/ex1.xlsx')

Data stored in a sheet can then be read into DataFrame with parse:

In [None]:
pd.read_excel(xlsx, 'Sheet1')

If you are reading multiple sheets in a file, then it is faster to create the ExcelFile, but you can also simply pass the filename to pandas.read_excel:

In [None]:
frame = pd.read_excel('examples/ex1.xlsx', 'Sheet1')
frame

To write pandas data to Excel format, you must first create an *ExcelWriter*, then write data to it using pandas objects *to_excel* method:

In [None]:
writer = pd.ExcelWriter('examples/ex2.xlsx')
frame.to_excel(writer, 'Sheet1')
writer.save()

## Interacting with Web APIs

Many websites have public APIs providing data feeds via JSON or some other format. There are a number of ways to access these APIs from Python; one easy-to-use method that i recommend is the request package.

To find the last 30 GitHub issues for pandas on GitHub we can make a *GET* HTTP request using the add-on request library:

In [None]:
import requests
url = 'https://api.github.com/repos/pandas-dev/pandas/issues'
resp = requests.get(url)

In [None]:
resp

The Response object's json method will return a dictionary containing JSON into native Python objects:

In [None]:
data = resp.json()
data[0]['title']

Each element in *data* is a dictionary containing all the data found on a GitHub issue page (except for the comments). We can pass *data* directly to DataFrame and extract fields of interest:

In [None]:
issues = pd.DataFrame(data, columns = ['number', 'title', 'labels', 'state'])
issues

## Interacting with Databases

In an business setting, most data may not be stored in text or Excel files. SQL-based rlational databases are in wide use, and many alternative databases have become quite popular. The choice of database is usually dependent on the performance, data integrity and scalability needs of an application.

Loading data from SQL into a DataFrame is fairly straightforward, and pandas has some functions to simplify the process. As an example:

In [None]:
import sqlite3

query = """
CREATE TABLE test
(a VARCHAR(20), b VARCHAR(20),
c REAL,         d INTEGER)
;"""

con = sqlite3.connect('mydata.sqlite')
con.execute(query)
con.commit