# 6.1 Reading/Writing Data in Text Format

- pandas features a number of functions for reading tabular data as a DataFrame object
    - including csv, hdf, json, sql, etc.
- `read_x` methods has optional arguments
    - **indexing**: treat one or more columns as the returned DataFrame, and whether to get column names from the file, the user, or not at all
    - **type inference and data conversion**: includes the user-defined value conversions and custom list of missing value markers
    - **datetime parsing**: includes combining capability, including combining date and time information spreading over multiple columns into a single column in the result
    - **iterating**: support for iterating over chunks of very large files
    - **unclean data issues**: skipping rows or a footer, comments, etc.


In [20]:
%%bash
mkdir -p examples
rm -f examples/ex1.csv
cat <<EOT >> examples/ex1.csv
a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
EOT

In [19]:
import pandas as pd

df = pd.read_csv("examples/ex1.csv")
df

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo
3,a,b,c,d,message
4,1,2,3,4,hello
5,5,6,7,8,world
6,9,10,11,12,foo


In [15]:
import pandas as pd

df = pd.read_table("examples/ex1.csv", sep=',')
df

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [22]:
%%bash
mkdir -p examples
rm -f examples/ex2.csv
cat <<EOT >> examples/ex2.csv
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
EOT

In [23]:
import pandas as pd

pd.read_csv('examples/ex2.csv', header=None)

Unnamed: 0,0,1,2,3,4
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [25]:
import pandas as pd

pd.read_csv('examples/ex2.csv', names=['a', 'b', 'c', 'd', 'message'])

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [26]:
# optional index columns specified by index_col
import pandas as pd

pd.read_csv('examples/ex2.csv',
            names=['a', 'b', 'c', 'd', 'message'],
            index_col='message')

Unnamed: 0_level_0,a,b,c,d
message,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
hello,1,2,3,4
world,5,6,7,8
foo,9,10,11,12


In [27]:
%%bash
mkdir -p examples
rm -f examples/csv_mindex.csv
cat <<EOT >> examples/csv_mindex.csv
key1,key2,value1,value2
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16
EOT

In [28]:
# form a hierarchical index from multiple columns
parsed = pd.read_csv('examples/csv_mindex.csv',
                     index_col=['key1', 'key2'])
parsed

Unnamed: 0_level_0,Unnamed: 1_level_0,value1,value2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16


## Missing Data Handling

- missing data is usually not present or marked by some sentinel value
    - by default, pandas uses a set of commonly occuring sentinels, such as NA and NULL


In [29]:
%%bash
mkdir -p examples
rm -f examples/ex5.csv
cat <<EOT >> examples/ex5.csv
something,a,b,c,d,message
one,1,2,3,4,NA
two,5,6,,8,world
three,9,10,11,12,foo
EOT

In [30]:
pd.read_csv('examples/ex5.csv')

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


In [31]:
# na_values take a list of set of strings to consider missing values
pd.read_csv('examples/ex5.csv', na_values=['NULL']) 

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


In [33]:
# different NA sentinels can be specified for each column
pd.read_csv('examples/ex5.csv',
            na_values={'message': ['foo', 'NA'], 'something': ['two']})

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,,5,6,,8,world
2,three,9,10,11.0,12,


## Reading Text Files in Pieces

- when processing very large files or figuring out the right set of arguments to correctly process a large file, you may only want to read a small piece of a file or iterate through smaller chunks of the file


In [34]:
# only read a small number of rows, specify with nrows

pd.read_csv('examples/ex5.csv', nrows=1)

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3,4,


In [35]:
# to read a file in pieces, specify a chunksize as a number of rows
chunker = pd.read_csv('examples/ex5.csv', chunksize=2)
chunker

<pandas.io.parsers.TextFileReader at 0x11dd2f390>

In [36]:
# the TextParser object returned allows you to iterate over the parts of
# the file according to the chunk size
for piece in chunker:
    print(piece)

  something  a  b    c  d message
0       one  1  2  3.0  4     NaN
1       two  5  6  NaN  8   world
  something  a   b   c   d message
2     three  9  10  11  12     foo


## Writing Data to Text Format

- data can be exported to a delimited format

In [37]:
import sys
data = pd.read_csv('examples/ex5.csv')
data.to_csv(sys.stdout, sep='|')  # write the data out to a file

|something|a|b|c|d|message
0|one|1|2|3.0|4|
1|two|5|6||8|world
2|three|9|10|11.0|12|foo


# 6.2 Binary Data Formats

- pandas has builtin support for binary formats like Python pickle, HDF5, and MessagePack

## HDF5 Format

- a well-regarded file format intended for storing large quantities of scientific array data
- HDF stands for hierarchical data format
- Each HDF5 file can store multiple datasets and supporting metadata
- supports on-the-fly compression with a variety of compression modes
    - enabling data with repeated patterns to be stored more efficiently
- can be a good choice for working with very large datasets that don't fit into memory, as one can efficiently read/write small sections of much larger arrays

In [40]:
# HDFStore class from pandas provides interface for 
# storing Series and DataFrame to HDF5 file
import pandas as pd
import numpy as np

frame = pd.DataFrame({'a': np.random.randn(100)})
store = pd.HDFStore('examples/pandas_hdfstore.h5')
store['obj1'] = frame
store['obj1_col'] = frame['a']
store

<class 'pandas.io.pytables.HDFStore'>
File path: examples/pandas_hdfstore.h5

In [46]:
# objects contained in the HDF5 file can be retrieved with a dict-like API
store['obj1']

Unnamed: 0,a
0,0.304143
1,-1.339213
2,1.213074
3,-0.116605
4,-0.950282
...,...
95,-0.096383
96,0.277227
97,-0.050240
98,0.620772


In [45]:
print(store.info())

<class 'pandas.io.pytables.HDFStore'>
File path: examples/pandas_hdfstore.h5
/obj1                frame        (shape->[100,1])
/obj1_col            series       (shape->[100])  


In [50]:
# HDFStore supports two storage schemas, 'fixed' and 'table'
# the latter is generally slower but it supports query operations with
# special syntax
store = pd.HDFStore('examples/pandas_hdfstore.h5')
store.put('obj2', frame, format='table')
store.select('obj2', where=['index >= 10 and index <= 15'])

Unnamed: 0,a
10,-1.137287
11,0.056313
12,-0.081224
13,0.318588
14,0.867176
15,1.500923


In [51]:
store.close()

In [53]:
frame.to_hdf('examples/pandas_hdfstore.h5', 'obj3', format='table')

In [54]:
pd.read_hdf('examples/pandas_hdfstore.h5', 'obj3', where=['index < 5'])

Unnamed: 0,a
0,0.304143
1,-1.339213
2,1.213074
3,-0.116605
4,-0.950282


## Parquet

- Apache Parquet provides a partitioned binary columnar serialization for  data frames
    - designed to make reading/writing data frames efficient
    - make sharing data across data analysis languages easy
- uses a variety of compression techniques to shrink the file size
- support all pandas dtypes, including extension dtypes such as datetime with tz
- user specify an *engine* to direct the serialization, can be one of `pyarrow`, or `fastparquet` or `auto`


In [2]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'a': list('abc'),
                   'b': list(range(1, 4)),
                   'c': np.arange(3, 6).astype('u1'),
                   'd': np.arange(4.0, 7.0, dtype='float64'),
                   'e': [True, False, True],
                   'f': pd.date_range('20130101', periods=3),
                   'g': pd.date_range('20130101', periods=3, tz='US/Eastern'),
                   'h': pd.Categorical(list('abc')),
                   'i': pd.Categorical(list('abc'), ordered=True)}) 
df

Unnamed: 0,a,b,c,d,e,f,g,h,i
0,a,1,3,4.0,True,2013-01-01,2013-01-01 00:00:00-05:00,a,a
1,b,2,4,5.0,False,2013-01-02,2013-01-02 00:00:00-05:00,b,b
2,c,3,5,6.0,True,2013-01-03,2013-01-03 00:00:00-05:00,c,c


In [3]:
df.dtypes

a                        object
b                         int64
c                         uint8
d                       float64
e                          bool
f                datetime64[ns]
g    datetime64[ns, US/Eastern]
h                      category
i                      category
dtype: object

In [4]:
df.to_parquet('examples/parrow.parquet', engine='pyarrow')

In [5]:
df.to_parquet('examples/fastparquet.parquet', engine='fastparquet')

In [6]:
pd.read_parquet('examples/fastparquet.parquet', engine='fastparquet')

Unnamed: 0,a,b,c,d,e,f,g,h,i
0,a,1,3,4.0,True,2013-01-01,2013-01-01 00:00:00-05:00,a,a
1,b,2,4,5.0,False,2013-01-02,2013-01-02 00:00:00-05:00,b,b
2,c,3,5,6.0,True,2013-01-03,2013-01-03 00:00:00-05:00,c,c


In [8]:
pd.read_parquet('examples/parrow.parquet',
                engine='pyarrow',
                columns=['a', 'b'])

Unnamed: 0,a,b
0,a,1
1,b,2
2,c,3


# 6.3 Interacting with Web APIs



In [9]:
import requests

url = 'https://api.github.com/repos/pandas-dev/pandas/issues'
resp = requests.get(url)
resp

<Response [200]>

In [11]:
data = resp.json()
data

[{'url': 'https://api.github.com/repos/pandas-dev/pandas/issues/32471',
  'repository_url': 'https://api.github.com/repos/pandas-dev/pandas',
  'labels_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/32471/labels{/name}',
  'comments_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/32471/comments',
  'events_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/32471/events',
  'html_url': 'https://github.com/pandas-dev/pandas/issues/32471',
  'id': 576531429,
  'node_id': 'MDU6SXNzdWU1NzY1MzE0Mjk=',
  'number': 32471,
  'title': 'Dataframe Groupby value_counts with bins parameter',
  'user': {'login': 'scottboston',
   'id': 23064098,
   'node_id': 'MDQ6VXNlcjIzMDY0MDk4',
   'avatar_url': 'https://avatars3.githubusercontent.com/u/23064098?v=4',
   'gravatar_id': '',
   'url': 'https://api.github.com/users/scottboston',
   'html_url': 'https://github.com/scottboston',
   'followers_url': 'https://api.github.com/users/scottboston/followers',
   'following

In [12]:
data[0]['title']

'Dataframe Groupby value_counts with bins parameter'

In [13]:
# create a DataFrame by passing in fields of interest
issues = pd.DataFrame(data, columns=['number', 'title', 'labels', 'state'])
issues

Unnamed: 0,number,title,labels,state
0,32471,Dataframe Groupby value_counts with bins param...,[],open
1,32470,Mishandling exception when trying to access in...,[],open
2,32469,DOC: fix styling (css) of getting started tuto...,"[{'id': 134699, 'node_id': 'MDU6TGFiZWwxMzQ2OT...",open
3,32468,Should Groupby.sum modify _selected_obj?,[],open
4,32467,CLN: use _values_for_argsort for join_non_uniq...,[],open
5,32466,Should Whitespaces be placed at the begging of...,"[{'id': 106935113, 'node_id': 'MDU6TGFiZWwxMDY...",open
6,32465,TST: Fixed xfail for tests in pandas/tests/tse...,"[{'id': 127685, 'node_id': 'MDU6TGFiZWwxMjc2OD...",open
7,32464,Grouping by all columns of an empty DataFrame ...,[],open
8,32463,Difference between count and nunique formatting,[],open
9,32462,Inconsistent result with cumsum columns,[],open


# 6.4 Interacting with Databases

- Loading data from SQL into a DataFrame

In [14]:
import sqlite3

query = """
CREATE TABLE test(
    a VARCHAR(20),
    b VARCHAR(20),
    c REAL,
    d INTEGER
);
"""

conn = sqlite3.connect('examples/data.sqlite')
conn.execute(query)
conn.commit()

In [20]:
data = [('Atlanta', 'Georgia', 1.25, 6),
        ('Tallahassee', 'Florida', 2.6, 3),
        ('Sacramento', 'California', 1.7, 5)]

stmt = "INSERT INTO test VALUES(?, ?, ?, ?)"
conn.executemany(stmt, data)
conn.commit()

In [21]:
cursor = conn.execute("SELECT * FROM test")
rows = cursor.fetchall()
rows

[('Atlanta', 'Georgia', 1.25, 6),
 ('Tallahassee', 'Florida', 2.6, 3),
 ('Sacramento', 'California', 1.7, 5)]

In [23]:
description = cursor.description
description

(('a', None, None, None, None, None, None),
 ('b', None, None, None, None, None, None),
 ('c', None, None, None, None, None, None),
 ('d', None, None, None, None, None, None))

In [24]:
df = pd.DataFrame(rows, columns=[d[0] for d in cursor.description])
df

Unnamed: 0,a,b,c,d
0,Atlanta,Georgia,1.25,6
1,Tallahassee,Florida,2.6,3
2,Sacramento,California,1.7,5


- SQLAlchemy abstracts away SQL databases and pandas has `read_sql` for interfacing with general SQLAlchemy connection


In [27]:
import sqlalchemy as sqla

db = sqla.create_engine('sqlite:///examples/data.sqlite')
pd.read_sql('SELECT * FROM test', db)

Unnamed: 0,a,b,c,d
0,Atlanta,Georgia,1.25,6
1,Tallahassee,Florida,2.6,3
2,Sacramento,California,1.7,5
