# Data Loading, Storage, and File Formats

Accessing data is a necessary first step for data analysis works. This note will focused on data input and output using pandas. Input and output typically falls into a few main categories below:
 + reading text files and other efficient on-disk formats
 + loading data from databases, and
 + interacting with network sources like web API
 

## Reading Data in Text Format
  pandas features a number of functions for reading tabular data as a DataFrame object. Table below summarises some of them, amongst them is `read_csv`, which is likely the one you'll use the most
  
  **Function** | **Description**
  --- | ---
  `read_csv` | Load delimited data from a file, URL, or file-like object; use comma as default delimiter
  `read_fwf` | Read data in fixed-width column format (.e., no delimiters)
  `read_clipboard` | Version of `read_csv` that reads data from the clipboard; useful for converting tables from web pages
  `read_excel` | Read tabular data from an Excel XLS or XLSX file
  `read_hdf` | Read HDF5 files written by pandas
  `read_html` | Read all tables found in the given HTML document
  `read_json` | Read data from a JSON (JavaScript Object Notation) string representation
  `read_msgpack` | Read pandas data encoded using the MessagePack binary format
  `read_pickle` | Read an arbitrary object stored in Python pickle format
  `read_sas` | Read a SAS dataset stored in one of the SAS system's custom storage formats
  `read_sql` | Read the results of a SQL query (using SQLAlchemy) as a pandas DataFrame
  `read_stata` | Read a dataset from Stata file format
  `read_feather` | Read the Feather binary file format
  
  Some of these functions, like `read_csv()`, perform *type inference*, because the column data types are not part of the data format. That means you don't necessarily have to specify which columns are numeric, integer, boolean, or string. Other data formats, like HDF5, Feather, and msgpack, have the data types stored in the format.
  
  Handling dates and other custom types can require extra effort. The following sections will demonstrate how to read data in various formats in text files. Let's start with a small comma-separated (CSV) text file that I've created: 

In [1]:
!cat examples/ex1.csv

a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo


***Note:*** *Using an exclamation mark(!) before the command will pass the command to the underlying shell (not to the 
Python interpreter). In the example above, I used Linux `cat` command to print the raw contents of the file to the screen. For Windows systems, you can use `type` instead of `cat` to achieve the same effect*

   As the file is comma-delimited, we can use `read_csv()` method to read it into the DataFrame, but before that, we need to import pandas library:

In [2]:
import pandas as pd
df = pd.read_csv('examples/ex1.csv')
df

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


#### Header row 

By default, `read_csv()` will read the first row as the header row. However, a file will not always have a header row. Consider this example:

In [3]:
!cat examples/ex2.csv

1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo


To read the file using `read_csv()`, there are several options you can configure. You can allow pandas to assign default column names:

In [4]:
pd.read_csv('examples/ex2.csv', header=None)

Unnamed: 0,0,1,2,3,4
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


Alternatively, you can specify the column names yourself or indicate the column index:

In [5]:
names = ['a','b','c','d','message']
pd.read_csv('examples/ex2.csv', names=names, index_col='message')

Unnamed: 0_level_0,a,b,c,d
message,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
hello,1,2,3,4
world,5,6,7,8
foo,9,10,11,12


In [6]:
data = pd.read_csv('examples/ex2.csv', names=names, index_col='message')

#### Hierarchical Index

In cases when you want to form a hierarchical index from multiple columns, consider this example below:

In [7]:
!cat examples/csv_mindex.csv

key1,key2,value1,value2
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16


In [8]:
pd.read_csv('examples/csv_mindex.csv', index_col=['key1','key2'])

Unnamed: 0_level_0,Unnamed: 1_level_0,value1,value2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16


#### Delimiters apart from Comma

In some cases, a table in the file might not have a fixed delimiter, using whitespace or some other pattern to separate fields. Consider a text file that looks like this:

In [9]:
list(open('examples/ex3.txt'))

['            A         B         C\n',
 'aaa -0.264438 -1.026059 -0.619500\n',
 'bbb  0.927272  0.302904 -0.032399\n',
 'ccc -0.264273 -0.386314 -0.217601\n',
 'ddd -0.871858 -0.348382  1.100491\n']

The fields here are separated by a variable amount of whitespace. In these cases, you can pass a regular expression (or regex, in short) as a delimiter for `read_csv()`. This can be expressed by the regular expression `\s+`, so we have then:

In [10]:
pd.read_csv('examples/ex3.txt', sep='\s+')

Unnamed: 0,A,B,C
aaa,-0.264438,-1.026059,-0.6195
bbb,0.927272,0.302904,-0.032399
ccc,-0.264273,-0.386314,-0.217601
ddd,-0.871858,-0.348382,1.100491


   In the above example, there was one fewer column name than the number of data rows. Thus `read_csv` infers that the first column should be the DataFrame's index.
   
   The parser functions have several additional arguments to help you handle the wide variety of exception file formats that occur. For example, you can skip the first, third and fourth rows of a file with skiprows:

In [11]:
!cat examples/ex4.csv

# hey!
a,b,c,d,message
# just wanted to make things more difficult
# who reads CSV files with computers, anyway?
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo


In [12]:
pd.read_csv('examples/ex4.csv', skiprows=[0,2,3])

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


#### Missing Values

Handling missing values is an important and frequently nuanced part of the file parsing process. Missing data is usually either not present (empty string) or marked by some *sentinel* value (e.g. NA). By default, pandas uses a set of commonly occuring sentinels, such as NA and NULL:

In [13]:
!cat examples/ex5.csv

something,a,b,c,d,message
one,1,2,3,4,NA
two,5,6,,8,world
three,9,10,11,12,foo


In [14]:
data = pd.read_csv('examples/ex5.csv')
data

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


Checking for null values:

In [15]:
pd.isnull(data)

Unnamed: 0,something,a,b,c,d,message
0,False,False,False,False,False,True
1,False,False,False,True,False,False
2,False,False,False,False,False,False


The `na_values` option can take either a list or set of strings to consider missing values. You can also specify different NA sentinels for each column in a dict:

In [16]:
sentinels = {'message': ['foo','NA'], 'something': ['two']}
pd.read_csv('examples/ex5.csv', na_values=sentinels)

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,,5,6,,8,world
2,three,9,10,11.0,12,


Table below lists some frequently used options in `pandas.read_csv()`:

**Argument** | **Description**
--- | ---
`path` | String indicating filesystem location, URL, or file-like object
`sep` or `delimiter` | Chracter sequence or regular expression to use to split fields in each row
`header` | Row number to use as column names; defaults to 0(first row), but should be `None` if there is no header row
`index_col` | Column numbers or names to use as the row index in the result; can be a single name/number or a list of them for a hierarchical index
`names` | list of column names fo rresult, combine with `header=None`
`skiprows` | Number of rows at beginning of file to ignore or list of row numbers (starting from 0) to skip.
`na_values` | Sequence of values to replace with NA
`comment` | Character(s) to split comments off the end of lines
`parse_dates` | Attempt to parse data to `datetime`; `False` by default. If `True`, will attempt to parse all columns. Otherwise can specify a list of column numbers or name to parse. If element of list is tuple or list, will combine multiple columns together and parse to date (e.g. if date/time split across two columns)
`keep_date_col` | If joining columns to parse date, keep the joined columns; `False` by default
`converters` | Dict conintaing column number of name mapping to functions (e.g., {'foo': f} would apply the function f to all values in the 'foo' column
`dayfirst` | When parsing potentially ambigious dates, treat as international format (e.g.7/6/2012 -> June 7, 2012); `False` by default
`date_parser` | Function to use to parse dates
`nrows` | Number of rows to read from beginning of file
`iterator` | Return a TextParser object for reading file piecemeal
`chunksize` | For interation, size of file chunks
`skip_footer` | Number of lines to ignore at end of file
`verbose` | Print various parser output information, like the number of missing values placed in non-numeric columns
`encoding` | Text encoding for Unicode (e.g., 'utf-8' for UTF-8 encoded text)
`squeeze` | If the parsed data only contains one column, return a Series
`thousands` | Separator for thousands (e.g., ',' or '.')

## Reading Text Files in Pieces

   When processing very large files or figuring out the right set of arguments to correctly process a large file, you may only want to read in a small piece of a file or iterate through smaller chunks of the file.
   
   Vefore we look at a large file, we can limit the maximum rows display either via adjusting pandas display settings (default: `maxrow=60`) or passing `nrows` option in the `read_csv()` method:

In [17]:
pd.options.display.max_rows = 10
result = pd.read_csv('examples/ex6.csv')
result

Unnamed: 0,one,two,three,four,key
0,0.467976,-0.038649,-0.295344,-1.824726,L
1,-0.358893,1.404453,0.704965,-0.200638,B
2,-0.501840,0.659254,-0.421691,-0.057688,G
3,0.204886,1.074134,1.388361,-0.982404,R
4,0.354628,-0.133116,0.283763,-0.837063,Q
...,...,...,...,...,...
9995,2.311896,-0.417070,-1.409599,-0.515821,L
9996,-0.479893,-0.650419,0.745152,-0.646038,E
9997,0.523331,0.787112,0.486066,1.093156,K
9998,-0.362559,0.598894,-1.843201,0.887292,G


In [18]:
# alternatively, use the nrows parameter..
result = pd.read_csv('examples/ex6.csv', nrows=5)
result

Unnamed: 0,one,two,three,four,key
0,0.467976,-0.038649,-0.295344,-1.824726,L
1,-0.358893,1.404453,0.704965,-0.200638,B
2,-0.50184,0.659254,-0.421691,-0.057688,G
3,0.204886,1.074134,1.388361,-0.982404,R
4,0.354628,-0.133116,0.283763,-0.837063,Q


To read a file in pieces, specify a `chunksize` as a number of rows:

In [19]:
chunker = pd.read_csv('examples/ex6.csv', chunksize=1000)
print(type(chunker))

tot = pd.Series([])
for piece in chunker:
    tot = tot.add(piece['key'].value_counts(), fill_value=0)
    
# display first 10 rows in descending order
tot.sort_values(ascending=False)[:10]

<class 'pandas.io.parsers.TextFileReader'>


  tot = pd.Series([])


E    368.0
X    364.0
L    346.0
O    343.0
Q    340.0
M    338.0
J    337.0
F    335.0
K    334.0
H    330.0
dtype: float64

## Writing Data to Text Format

Data can also be exported to a delimited format. Using DataFrame's `to_csv()` method, we can write the data out to a comma-separated file:

In [20]:
# print data from one of the earlier examples
print("{}\n".format(data))

# output data to out.csv file
data.to_csv('examples/out.csv')

# invoking shell command to display file contents
!cat examples/out.csv

  something  a   b     c   d message
0       one  1   2   3.0   4     NaN
1       two  5   6   NaN   8   world
2     three  9  10  11.0  12     foo

,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


####  Header and Index options 
By default, `to_csv()` method will output the index labels to the file. We can alter this by passing `index=False` argument to the method:

In [21]:
data.to_csv('examples/out2.csv', index=False)

!cat examples/out2.csv

something,a,b,c,d,message
one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo


We can also disable the header option as well:

In [22]:
data.to_csv('examples/out3.csv', header=False)

!cat examples/out3.csv

0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


#### Delimiter options

Other delimiters can be used apart from comma(,), which is the default option. Below example illustrate using `|` delimiter and outputs to Python console instead of a file:

In [23]:
import sys

data.to_csv(sys.stdout, sep='|')

|something|a|b|c|d|message
0|one|1|2|3.0|4|
1|two|5|6||8|world
2|three|9|10|11.0|12|foo


#### Missing Values

   Missing values will appear as empty strings in the output. You can denote them by some other sentinel value like below:

In [24]:
data.to_csv(sys.stdout, na_rep='NULL')

,something,a,b,c,d,message
0,one,1,2,3.0,4,NULL
1,two,5,6,NULL,8,world
2,three,9,10,11.0,12,foo


#### Subset of Data

You can also write only a subset of the columns, and in an order of your choosing:

In [25]:
data.to_csv(sys.stdout, columns=['a','b','c'])

,a,b,c
0,1,2,3.0
1,5,6,
2,9,10,11.0


#### `to_csv` for Series

pandas Series also have `to_csv()` method that we can call to write a Series output, not necessarily only DataFrames:

In [26]:
import numpy as np

dates = pd.date_range('1/1/2000', periods=7)
ts = pd.Series(np.arange(7), index=dates)
ts.to_csv('examples/tseries.csv', header=False)

!cat examples/tseries.csv

2000-01-01,0
2000-01-02,1
2000-01-03,2
2000-01-04,3
2000-01-05,4
2000-01-06,5
2000-01-07,6


### Working with Delimited Formats

Most forms of tabular data from disk can be loaded using functions like `pandas.read_csv()` method. In some cases though, some manual processing may be necessary. It's not uncommon to receive a file with one or more malformed lines that trip up `read_csv`. To illustrate the basic tools, consider a small CSV file:

In [27]:
!cat examples/ex7.csv

"a","b","c"
"1","2","3"
"1","2","3"


Suppose that the first line represent the header row and the second and third lines contain values that we like to parse as tuples. For any file with a single delimiter, you can use Python's built-in `csv` module. To use it, pass any open file or file-like object to `csv.reader`:

In [28]:
import csv
f = open('examples/ex7.csv')
reader = csv.reader(f)

# iterating through the reader yields values with any quote characters removed
for line in reader:
    print(line)

['a', 'b', 'c']
['1', '2', '3']
['1', '2', '3']


Let's take this step by step in order to read the values and store as tuples:

In [29]:
# Read the file into a list of lines
with open('examples/ex7.csv') as f:
    lines = list(csv.reader(f))
    
# Split the lines into the header line and the data lines
header, values = lines[0], lines[1:]

# Create a dictionary of data columns using dictionary comprehension and 
# zip(*values), which transposes rows to columns
data_dict = {h: v for h, v in zip(header, zip(*values))}
data_dict

{'a': ('1', '1'), 'b': ('2', '2'), 'c': ('3', '3')}

CSV files could come in many different flavors. To define a new format with a differnt delimiter, string quoting convention, or line terminator, we define a simple subclass of `csv.Dialect`:

In [30]:
!cat examples/ex8.csv

"a";"b";"c"
"1";"2";"3"
"1";"2";"3"


In [31]:
f = open('examples/ex8.csv')

class my_dialect(csv.Dialect):
    lineterminator = '\n';
    delimiter = ';'
    quotechar = '"'
    quoting = csv.QUOTE_MINIMAL
    
reader = csv.reader(f, dialect=my_dialect)

for line in reader:
    print(line)

['a', 'b', 'c']
['1', '2', '3']
['1', '2', '3']


We can also give individual CSV dialect parameters as keywords to `csv.reader` without having to define a subclass like this:
```Python
reader = csv.reader(f, delimiters='|')
```

Possible optons (attributes of `csv.Dialect`) and what they do can be found in table below:

**Argument** | **Description**
--- | ---
`delimiter` | One-character string to separate fields; defaults to ','.
`lineterminator` | Line terminator for writing; defaults to '\r\n'. Reader ignores this and recognises cross-platform line terminators
`quotechar` | Quote character for fields with special characters (like a delimiter); default is '"'.
`quoting` |  Quoting convention. Options include `csv.QUOTE_ALL` (quote all fields), `csv.QUOTE_MINIMAL` (only fields with special characters like the delimiter), `csv.QUOTE_NONNUMERIC`, and `csv.QUOTE_NONE` (no quoting). Defaults to `QUOTE_MINIMAL`
`skipinitialspace` | Ignore whitespace after each delimiter; default is `False`.
`doublequote` | How to handle quotign character inside a field; if `True` it is doubled
`escapechar` |  String to escape the delimiter if `quoting` is set to `csv.QUOTE_NONE`; disable by default.

*Note: For files with more complicated or fixed multicharacter delimiters, you will not be able to use the `csv` module. In those cases, you'll have to do the line splitting and other cleanup using string's `split` method or the regular expression method `re.split`.*

You can also write delimited file manually using the `csv.writer` like this:

In [None]:
with open('mydata.csv', 'w') as f:
    writer = csv.writer(f, dialect=my_dialect)
    writer.writerow(('one', 'two', 'three'))
    writer.writerow(('1','2','3'))