# Data Loading, Storage, and File Formats

Accessing data is a necessary first step for data analysis works. This note will focused on data input and output using pandas. Input and output typically falls into a few main categories below:
 + reading text files and other efficient on-disk formats
 + loading data from databases, and
 + interacting with network sources like web API
 

## Reading and Writing Data in Text Format
  pandas features a number of functions for reading tabular data as a DataFrame object. Table below summarises some of them, amongst them is `read_csv`, which is likely the one you'll use the most
  
  **Function** | **Description**
  --- | ---
  `read_csv` | Load delimited data from a file, URL, or file-like object; use comma as default delimiter
  `read_fwf` | Read data in fixed-width column format (.e., no delimiters)
  `read_clipboard` | Version of `read_csv` that reads data from the clipboard; useful for converting tables from web pages
  `read_excel` | Read tabular data from an Excel XLS or XLSX file
  `read_hdf` | Read HDF5 files written by pandas
  `read_html` | Read all tables found in the given HTML document
  `read_json` | Read data from a JSON (JavaScript Object Notation) string representation
  `read_msgpack` | Read pandas data encoded using the MessagePack binary format
  `read_pickle` | Read an arbitrary object stored in Python pickle format
  `read_sas` | Read a SAS dataset stored in one of the SAS system's custom storage formats
  `read_sql` | Read the results of a SQL query (using SQLAlchemy) as a pandas DataFrame
  `read_stata` | Read a dataset from Stata file format
  `read_feather` | Read the Feather binary file format
  
  Some of these functions, like `read_csv()`, perform *type inference*, because the column data types are not part of the data format. That means you don't necessarily have to specify which columns are numeric, integer, boolean, or string. Other data formats, like HDF5, Feather, and msgpack, have the data types stored in the format.
  
  Handling dates and other custom types can require extra effort. The following sections will demonstrate how to read data in various formats in text files. Let's start with a small comma-separated (CSV) text file that I've created: 

In [1]:
!cat examples/ex1.csv

a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo


***Note:*** *Using an exclamation mark(!) before the command will pass the command to the underlying shell (not to the 
Python interpreter). In the example above, I used Linux `cat` command to print the raw contents of the file to the screen. For Windows systems, you can use `type` instead of `cat` to achieve the same effect*

   As the file is comma-delimited, we can use `read_csv()` method to read it into the DataFrame:

In [2]:
import pandas as pd
df = pd.read_csv('examples/ex1.csv')
df

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


By default, `read_csv()` will read the first row as the header row. However, a file will not always have a header row. Consider this example:

In [3]:
!cat examples/ex2.csv

1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo


To read the file using `read_csv()`, there are several options you can configure. You can allow pandas to assign default column names:

In [4]:
pd.read_csv('examples/ex2.csv', header=None)

Unnamed: 0,0,1,2,3,4
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


Alternatively, you can specify the column names yourself or indicate the column index:

In [5]:
names = ['a','b','c','d','message']
pd.read_csv('examples/ex2.csv', names=names, index_col='message')

Unnamed: 0_level_0,a,b,c,d
message,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
hello,1,2,3,4
world,5,6,7,8
foo,9,10,11,12


In [6]:
data = pd.read_csv('examples/ex2.csv', names=names, index_col='message')

In cases when you want to form a hierarchical index from multiple columns, consider this example below:

In [7]:
!cat examples/csv_mindex.csv

key1,key2,value1,value2
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16


In [8]:
pd.read_csv('examples/csv_mindex.csv', index_col=['key1','key2'])

Unnamed: 0_level_0,Unnamed: 1_level_0,value1,value2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16


In some cases, a table in the file might not have a fixed delimiter, using whitespace or some other pattern to separate fields. Consider a text file that looks like this:

In [9]:
list(open('examples/ex3.txt'))

['            A         B         C\n',
 'aaa -0.264438 -1.026059 -0.619500\n',
 'bbb  0.927272  0.302904 -0.032399\n',
 'ccc -0.264273 -0.386314 -0.217601\n',
 'ddd -0.871858 -0.348382  1.100491\n']

The fields here are separated by a variable amount of whitespace. In these cases, you can pass a regular expression (or regex, in short) as a delimiter for `read_csv()`. This can be expressed by the regular expression `\s+`, so we have then:

In [10]:
pd.read_csv('examples/ex3.txt', sep='\s+')

Unnamed: 0,A,B,C
aaa,-0.264438,-1.026059,-0.6195
bbb,0.927272,0.302904,-0.032399
ccc,-0.264273,-0.386314,-0.217601
ddd,-0.871858,-0.348382,1.100491


   In the above example, there was one fewer column name than the number of data rows. Thus `read_csv` infers that the first column should be the DataFrame's index.
   
   The parser functions have several additional arguments to help you handle the wide variety of exception file formats that occur. For example, you can skip the first, third and fourth rows of a file with skiprows:

In [11]:
!cat examples/ex4.csv

# hey!
a,b,c,d,message
# just wanted to make things more difficult
# who reads CSV files with computers, anyway?
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo


In [12]:
pd.read_csv('examples/ex4.csv', skiprows=[0,2,3])

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


Handling missing values is an important and frequently nuanced part of the file parsing process. Missing data is usually either not present (empty string) or marked by some *sentinel* value (e.g. NA). By default, pandas uses a set of commonly occuring sentinels, such as NA and NULL:

In [13]:
!cat examples/ex5.csv

something,a,b,c,d,message
one,1,2,3,4,NA
two,5,6,,8,world
three,9,10,11,12,foo


In [14]:
result = pd.read_csv('examples/ex5.csv')
result

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


Checking for null values:

In [15]:
pd.isnull(result)

Unnamed: 0,something,a,b,c,d,message
0,False,False,False,False,False,True
1,False,False,False,True,False,False
2,False,False,False,False,False,False


The `na_values` option can take either a list or set of strings to consider missing values. You can also specify different NA sentinels for each column in a dict:

In [16]:
sentinels = {'message': ['foo','NA'], 'something': ['two']}
pd.read_csv('examples/ex5.csv', na_values=sentinels)

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,,5,6,,8,world
2,three,9,10,11.0,12,


Table below lists some frequently used options in `pandas.read_csv()`:

**Argument** | **Description**
--- | ---
`path` | String indicating filesystem location, URL, or file-like object
`sep` or `delimiter` | Chracter sequence or regular expression to use to split fields in each row
`header` | Row number to use as column names; defaults to 0(first row), but should be `None` if there is no header row
`index_col` | Column numbers or names to use as the row index in the result; can be a single name/number or a list of them for a hierarchical index
`names` | list of column names fo rresult, combine with `header=None`
`skiprows` | Number of rows at beginning of file to ignore or list of row numbers (starting from 0) to skip.
`na_values` | Sequence of values to replace with NA
`comment` | Character(s) to split comments off the end of lines
`parse_dates` | Attempt to parse data to `datetime`; `False` by default. If `True`, will attempt to parse all columns. Otherwise can specify a list of column numbers or name to parse. If element of list is tuple or list, will combine multiple columns together and parse to date (e.g. if date/time split across two columns)
`keep_date_col` | If joining columns to parse date, keep the joined columns; `False` by default
`converters` | Dict conintaing column number of name mapping to functions (e.g., {'foo': f} would apply the function f to all values in the 'foo' column
`dayfirst` | When parsing potentially ambigious dates, treat as international format (e.g.7/6/2012 -> June 7, 2012); `False` by default
`date_parser` | Function to use to parse dates
`nrows` | Number of rows to read from beginning of file
`iterator` | Return a TextParser object for reading file piecemeal
`chunksize` | For interation, size of file chunks
`skip_footer` | Number of lines to ignore at end of file
`verbose` | Print various parser output information, like the number of missing values placed in non-numeric columns
`encoding` | Text encoding for Unicode (e.g., 'utf-8' for UTF-8 encoded text)
`squeeze` | If the parsed data only contains one column, return a Series
`thousands` | Separator for thousands (e.g., ',' or '.')