In [1]:
import pandas as pd
import numpy as np

<h3>Reading and Writing Data in Text Format</h3>

Pandas features a number of functions for reading tabular data as a DataFrame Object. The table below summarizes some of them, though read_csv and read_table are likely the ones we'll most encounter with.

![alt Text](Images/DataLoadingandStorage/reading_and_writing.png)

![alt Text](Images/DataLoadingandStorage/reading_and_writing2.png)

The optional arguments for these functions may fall into a few categories:

<b>Indexing</b> - Can treat one or more columns as the returned DataFrame, and whether to get column names from the file, the user, or not at all.

<b>Type inference and data conversion</b> - This include the user-defined value conversion and custom list of misisng value markers.

<b>Datetime parsing</b> - Includes combining capability, including combining date and time information spread over multiple columns into a single column in the result.

<b>Iterating</b> - Support for iterating over chunks of very large files.

<b>Unclean data issues</b> - Skipping rows or a footer, comments, or other minor things like numeica data with thousands separated by commas.

<b>Note - </b> Some of these funcitons, like pandas.read_csv, perform <b>type inference</b>, becuase the column data types are not part of the data format. That means you don't necessarily have to specify which columns are numeric, integer, boolean, or string. Other data formats like HDF5, Feather and msgpack, have the data types stored in the format.

In [2]:
import os
os.listdir()

['.idea',
 '.ipynb_checkpoints',
 'array_archive.npz',
 'array_compressed.npz',
 'bacon.txt',
 'Built-in Data Structures, Functions and Files.ipynb',
 'Data Cleaning and Preparation.ipynb',
 'Data Loading, Storage, and File Formats.ipynb',
 'ex2.xlsx',
 'frame_pickle',
 'Images',
 'ipyhon_script_test.py',
 'mydata.csv',
 'mydata.sqlite',
 'Numpy.ipynb',
 'out.csv',
 'Pandas.ipynb',
 'pydata-book-2nd-edition',
 'pydata-book-2nd-edition.zip',
 'Python Data Analysis - 1.ipynb',
 'some_array.npy',
 'sonnet.txt',
 'text.txt',
 'tseries.csv']

In [3]:
os.getcwd()

'E:\\PycharmProjects\\Practice Python\\Python Data Science Analysis'

In [4]:
path = os.getcwd()+"\\pydata-book-2nd-edition\\examples\\ex1.csv"

In [5]:
!type pydata-book-2nd-edition\\examples\\ex1.csv

a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo


<b>Note</b>  - Here, the !type command is used to print the raw contents of the file to the screen

Since, this is a comma-delimited, we can use read_csv mentioned in the above table, to read it into a DataFrame

In [6]:
df = pd.read_csv('pydata-book-2nd-edition\\examples\\ex1.csv')

In [7]:
df

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


We could also have used read_table and specified the delimiter

In [8]:
pd.read_table('pydata-book-2nd-edition\\examples\\ex1.csv', sep=',')

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


<b>Note: - </b> A file will not always have a header row like the file shown  below:

In [9]:
!type pydata-book-2nd-edition\\examples\\ex2.csv

1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo


To read such files, we have a couple of options. We can allow pandas to assign defualt column names, or we can specify names ourselves

In [10]:
pd.read_csv('pydata-book-2nd-edition\\examples\\ex2.csv', header = None)

Unnamed: 0,0,1,2,3,4
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [11]:

pd.read_csv('pydata-book-2nd-edition\\examples\\ex2.csv',names = ['a', 'b', 'c', 'd', 'message'])

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


Suppose we wanted the message column to the index of the returned DataFrame. We can either indicate we want the column at index 4 or named 'message' using the index_col argument

In [12]:
names = ['a', 'b', 'c', 'd', 'message']

In [13]:
pd.read_csv('pydata-book-2nd-edition/examples/ex2.csv', names = names, index_col='message')

Unnamed: 0_level_0,a,b,c,d
message,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
hello,1,2,3,4
world,5,6,7,8
foo,9,10,11,12


In case we want to form a hierarchial index from multiple columns, we can pass a list of column numbers or names

In [14]:
!type pydata-book-2nd-edition\\examples\\csv_mindex.csv

key1,key2,value1,value2
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16


In [15]:
parsed = pd.read_csv('pydata-book-2nd-edition\\examples\\csv_mindex.csv', index_col=['key1', 'key2'])

In [16]:
parsed

Unnamed: 0_level_0,Unnamed: 1_level_0,value1,value2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16


In some cases,  a table might not have a fixed delimiter, using whitespace or some other pattern to seperate fields. Like the example shown below

In [17]:
!type pydata-book-2nd-edition\\examples\\ex3.txt

            A         B         C
aaa -0.264438 -1.026059 -0.619500
bbb  0.927272  0.302904 -0.032399
ccc -0.264273 -0.386314 -0.217601
ddd -0.871858 -0.348382  1.100491


In [18]:
list(open('pydata-book-2nd-edition\\examples\\ex3.txt'))

['            A         B         C\n',
 'aaa -0.264438 -1.026059 -0.619500\n',
 'bbb  0.927272  0.302904 -0.032399\n',
 'ccc -0.264273 -0.386314 -0.217601\n',
 'ddd -0.871858 -0.348382  1.100491\n']

While we could do some data munging by hand, the fields here are separated by a variable amount of whitespace. In these cases, we can pass a regular expressino as a delimeter for read_table. This can be expressed by regulat expression <b>'\s+'</b>, so we could then have:

In [19]:
result = pd.read_table('pydata-book-2nd-edition\\examples\\ex3.txt', sep = '\s+')

In [20]:
result

Unnamed: 0,A,B,C
aaa,-0.264438,-1.026059,-0.6195
bbb,0.927272,0.302904,-0.032399
ccc,-0.264273,-0.386314,-0.217601
ddd,-0.871858,-0.348382,1.100491


<b>Note: - </b>Because there was one fewer column name than the number of data rows, read_tabel infers that the first column should be the DataFrame's index in this special case.

The parser functions have many additional arguements to help us handle the wide vareity of exception file formats that occur. For example, we can skip the first, third and fourth rows of a file with <b>skiprows</b> like in the example below

In [21]:
!type pydata-book-2nd-edition\\examples\\ex4.csv

# hey!
a,b,c,d,message
# just wanted to make things more difficult for you
# who reads CSV files with computers, anyway?
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo


In [22]:
pd.read_csv('pydata-book-2nd-edition\\examples\\ex4.csv', skiprows=[0,2,3])

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


Handling missing values is an important and frequently nuanced part of the file parsing process. Missing data is usually either not present (empty string) or marked by some sentinel value. By default, pandas uses a set of commonly occuring sentinels such as NA and NULL:

In [23]:
!type pydata-book-2nd-edition\\examples\\ex5.csv 

something,a,b,c,d,message
one,1,2,3,4,NA
two,5,6,,8,world
three,9,10,11,12,foo


In [24]:
result = pd.read_csv('pydata-book-2nd-edition/examples/ex5.csv')

In [25]:
result

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


In [26]:
pd.isnull(result)

Unnamed: 0,something,a,b,c,d,message
0,False,False,False,False,False,True
1,False,False,False,True,False,False
2,False,False,False,False,False,False


The na_values option can take either a list or set of strings to consider missing values

In [27]:
result = pd.read_csv('pydata-book-2nd-edition/examples/ex5.csv', na_values=['NULL'])

In [28]:
result

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


Different NA sentinels can be specified for each column in a dict:

In [29]:
sentinels = {'message': ['foo', 'NA'], 'something': ['two']}

In [30]:
pd.read_csv('pydata-book-2nd-edition/examples/ex5.csv', na_values=sentinels)

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,,5,6,,8,world
2,three,9,10,11.0,12,


Some frequently used options in pandas.read_csv an pandas.read_table are

![alt Text](Images/DataLoadingandStorage/read_csv_func_args1.png)

![alt Text](Images/DataLoadingandStorage/read_csv_func_args2.png)

<h3>Reading Text Files in Pieces</h3>

When processing very large files or figuring out the right set of arguments to correctly process a large file, we may only want to read in a small piece of a file or iterate through smaller chunks of the file.

Before we look at a large file, we make the pandas display settings more  compact:

In [31]:
pd.options.display.max_rows= 10

In [32]:
result = pd.read_csv('pydata-book-2nd-edition/examples/ex6.csv')

In [33]:
result

Unnamed: 0,one,two,three,four,key
0,0.467976,-0.038649,-0.295344,-1.824726,L
1,-0.358893,1.404453,0.704965,-0.200638,B
2,-0.501840,0.659254,-0.421691,-0.057688,G
3,0.204886,1.074134,1.388361,-0.982404,R
4,0.354628,-0.133116,0.283763,-0.837063,Q
...,...,...,...,...,...
9995,2.311896,-0.417070,-1.409599,-0.515821,L
9996,-0.479893,-0.650419,0.745152,-0.646038,E
9997,0.523331,0.787112,0.486066,1.093156,K
9998,-0.362559,0.598894,-1.843201,0.887292,G


If we want to only read a small number of rows(avoiding reading the entire file), specify that with nrows:

In [34]:
pd.read_csv('pydata-book-2nd-edition/examples/ex6.csv', nrows=5)

Unnamed: 0,one,two,three,four,key
0,0.467976,-0.038649,-0.295344,-1.824726,L
1,-0.358893,1.404453,0.704965,-0.200638,B
2,-0.50184,0.659254,-0.421691,-0.057688,G
3,0.204886,1.074134,1.388361,-0.982404,R
4,0.354628,-0.133116,0.283763,-0.837063,Q


To read a file in pieces, specify a chunksize as a number of rows:

In [35]:
chunker = pd.read_csv('pydata-book-2nd-edition/examples/ex6.csv', chunksize=1000)

In [36]:
chunker

<pandas.io.parsers.TextFileReader at 0x2045b783d08>

The Textparser object returned by read_csv allows us to iterate over the parts of the file according to the chunksize. For example, we can iterate over ex6.csv, aggregating the value counts in the 'key' column like so:

In [37]:
chunker = pd.read_csv('pydata-book-2nd-edition/examples/ex6.csv', chunksize=1000)

tot = pd.Series([])
for piece in chunker:
    tot = tot.add(piece['key'].value_counts(), fill_value = 0)
    
tot = tot.sort_values(ascending=False)

  This is separate from the ipykernel package so we can avoid doing imports until


In [38]:
tot

E    368.0
X    364.0
L    346.0
O    343.0
Q    340.0
     ...  
5    157.0
2    152.0
0    151.0
9    150.0
1    146.0
Length: 36, dtype: float64

TextParser is also equipped with a get_chunk method that enables you to read pieces of an arbitraty size.

<h3>Writing Data to Text Format</h3>

Data can also be exported to a delimited format. Let's consider one of the CSV Files read before:

In [39]:
data = pd.read_csv('pydata-book-2nd-edition/examples/ex5.csv')

In [40]:
data

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


Using DataFrame's <b>to_csv</b> method, we can write the data out to a commna-separated file:

In [41]:
data.to_csv('out.csv')

In [42]:
!type out.csv

,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


Other delimeters can be user, of course (writing to sys.stdout so it prints the text result to the console):

In [43]:
import sys

In [44]:
data.to_csv(sys.stdout, sep='|')

|something|a|b|c|d|message
0|one|1|2|3.0|4|
1|two|5|6||8|world
2|three|9|10|11.0|12|foo


Missing values appear as empty strings in the output. We might want to denote them by some other sentinel value:

In [45]:
data.to_csv(sys.stdout, na_rep='NULL')

,something,a,b,c,d,message
0,one,1,2,3.0,4,NULL
1,two,5,6,NULL,8,world
2,three,9,10,11.0,12,foo


With no other options specified, both the row and column labels are written. Both of these can be disabled:

In [46]:
data.to_csv(sys.stdout, index = False, header=False)

one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo


We can also write only a subset of the columns, and in an order of our choosing

In [47]:
data.to_csv(sys.stdout, index = False, columns = ['a', 'b', 'c'])

a,b,c
1,2,3.0
5,6,
9,10,11.0


Series also has a to_csv method

In [48]:
dates = pd.date_range('1/1/2000', periods = 7)

In [49]:
dates

DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04',
               '2000-01-05', '2000-01-06', '2000-01-07'],
              dtype='datetime64[ns]', freq='D')

In [50]:
ts = pd.Series(np.arange(7), index = dates)

In [51]:
ts.to_csv('tseries.csv')

In [52]:
!type tseries.csv

,0
2000-01-01,0
2000-01-02,1
2000-01-03,2
2000-01-04,3
2000-01-05,4
2000-01-06,5
2000-01-07,6


<h3>Working with Delimited Formats</h3>

It is possible to load most forms of tabular data from disk using functions like pandas.read_table. In some cases, however, some manual processing may be necessary. It is not uncommon to recieve a file with one or more malformed lines that trip up read_table. To illustrate the basic tools, consider a small CSV file:

In [53]:
!type pydata-book-2nd-edition\\examples\\ex7.csv

"a","b","c"
"1","2","3"
"1","2","3"


For any file with a single-character delimeter, we can use Python's built-in csv module. To use it, we pass any open file or file-like object to csv.reader:

In [54]:
import csv

In [55]:
f = open('pydata-book-2nd-edition/examples/ex7.csv')

In [56]:
reader = csv.reader(f)

Iterating through the reader like a file yields tuples of values with any quiote characters removed:

In [57]:
for line in reader:
    print(line)

['a', 'b', 'c']
['1', '2', '3']
['1', '2', '3']


From there, it is up to us to do the wrangling necessary to put the data in the form that we need it. 
Let's take this step by step. First, we read the file into a list of lines:

In [58]:
with open('pydata-book-2nd-edition/examples/ex7.csv') as f:
    lines = list(csv.reader(f))

In [59]:
lines

[['a', 'b', 'c'], ['1', '2', '3'], ['1', '2', '3']]

Then we split the lines into the header line and the data lines:


In [60]:
header, values = lines[0], lines[1:]

In [61]:
header

['a', 'b', 'c']

In [62]:
values

[['1', '2', '3'], ['1', '2', '3']]

Then we can create a dictonary of data columns using a dictionary comprehension and the expression zip(*values), which transposes rows to columns:

In [63]:
data_dict = {h:v for h,v in zip(header, zip(*values))}

In [64]:
data_dict

{'a': ('1', '1'), 'b': ('2', '2'), 'c': ('3', '3')}

To write delimited files manually, we can use csv.writer. It accepts an open, wirtabe file object and the same dialect and format options as csv.reader:

In [65]:
with open('mydata.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(('one', 'two', 'three'))
    writer.writerow(('1', '2', '3'))
    writer.writerow(('4', '5', '6'))
    writer.writerow(('7', '8', '9'))

In [66]:
!type mydata.csv

one,two,three

1,2,3

4,5,6

7,8,9



<h3>JSON Data </h3>

JSON(short for JavaScript Object Notation) has become one of the standard formats for sending data by HTTP request between web browsers and other applications. It is a much more free-form data format than a tabular text form like csv. Hers is an example:

In [67]:
obj = """
{"name": "Wes",
"places_lived": ["United States", "Spain", "Germany"],
"pet": null,
"siblings" : [{"name": "Scott", "age": 30, "pets":["Zeus", "Zuko"]},
{"name": "Katie", "age":38,
"pets":["Sixes", "Stache", "Cisco"]}]

}
"""

JSON is very nealry valid Python code with the exception of its null value 'null', and some other nuances (such as disallowing trailing commas at the nd of lists). The basic types are objects(dicts), arrays(lists), strings, numbers, booleans, and nulls. All of the kyes in an object must be strings. There are several Python librareis for reading and writing JSON data.  We'll use <b>json</b> here, as it is built into the Python standard library. To convert a JSON string to Python form, use json.loads:

In [68]:
import json

In [69]:
result = json.loads(obj)

In [70]:
result

{'name': 'Wes',
 'places_lived': ['United States', 'Spain', 'Germany'],
 'pet': None,
 'siblings': [{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']},
  {'name': 'Katie', 'age': 38, 'pets': ['Sixes', 'Stache', 'Cisco']}]}

<b>json.dumps</b> on the other hand, converts a Python Object back to JSON:

In [71]:
asjson = json.dumps(result)

In [72]:
asjson

'{"name": "Wes", "places_lived": ["United States", "Spain", "Germany"], "pet": null, "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]}, {"name": "Katie", "age": 38, "pets": ["Sixes", "Stache", "Cisco"]}]}'

How we convert a JSON object or list of objects to a DataFrame or some other data structure for analysis will be up to us. Conveniently, we can pass a list of dicts to the DataFrame constructor and select a subset of the data fields:

In [73]:
siblings = pd.DataFrame(result['siblings'], columns = ['name', 'age'])

In [74]:
siblings

Unnamed: 0,name,age
0,Scott,30
1,Katie,38


The pandas.read_json can automatically convert JSON datasets in specific arrangements into a Series or DataFrame. For example:

In [75]:
!type pydata-book-2nd-edition\\examples\\example.json

[{"a": 1, "b": 2, "c": 3},
 {"a": 4, "b": 5, "c": 6},
 {"a": 7, "b": 8, "c": 9}]


The default options for pandas.read_json assumes that each object in the JSON array is a row in the table:

In [76]:
data = pd.read_json('pydata-book-2nd-edition/examples/example.json')

In [77]:
data

Unnamed: 0,a,b,c
0,1,2,3
1,4,5,6
2,7,8,9


If we need to export data from pandas to JSON, one way is to use the <b>to_json</b> methods on Series and DataFrame

In [78]:
print(data.to_json())

{"a":{"0":1,"1":4,"2":7},"b":{"0":2,"1":5,"2":8},"c":{"0":3,"1":6,"2":9}}


In [79]:
print(data.to_json(orient='records'))

[{"a":1,"b":2,"c":3},{"a":4,"b":5,"c":6},{"a":7,"b":8,"c":9}]


<h3>XML and HTML: Web Scraping</h3>

Python has many libraries for reading and writing data in the ubiquiotous HTML and XML formats. Examples include lxml, Beautiful Soup and html5lib. While lxml is comparatively much faster in general, the other libraries can better handle malformed HTML or XML files.

pandas has a built-in function, <b>read_html</b>, which uses libraries like lxml and Beautiful Soup to automatically parse tables out of HTML files as DataFrame objects.

The pandas.read_html function has a number of options, but by default it searches for and attempts to parse all tabular data contained within <table>tags. The result is a lsit of DataFrame objects:

In [80]:
tables = pd.read_html('pydata-book-2nd-edition/examples/fdic_failed_bank_list.html')

In [81]:
tables

[                             Bank Name             City  ST   CERT  \
 0                          Allied Bank         Mulberry  AR     91   
 1         The Woodbury Banking Company         Woodbury  GA  11297   
 2               First CornerStone Bank  King of Prussia  PA  35312   
 3                   Trust Company Bank          Memphis  TN   9956   
 4           North Milwaukee State Bank        Milwaukee  WI  20364   
 ..                                 ...              ...  ..    ...   
 542                 Superior Bank, FSB         Hinsdale  IL  32646   
 543                Malta National Bank            Malta  OH   6629   
 544    First Alliance Bank & Trust Co.       Manchester  NH  34264   
 545  National State Bank of Metropolis       Metropolis  IL   3815   
 546                   Bank of Honolulu         Honolulu  HI  21029   
 
                    Acquiring Institution        Closing Date  \
 0                           Today's Bank  September 23, 2016   
 1              

In [82]:
len(tables)

1

In [83]:
failures = tables[0]

In [84]:
failures

Unnamed: 0,Bank Name,City,ST,CERT,Acquiring Institution,Closing Date,Updated Date
0,Allied Bank,Mulberry,AR,91,Today's Bank,"September 23, 2016","November 17, 2016"
1,The Woodbury Banking Company,Woodbury,GA,11297,United Bank,"August 19, 2016","November 17, 2016"
2,First CornerStone Bank,King of Prussia,PA,35312,First-Citizens Bank & Trust Company,"May 6, 2016","September 6, 2016"
3,Trust Company Bank,Memphis,TN,9956,The Bank of Fayette County,"April 29, 2016","September 6, 2016"
4,North Milwaukee State Bank,Milwaukee,WI,20364,First-Citizens Bank & Trust Company,"March 11, 2016","June 16, 2016"
...,...,...,...,...,...,...,...
542,"Superior Bank, FSB",Hinsdale,IL,32646,"Superior Federal, FSB","July 27, 2001","August 19, 2014"
543,Malta National Bank,Malta,OH,6629,North Valley Bank,"May 3, 2001","November 18, 2002"
544,First Alliance Bank & Trust Co.,Manchester,NH,34264,Southern New Hampshire Bank & Trust,"February 2, 2001","February 18, 2003"
545,National State Bank of Metropolis,Metropolis,IL,3815,Banterra Bank of Marion,"December 14, 2000","March 17, 2005"


In [85]:
failures.head()

Unnamed: 0,Bank Name,City,ST,CERT,Acquiring Institution,Closing Date,Updated Date
0,Allied Bank,Mulberry,AR,91,Today's Bank,"September 23, 2016","November 17, 2016"
1,The Woodbury Banking Company,Woodbury,GA,11297,United Bank,"August 19, 2016","November 17, 2016"
2,First CornerStone Bank,King of Prussia,PA,35312,First-Citizens Bank & Trust Company,"May 6, 2016","September 6, 2016"
3,Trust Company Bank,Memphis,TN,9956,The Bank of Fayette County,"April 29, 2016","September 6, 2016"
4,North Milwaukee State Bank,Milwaukee,WI,20364,First-Citizens Bank & Trust Company,"March 11, 2016","June 16, 2016"


Because failuers has many columns, pandas inserts a line break character \.

In [86]:
close_timestamps = pd.to_datetime(failures['Closing Date'])

In [87]:
close_timestamps.dt.year.value_counts()

2010    157
2009    140
2011     92
2012     51
2008     25
       ... 
2004      4
2001      4
2007      3
2003      3
2000      2
Name: Closing Date, Length: 15, dtype: int64

<h3>Binary Data Formats</h3>

One of the easiest ways to store data (also known as serialization) efficiently in binary format is using Python's built-in pickle serialization. Pandas objects all have a <b>to_pickle</b> method that writes the data to disk in pickle format:

In [88]:
frame = pd.read_csv('pydata-book-2nd-edition/examples/ex1.csv')

In [89]:
frame

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [90]:
frame.to_pickle('frame_pickle')

We can read any "pickled" object stored in a file by using the built-in pickle directly, or even more conveniently using pandas.read_pickle

In [91]:
pd.read_pickle('frame_pickle')

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


<h3>Note: </h3> pickle is only recommended as a short-term storage format. The problem is that it is hard to guarantee that the format will be stable over time; an object pickled today may not unpickle with a later version of a library.  

Some other storage formats for pandas or NumPy data include:
<pre><b>bcolz</b> - A compressable column-oriented binary format based on the Blosc compression library.
<b>Feather</b> - A cross-lanugage column-oriented file format designed with R programming. Feather uses the <b>Apache Arrow</b> columnar memory format.</pre>

<h3>Reading Microsoft Excel Files</h3>

Pandas also supports reading tabular data stored in Excel 2003 files using either the ExcelFile class or pandas.read_excel function. Internally these tools use the add-on packages xlrd and openpyxl to read XLS and XLSX files, respectively. We may need to install these manuall ywith pip or conda

To use ExcelFile, create an instance by passing a path to an xls or xlsx file:

In [92]:
xlsx = pd.ExcelFile('pydata-book-2nd-edition/examples/ex1.xlsx', engine='openpyxl')

Data stored in a sheet can be read into DataFrame with parse:

In [93]:
pd.read_excel(xlsx, 'Sheet1')

Unnamed: 0.1,Unnamed: 0,a,b,c,d,message
0,0,1,2,3,4,hello
1,1,5,6,7,8,world
2,2,9,10,11,12,foo


If we are reading multiple sheets in a file, then it is faster to create the ExcelFile, but we can also simply pass the filename to pandas.read_excel:

In [94]:
frame = pd.read_excel('pydata-book-2nd-edition/examples/ex1.xlsx', 'Sheet1', engine='openpyxl')

In [95]:
frame

Unnamed: 0.1,Unnamed: 0,a,b,c,d,message
0,0,1,2,3,4,hello
1,1,5,6,7,8,world
2,2,9,10,11,12,foo


To write pandas data to Excel format, we must first create an ExcelWriter, then write data to it using pandas objects' to_excel method:

In [96]:
writer = pd.ExcelWriter('ex2.xlsx')

In [97]:
frame.to_excel(writer, 'Sheet1')

In [98]:
writer.save()

<h3>Interacting with Web APIs</h3>

Many websites have public APIs providing data feeds via JSON or some other format. There are a number of ways to access these APIs from Python; one easy-to-use method that I recommend is the requests package.

In [99]:
import requests



In [100]:
url = 'https://api.github.com/repos/pandas-dev/pandas/issues'

In [101]:
response = requests.get(url)

In [102]:
response

<Response [200]>

In [103]:
response.status_code == 200

True

The Response object's json method will return a dictionary containing JSON parsed into native Python objects:

In [104]:
data = response.json()

In [105]:
data[0]['title']

'DOC: Update contributing.rst'

Each element in data is a dictionary containing all of the data found on a GitHub issue page (except for the comments). We can pass data directly to DataFrame and extract fields of interest:

In [106]:
issues = pd.DataFrame(data, columns = ['number', 'title', 'labels', 'state'])

In [107]:
issues

Unnamed: 0,number,title,labels,state
0,38938,DOC: Update contributing.rst,"[{'id': 134699, 'node_id': 'MDU6TGFiZWwxMzQ2OT...",open
1,38937,ENH: pd.read_excel with table parameter,"[{'id': 76812, 'node_id': 'MDU6TGFiZWw3NjgxMg=...",open
2,38936,CLN: Consolidate raise_on_missing and on_versi...,"[{'id': 76812, 'node_id': 'MDU6TGFiZWw3NjgxMg=...",open
3,38935,DOC: remove use of head() in the comparison docs,"[{'id': 134699, 'node_id': 'MDU6TGFiZWwxMzQ2OT...",open
4,38934,ENH: Improve numerical stability for groupby.m...,"[{'id': 233160, 'node_id': 'MDU6TGFiZWwyMzMxNj...",open
...,...,...,...,...
25,38896,API: setitem copy/view behavior ndarray vs Cat...,"[{'id': 1741841389, 'node_id': 'MDU6TGFiZWwxNz...",open
26,38895,ENH: Add numba engine to several rolling aggre...,"[{'id': 76812, 'node_id': 'MDU6TGFiZWw3NjgxMg=...",open
27,38889,CLN/TST: Pyarrow CSV engine,"[{'id': 47229171, 'node_id': 'MDU6TGFiZWw0NzIy...",open
28,38886,CLN: add typing for dtype arg in core/arrays (...,"[{'id': 31404521, 'node_id': 'MDU6TGFiZWwzMTQw...",open


In [108]:
issues.head(n = 10)

Unnamed: 0,number,title,labels,state
0,38938,DOC: Update contributing.rst,"[{'id': 134699, 'node_id': 'MDU6TGFiZWwxMzQ2OT...",open
1,38937,ENH: pd.read_excel with table parameter,"[{'id': 76812, 'node_id': 'MDU6TGFiZWw3NjgxMg=...",open
2,38936,CLN: Consolidate raise_on_missing and on_versi...,"[{'id': 76812, 'node_id': 'MDU6TGFiZWw3NjgxMg=...",open
3,38935,DOC: remove use of head() in the comparison docs,"[{'id': 134699, 'node_id': 'MDU6TGFiZWwxMzQ2OT...",open
4,38934,ENH: Improve numerical stability for groupby.m...,"[{'id': 233160, 'node_id': 'MDU6TGFiZWwyMzMxNj...",open
5,38932,BUG: rank_2d raising with mixed dtypes,"[{'id': 31404521, 'node_id': 'MDU6TGFiZWwzMTQw...",open
6,38931,BUG: DataFrame.__setitem__ raising ValueError...,"[{'id': 2822098, 'node_id': 'MDU6TGFiZWwyODIyM...",open
7,38930,TST/REF: splitting pandas/io/parsers.py into m...,"[{'id': 211029535, 'node_id': 'MDU6TGFiZWwyMTE...",open
8,38927,TST: stricten xfails,"[{'id': 127685, 'node_id': 'MDU6TGFiZWwxMjc2OD...",open
9,38926,CLN: Unify number recognition tests in read_cs...,"[{'id': 211029535, 'node_id': 'MDU6TGFiZWwyMTE...",open


<h3>Interacting with Databases </h3>

In a business setting, most of the data may not be stored in text or Excel files. SQL-based relational databases (such as SQL Server, PostgreSQL, and MySQL) are in wide user, and many alternative databases have become quite popular. The choice of databse is usually dependent on the performance, data integrity, and scalability needs of an application.

Loading data from SQL into  a DataFrame is fairly straightforward, and pandas have some functions to simplify the process. As an example, we'll now create a SQLite databse using Python's built in sqlite3 driver:

In [109]:
import sqlite3

In [110]:
query = """
    CREATE TABLE test
    (a VARCHAR(20), b VARCHAR(20),
    c REAL, d INTEGER
    );
"""

In [111]:
con = sqlite3.connect('mydata.sqlite')

In [112]:
con.execute(query)

OperationalError: table test already exists

In [None]:
con.commit()

Then, insert a few rows of data:

In [None]:
data = [('Atlanta', 'Georgia', 1.23, 6),
        ('Tallahassee', 'Florida', 2.6, 3),
       ('Sacramento', 'California', 1.7, 5)]

In [None]:
stmt = "INSERT INTO test VALUES(?, ?, ?, ?)"

In [None]:
con.executemany(stmt, data)

In [None]:
con.commit()

Most Python SQL drivers (PyODBC, psycopg2, MySQLdb, pymssql, etc.) return a list of tuples when selecting data from a table:

In [None]:
cursor = con.execute('select * from test')

In [None]:
rows = cursor.fetchall()

In [None]:
rows