# 6.1 Data loading, storage and file formats

## Reading and Writeing Data in Text Format

pandas features a number of functions for reading tabular data as a DataFrame object

Here is some of them:

Funciton|Description
---|---
read_csv | Load delimited data from a file, URL, or file-like object, use comma as default delimiter
read_fwf | Rad data in fixed-width column format (ie no delimiter)
read_clipboard | Version of read_csv that read data from the clipboard(useful for converting tables form web)
read_excel | Read tabular data from an excel XLS or XLSX file
read_hdf | Read HDF5 files written in pandas
read_json | Read data fram a JSON string representation
read_msgpack | Read pandas data encoded useing MessagePack binary format
read_pickle | Read an arbitrary object stored in Python pickle format
read_sas | Read a SAS dataset stort in one of the SAS system custom storage formats
read_sql | Read the restults of a SQL query (using SQL alchemy) as a Pandas dataframe
read_stata | Read a statset from Stata file format
read_feather | Read the Feather binary file format

The opional arguments for thes functions may fall into a few categories:

- Indexing -- can treat one or more columns as the returned DataFrame, and whether to get column names form the file, the user, or not at all
- Type inference and data conversion -- This includes the user-defined value conversions and custom list of missing value markers
- Datetime parsing -- Includes combining capability, including combining data and time information spread over multiple columns into a single column in the result.
- Interacting -- Skipping rows or a footer , comments, or other minor things like numeric data with thousands separated by commas.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('examples/ex1.csv')
df

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [3]:
# not all csv have headers
df = pd.read_csv('examples/ex2.csv', header=None)
df

Unnamed: 0,0,1,2,3,4
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [4]:
#then we can add headers
df = pd.read_csv('examples/ex2.csv', names=['a','b','c','d','message'])
df

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [5]:
names=['a','b','c','d','message']

In [6]:
# Suppose you wanted 'message' column to be the index, use can use index_col for this
pd.read_csv('examples/ex2.csv', names=names, index_col='message')

Unnamed: 0_level_0,a,b,c,d
message,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
hello,1,2,3,4
world,5,6,7,8
foo,9,10,11,12


In [7]:
# in the event that you want to form a hierarchical index from multiple columns, 
# pass a list of columns numberes or names

parsed = pd.read_csv('examples/csv_mindex.csv', index_col=['key1','key2'])
parsed

Unnamed: 0_level_0,Unnamed: 1_level_0,value1,value2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16


CONTINUE ON PAGE 172

### Reading text files in pieces

In [8]:
# select how many rows you want to read
df = pd.read_csv('examples/ex3.csv', nrows=3)
df

Unnamed: 0,one,tow,three,four,five
0,2,5,7,3,7
1,3,5,6,3,6
2,7,3,7,3,3


In [9]:
# to read a file in pieces, specify a chuncksize as a number of rows
chunker = pd.read_csv('examples/ex3.csv', chunksize=3)
chunker

<pandas.io.parsers.readers.TextFileReader at 0x7f78205f4b20>

### Writing data to text formats

In [10]:
# use the to_csv method to write the data out
df.to_csv('examples/01-out.csv')

In [11]:
# you can also use sys module
import sys

In [12]:
df.to_csv(sys.stdout, sep='|') # prints the text result to console

|one|tow|three|four|five
0|2|5|7|3|7
1|3|5|6|3|6
2|7|3|7|3|3


In [13]:
#disable row and col labels
df.to_csv(sys.stdout, index=False, header=False)

2,5,7,3,7
3,5,6,3,6
7,3,7,3,3


In [14]:
# to only write a subset of columns 
df.to_csv(sys.stdout, index=False, columns=['three', 'five'])

three,five
7,7
6,6
7,3


In [15]:
# series also have a to_csv method
dates = pd.date_range('1/1/2000', periods=7)
ts = pd.Series(np.arange(7), index=dates)
ts.to_csv('examples/tseries.csv')

### Working with delimited formats

LOOKUP IN BOOK ON PAGE 178

### JSON data

JSON is very nearly valid python code with the exception of its null value ```null``` and some other nuances.

In [16]:
import json

In [17]:
obj = """
{"name": "Wes",
"places_lived": ["USA", "Spain", "Norway"],
"pet": null,
"siblings": [
    {"name": "Scott", "age": 30, "pets":["Zeus","Zuko"]},
    {"name": "Stine", "age": 38, "pets":["Mille","Milo","Grumpy"]}
    ]
}
"""

In [18]:
# To convert json string to Python form 
result = json.loads(obj)
result

{'name': 'Wes',
 'places_lived': ['USA', 'Spain', 'Norway'],
 'pet': None,
 'siblings': [{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']},
  {'name': 'Stine', 'age': 38, 'pets': ['Mille', 'Milo', 'Grumpy']}]}

In [19]:
# to convert a Python object to JSON
asjson = json.dumps(result)
asjson

'{"name": "Wes", "places_lived": ["USA", "Spain", "Norway"], "pet": null, "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]}, {"name": "Stine", "age": 38, "pets": ["Mille", "Milo", "Grumpy"]}]}'

How you convert JSON to a DataFrame is up to you. YOu can pass a list of dicts. and select a subset of the data fields

In [20]:
siblings = pd.DataFrame(result['siblings'], columns=['name','age'])
siblings

Unnamed: 0,name,age
0,Scott,30
1,Stine,38


But the ```pandas.read_json``` can automatically convert JSON datasets in sepcific arrangements into a Series or DataFrame. For example:

In [21]:
data = pd.read_json('examples/example.json')
data

Unnamed: 0,a,b,c
0,1,2,3
1,4,5,6
2,7,8,9


In [22]:
# if you need to export from pandas to JSON you can use to_json method on Series and DataFrame

print(data.to_json())

{"a":{"0":1,"1":4,"2":7},"b":{"0":2,"1":5,"2":8},"c":{"0":3,"1":6,"2":9}}


In [23]:
print(data.to_json(orient='records'))

[{"a":1,"b":2,"c":3},{"a":4,"b":5,"c":6},{"a":7,"b":8,"c":9}]


### XML and HTML: Web Scraping

you need bs4 and or html5lib libs in addition. 

the ```pandas.read_html``` function has a number of options, but by default it serches for and attempts to parse all tabular data containd within the ```<table>``` tags. The result is a list of dataDrame objects. 

In [24]:
from bs4 import BeautifulSoup
import html5lib

In [27]:
tables = pd.read_html('examples/a-webpage.html')
tables

[                        Company           Contact  Country
 0           Alfreds Futterkiste      Maria Anders  Germany
 1    Centro comercial Moctezuma   Francisco Chang   Mexico
 2                  Ernst Handel     Roland Mendel  Austria
 3                Island Trading     Helen Bennett       UK
 4  Laughing Bacchus Winecellars   Yoshi Tannamuri   Canada
 5  Magazzini Alimentari Riuniti  Giovanni Rovelli    Italy]

FOR XML READ PAGE 183

## Binary data formats

One of the easiest ways to store data (also knows as serilization) efficiently is binary format using Pythons built-in pickle serialization. 

Pandas objects all have a ```to_pickle``` method that writes data to disk in pickle format. 

In [28]:
frame = pd.read_csv('examples/ex1.csv')
frame

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [29]:
# to write data to pickle
frame.to_pickle('examples/frame_pickle')

In [30]:
# to read pickle data
pd.read_pickle('examples/frame_pickle')

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


> PS! pickle is only recommended as a short term storage format. 

### Using HDF5 Format

HDF5 (hierarchical data format 5) is a well-regarded file format intended for storing large quantities of scientific array data. 
- HDF5 support on-the-fly compression with a variety of compression modes
- It can be a good solution for very large datasets that do not fit into memory. 

In [31]:
frame = pd.DataFrame({'a': np.random.randn(100)})

In [32]:
store = pd.HDFStore('mydata.h5') # noe gikk galt :/ 

ImportError: Missing optional dependency 'pytables'.  Use pip or conda to install pytables.

### Reading Microsoft Excel Files

In [36]:
import openpyxl

In [38]:
xlsx = pd.ExcelFile('examples/ex1.xlsx')

In [39]:
# data stored in the sheet can then be read into the  dataframe with parse
pd.read_excel(xlsx, 'Sheet1')

Unnamed: 0,a,b,c,d
0,1,2,3,4
1,5,6,7,8
2,9,10,11,12


In [41]:
# if you need to write pandas dadta to excel format you must first create a ExcelWriter
# then you can use the method to_excel

writer = pd.ExcelWriter('examples/ex2.xlsx')
frame.to_excel(writer, 'Sheet1')
writer.save()


> you can also pass a file path to to_excel and avoid the ExcelWriter

```frame.to_excel('examples/ex2.xlsx')```

## Interacting with WEB APIs

In [43]:
import requests

In [46]:
url = 'https://api.github.com/repos/pandas-dev/pandas/issues'

In [47]:
res = requests.get(url)
res

<Response [200]>

In [48]:
# The response will return a dictionary containing JSON parsed into native Python

data = res.json()
data[0]['title']

'BUG: Unexpected results when adding offsets to periods stored in series.'

In [49]:
issues = pd.DataFrame(data, columns=['number', 'title', 'lables', 'state'])
issues

Unnamed: 0,number,title,lables,state
0,47883,BUG: Unexpected results when adding offsets to...,,open
1,47882,BUG: FutureWarning for timezone-naive date tim...,,open
2,47881,BUG: fix Dataframe.join with categorical index...,,open
3,47880,ENH: parse 8 or 9 digit delimited dates,,open
4,47879,PERF: preserve Index._id through pickle round-...,,open
5,47878,PERF: MultiIndex.copy(deep=False) not preservi...,,open
6,47877,DOC: Additions/updates to documentation-GH46359,,open
7,47874,BUG: `to_sql` string to date and/or time conve...,,open
8,47872,ENH: Allow different `dtype` in `pandas.Series...,,open
9,47871,BUG: to_csv requires escapechar unnecessarily ...,,open


## Interacting with databases

First create a database using sqlite3

In [50]:
import sqlite3

In [51]:
query = """ 
CREATE TABLE test
(a VARCHAR(20), b VARCHAR(20),
c REAL, d INTEGER
);
"""

In [52]:
con = sqlite3.connect('mydata.sqlite')

In [53]:
con.execute(query)

<sqlite3.Cursor at 0x7f77d512b420>

In [54]:
con.commit()

Add some data

In [55]:
data = [('Atlanta', 'Ohio', 1.25, 6),('Utah', 'New York', 4.5, 3),('Boston', 'Florida', 1.7, 5),]

In [56]:
stmt = "INSERT INTO test VALUES(?,?,?,?)"

In [57]:
con.executemany(stmt, data)

<sqlite3.Cursor at 0x7f77d51b3c70>

In [58]:
# Then select data
# most connectors return a list of tuples when selecting data

cursor = con.execute('select * from test')

In [59]:
rows = cursor.fetchall()
rows

[('Atlanta', 'Ohio', 1.25, 6),
 ('Utah', 'New York', 4.5, 3),
 ('Boston', 'Florida', 1.7, 5)]

In [60]:
# you can pass the list of tuples to the dataframe constructor, 
# but you also need the columnames, contained in the cursors description attribute

cursor.description

(('a', None, None, None, None, None, None),
 ('b', None, None, None, None, None, None),
 ('c', None, None, None, None, None, None),
 ('d', None, None, None, None, None, None))

In [61]:
pd.DataFrame(rows, columns=[x[0] for x in cursor.description])

Unnamed: 0,a,b,c,d
0,Atlanta,Ohio,1.25,6
1,Utah,New York,4.5,3
2,Boston,Florida,1.7,5


This is quite a bit of munging that you rather not repeat each time you query the database. 

Pandas has a ```read_sql``` function that enables you to read data easily from a general SQLalchemy connection

In [62]:
import sqlalchemy as sqla

In [64]:
db = sqla.create_engine('sqlite:///mydata.sqlite')

In [66]:
pd.read_sql('select * from test', db)

Unnamed: 0,a,b,c,d
