# Reading and writing data across multiple formats

A practical introduction.

In [20]:
import pandas as pd

*import as pd* is a widely used convention

in our data directory we have some files stored from the Blooth store, lets import one!

In [21]:
!pwd

/Users/hendorf/code/pandas-pydata-berlin-2017/notebooks


In [22]:
sales_data = pd.read_csv('../data/blooth_sales_data.csv')

### Let's explore our data set

In [51]:
#pd.set_option('display.max_rows', 10)
sales_data

Unnamed: 0,name,birthday,customer,orderdate,product,units,unitprice
0,Mohamed,1978-03-15,Federated North Corporation,2016-07-05 10:33:46.009401,banana,10,10.00
1,Kim,1986-01-05,Star Application Agency,2016-07-15 10:33:46.009447,Rubik’s Cube,5,16.44
2,Chester,1967-05-24,Hardware Industries,2016-07-25 10:33:46.009479,iPad,48,865.62
3,Boyce,1980-11-12,Bell Building Co,2016-07-06 10:33:46.009507,Harry Potter book,44,18.54
4,Wilma,1953-12-24,Net Electronic Consulting Limited,2016-07-09 10:33:46.009537,iPad,40,772.63
...,...,...,...,...,...,...,...
995,Tory,1969-11-23,Venture Limited,2016-07-10 10:33:46.036736,Lipitor,49,10.78
996,Randolph,1987-01-09,Hill Omega Data Industries,2016-07-23 10:33:46.036760,iPad,43,404.96
997,Lawrence,1979-12-11,Internet Innovation Provider Inc,2016-07-09 10:33:46.036783,Lipitor,9,10.90
998,Ward,1980-12-04,Solutions Group,2016-07-18 10:33:46.036807,Corolla,28,21155.09


In [52]:
pd.reset_option('display.max_rows')

In [24]:
type(sales_data)

pandas.core.frame.DataFrame

In [25]:
len(sales_data)

1000

In [26]:
sales_data.head(5)

Unnamed: 0,name,birthday,customer,orderdate,product,units,unitprice
0,Mohamed,1978-03-15,Federated North Corporation,2016-07-05 10:33:46.009401,banana,10,10.0
1,Kim,1986-01-05,Star Application Agency,2016-07-15 10:33:46.009447,Rubik’s Cube,5,16.44
2,Chester,1967-05-24,Hardware Industries,2016-07-25 10:33:46.009479,iPad,48,865.62
3,Boyce,1980-11-12,Bell Building Co,2016-07-06 10:33:46.009507,Harry Potter book,44,18.54
4,Wilma,1953-12-24,Net Electronic Consulting Limited,2016-07-09 10:33:46.009537,iPad,40,772.63


In [27]:
sales_data.tail(5)

Unnamed: 0,name,birthday,customer,orderdate,product,units,unitprice
995,Tory,1969-11-23,Venture Limited,2016-07-10 10:33:46.036736,Lipitor,49,10.78
996,Randolph,1987-01-09,Hill Omega Data Industries,2016-07-23 10:33:46.036760,iPad,43,404.96
997,Lawrence,1979-12-11,Internet Innovation Provider Inc,2016-07-09 10:33:46.036783,Lipitor,9,10.9
998,Ward,1980-12-04,Solutions Group,2016-07-18 10:33:46.036807,Corolla,28,21155.09
999,Jonnie,1981-03-02,Design Hill Corporation,2016-07-03 10:33:46.036831,iPad,42,570.57


In [28]:
sales_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
name         1000 non-null object
birthday     1000 non-null object
customer     1000 non-null object
orderdate    1000 non-null object
product      1000 non-null object
units        1000 non-null int64
unitprice    1000 non-null float64
dtypes: float64(1), int64(1), object(5)
memory usage: 54.8+ KB


note: date(time) are still strings objects

In [29]:
sales_data = pd.read_csv('../data/blooth_sales_data.csv',
                         parse_dates=['birthday', 'orderdate']
                        )
sales_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
name         1000 non-null object
birthday     1000 non-null datetime64[ns]
customer     1000 non-null object
orderdate    1000 non-null datetime64[ns]
product      1000 non-null object
units        1000 non-null int64
unitprice    1000 non-null float64
dtypes: datetime64[ns](2), float64(1), int64(1), object(3)
memory usage: 54.8+ KB


In [30]:
sales_data.head(5)

Unnamed: 0,name,birthday,customer,orderdate,product,units,unitprice
0,Mohamed,1978-03-15,Federated North Corporation,2016-07-05 10:33:46.009401,banana,10,10.0
1,Kim,1986-01-05,Star Application Agency,2016-07-15 10:33:46.009447,Rubik’s Cube,5,16.44
2,Chester,1967-05-24,Hardware Industries,2016-07-25 10:33:46.009479,iPad,48,865.62
3,Boyce,1980-11-12,Bell Building Co,2016-07-06 10:33:46.009507,Harry Potter book,44,18.54
4,Wilma,1953-12-24,Net Electronic Consulting Limited,2016-07-09 10:33:46.009537,iPad,40,772.63


The auto date parser is US date friendly by default -> month first! MM/DD/YYYY add *dayfirst=True* for international and European format.

In [31]:
sales_data = pd.read_csv('../data/blooth_sales_data.csv',
                         parse_dates=['birthday', 'orderdate'],
                         dayfirst=True
                        )
sales_data.head(5)

Unnamed: 0,name,birthday,customer,orderdate,product,units,unitprice
0,Mohamed,1978-03-15,Federated North Corporation,2016-07-05 10:33:46.009401,banana,10,10.0
1,Kim,1986-01-05,Star Application Agency,2016-07-15 10:33:46.009447,Rubik’s Cube,5,16.44
2,Chester,1967-05-24,Hardware Industries,2016-07-25 10:33:46.009479,iPad,48,865.62
3,Boyce,1980-11-12,Bell Building Co,2016-07-06 10:33:46.009507,Harry Potter book,44,18.54
4,Wilma,1953-12-24,Net Electronic Consulting Limited,2016-07-09 10:33:46.009537,iPad,40,772.63


### Other formats

In [32]:
sales_data_json = pd.read_json('../data/blooth_sales_data.json')
sales_data_json.head(5)

Unnamed: 0,birthday,customer,name,orderdate,product,unitprice,units
0,1978-03-15,Federated North Corporation,Mohamed,2016-07-05 10:33:46.009401,banana,10.0,10
1,1986-01-05,Star Application Agency,Kim,2016-07-15 10:33:46.009447,Rubik’s Cube,16.44,5
2,1967-05-24,Hardware Industries,Chester,2016-07-25 10:33:46.009479,iPad,865.62,48
3,1980-11-12,Bell Building Co,Boyce,2016-07-06 10:33:46.009507,Harry Potter book,18.54,44
4,1953-12-24,Net Electronic Consulting Limited,Wilma,2016-07-09 10:33:46.009537,iPad,772.63,40


In [33]:
sales_data_json = pd.read_json('../data/blooth_sales_data.json',
                              )
sales_data_json.head(5)

Unnamed: 0,birthday,customer,name,orderdate,product,unitprice,units
0,1978-03-15,Federated North Corporation,Mohamed,2016-07-05 10:33:46.009401,banana,10.0,10
1,1986-01-05,Star Application Agency,Kim,2016-07-15 10:33:46.009447,Rubik’s Cube,16.44,5
2,1967-05-24,Hardware Industries,Chester,2016-07-25 10:33:46.009479,iPad,865.62,48
3,1980-11-12,Bell Building Co,Boyce,2016-07-06 10:33:46.009507,Harry Potter book,18.54,44
4,1953-12-24,Net Electronic Consulting Limited,Wilma,2016-07-09 10:33:46.009537,iPad,772.63,40


In [34]:
sales_data_json.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 7 columns):
birthday     1000 non-null object
customer     1000 non-null object
name         1000 non-null object
orderdate    1000 non-null object
product      1000 non-null object
unitprice    1000 non-null float64
units        1000 non-null int64
dtypes: float64(1), int64(1), object(5)
memory usage: 62.5+ KB


Our json is not correctly formatted for datetime!

Should be ISO 8601 "YYYY-MM-DDTHH:MM:SS.NNNZ" e.g. 2012-04-23T18:25:43.511Z

In [40]:
import json
with open('../data/blooth_sales_data.json', 'r') as f:
    _json = json.load(f)
_json[0]

{'birthday': '1978-03-15',
 'customer': 'Federated North Corporation',
 'name': 'Mohamed',
 'orderdate': '2016-07-05 10:33:46.009401',
 'product': 'banana',
 'unitprice': 10.0,
 'units': 10}

In [41]:
import datetime
for j in _json:
    j['orderdate'] = datetime.datetime.strptime(j['orderdate'], "%Y-%m-%d %H:%M:%S.%f").strftime("%Y-%m-%dT%H:%M:%SZ")
    j['birthday'] = datetime.datetime.strptime(j['birthday'], "%Y-%m-%d")
_json[0]

{'birthday': datetime.datetime(1978, 3, 15, 0, 0),
 'customer': 'Federated North Corporation',
 'name': 'Mohamed',
 'orderdate': '2016-07-05T10:33:46Z',
 'product': 'banana',
 'unitprice': 10.0,
 'units': 10}

In [42]:
sales_data_from_dict = pd.DataFrame(_json)
sales_data_from_dict.head(5)

Unnamed: 0,birthday,customer,name,orderdate,product,unitprice,units
0,1978-03-15,Federated North Corporation,Mohamed,2016-07-05T10:33:46Z,banana,10.0,10
1,1986-01-05,Star Application Agency,Kim,2016-07-15T10:33:46Z,Rubik’s Cube,16.44,5
2,1967-05-24,Hardware Industries,Chester,2016-07-25T10:33:46Z,iPad,865.62,48
3,1980-11-12,Bell Building Co,Boyce,2016-07-06T10:33:46Z,Harry Potter book,18.54,44
4,1953-12-24,Net Electronic Consulting Limited,Wilma,2016-07-09T10:33:46Z,iPad,772.63,40


In [43]:
sales_data_from_dict.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
birthday     1000 non-null datetime64[ns]
customer     1000 non-null object
name         1000 non-null object
orderdate    1000 non-null object
product      1000 non-null object
unitprice    1000 non-null float64
units        1000 non-null int64
dtypes: datetime64[ns](1), float64(1), int64(1), object(4)
memory usage: 54.8+ KB


In [45]:
with open('../data/blooth_sales_data.json', 'r') as f:
    _json = json.load(f)
for j in _json:
    j['orderdate'] = datetime.datetime.strptime(j['orderdate'], "%Y-%m-%d %H:%M:%S.%f")
    j['birthday'] = datetime.datetime.strptime(j['birthday'], "%Y-%m-%d")
sales_data_from_dict = pd.DataFrame(_json)
sales_data_from_dict.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
birthday     1000 non-null datetime64[ns]
customer     1000 non-null object
name         1000 non-null object
orderdate    1000 non-null datetime64[ns]
product      1000 non-null object
unitprice    1000 non-null float64
units        1000 non-null int64
dtypes: datetime64[ns](2), float64(1), int64(1), object(3)
memory usage: 54.8+ KB


### Summary

* reading and writing data is simple with Pandas
* very customizable imports
* a lot of the handling is done by Pandas by default, e.g.:
    * header
    * datatype (int/float, but not datetime)
    * skipping blank lines
    * …
* data cleansing (datetime example above) can also be done in pandas directly, we'll see later.
* caveats:
    * Columns with NaN values become floats
   