# Reading and writing data across multiple formats

A practical start into Pandas.

In [31]:
import pandas as pd

*import as pd* is a widely used convention

in our data directory we have some files stored from the *Blooth store*, *let's import one*!

In [32]:
!pwd

/Users/hendorf/code/pandas-pydata-berlin-2017/notebooks


In [33]:
!ls ../data

blooth_sales_data.csv    blooth_sales_data.json   createFakeData.py
blooth_sales_data.html   blooth_sales_data.xlsx   ~$blooth_sales_data.xlsx


In [34]:
sales_data = pd.read_csv('../data/blooth_sales_data.csv')

### Let's explore our data set

In [35]:
pd.set_option('display.max_rows', 10)  # change presets for data preview
sales_data

Unnamed: 0,name,birthday,customer,orderdate,product,units,unitprice
0,Pasquale,1967-09-02,Electronics Inc,2016-07-17 13:48:03.156566,Thriller record,2,13.27
1,India,1968-12-13,Electronics Resource Group,2016-07-06 13:48:03.156596,Corolla,26,24458.69
2,Wayne,1992-09-10,East Application Contract Inc,2016-07-22 13:48:03.156618,Rubik’s Cube,41,15.79
3,Cori,1986-11-05,Signal Industries,2016-07-23 13:48:03.156638,iPhone,16,584.01
4,Chang,1972-04-23,Star Alpha Industries,2016-07-16 13:48:03.156657,Harry Potter book,4,25.69
...,...,...,...,...,...,...,...
995,Ethan,1952-12-08,Application Industries,2016-07-21 13:48:03.177885,Harry Potter book,39,24.40
996,Rudolph,1959-10-15,Network Software West Inc,2016-07-19 13:48:03.177903,Rubik’s Cube,9,15.11
997,Annmarie,1982-06-04,Atlantic Corporation,2016-07-13 13:48:03.177924,Thriller record,19,9.16
998,Chang,1984-02-05,Venture Alpha Corporation,2016-07-13 13:48:03.177943,Harry Potter book,24,28.21


In [36]:
pd.reset_option('display.max_rows')

#### Let's see what we have got now

In [37]:
type(sales_data)

pandas.core.frame.DataFrame

In [38]:
len(sales_data)

1000

#### Inspect your DataFrame with pandas methods

In [39]:
sales_data.head(5)

Unnamed: 0,name,birthday,customer,orderdate,product,units,unitprice
0,Pasquale,1967-09-02,Electronics Inc,2016-07-17 13:48:03.156566,Thriller record,2,13.27
1,India,1968-12-13,Electronics Resource Group,2016-07-06 13:48:03.156596,Corolla,26,24458.69
2,Wayne,1992-09-10,East Application Contract Inc,2016-07-22 13:48:03.156618,Rubik’s Cube,41,15.79
3,Cori,1986-11-05,Signal Industries,2016-07-23 13:48:03.156638,iPhone,16,584.01
4,Chang,1972-04-23,Star Alpha Industries,2016-07-16 13:48:03.156657,Harry Potter book,4,25.69


In [40]:
sales_data.tail(5)

Unnamed: 0,name,birthday,customer,orderdate,product,units,unitprice
995,Ethan,1952-12-08,Application Industries,2016-07-21 13:48:03.177885,Harry Potter book,39,24.4
996,Rudolph,1959-10-15,Network Software West Inc,2016-07-19 13:48:03.177903,Rubik’s Cube,9,15.11
997,Annmarie,1982-06-04,Atlantic Corporation,2016-07-13 13:48:03.177924,Thriller record,19,9.16
998,Chang,1984-02-05,Venture Alpha Corporation,2016-07-13 13:48:03.177943,Harry Potter book,24,28.21
999,Ervin,1977-10-14,Provider Agency,2016-07-09 13:48:03.177962,iPhone,39,663.83


In [41]:
sales_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
name         1000 non-null object
birthday     1000 non-null object
customer     1000 non-null object
orderdate    1000 non-null object
product      1000 non-null object
units        1000 non-null int64
unitprice    1000 non-null float64
dtypes: float64(1), int64(1), object(5)
memory usage: 54.8+ KB


**note: floats and ints were detected automatically but date(time) are still strings objects**

In [42]:
sales_data = pd.read_csv('../data/blooth_sales_data.csv',
                         parse_dates=['birthday', 'orderdate']
                        )
sales_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
name         1000 non-null object
birthday     1000 non-null datetime64[ns]
customer     1000 non-null object
orderdate    1000 non-null datetime64[ns]
product      1000 non-null object
units        1000 non-null int64
unitprice    1000 non-null float64
dtypes: datetime64[ns](2), float64(1), int64(1), object(3)
memory usage: 54.8+ KB


In [43]:
sales_data.head(5)

Unnamed: 0,name,birthday,customer,orderdate,product,units,unitprice
0,Pasquale,1967-09-02,Electronics Inc,2016-07-17 13:48:03.156566,Thriller record,2,13.27
1,India,1968-12-13,Electronics Resource Group,2016-07-06 13:48:03.156596,Corolla,26,24458.69
2,Wayne,1992-09-10,East Application Contract Inc,2016-07-22 13:48:03.156618,Rubik’s Cube,41,15.79
3,Cori,1986-11-05,Signal Industries,2016-07-23 13:48:03.156638,iPhone,16,584.01
4,Chang,1972-04-23,Star Alpha Industries,2016-07-16 13:48:03.156657,Harry Potter book,4,25.69


The auto date parser is US date friendly by default -> month first! MM/DD/YYYY add *dayfirst=True* for international and European format.

In [44]:
sales_data = pd.read_csv('../data/blooth_sales_data.csv',
                         parse_dates=['birthday', 'orderdate'],
                         dayfirst=True
                        )
sales_data.head(5)

Unnamed: 0,name,birthday,customer,orderdate,product,units,unitprice
0,Pasquale,1967-09-02,Electronics Inc,2016-07-17 13:48:03.156566,Thriller record,2,13.27
1,India,1968-12-13,Electronics Resource Group,2016-07-06 13:48:03.156596,Corolla,26,24458.69
2,Wayne,1992-09-10,East Application Contract Inc,2016-07-22 13:48:03.156618,Rubik’s Cube,41,15.79
3,Cori,1986-11-05,Signal Industries,2016-07-23 13:48:03.156638,iPhone,16,584.01
4,Chang,1972-04-23,Star Alpha Industries,2016-07-16 13:48:03.156657,Harry Potter book,4,25.69


### JSON

In [45]:
sales_data_json = pd.read_json('../data/blooth_sales_data.json')
sales_data_json.head(5)

Unnamed: 0,birthday,customer,name,orderdate,product,unitprice,units
0,1967-09-02,Electronics Inc,Pasquale,2016-07-17 13:48:03.156566,Thriller record,13.27,2
1,1968-12-13,Electronics Resource Group,India,2016-07-06 13:48:03.156596,Corolla,24458.69,26
2,1992-09-10,East Application Contract Inc,Wayne,2016-07-22 13:48:03.156618,Rubik’s Cube,15.79,41
3,1986-11-05,Signal Industries,Cori,2016-07-23 13:48:03.156638,iPhone,584.01,16
4,1972-04-23,Star Alpha Industries,Chang,2016-07-16 13:48:03.156657,Harry Potter book,25.69,4


In [46]:
sales_data_json = pd.read_json('../data/blooth_sales_data.json')
sales_data_json.head(5)

Unnamed: 0,birthday,customer,name,orderdate,product,unitprice,units
0,1967-09-02,Electronics Inc,Pasquale,2016-07-17 13:48:03.156566,Thriller record,13.27,2
1,1968-12-13,Electronics Resource Group,India,2016-07-06 13:48:03.156596,Corolla,24458.69,26
2,1992-09-10,East Application Contract Inc,Wayne,2016-07-22 13:48:03.156618,Rubik’s Cube,15.79,41
3,1986-11-05,Signal Industries,Cori,2016-07-23 13:48:03.156638,iPhone,584.01,16
4,1972-04-23,Star Alpha Industries,Chang,2016-07-16 13:48:03.156657,Harry Potter book,25.69,4


In [47]:
sales_data_json.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 7 columns):
birthday     1000 non-null object
customer     1000 non-null object
name         1000 non-null object
orderdate    1000 non-null object
product      1000 non-null object
unitprice    1000 non-null float64
units        1000 non-null int64
dtypes: float64(1), int64(1), object(5)
memory usage: 62.5+ KB


**The JSON is not correctly formatted for datetime!**, should be ISO 8601 "YYYY-MM-DDTHH:MM:SS.NNNZ" e.g. 2012-04-23T18:25:43.511Z

One approach to slove it: read the json and create the DataFrame from a a list of dictionaries.

In [48]:
import json
with open('../data/blooth_sales_data.json', 'r') as f:
    _json = json.load(f)
_json[0]

{'birthday': '1967-09-02',
 'customer': 'Electronics Inc',
 'name': 'Pasquale',
 'orderdate': '2016-07-17 13:48:03.156566',
 'product': 'Thriller record',
 'unitprice': 13.27,
 'units': 2}

In [49]:
import datetime
for j in _json:
    j['orderdate'] = datetime.datetime.strptime(j['orderdate'], "%Y-%m-%d %H:%M:%S.%f").strftime("%Y-%m-%dT%H:%M:%SZ")
    j['birthday'] = datetime.datetime.strptime(j['birthday'], "%Y-%m-%d")
_json[0]

{'birthday': datetime.datetime(1967, 9, 2, 0, 0),
 'customer': 'Electronics Inc',
 'name': 'Pasquale',
 'orderdate': '2016-07-17T13:48:03Z',
 'product': 'Thriller record',
 'unitprice': 13.27,
 'units': 2}

In [50]:
sales_data_from_dict = pd.DataFrame(_json)
sales_data_from_dict.head(5)

Unnamed: 0,birthday,customer,name,orderdate,product,unitprice,units
0,1967-09-02,Electronics Inc,Pasquale,2016-07-17T13:48:03Z,Thriller record,13.27,2
1,1968-12-13,Electronics Resource Group,India,2016-07-06T13:48:03Z,Corolla,24458.69,26
2,1992-09-10,East Application Contract Inc,Wayne,2016-07-22T13:48:03Z,Rubik’s Cube,15.79,41
3,1986-11-05,Signal Industries,Cori,2016-07-23T13:48:03Z,iPhone,584.01,16
4,1972-04-23,Star Alpha Industries,Chang,2016-07-16T13:48:03Z,Harry Potter book,25.69,4


In [51]:
sales_data_from_dict.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
birthday     1000 non-null datetime64[ns]
customer     1000 non-null object
name         1000 non-null object
orderdate    1000 non-null object
product      1000 non-null object
unitprice    1000 non-null float64
units        1000 non-null int64
dtypes: datetime64[ns](1), float64(1), int64(1), object(4)
memory usage: 54.8+ KB


In [52]:
with open('../data/blooth_sales_data.json', 'r') as f:
    _json = json.load(f)
for j in _json:
    j['orderdate'] = datetime.datetime.strptime(j['orderdate'], "%Y-%m-%d %H:%M:%S.%f")
    j['birthday'] = datetime.datetime.strptime(j['birthday'], "%Y-%m-%d")
sales_data_from_dict = pd.DataFrame(_json)
sales_data_from_dict.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
birthday     1000 non-null datetime64[ns]
customer     1000 non-null object
name         1000 non-null object
orderdate    1000 non-null datetime64[ns]
product      1000 non-null object
unitprice    1000 non-null float64
units        1000 non-null int64
dtypes: datetime64[ns](2), float64(1), int64(1), object(3)
memory usage: 54.8+ KB


Probaly more efficient:

In [53]:
sales_data_json = pd.read_json('../data/blooth_sales_data.json',
                              convert_dates=['birthday', 'orderdate']
                              )
sales_data_json.head(5)

Unnamed: 0,birthday,customer,name,orderdate,product,unitprice,units
0,1967-09-02,Electronics Inc,Pasquale,2016-07-17 13:48:03.156566,Thriller record,13.27,2
1,1968-12-13,Electronics Resource Group,India,2016-07-06 13:48:03.156596,Corolla,24458.69,26
2,1992-09-10,East Application Contract Inc,Wayne,2016-07-22 13:48:03.156618,Rubik’s Cube,15.79,41
3,1986-11-05,Signal Industries,Cori,2016-07-23 13:48:03.156638,iPhone,584.01,16
4,1972-04-23,Star Alpha Industries,Chang,2016-07-16 13:48:03.156657,Harry Potter book,25.69,4


In [54]:
sales_data_json.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 7 columns):
birthday     1000 non-null datetime64[ns]
customer     1000 non-null object
name         1000 non-null object
orderdate    1000 non-null datetime64[ns]
product      1000 non-null object
unitprice    1000 non-null float64
units        1000 non-null int64
dtypes: datetime64[ns](2), float64(1), int64(1), object(3)
memory usage: 62.5+ KB


In [83]:
sales_data_json.describe()

Unnamed: 0,unitprice,units
count,1000.0,1000.0
mean,2202.87399,25.842
std,6369.863333,14.581689
min,5.02,1.0
25%,10.775,13.0
50%,18.005,26.0
75%,508.98,39.0
max,24967.31,50.0


### Working with Excel

In [63]:
sales_data_excel = pd.read_excel('../data/blooth_sales_data.xlsx',
                                parse_cols='A:B')
sales_data_excel.head(5)

Unnamed: 0,name,birthday
0,Pasquale,1967-09-02
1,India,1968-12-13
2,Wayne,1992-09-10
3,Cori,1986-11-05
4,Chang,1972-04-23


In [64]:
sales_data_excel.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
name        1000 non-null object
birthday    1000 non-null datetime64[ns]
dtypes: datetime64[ns](1), object(1)
memory usage: 15.7+ KB


#### Exporting to Excel

In [82]:
sales_data_json.head(3)

Unnamed: 0,birthday,customer,name,orderdate,product,unitprice,units
0,1967-09-02,Electronics Inc,Pasquale,2016-07-17 13:48:03.156566,Thriller record,13.27,2
1,1968-12-13,Electronics Resource Group,India,2016-07-06 13:48:03.156596,Corolla,24458.69,26
2,1992-09-10,East Application Contract Inc,Wayne,2016-07-22 13:48:03.156618,Rubik’s Cube,15.79,41


In [81]:
sales_data_json.to_excel('../data/sales_data_json_exported.xlsx', index=False, sheet_name='Sales Data')
# ! make sure the .ext is correct - used to determine the export engine

**Example: Exporting bto Excel is very powerful when using the *xlsxwriter engine***

![title](../pic/xlsxwriterexample.png)

### Reading from the Clipboard

In [69]:
# put this in your clipboard
sales_data_json.head(5)

Unnamed: 0,birthday,customer,name,orderdate,product,unitprice,units
0,1967-09-02,Electronics Inc,Pasquale,2016-07-17 13:48:03.156566,Thriller record,13.27,2
1,1968-12-13,Electronics Resource Group,India,2016-07-06 13:48:03.156596,Corolla,24458.69,26
2,1992-09-10,East Application Contract Inc,Wayne,2016-07-22 13:48:03.156618,Rubik’s Cube,15.79,41
3,1986-11-05,Signal Industries,Cori,2016-07-23 13:48:03.156638,iPhone,584.01,16
4,1972-04-23,Star Alpha Industries,Chang,2016-07-16 13:48:03.156657,Harry Potter book,25.69,4


In [71]:
sales_from_clipboard = pd.read_clipboard()
sales_from_clipboard.head(5)

Unnamed: 0,0,1967-09-02,Electronics Inc,Pasquale,2016-07-17 13:48:03.156566,Thriller record,13.27,2
0,1,1968-12-13,Electronics Resource Group,India,2016-07-06 13:48:03.156596,Corolla,24458.69,26
1,2,1992-09-10,East Application Contract Inc,Wayne,2016-07-22 13:48:03.156618,Rubik’s Cube,15.79,41
2,3,1986-11-05,Signal Industries,Cori,2016-07-23 13:48:03.156638,iPhone,584.01,16
3,4,1972-04-23,Star Alpha Industries,Chang,2016-07-16 13:48:03.156657,Harry Potter book,25.69,4


### Summary

* reading and writing data is simple with Pandas
* very customizable imports
* the many options can be overwhelming for beginners - be patient with yourself
* a lot of the handling is done by Pandas by default, e.g.:
    * header
    * datatype (int/float, but not datetime)
    * skipping blank lines
    * …
* data cleansing (datetime example above) can also be done in pandas directly, we'll see later.
* exporting data is simple,…
* caveats:
    * Columns with NaN values become floats

End of our light warm-up.
   