# Reading and writing data across multiple formats

A practical start into Pandas.

In [None]:
import pandas as pd

*import as pd* is a widely used convention

in our data directory we have some files stored from the *Blooth store*, *let's import one*!

In [None]:
!pwd

In [None]:
!ls ../data

In [None]:
sales_data = pd.read_csv('../data/blooth_sales_data.csv')

### Let's explore our data set

In [None]:
pd.set_option('display.max_rows', 10)  # change presets for data preview
sales_data

In [None]:
pd.reset_option('display.max_rows')

#### Let's see what we have got now

In [None]:
type(sales_data)

In [None]:
len(sales_data)

#### Inspect your DataFrame with pandas methods

In [None]:
sales_data.head(5)

In [None]:
sales_data.tail(5)

In [None]:
sales_data.info()

**note: floats and ints were detected automatically but date(time) are still strings objects**

In [None]:
sales_data = pd.read_csv('../data/blooth_sales_data.csv',
                         parse_dates=['birthday', 'orderdate']
                        )
sales_data.info()

In [None]:
sales_data.head(5)

The auto date parser is US date friendly by default -> month first! MM/DD/YYYY add *dayfirst=True* for international and European format.

In [None]:
sales_data = pd.read_csv('../data/blooth_sales_data.csv',
                         parse_dates=['birthday', 'orderdate'],
                         dayfirst=True)
sales_data.head(5)

### JSON

In [None]:
sales_data_json = pd.read_json('../data/blooth_sales_data.json')
sales_data_json.head(5)

In [None]:
sales_data_json = pd.read_json('../data/blooth_sales_data.json')
sales_data_json.head(5)

In [None]:
sales_data_json.info()

**The JSON is not correctly formatted for datetime!**, should be ISO 8601 "YYYY-MM-DDTHH:MM:SS.NNNZ" e.g. 2012-04-23T18:25:43.511Z

One approach to slove it: read the json and create the DataFrame from a a list of dictionaries.

In [None]:
import json
with open('../data/blooth_sales_data.json', 'r') as f:
    _json = json.load(f)
_json[0]

In [None]:
import datetime
for j in _json:
    j['orderdate'] = datetime.datetime.strptime(j['orderdate'], "%Y-%m-%d %H:%M:%S.%f").strftime("%Y-%m-%dT%H:%M:%SZ")
    j['birthday'] = datetime.datetime.strptime(j['birthday'], "%Y-%m-%d")
_json[0]

In [None]:
sales_data_from_dict = pd.DataFrame(_json)
sales_data_from_dict.head(5)

In [None]:
sales_data_from_dict.info()

In [None]:
with open('../data/blooth_sales_data.json', 'r') as f:
    _json = json.load(f)
for j in _json:
    j['orderdate'] = datetime.datetime.strptime(j['orderdate'], "%Y-%m-%d %H:%M:%S.%f")
    j['birthday'] = datetime.datetime.strptime(j['birthday'], "%Y-%m-%d")
sales_data_from_dict = pd.DataFrame(_json)
sales_data_from_dict.info()


Probaly more efficient:

In [None]:
sales_data_json = pd.read_json('../data/blooth_sales_data.json',
                              convert_dates=['birthday', 'orderdate']
                              )
sales_data_json.head(5)

In [None]:
sales_data_json.info()

In [None]:
sales_data_json.describe()

### Working with Excel

In [None]:
sales_data_excel = pd.read_excel('../data/blooth_sales_data.xlsx',
                                parse_cols='A:B')
sales_data_excel.head(5)

In [None]:
sales_data_excel.info()

#### Exporting to Excel

In [None]:
sales_data_json.head(3)

In [None]:
sales_data_json.to_excel('../data/sales_data_json_exported.xlsx', index=False, sheet_name='Sales Data')
# ! make sure the .ext is correct - used to determine the export engine

**Example: Exporting bto Excel is very powerful when using the *xlsxwriter engine***

![title](../pic/xlsxwriterexample.png)

For more see: http://xlsxwriter.readthedocs.io

### Reading from the Clipboard

In [None]:
# put this in your clipboard
sales_data_json.head(5)

In [None]:
sales_from_clipboard = pd.read_clipboard()
sales_from_clipboard.head(5)

### Summary

* reading and writing data is simple with Pandas
* very customizable imports
* the many options can be overwhelming for beginners - be patient with yourself
* a lot of the handling is done by Pandas by default, e.g.:
    * header
    * datatype (int/float, but not datetime)
    * skipping blank lines
    * …
* data cleansing (datetime example above) can also be done in pandas directly, we'll see later.
* exporting data is simple,…
* caveats:
    * Columns with NaN values become floats

End of our light warm-up.
   