# File formats

There are many different file formats in widespread use within data science. In this lecture, we will review common file formats and their trade-offs, and how to choose an appropriate file format. We will also review the mechanics of reading/parsing different file formats, and how to write to them.

## Reading data from different file formats

### CSV

#### When the CSV file can be read as is

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('data/profiles.csv')

In [3]:
df.head(1)

Unnamed: 0,address,birthdate,blood_group,company,current_location,job,mail,name,residence,sex,ssn,username,website
0,"8009 آل علي Place Apt. 860\nNorth نصري, LA 85261",1917-10-14,A-,آل محمد بن علي بن جماز-آل عواض,"(Decimal('84.6993865'), Decimal('173.552786'))",Toxicologist,jyln11@gmail.com,المهندس عبد الرّحمن حجار,"5059 نورس Cove\nالعقيلmouth, WV 15321",M,045-50-8831,slshy,"['https://www.al.com/', 'https://lkhrfy.biz/',..."


In [4]:
df.loc[0]

address              8009 آل علي Place Apt. 860\nNorth نصري, LA 85261
birthdate                                                  1917-10-14
blood_group                                                        A-
company                                آل محمد بن علي بن جماز-آل عواض
current_location       (Decimal('84.6993865'), Decimal('173.552786'))
job                                                      Toxicologist
mail                                                 jyln11@gmail.com
name                                         المهندس عبد الرّحمن حجار
residence                       5059 نورس Cove\nالعقيلmouth, WV 15321
sex                                                                 M
ssn                                                       045-50-8831
username                                                        slshy
website             ['https://www.al.com/', 'https://lkhrfy.biz/',...
Name: 0, dtype: object

#### When scrubbing of rows may be needed

In [5]:
import csv

In [6]:
rows = []
with open('data/profiles.csv') as f:
    reader = csv.reader(f)
    for row in reader:
        rows.append(row)

In [7]:
list(map(len, rows))

[13, 13, 13, 13]

In [8]:
rows[:2]

[['address',
  'birthdate',
  'blood_group',
  'company',
  'current_location',
  'job',
  'mail',
  'name',
  'residence',
  'sex',
  'ssn',
  'username',
  'website'],
 ['8009 آل علي Place Apt. 860\nNorth نصري, LA 85261',
  '1917-10-14',
  'A-',
  'آل محمد بن علي بن جماز-آل عواض',
  "(Decimal('84.6993865'), Decimal('173.552786'))",
  'Toxicologist',
  'jyln11@gmail.com',
  'المهندس عبد الرّحمن حجار',
  '5059 نورس Cove\nالعقيلmouth, WV 15321',
  'M',
  '045-50-8831',
  'slshy',
  "['https://www.al.com/', 'https://lkhrfy.biz/', 'https://www.al.org/']"]]

In [9]:
df = pd.DataFrame(rows[1:], columns=rows[0])

In [10]:
df.head(1)

Unnamed: 0,address,birthdate,blood_group,company,current_location,job,mail,name,residence,sex,ssn,username,website
0,"8009 آل علي Place Apt. 860\nNorth نصري, LA 85261",1917-10-14,A-,آل محمد بن علي بن جماز-آل عواض,"(Decimal('84.6993865'), Decimal('173.552786'))",Toxicologist,jyln11@gmail.com,المهندس عبد الرّحمن حجار,"5059 نورس Cove\nالعقيلmouth, WV 15321",M,045-50-8831,slshy,"['https://www.al.com/', 'https://lkhrfy.biz/',..."


### Tab-delimited

Same as CSV, just change separator.

#### Direct reading into DataFrame

In [11]:
df = pd.read_csv('data/profiles.txt', sep='\t')

In [12]:
df.head()

Unnamed: 0,address,birthdate,blood_group,company,current_location,job,mail,name,residence,sex,ssn,username,website
0,"8009 آل علي Place Apt. 860\nNorth نصري, LA 85261",1917-10-14,A-,آل محمد بن علي بن جماز-آل عواض,"(Decimal('84.6993865'), Decimal('173.552786'))",Toxicologist,jyln11@gmail.com,المهندس عبد الرّحمن حجار,"5059 نورس Cove\nالعقيلmouth, WV 15321",M,045-50-8831,slshy,"['https://www.al.com/', 'https://lkhrfy.biz/',..."
1,"7372 Sheila Springs Apt. 873\nLake Danielle, W...",1913-05-10,AB+,"Diaz, Williams and Nelson","(Decimal('-26.938657'), Decimal('-33.134258'))",Hotel manager,susan77@yahoo.com,Karen Butler DVM,"85542 Shelby Branch Apt. 181\nWest Joseph, KS ...",F,611-04-5409,angelamccormick,['https://rogers.com/']
2,湖南省丽华县沙湾潘街c座 318538,2012-09-10,B+,快讯网络有限公司,"(Decimal('7.011327'), Decimal('-59.304854'))",系统集成工程师,jun96@hotmail.com,王凤兰,吉林省太原市城东太原街d座 229987,F,622924194002266298,leizhao,"['http://www.an.cn/', 'http://ping.cn/', 'http..."


#### Row by row processing

In [13]:
rows = []
with open('data/profiles.txt') as f:
    reader = csv.reader(f, delimiter='\t')
    for row in reader:
        rows.append(row)

In [14]:
list(map(len, rows))

[13, 13, 13, 13]

### JSON

JSON is the most popular format for sharing information over the web. Most data retrieval APIs will return JSON.m

In [15]:
import json

In [16]:
with open('data/profiles.json') as f:
    profiles = json.load(f)

In [17]:
len(profiles)

3

In [18]:
profiles[0]

{'job': 'Toxicologist',
 'company': 'آل محمد بن علي بن جماز-آل عواض',
 'ssn': '045-50-8831',
 'residence': '5059 نورس Cove\nالعقيلmouth, WV 15321',
 'current_location': [84.6993865, 173.552786],
 'blood_group': 'A-',
 'website': ['https://www.al.com/',
  'https://lkhrfy.biz/',
  'https://www.al.org/'],
 'username': 'slshy',
 'name': 'المهندس عبد الرّحمن حجار',
 'sex': 'M',
 'address': '8009 آل علي Place Apt. 860\nNorth نصري, LA 85261',
 'mail': 'jyln11@gmail.com',
 'birthdate': None}

#### Using a REST API to retrieve JSON data

In [19]:
import os

In [20]:
if not os.path.exists('data/pokemon.json'):
    ! curl -o data/pokemon.json https://pokeapi.co/api/v2/pokemon/23

In [21]:
with open('data/pokemon.json') as f:
    pokemon = json.load(f)

In [22]:
pokemon.keys()

dict_keys(['abilities', 'base_experience', 'forms', 'game_indices', 'height', 'held_items', 'id', 'is_default', 'location_area_encounters', 'moves', 'name', 'order', 'species', 'sprites', 'stats', 'types', 'weight'])

In [23]:
pokemon['name']

'ekans'

In [24]:
pokemon['abilities']

[{'ability': {'name': 'unnerve',
   'url': 'https://pokeapi.co/api/v2/ability/127/'},
  'is_hidden': True,
  'slot': 3},
 {'ability': {'name': 'shed-skin',
   'url': 'https://pokeapi.co/api/v2/ability/61/'},
  'is_hidden': False,
  'slot': 2},
 {'ability': {'name': 'intimidate',
   'url': 'https://pokeapi.co/api/v2/ability/22/'},
  'is_hidden': False,
  'slot': 1}]

### XML

In [25]:
import xml.etree.cElementTree as ET

In [26]:
tree = ET.parse('data/profiles.xml')
root = tree.getroot()

In [27]:
root.tag

'duke'

In [28]:
ET.dump(root)

<duke>
    <employee>
        <address>8009 آل علي Place Apt. 860
        North نصري, LA 85261</address>
        <birthdate>None</birthdate>
        <blood_group>A-</blood_group>
        <company>آل محمد بن علي بن جماز-آل عواض</company>
        <current_location>84.6993865</current_location>
        <current_location>173.552786</current_location>
        <job>Toxicologist</job>
        <mail>jyln11@gmail.com</mail>
        <name>المهندس عبد الرّحمن حجار</name>
        <residence>5059 نورس Cove
        العقيلmouth, WV 15321</residence>
        <sex>M</sex>
        <ssn>045-50-8831</ssn>
        <username>slshy</username>
        <website>https://www.al.com/</website>
        <website>https://lkhrfy.biz/</website>
        <website>https://www.al.org/</website>
    </employee>
    <employee>
        <address>7372 Sheila Springs Apt. 873
        Lake Danielle, WV 06246</address>
        <birthdate>None</birthdate>
        <blood_group>AB+</blood_group>
        <company>Diaz, Williams and N

In [29]:
for employee in root:
    for elem in employee:
        print(f'{elem.tag:>20}: {elem.text}')
    break

             address: 8009 آل علي Place Apt. 860
        North نصري, LA 85261
           birthdate: None
         blood_group: A-
             company: آل محمد بن علي بن جماز-آل عواض
    current_location: 84.6993865
    current_location: 173.552786
                 job: Toxicologist
                mail: jyln11@gmail.com
                name: المهندس عبد الرّحمن حجار
           residence: 5059 نورس Cove
        العقيلmouth, WV 15321
                 sex: M
                 ssn: 045-50-8831
            username: slshy
             website: https://www.al.com/
             website: https://lkhrfy.biz/
             website: https://www.al.org/


In [30]:
root.findall('.')

[<Element 'duke' at 0x11449e9a8>]

In [31]:
root.findall('./')

[<Element 'employee' at 0x11449e958>,
 <Element 'employee' at 0x1144ae4a8>,
 <Element 'employee' at 0x1144aeae8>]

In [32]:
root.findall('.//')[:5]

[<Element 'employee' at 0x11449e958>,
 <Element 'address' at 0x1144a8f48>,
 <Element 'birthdate' at 0x1144a8f98>,
 <Element 'blood_group' at 0x1144ae048>,
 <Element 'company' at 0x1144ae098>]

In [33]:
for item in root.findall('.//company'):
    print(item.text)

آل محمد بن علي بن جماز-آل عواض
Diaz, Williams and Nelson
快讯网络有限公司


### HDF5

Like XML and JSON, HDF5 files store hierarchical data that can be annotated. The strong points of HDF5 are its ability to store large numerical data sets so that selective loading of parts of the data into memory for analysis is possible. HDF5 are also easy to use for people familiar with `numpy` and widely used in the scientific community.

There are two popular libraries for working with HDF5. Pandas uses `pytables`, and the stored schema can be quite unintuitive, but that does not matter since we usually just use Pandas to read it back in.

#### Pandas and `tables`

In [34]:
import tables

In [35]:
f = tables.open_file('data/profiles.h5')

In [36]:
f

File(filename=data/profiles.h5, title='', mode='r', root_uep='/', filters=Filters(complevel=0, shuffle=False, bitshuffle=False, fletcher32=False, least_significant_digit=None))
/ (RootGroup) ''
/duke (Group) ''
/duke/axis0 (Array(14,)) ''
  atom := StringAtom(itemsize=11, shape=(), dflt=b'')
  maindim := 0
  flavor := 'numpy'
  byteorder := 'irrelevant'
  chunkshape := None
/duke/axis1 (Array(3,)) ''
  atom := Int64Atom(shape=(), dflt=0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := None
/duke/block0_items (Array(1,)) ''
  atom := StringAtom(itemsize=9, shape=(), dflt=b'')
  maindim := 0
  flavor := 'numpy'
  byteorder := 'irrelevant'
  chunkshape := None
/duke/block0_values (Array(3, 1)) ''
  atom := Int64Atom(shape=(), dflt=0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := None
/duke/block1_items (Array(2,)) ''
  atom := StringAtom(itemsize=10, shape=(), dflt=b'')
  maindim := 0
  flavor := 'numpy'
  byteorder := 'irrelevant'
 

In [37]:
f.root.duke.axis0[:]

array([b'location_x', b'location_y', b'address', b'birthdate',
       b'blood_group', b'company', b'job', b'mail', b'name', b'residence',
       b'sex', b'ssn', b'username', b'website'], dtype='|S11')

In [38]:
f.root.duke.axis1[:]

array([0, 1, 2])

In [39]:
f.root.duke.block0_items[:]

array([b'birthdate'], dtype='|S9')

In [40]:
f.root.duke.block0_values[:]

array([[-1647820800000000000],
       [-1787616000000000000],
       [ 1347235200000000000]])

In [41]:
f.close()

#### Reading into `pandas`

In [42]:
df = pd.read_hdf('data/profiles.h5')

In [43]:
df

Unnamed: 0,location_x,location_y,address,birthdate,blood_group,company,job,mail,name,residence,sex,ssn,username,website
0,84.699387,173.552786,"8009 آل علي Place Apt. 860\nNorth نصري, LA 85261",1917-10-14,A-,آل محمد بن علي بن جماز-آل عواض,Toxicologist,jyln11@gmail.com,المهندس عبد الرّحمن حجار,"5059 نورس Cove\nالعقيلmouth, WV 15321",M,045-50-8831,slshy,"https://www.al.com/,https://lkhrfy.biz/,https:..."
1,-26.938657,-33.134258,"7372 Sheila Springs Apt. 873\nLake Danielle, W...",1913-05-10,AB+,"Diaz, Williams and Nelson",Hotel manager,susan77@yahoo.com,Karen Butler DVM,"85542 Shelby Branch Apt. 181\nWest Joseph, KS ...",F,611-04-5409,angelamccormick,https://rogers.com/
2,7.011327,-59.304854,湖南省丽华县沙湾潘街c座 318538,2012-09-10,B+,快讯网络有限公司,系统集成工程师,jun96@hotmail.com,王凤兰,吉林省太原市城东太原街d座 229987,F,622924194002266298,leizhao,"http://www.an.cn/,http://ping.cn/,https://gao...."


#### Using `h5py`

For actually working directly with HDF5, I find `h5py` more intuitive.

In [44]:
import h5py

In [45]:
filename = 'data/simulations.h5'
if os.path.exists(filename):
    os.remove(filename)
f = h5py.File(filename)

In [46]:
import numpy as np
import pendulum

In [47]:
start = pendulum.datetime(2019, 8, 31)
stop = start.add(days=3)
for day in pendulum.period(start, stop):
    g = f.create_group(day.format('ddd'))
    g.attrs['date'] = day.format('LLL')
    g.attrs['analyst'] = 'Mario'
    for expt in range(3):
        data = np.random.poisson(size=(100, 100))
        ds = g.create_dataset(f'expt-{expt:02d}', data=data)

In [48]:
f = h5py.File(filename, 'r')

In [49]:
list(f.keys())

['Mon', 'Sat', 'Sun', 'Tue']

In [50]:
list(f['Sat'].attrs.keys())

['analyst', 'date']

In [51]:
f['Sat'].attrs['analyst']

'Mario'

In [52]:
f['Sat'].attrs['date']

'August 31, 2019 12:00 AM'

In [53]:
list(f['Sat'].keys())

['expt-00', 'expt-01', 'expt-02']

In [54]:
f['Sat']['expt-01'][5:10, 5:10]

array([[0, 2, 4, 0, 2],
       [2, 0, 1, 0, 3],
       [0, 1, 2, 2, 3],
       [0, 3, 0, 1, 1],
       [1, 0, 1, 2, 1]])

In [55]:
f['Sat']['expt-01'][5:10, 5:10].sum(axis=0)

array([ 3,  6,  8,  5, 10])

In [56]:
f.close()

## Avro

In [57]:
import fastavro 

In [58]:
%%bash --out s
fastavro --schema data/profiles.avro

In [59]:
schema = eval(s.replace('true', 'True'))

In [60]:
schema

{'__rec_avro_schema__': True,
 'type': 'record',
 'name': 'rec_avro.rec_object',
 'fields': [{'name': '_',
   'type': [{'type': 'map',
     'values': ['null',
      'boolean',
      'int',
      'long',
      'float',
      'double',
      'string',
      'bytes',
      'rec_avro.rec_object']},
    {'type': 'array',
     'items': ['null',
      'boolean',
      'int',
      'long',
      'float',
      'double',
      'string',
      'bytes',
      'rec_avro.rec_object']}]}]}

In [61]:
with open('data/profiles.avro', 'rb') as f:
    avro_reader = fastavro.reader(f, reader_schema=schema)
    for record in avro_reader:
        print(record)

{'_': {'job': 'Toxicologist', 'company': 'آل محمد بن علي بن جماز-آل عواض', 'ssn': '045-50-8831', 'residence': '5059 نورس Cove\nالعقيلmouth, WV 15321', 'current_location': {'_': [84.69938659667969, 173.5527801513672]}, 'blood_group': 'A-', 'website': {'_': ['https://www.al.com/', 'https://lkhrfy.biz/', 'https://www.al.org/']}, 'username': 'slshy', 'name': 'المهندس عبد الرّحمن حجار', 'sex': 'M', 'address': '8009 آل علي Place Apt. 860\nNorth نصري, LA 85261', 'mail': 'jyln11@gmail.com', 'birthdate': None}}
{'_': {'job': 'Hotel manager', 'company': 'Diaz, Williams and Nelson', 'ssn': '611-04-5409', 'residence': '85542 Shelby Branch Apt. 181\nWest Joseph, KS 53929', 'current_location': {'_': [-26.938657760620117, -33.13425827026367]}, 'blood_group': 'AB+', 'website': {'_': ['https://rogers.com/']}, 'username': 'angelamccormick', 'name': 'Karen Butler DVM', 'sex': 'F', 'address': '7372 Sheila Springs Apt. 873\nLake Danielle, WV 06246', 'mail': 'susan77@yahoo.com', 'birthdate': None}}
{'_': {'

#### Avro to JSON

In [62]:
from rec_avro import from_rec_avro_destructive

In [63]:
with open('data/profiles.avro', 'rb') as f:
    avro_reader = fastavro.reader(f, reader_schema=schema)
    for record in avro_reader:
        print(from_rec_avro_destructive(record))

{'job': 'Toxicologist', 'company': 'آل محمد بن علي بن جماز-آل عواض', 'ssn': '045-50-8831', 'residence': '5059 نورس Cove\nالعقيلmouth, WV 15321', 'current_location': [84.69938659667969, 173.5527801513672], 'blood_group': 'A-', 'website': ['https://www.al.com/', 'https://lkhrfy.biz/', 'https://www.al.org/'], 'username': 'slshy', 'name': 'المهندس عبد الرّحمن حجار', 'sex': 'M', 'address': '8009 آل علي Place Apt. 860\nNorth نصري, LA 85261', 'mail': 'jyln11@gmail.com', 'birthdate': None}
{'job': 'Hotel manager', 'company': 'Diaz, Williams and Nelson', 'ssn': '611-04-5409', 'residence': '85542 Shelby Branch Apt. 181\nWest Joseph, KS 53929', 'current_location': [-26.938657760620117, -33.13425827026367], 'blood_group': 'AB+', 'website': ['https://rogers.com/'], 'username': 'angelamccormick', 'name': 'Karen Butler DVM', 'sex': 'F', 'address': '7372 Sheila Springs Apt. 873\nLake Danielle, WV 06246', 'mail': 'susan77@yahoo.com', 'birthdate': None}
{'job': '系统集成工程师', 'company': '快讯网络有限公司', 'ssn': '

### Parquet

In [64]:
import fastparquet

In [65]:
parq = fastparquet.ParquetFile('data/profiles.parq')

In [66]:
parq.columns

['location_x',
 'location_y',
 'address',
 'birthdate',
 'blood_group',
 'company',
 'job',
 'mail',
 'name',
 'residence',
 'sex',
 'ssn',
 'username',
 'website']

In [67]:
df = parq.to_pandas()

In [68]:
df.head(1)

Unnamed: 0,location_x,location_y,address,birthdate,blood_group,company,job,mail,name,residence,sex,ssn,username,website
0,84.699387,173.552786,"8009 آل علي Place Apt. 860\nNorth نصري, LA 85261",1917-10-14,A-,آل محمد بن علي بن جماز-آل عواض,Toxicologist,jyln11@gmail.com,المهندس عبد الرّحمن حجار,"5059 نورس Cove\nالعقيلmouth, WV 15321",M,045-50-8831,slshy,"https://www.al.com/,https://lkhrfy.biz/,https:..."


#### Reading directly in `pandas`

In [69]:
df = pd.read_parquet('data/profiles.parq')

In [70]:
df.head(1)

Unnamed: 0,location_x,location_y,address,birthdate,blood_group,company,job,mail,name,residence,sex,ssn,username,website
0,84.699387,173.552786,"8009 آل علي Place Apt. 860\nNorth نصري, LA 85261",1917-10-14,A-,آل محمد بن علي بن جماز-آل عواض,Toxicologist,jyln11@gmail.com,المهندس عبد الرّحمن حجار,"5059 نورس Cove\nالعقيلmouth, WV 15321",M,045-50-8831,slshy,"https://www.al.com/,https://lkhrfy.biz/,https:..."


## SQL

A relatinal databse isn't really a filetype, but SQLite3 stores data as a simple file.

In [71]:
import sqlite3

In [72]:
conn = sqlite3.connect('data/profiles.sqlite')
c = conn.cursor()

In [73]:
c.execute("SELECT * FROM sqlite_master WHERE type='table'")
c.fetchall()

[('table',
  'duke',
  'duke',
  2,
  'CREATE TABLE duke (\n\tid BIGINT, \n\tlocation_x FLOAT, \n\tlocation_y FLOAT, \n\taddress TEXT, \n\tbirthdate DATETIME, \n\tblood_group TEXT, \n\tcompany TEXT, \n\tjob TEXT, \n\tmail TEXT, \n\tname TEXT, \n\tresidence TEXT, \n\tsex TEXT, \n\tssn TEXT, \n\tusername TEXT, \n\twebsite TEXT\n)')]

In [74]:
c.execute('SELECT * FROM duke')
c.fetchone()

(0,
 84.6993865,
 173.552786,
 '8009 آل علي Place Apt. 860\nNorth نصري, LA 85261',
 '1917-10-14 00:00:00.000000',
 'A-',
 'آل محمد بن علي بن جماز-آل عواض',
 'Toxicologist',
 'jyln11@gmail.com',
 'المهندس عبد الرّحمن حجار',
 '5059 نورس Cove\nالعقيلmouth, WV 15321',
 'M',
 '045-50-8831',
 'slshy',
 'https://www.al.com/,https://lkhrfy.biz/,https://www.al.org/')

In [75]:
conn.close()

## How the files were created

### Create fake profiles using `Faker`

In [76]:
from faker import Faker

In [77]:
fakes = [
    Faker('zh_CN'), 
    Faker('ar_SA'), 
    Faker('en_US'), 
]

In [78]:
n = 3
p = [0.3, 0.2, 0.5]
np.random.seed(1)
locales = np.random.choice(len(fakes), size=n, p=p)

In [79]:
profiles = [fakes[locale].profile() for locale in locales]

In [80]:
profiles

[{'job': 'Scientist, forensic',
  'company': 'المغاولة and Sons',
  'ssn': '177-66-4460',
  'residence': '8475 جوريّة Brooks\nعرفانbury, MT 02323',
  'current_location': (Decimal('54.993302'), Decimal('-16.808277')),
  'blood_group': 'A+',
  'website': ['http://al.com/'],
  'username': 'mlhjr',
  'name': 'السيد موسى المشاولة',
  'sex': 'M',
  'address': '43457 كتوم Plains Suite 096\nبنانview, MD 48123',
  'mail': 'syfldwynlrshd@hotmail.com',
  'birthdate': datetime.date(1992, 4, 24)},
 {'job': 'Minerals surveyor',
  'company': 'Rojas-Manning',
  'ssn': '148-32-8045',
  'residence': '733 Gordon Freeway Apt. 006\nWest Richard, DE 29439',
  'current_location': (Decimal('36.7676655'), Decimal('-50.791245')),
  'blood_group': 'AB-',
  'website': ['https://www.shah.com/'],
  'username': 'sherry05',
  'name': 'Vanessa Smith',
  'sex': 'F',
  'address': '337 John Mountain\nSouth Josebury, IL 88948',
  'mail': 'davisbarbara@yahoo.com',
  'birthdate': datetime.date(1989, 9, 19)},
 {'job': '电子/电器

### Make JSON files

In [81]:
import datetime
import decimal

In [82]:
def converter(o):
    if isinstance(o, datetime.datetime):
        return o.__str__()
    if isinstance(o, decimal.Decimal):
        return o.__str__()

In [83]:
import simplejson

In [84]:
with open('data/profiles.json', 'w') as f:
    simplejson.dump(profiles , f, default=converter)

### Make XML files

In [85]:
from json2xml import json2xml, readfromstring

In [86]:
with open('data/profiles.xml', 'w') as f:
    data = readfromstring(simplejson.dumps(profiles , f, default=converter))
    f.write(json2xml.Json2xml({'employee': data}, wrapper="duke").to_xml())

### Make AVRO files

In [87]:
from rec_avro import to_rec_avro_destructive, rec_avro_schema

In [88]:
ps = simplejson.load(open('data/profiles.json'))
avro_objects = [to_rec_avro_destructive(rec) for rec in ps]
with open('data/profiles.avro', 'wb') as f_out:
    fastavro.writer(f_out, fastavro.parse_schema(rec_avro_schema()), avro_objects)

### Make `pandas` data framee

In [89]:
df = pd.DataFrame(profiles)

In [90]:
df.iloc[0]

address             43457 كتوم Plains Suite 096\nبنانview, MD 48123
birthdate                                                1992-04-24
blood_group                                                      A+
company                                           المغاولة and Sons
current_location                            (54.993302, -16.808277)
job                                             Scientist, forensic
mail                                      syfldwynlrshd@hotmail.com
name                                            السيد موسى المشاولة
residence                   8475 جوريّة Brooks\nعرفانbury, MT 02323
sex                                                               M
ssn                                                     177-66-4460
username                                                      mlhjr
website                                            [http://al.com/]
Name: 0, dtype: object

### Make comma delimited files

In [91]:
df.to_csv('data/profiles.csv', index=False)

### Make tab-delimited files

In [92]:
df.to_csv('data/profiles.txt', index=False, sep='\t')

### Munge pandas data to be compratible with storage

In [93]:
df.birthdate = pd.to_datetime(df.birthdate)
df = (
    df.current_location.
    apply(pd.Series).
    merge(df, left_index=True, right_index=True).
    drop('current_location', axis=1).
    rename({0: 'location_x', 1: 'location_y'}, axis=1)
)
df['location_x'] = df['location_x'].astype('float')
df['location_y'] = df['location_y'].astype('float')
df.website = df.website.apply(lambda s: ','.join(s))

### Make HDF5 files

In [94]:
df.to_hdf('data/profiles.h5', key='duke')

### Make Parquet files

In [95]:
fastparquet.write('data/profiles.parq', df)

### Make SQLite3 database files

In [96]:
from sqlalchemy import create_engine
engine = create_engine('sqlite:///data/profiles.sqlite', echo=False)

In [97]:
df.to_sql('duke', con=engine, if_exists='replace', index_label='id')