# Python 101
## XIV. File content parsing

---

## I. `json` file parsing

Python's standard library has a solution for json file parsing.

In [None]:
import os
import json

### 1. Basic workflow

- transforming python objects to json string

In [None]:
data = {f'key{i}': [f'val{i}{j}' for j in range(10)] for i in range(10)}
json.dumps(data)

- serializing python objects to json file

In [None]:
with open('./data/test.json', 'w') as f:
    json.dump(data, f)

- loading from json string

In [None]:
data2 = json.loads('{"users": [{"name": "john", "mail": "john@doe.com"}, '
                   '{"name": "joe", "mail": "joe@doe.com"}], '
                   ' "groups": [{"name": "does", "users": ["john", "joe"]}]}')
data2

- load from file

In [None]:
with open('./data/test.json', 'r') as f:
    data = json.load(f)
data

### Exercise: Export the `users` table from `data/example.db` to a json file (`data/users.json`)

---

## II. Structured file parsing with `pandas`

`pandas` (**pan**el **da**ta **s**tructure) is a third party library for structured data processing, transforming and exporting. 

In [None]:
import pandas as pd

Reading from several file format is supported (details <a href="https://pandas.pydata.org/pandas-code/stable/io.htlm">here</a>):
- csv, tsv
- json
- sql
- excel
- pickle
- html
- parquet
- hdf5
- stata
- sas
- google big query
    
#### 1. Reading, writing CSV, TSV

In [None]:
cars = pd.read_csv('data/cars.csv', sep=',', index_col='brand')
cars.head()

In [None]:
cars.loc[cars.group == 'vw', ['group']].values.tolist()

- showing first|last n lines

In [None]:
cars.head(3)

In [None]:
cars.tail()

- column names, index name, data shape

In [None]:
print('Data shape:', cars.shape, '| Index:', cars.index.name, 
      '| Columns:', cars.columns, '| Data types:', cars.dtypes)

- exporting data to tsv

In [None]:
cars.to_csv('data/cars.tsv', sep='\t')

#### 2. Reading/writing excels

In [None]:
pd.read_excel('data/test.xlsx', sheet_name=0).head()

In [None]:
with pd.ExcelWriter('data/test.xlsx') as ew:
    cars.to_excel(ew, sheet_name='cars')
    data.to_excel(ew, sheet_name='test') # FIXME: data is dict here!!!

#### 3. Reading/writing jsons

In [None]:
data = pd.read_json('data/test.json')
data

In [None]:
data.to_json('tmp.json')
os.remove('tmp.json')

---

## III. xml file parsing with `lxml.etree`

Third party library lxml supports xml processing through it's etree submodule.



In [None]:
from lxml import etree

### 1. `Element` class
- create node

In [None]:
root = etree.Element('root')

- adding elements

In [None]:
root.append(etree.Element('child1'))

In [None]:
child2 = etree.SubElement(root, "child2")
child3 = etree.SubElement(root, "child3")

- displaying document

In [None]:
print(etree.tostring(root, pretty_print=True, encoding='unicode'))

- elements works like lists

In [None]:
for element in root:
    print(element.tag)

In [None]:
root.append(etree.Element('child4'))

In [None]:
for element in root:
    print(element.tag)

- checking for childs

In [None]:
print(etree.iselement(root))
print(f'root has {len(root)} child, child2 has {len(child2)}')

- attributes are stored with the elements

In [None]:
child5 = etree.SubElement(root, 'child5', badass='very much')

In [None]:
etree.tostring(child5)

In [None]:
root.get('hello')

In [None]:
root.set('hello', 'aye')
print(etree.tostring(root, pretty_print=True, encoding='unicode'))

In [None]:
child2.text = "interesting text"

In [None]:
print(etree.tostring(root, pretty_print=True, encoding='unicode'))

- tree iteration

In [None]:
root = etree.Element("root")
etree.SubElement(root, "child").text = "Child 1"
etree.SubElement(root, "child").text = "Child 2"
etree.SubElement(root, "another").text = "Child 3"
etree.SubElement(root, "child").text = "Child 1"
etree.SubElement(root, "child").text = "Child 2"

In [None]:
for element in root.iter():
    print(f'{element.tag}, {element.text}')

In [None]:
for element in root.iter("child"):
    print(f'{element.tag}, {element.text}')

In [None]:
for element in root.iter("another", "child"):
    print(f'{element.tag}, {element.text}')

- `ElementTree` class

In [None]:
root = etree.XML('''\
<?xml version="1.0"?>
<!DOCTYPE root SYSTEM "test" [ <!ENTITY tasty "parsnips"> ]>
<root>
  <a>&tasty;</a>
</root>
''')

tree = etree.ElementTree(root)
print(etree.tostring(tree, encoding='unicode'))

- parsing xml string

In [None]:
some_xml_data = (b'<root hello="aye"><child1/><child2>interesting text</child2>'
                 b'<child3/><child4/><child5 badass="very much"/></root>')
root = etree.fromstring(some_xml_data)

In [None]:
for elem in root:
    print(elem.tag, elem.text)

---

## IV. Exercises

### 1. Export from data/movies.xml into a json file