# Python 101
## XIV. File content parsing

---

## I. `json` file parsing

Python's standard library has a solution for json file parsing.

In [1]:
import os
import json

### 1. Basic workflow

- transforming python objects to json string

In [2]:
data = {f'key{i}': [f'val{i}{j}' for j in range(10)] for i in range(10)}
json.dumps(data)

'{"key0": ["val00", "val01", "val02", "val03", "val04", "val05", "val06", "val07", "val08", "val09"], "key1": ["val10", "val11", "val12", "val13", "val14", "val15", "val16", "val17", "val18", "val19"], "key2": ["val20", "val21", "val22", "val23", "val24", "val25", "val26", "val27", "val28", "val29"], "key3": ["val30", "val31", "val32", "val33", "val34", "val35", "val36", "val37", "val38", "val39"], "key4": ["val40", "val41", "val42", "val43", "val44", "val45", "val46", "val47", "val48", "val49"], "key5": ["val50", "val51", "val52", "val53", "val54", "val55", "val56", "val57", "val58", "val59"], "key6": ["val60", "val61", "val62", "val63", "val64", "val65", "val66", "val67", "val68", "val69"], "key7": ["val70", "val71", "val72", "val73", "val74", "val75", "val76", "val77", "val78", "val79"], "key8": ["val80", "val81", "val82", "val83", "val84", "val85", "val86", "val87", "val88", "val89"], "key9": ["val90", "val91", "val92", "val93", "val94", "val95", "val96", "val97", "val98", "val99"]

- serializing python objects to json file

In [3]:
with open('./data/test.json', 'w') as f:
    json.dump(data, f)

- loading from json string

In [4]:
data2 = json.loads('{"users": [{"name": "john", "mail": "john@doe.com"}, {"name": "joe", "mail": "joe@doe.com"}],'
                   ' "groups": [{"name": "does", "users": ["john", "joe"]}]}')
data2

{'users': [{'name': 'john', 'mail': 'john@doe.com'},
  {'name': 'joe', 'mail': 'joe@doe.com'}],
 'groups': [{'name': 'does', 'users': ['john', 'joe']}]}

- load from file

In [5]:
with open('./data/test.json', 'r') as f:
    data = json.load(f)
data

{'key0': ['val00',
  'val01',
  'val02',
  'val03',
  'val04',
  'val05',
  'val06',
  'val07',
  'val08',
  'val09'],
 'key1': ['val10',
  'val11',
  'val12',
  'val13',
  'val14',
  'val15',
  'val16',
  'val17',
  'val18',
  'val19'],
 'key2': ['val20',
  'val21',
  'val22',
  'val23',
  'val24',
  'val25',
  'val26',
  'val27',
  'val28',
  'val29'],
 'key3': ['val30',
  'val31',
  'val32',
  'val33',
  'val34',
  'val35',
  'val36',
  'val37',
  'val38',
  'val39'],
 'key4': ['val40',
  'val41',
  'val42',
  'val43',
  'val44',
  'val45',
  'val46',
  'val47',
  'val48',
  'val49'],
 'key5': ['val50',
  'val51',
  'val52',
  'val53',
  'val54',
  'val55',
  'val56',
  'val57',
  'val58',
  'val59'],
 'key6': ['val60',
  'val61',
  'val62',
  'val63',
  'val64',
  'val65',
  'val66',
  'val67',
  'val68',
  'val69'],
 'key7': ['val70',
  'val71',
  'val72',
  'val73',
  'val74',
  'val75',
  'val76',
  'val77',
  'val78',
  'val79'],
 'key8': ['val80',
  'val81',
  'val82',
  'val8

### Exercise: Export the `users` table from `data/example.db` to a json file (`data/users.json`)

---

## II. Structured file parsing with `pandas`

`pandas` (**pan**el **da**ta **s**tructure) is a third party library for structured data processing, transforming and exporting. 

In [13]:
import pandas as pd

Reading from several file format is supported (details <a href="https://pandas.pydata.org/pandas-code/stable/io.htlm">here</a>):
- csv, tsv
- json
- sql
- excel
- pickle
- html
- parquet
- hdf5
- stata
- sas
- google big query
    
#### 1. Reading, writing CSV, TSV

In [14]:
cars = pd.read_csv('data/cars.csv', sep=',', index_col='brand')
cars.head()

Unnamed: 0_level_0,group
brand,Unnamed: 1_level_1
audi,vw
bugatti,vw
porsche,vw
seat,vw
skoda,vw


In [25]:
cars.loc[cars.group == 'vw', ['group']].values.tolist()

[['vw'], ['vw'], ['vw'], ['vw'], ['vw'], ['vw']]

- showing first|last n lines

In [27]:
cars.head(3)

Unnamed: 0_level_0,group
brand,Unnamed: 1_level_1
audi,vw
bugatti,vw
porsche,vw


In [None]:
cars.tail()

- column names, index name, data shape

In [28]:
print('Data shape:', cars.shape, '| Index:', cars.index.name, 
      '| Columns:', cars.columns, '| Data types:', cars.dtypes)

Data shape: (36, 1) | Index: brand | Columns: Index(['group'], dtype='object') | Data types: group    object
dtype: object


- exporting data to tsv

In [29]:
cars.to_csv('data/cars.tsv', sep='\t')

#### 2. Reading/writing excels

In [30]:
pd.read_excel('data/test.xlsx', sheet_name=0).head()

Unnamed: 0,brand,group
0,audi,vw
1,bugatti,vw
2,porsche,vw
3,seat,vw
4,skoda,vw


In [None]:
with pd.ExcelWriter('data/test.xlsx') as ew:
    cars.to_excel(ew, sheet_name='cars')
    data.to_excel(ew, sheet_name='test') # FIXME: data is dict here!!!

#### 3. Reading/writing jsons

In [32]:
data = pd.read_json('data/test.json')
data

Unnamed: 0,key0,key1,key2,key3,key4,key5,key6,key7,key8,key9
0,val00,val10,val20,val30,val40,val50,val60,val70,val80,val90
1,val01,val11,val21,val31,val41,val51,val61,val71,val81,val91
2,val02,val12,val22,val32,val42,val52,val62,val72,val82,val92
3,val03,val13,val23,val33,val43,val53,val63,val73,val83,val93
4,val04,val14,val24,val34,val44,val54,val64,val74,val84,val94
5,val05,val15,val25,val35,val45,val55,val65,val75,val85,val95
6,val06,val16,val26,val36,val46,val56,val66,val76,val86,val96
7,val07,val17,val27,val37,val47,val57,val67,val77,val87,val97
8,val08,val18,val28,val38,val48,val58,val68,val78,val88,val98
9,val09,val19,val29,val39,val49,val59,val69,val79,val89,val99


In [34]:
data.to_json('tmp.json')
os.remove('tmp.json')

---

## III. xml file parsing with `lxml.etree`

Third party library lxml supports xml processing through it's etree submodule.



In [35]:
from lxml import etree

### 1. `Element` class
- create node

In [36]:
root = etree.Element('root')

- adding elements

In [37]:
root.append(etree.Element('child1'))

In [38]:
child2 = etree.SubElement(root, "child2")
child3 = etree.SubElement(root, "child3")

- displaying document

In [39]:
print(etree.tostring(root, pretty_print=True, encoding='unicode'))

<root>
  <child1/>
  <child2/>
  <child3/>
</root>



- elements works like lists

In [40]:
for element in root:
    print(element.tag)

child1
child2
child3


In [41]:
root.append(etree.Element('child4'))

In [42]:
for element in root:
    print(element.tag)

child1
child2
child3
child4


- checking for childs

In [43]:
print(etree.iselement(root))
print(f'root has {len(root)} child, child2 has {len(child2)}')

True
root has 4 child, child2 has 0


- attributes are stored with the elements

In [44]:
child5 = etree.SubElement(root, 'child5', badass='very much')

In [45]:
etree.tostring(child5)

b'<child5 badass="very much"/>'

In [46]:
root.get('hello')

In [54]:
root.set('hello', 'aye')
print(etree.tostring(root, pretty_print=True, encoding='unicode'))

<root hello="aye">
  <child1/>
  <child2/>
  <child3/>
  <child4/>
  <child5 badass="very much"/>
</root>



In [55]:
child2.text = "interesting text"

In [57]:
print(etree.tostring(root, pretty_print=True, encoding='unicode'))

<root hello="aye">
  <child1/>
  <child2>interesting text</child2>
  <child3/>
  <child4/>
  <child5 badass="very much"/>
</root>



- tree iteration

In [60]:
root = etree.Element("root")
etree.SubElement(root, "child").text = "Child 1"
etree.SubElement(root, "child").text = "Child 2"
etree.SubElement(root, "another").text = "Child 3"
etree.SubElement(root, "child").text = "Child 1"
etree.SubElement(root, "child").text = "Child 2"

In [61]:
for element in root.iter():
    print(f'{element.tag}, {element.text}')

root, None
child, Child 1
child, Child 2
another, Child 3
child, Child 1
child, Child 2


In [62]:
for element in root.iter("child"):
    print(f'{element.tag}, {element.text}')

child, Child 1
child, Child 2
child, Child 1
child, Child 2


In [63]:
for element in root.iter("another", "child"):
    print(f'{element.tag}, {element.text}')

child, Child 1
child, Child 2
another, Child 3
child, Child 1
child, Child 2


- `ElementTree` class

In [64]:
root = etree.XML('''\
<?xml version="1.0"?>
<!DOCTYPE root SYSTEM "test" [ <!ENTITY tasty "parsnips"> ]>
<root>
  <a>&tasty;</a>
</root>
''')

tree = etree.ElementTree(root)
print(etree.tostring(tree, encoding='unicode'))

<!DOCTYPE root SYSTEM "test" [
<!ENTITY tasty "parsnips">
]>
<root>
  <a>parsnips</a>
</root>


- parsing xml string

In [65]:
some_xml_data = (b'<root hello="aye"><child1/><child2>interesting text</child2>'
                 b'<child3/><child4/><child5 badass="very much"/></root>')
root = etree.fromstring(some_xml_data)

In [66]:
for elem in root:
    print(elem.tag, elem.text)

child1 None
child2 interesting text
child3 None
child4 None
child5 None


---

## IV. Exercises

### 1. Export from data/movies.xml into a json file