# Lecture 14

## Thursday, October 18th 2018

* Comments on security
* Input Files
  * XML
  * YAML
  * JSON
* Comments on *pickling*

In [None]:
from IPython.display import HTML

# Brief Comments on Security
* Digital security is a HUGE field encompassing protection of identity, assests, and technology among other things
* We will briefly discuss authentication, which touches on identity protection
* How do you keep your, potentially sensitive, code secure?
* It's not enough to make your repo private; someone could steal your password and get access to all your secret recipes
* Multi-factor authentication is a proposed solution

## Multi-Factor Authentication
* The basic motivation is centered on how to confirm someone's identity
* Before a user is granted access to a resource, they must provided two or more pieces of evidence to an authentication mechanism
* **Two-factor** authentication requires just two pieces of evidence
  - Password
  - Authentication code sent to the user from the authentication system
* There are pros and cons to multi-factor authentication

## Comments on Two-Factor Authentication
* Most of you have used two-factor authentication one way or another
* *Gmail* now requires two-factor authentication for signing into email
* Supercomputers often require two-factor authentication to login to the machine
* Typical workflow:
  - Enter password
  - Receive a new passcode on your mobile device through an app that is connected to the authentication mechanism used by the service
  - Enter new passcode to gain entry to the system

## Some pros and cons

**Pros**
* Provides an extra layer of security
* As long as you have your phone, you can receive a secure passcode
* Passcodes are randomly generated, so they change every time you want to login

**Cons**: Relating to the common use of authentication via mobile phone
* SMS messages can be intercepted
* People lose their mobile phones; phones can also be stolen
* Mobile phones aren't the most secure devices (e.g. email is always logged in)

## Two-factor authentication on `GitHub`
* You may want to consider setting up two-factor authenticaton (2FA) for your `GitHub` account
* Will help to start thinking about security in the real world
* Here are some resources:
  - [Securing your account with two-factor authentication (2FA)](https://help.github.com/articles/securing-your-account-with-two-factor-authentication-2fa/)
  - [Two-factor authentication](https://blog.github.com/2013-09-03-two-factor-authentication/)
  - [About two-factor authentication](https://help.github.com/articles/about-two-factor-authentication/)

# Input Files and Parsing

We usually want to read data into our software:
* Input parameters to the code (e.g. time step, linear algebra solvers, physical parameters, etc)
* Input fields (e.g. fields to visualize)
* Calibration data
* $\vdots$

This data can be provided by us, or the client, or come from a database somewhere.

There are *many* ways of reading in and parsing data.  In fact, this is often a non-trivial exercise depending on the quality of the data as well as its size.

## XML Intro

```xml
<?xml version="1.0"?>

<ctml>

    <reactionData id="test_mechanism">

        <!-- reaction 01  -->
        <reaction reversible="yes" type="Elementary" id="reaction01">
            <equation>H + O2 [=] OH + O</equation>
            <rateCoeff>
                <Kooij>
                    <A units="cm3/mol/s">3.52e+16</A>
                    <b>-0.7</b>
                    <E units="kJ/mol">71.4</E>
                </Kooij>
            </rateCoeff>
            <reactants>H:1 O2:1</reactants>
            <products>OH:1 O:1</products>
        </reaction>

        <!-- reaction 02 -->
        <reaction reversible="yes" type="Elementary" id="reaction02">
            <equation>H2 + O [=] OH + H</equation>
            <rateCoeff>
                <Kooij>
                    <A units="cm3/mol/s">5.06e+4</A>
                    <b>2.7</b>
                    <E units="kJ/mol">26.3</E>
                </Kooij>
            </rateCoeff>
            <reactants>H2:1 O:1</reactants>
            <products>OH:1 H:1</products>
        </reaction>

    </reactionData>

</ctml>
```

## What is XML?

**Note:** Material presented here taken from the following sources
* [https://www.w3schools.com/xml/default.asp](w3schools XML tutorial)
* [https://docs.python.org/2/library/xml.etree.elementtree.html](`Python` `xml.etree.ElementTree` documentation)
* [https://www.w3.org/TR/2008/REC-xml-20081126/](`XML` Documentation)
* [https://en.wikipedia.org/wiki/XML](`XML` Wikipedia Page)

Some basic `XML` comments:
* XML stands for `Extensible Markup Language`
* XML is just information wrapped in tags
* It doesn't *do* anything per se
* Its format is both machine- and human-readable

## Some Basic `XML` Anatomy

```xml
<!-- This is an XML comment -->
<?xml version="1.0" encoding="UTF-8"?> <!-- This is the optional XML prolog -->

<dogshelter> <!-- This is the root element -->
    <dog id="dog1"> <!-- This is the first child element.
                         It has a `id` attribute -->
        <name> Cloe </name> <!-- First subchild element -->
        <age> 3 </age> <!-- Second subchild element -->
        <breed> Border Collie </breed>
        <playgroup> Yes </playgroup>
    </dog>
    <dog id="dog2"> 
        <name> Karl </name> 
        <age> 7 </age>
        <breed> Beagle </breed>
        <playgroup> Yes </playgroup>
    </dog>
</dogshelter>
```

Note that all `XML` elements have a closing tag!

## Some More Basic `XML` Anatomy
See [https://www.w3schools.com/xml/default.asp](w3schools XML tutorial) for a very nice summary of the essential `XML` rules.

`XML` elements:  a few things to be aware of:
* Elements can contain text, attributes, and other elements
* `XML` names are case sensitive and cannot contain spaces
* Be consistent in your naming convention

`XML` attributes:  a few things to be aware of:
* `XML` attributes must be in quotes
* There are no rules about when to use elements or attributes
  - You could make an attribute an element and it might read better
* Rule of thumb:  Data should be stored as elements.  Metadata should be stored as attributes.

## Python and `XML`
We will use the `ElementTree` class to read in and parse `XML` input files in `Python`.

A very nice tutorial can be found in the [https://docs.python.org/2/library/xml.etree.elementtree.html#module-xml.etree.ElementTree](`Python` `ElementTree` documentation).

We'll work with the `shelterdogs.xml` file to start.

In [None]:
import xml.etree.ElementTree as ET
tree = ET.parse('shelterdogs.xml')
dogshelter = tree.getroot()


print(dogshelter)
print(dogshelter.tag)
print(dogshelter.attrib)

### Looping Over Child Elements

In [None]:
for child in dogshelter:
    print(child.tag, child.attrib)

### Accessing Children by Index

In [None]:
print(dogshelter[0][0].text)

In [None]:
print(dogshelter[1][0].text)

In [None]:
print(dogshelter[0][2].text)

### The `Element.iter()` Method
From the documentation:
> Creates a tree iterator with the current element as the root. The iterator iterates over this element and all elements below it, in document (depth first) order. 

In [None]:
for age in dogshelter.iter('age'):
    print(age.text)

### The `Element.findall()` Method
From the documentation:
> Finds all matching subelements, by tag name or path. Returns a list containing all matching elements in document order.

In [None]:
print(dogshelter.findall('dog'))

In [None]:
for dog in dogshelter.findall('dog'): # Iterate over each child
    print('ID:  {}'.format(dog.get('id'))) # Use the get() method to get the attribute of the child
    print('----------')
    
    print('Name:  {}'.format(dog.find('name').text)) # Use the find() method to find a specific subchild

    age = float(dog.find('age').text)
    if (dog.find('age').attrib == 'months'):
        years = age / 12.0
        print('Age: {} years'.format(years))
    else:
        print('Age: {} years'.format(age))
    
    print('Breed: {}'.format(dog.find('breed').text))
    
    if (dog.find('playgroup').text.split()[0] == 'Yes'):
        print('PLAYGROUP')
    else:
        print('NO PLAYGROUP')
    print('\n::::::::::::::::::::\n')

# What is JSON?
* Stands for **JavaScript Object Notation**
* It's actually language agnostic
  - No need to learn JavaScript to use it
* Like XML, it's a human-readable format

## Some Basic `JSON` Anatomy

```json
{
    "dogShelter": "MSPCA-Angell",
    "dogs": [
        {
            "name": "Cloe",
            "age": 3,
            "breed": "Border Collie",
            "attendPlaygroup": "Yes"
        },
        {
            "name": "Karl",
            "age": 7,
            "breed": "Beagle",
            "attendPlaygroup": "Yes"
        }
    ]
}

```

## `JSON` and `Python`
* `Python` supports `JSON` natively
* Saving `Python` data to `JSON` format is called *serialization*
* Loading a `JSON` file into `Python` data is called *deserialization*

## Deserialization
Since we're interested in reading in some fancy input file, we'll begin by discussing deserialization.

We'll work with the `shelterdogs.json` file.

In [None]:
import json
with open ("shelterdogs.json", "r") as shelterdogs_file:
    shelterdogs = json.load(shelterdogs_file)

In [None]:
print(shelterdogs["dogs"])

In [None]:
print(type(shelterdogs))

### Comments on Deserialization
That was pretty nice!  We got a `Python` dictionary out.  We sure know how to work with `Python` dictionaries.

## Serialization
You can also write data out to `JSON` format.  Let's just do a brief example.

In [None]:
somedogs = {"shelterDogs": [{"name": "Cloe", "age": 3, "breed": "Border Collie", "attendPlaygroup": "Yes"}, 
                           {"name": "Karl", "age": 7, "breed": "Beagle", "attendPlaygroup": "Yes"}]}

In [None]:
with open("shelterdogs_write.json", "w") as write_dogs:
    json.dump(somedogs, write_dogs, indent=4)

## Some `JSON` References
* [www.json.org](https://www.json.org/)
* [Wikipedia page](https://en.wikipedia.org/wiki/JSON)
* [Working with `JSON` data in `Python`](https://realpython.com/python-json/)
* [w3schools `JSON` syntax](https://www.w3schools.com/js/js_json_syntax.asp)

# What is `YAML`?
* The official website: [`YAML`](http://yaml.org/)
* From the official website:
  - `YAML` stands for YAML Ain't Markup Language
    * Example of a [*recursive acronym*](https://en.wikipedia.org/wiki/Recursive_acronym) (like Linux!)
  - "What It Is:  YAML is a human friendly data serialization standard for all programming languages."
* YAML is quite friendly to use and continues to gain in popularity

## `YAML` Anatomy
```yaml
shelterDogs:
- {age: 3, attendPlaygroup: 'Yes', breed: Border Collie, name: Cloe}
- {age: 7, attendPlaygroup: 'Yes', breed: Beagle, name: Karl}
shelterStaff:
- {Job: dogWalker, age: 100, name: Bob}
- {Job: PlaygroupLeader, age: 47, name: Sally}
```

In [None]:
someshelter = {"shelterDogs": [{"name": "Cloe", "age": 3, "breed": "Border Collie", "attendPlaygroup": "Yes"}, 
                           {"name": "Karl", "age": 7, "breed": "Beagle", "attendPlaygroup": "Yes"}], 
               "shelterStaff": [{"name": "Bob", "age": 100, "Job": "dogWalker"}, 
                                {"name": "Sally", "age": 47, "Job": "PlaygroupLeader"}]}

In [None]:
import yaml # Use conda install -c anaconda yaml if you need to install it
print(yaml.dump(someshelter))

### Serialization

In [None]:
with open("shelter_write.yaml", "w") as write_dogs:
    yaml.dump(someshelter, write_dogs)

### Deserialization

In [None]:
with open ("shelterdogs.yaml", "r") as shelter_dogs:
    some_shelter = yaml.load(shelter_dogs)

In [None]:
print(some_shelter)

In [None]:
print(some_shelter["shelterStaff"])

# What is `pickle`?
* `Python` has it's own module for loading and writing `python` data
* Part of the `python` standard library
* Fast
* Can store arbitrarily complex `Python` data structures

## Some caveats
* `Python` specific:  no guarantee of cross-language compatibility
* Not every `python` datastructure can be serialized by `pickle`
* Older versions of `python` don't support newer serialization formats
  - Lastest format can handle the most `python` datastructures
  - They can also read in older datastructures
  - Older formats cannot read in newer formats
* Make sure to use *binary mode* when opening `pickle` files
  - Data will get corrupted otherwise

In [None]:
import pickle

someshelter = {"shelterDogs": [{"name": "Cloe", "age": 3, "breed": "Border Collie", "attendPlaygroup": "Yes"}, 
                           {"name": "Karl", "age": 7, "breed": "Beagle", "attendPlaygroup": "Yes"}], 
               "shelterStaff": [{"name": "Bob", "age": 100, "Job": "dogWalker"}, 
                                {"name": "Sally", "age": 47, "Job": "PlaygroupLeader"}]}

with open('data.pickle', 'wb') as f:
    pickle.dump(someshelter, f, pickle.HIGHEST_PROTOCOL) # highest protocol is the most recent one

In [None]:
with open('data.pickle', 'rb') as f:
    data = pickle.load(f)

print(data)

In [None]:
%%bash
cat "data.pickle"