<center>
<table>
  <tr>
    <td><img src="https://portal.nccs.nasa.gov/datashare/astg/training/python/logos/nasa-logo.svg" width="100"/> </td>
     <td><img src="https://portal.nccs.nasa.gov/datashare/astg/training/python/logos/ASTG_logo.png?raw=true" width="80"/> </td>
     <td> <img src="https://www.nccs.nasa.gov/sites/default/files/NCCS_Logo_0.png" width="130"/> </td>
    </tr>
</table>
</center>

        
<center>
<h1><font color= "blue" size="+3">ASTG Python Courses</font></h1>
</center>

---

<CENTER>
<H1>
    <font color="red">Manipulating YAML Files with Python</font>
</H1>
</CENTER>

## <font color="red">Objectives</font>

In this presentation, we want to:

- Understand the YAML data format
- Learn how to create a YAML file
- Lear how to read a YAML file and manipulate its content. 

## <font color="red">Useful Links</font>

- [Official YAML Website](https://yaml.org/) 
- [PyYAML Documentation](https://pyyaml.org/wiki/PyYAMLDocumentation)
- [Python YAML: A Comprehensive Guide for Beginners](https://www.pythoncentral.io/python-yaml-a-comprehensive-guide-for-beginners/)
- [YAML Tutorial : A Complete Language Guide with Examples](https://spacelift.io/blog/yaml) by Omkar Birade.
- [YAML: The Missing Battery in Python](https://realpython.com/python-yaml/) by Real Python.


## <font color="red">Modules needed</font>

We will rely on:

- `yaml`: Python interface to YAML

In [None]:
import pprint

In [None]:
import yaml

In [None]:
import pandas as pd

In [None]:
import seaborn as sns

## <font color="red">What is YAML?</font>

- YAML Ain't Markup Language (initially Yet Another Markup Language), is a data serialization language (used to exchange data across different systems using a standardized syntax).
- A simple human-readable format that is often used for configuration files and in applications where data is being stored or transmitted.
- YAML works with all modern programming languages and is widely used in data persistence, internet messaging, cross-language data sharing, etc. 
- The YAML data format is a superset of one more widely used Markup language called JSON.

#### A few areas where YAML is used

YAML is often used for configuration files that are parsed and read by a programming language or framework. Its format makes it easy for developers and system administrators to understand and modify configuration settings.

- _Configuration Management_: One of the primary uses of YAML is in configuration files. YAML’s human-friendly syntax makes it an excellent choice for storing configuration data. It’s readable, writable, and easy to update, making it a go-to choice for developers worldwide.
- _Data serialization_: YAML  can represent complex data structures in a format that’s easy to understand and manipulate.
- _CI/CD_ Many CI/CD products rely on yaml to describe their pipelines (GitHub Actions, GitLab CI/CD, Azure DevOps, CircleCI).

## <font color="red">Structure of a YAML File</font>

- Any YAML files ends with the suffix `.yaml` or `.yml`.
- A construct of a YAML document consists of key-value pairs.
- A YAML format primarily uses three node types:
   - __Maps/Dictionaries__ (mapping): An unordered set of key/value node pairs, with the restriction that each of the keys is unique. 
   - __Arrays/Lists__ (sequences): An ordered series of zero or more nodes. In particular, a sequence may contain the same node more than once. It could even contain itself.
   - __Literals__ (Strings, numbers, boolean, etc.): An opaque datum that can be presented as a series of zero or more Unicode characters.

#### Syntax
- __Beginning of file__: The file starts with three dashes: `---`
  - These dashes indicate the start of a new YAML document.
  - A file with multiple documents would look like this, where each new document is indicated by `---`.
  - Triple dots, `...` (optional), are used to end a YAML document without starting a new one.
- __Comments__: A comments start with `#`.
- __Indentation__: A YAML file relies on whitespace and indentation to indicate nesting. The number of spaces used for indentation doesn’t matter as long as they are consistent.
- __Mapping__: Mappings are used to associate key/value pairs that are unordered. Maps can be nested by increasing the indentation, or new maps can be created at the same level by resolving the previous one.
- __Sequences__: Sequences are represented by using the hyphen (`-`) and space. They are ordered and can be embedded inside a map using indentation.
- __Literals__:
  - String literals in YAML do not need to be quoted. It is only important to quote them when they contain a value that can be mistaken for a special character.
- __Multiple lines__: Values can span multiple lines using `|` or `>`.
  - Spanning multiple lines using a “Literal Block Scalar” `|` will include the newlines and any trailing spaces.
  - Using a “Folded Block Scalar” `>` will fold newlines to spaces; it is used to make what would otherwise be a very long line easier to read and edit.
    
  In both cases, the indentation will be ignored.

#### Supported literals

| Data Type	| YAML |
| --- |  :--- |
| Null	| null, ~ | 
| Boolean	| true, false | 
| Integer	| 10, 0b10, 0x10, 0o10 | 
| Floating-Point	| 3.14, 12.5e-9, .inf, .nan | 
| String	| Lorem ipsum | 
| Date and Time	| 01:35, 2024-10-23,  2024-10-23 01:15:32 | 

### Example

Consider the following YAML file content:

```
---
# key: value [mapping]
Description: >
      We provide information on
      the planets of the solar system.

Information: |
    The given data is the distance to the sun,
    the orbital and physical characteristics, and
    the number of moons of each planet.

sun:
   mean_radius: "696340 km"     # string [literal]
   # key: value is an array [sequence]
   planets: 
     - Mercury:
         distance: 0.580e8      # floating point [literal]
         orbital_characteristics:
            orbit_duration: 87.96 Earth years   # string [literal]
            orbital_speed: 47.36 km/s           # string [literal]
         physical_characteristics:
            mean_radius: "2439.7 km"           # string [literal]
            surface_area: 7.48e7               # string [literal]
         num_moons: 0                          # integer number [literal]
    
     - Earth:
         distance: 1.497e8
         orbital_characteristics:
            orbit_duration: "1.0 Earth years"
            orbital_speed: "29.7827 km/s"
         physical_characteristics:
            mean_radius: "6371.0  km"
            surface_area: 510072.0e3
         num_moons: 1
```

## <font color="red">PyYAML: YAML Parser for Python</font>

- PyYAML is a YAML parser and emitter for Python, which allows you to easily load and dump YAML data.
- It is used by importing the `yaml` module.
- It has the main methods:
   - `yaml.load()`: To parse the YAML file and load its contents into a Python data structure.
   - `yaml.dump()`: To convert a Python object into a YAML document.

## <font color="red">Reading a YAML File in Python</font>

In [None]:
file_name = "solar_planet_data.yaml"

In [None]:
import urllib.request
url = "https://raw.githubusercontent.com/astg606/astg_pymaterials/main/yaml/solar_planet.yaml"
urllib.request.urlretrieve(url, file_name)

#### Loading YAML content

We can read content of a YAML file and convert it to Python data structures.

- Use the `yaml.load()` function when data is coming from trusted sources.
- The `Loader=yaml.FullLoader` argument ensures that the full YAML syntax is supported.

In [None]:
with open(file_name, 'r') as fid:
    solar_system = yaml.load(fid, Loader=yaml.FullLoader)

- An alternative approach to parse the YAML file is by using `yaml.safe_load()` function to load data by untrusted sources.

In [None]:
try:
    with open(file_name, 'r') as fid:
       solar_system = yaml.safe_load(fid)
except yaml.YAMLError as exc:
    print(f"Error reading YAML file: {exc}")

In [None]:
type(solar_system)

Parsing a YAML file in Python reads the contents of the YAML file into Python as a dictionary.

#### Inspecting the dictionary

In [None]:
solar_system.keys()

In [None]:
solar_system['Description']

In [None]:
solar_system['Information']

In [None]:
solar_system['sun']

In [None]:
solar_system['sun'].keys()

In [None]:
type(solar_system['sun']['planets'])

In [None]:
len(solar_system['sun']['planets'])

In [None]:
for planet in solar_system['sun']['planets']:
    print(planet.keys())

In [None]:
for planet in solar_system['sun']['planets']:
    for x in planet:
        print(f"{x} --> {planet[x]}")
        print()

#### Get the list of planets

In [None]:
list_planets = list()
for planet in solar_system['sun']['planets']:
    for x in planet:
        list_planets.append(x)

In [None]:
list_planets

#### Create a Pandas DataFrame

In [None]:
columns = [
    'distance', 'num_moons', 'orbit_duration', 'orbital_speed', 
    'mean_radius', 'surface_area'
]
df = pd.DataFrame(index=list_planets, columns=columns)
df

In [None]:
for planet in solar_system['sun']['planets']:
    for x in planet:
        data = [planet[x]['distance'], 
                planet[x]['num_moons'],
                planet[x]['orbital_characteristics']['orbit_duration'],
                planet[x]['orbital_characteristics']['orbital_speed'],
                planet[x]['physical_characteristics']['mean_radius'],
                planet[x]['physical_characteristics']['surface_area']
               ]
        df.loc[x] = data

In [None]:
df

In [None]:
df.info()

#### Changing the data type

In [None]:
df['distance'] = df['distance'].apply(float)
df['num_moons'] = df['num_moons'].apply(int)
df['surface_area'] = df['surface_area'].apply(float)

In [None]:
df.info()

In [None]:
def remove_words(x):
    if isinstance(x, str):
        if 'km' in x or 'Earth' in x:
            return float(x.split()[0])
    else:
        return x

In [None]:
df = df.map(remove_words)

In [None]:
df

In [None]:
df.info()

#### Simple plot

In [None]:
g = sns.relplot(x=df.index, y="distance", 
                size="mean_radius", hue="num_moons",
                sizes=(40, 400), 
                alpha=.75, 
                palette="muted",
                height=6, data=df
               )
g.set_xticklabels(rotation=90)

## <font color="red">Writing a YAML document</font>

We use the `dump` method to convert Python objects into a YAML document:

```python
with open("new_file.yaml", 'w') as fid:
    yaml.dump(data, fid)
```

#### Create a YAML document from a Pandas DataFrame

- We need to convert the Pandas DataFrame to a dictionary and then to YAML document

In [None]:
yaml_doc = yaml.dump(df.to_dict(orient='records'),default_flow_style=None)

In [None]:
pprint.pprint(yaml_doc)

#### Dump a Pandas DataFrame into a YAML file

In [None]:
with open("new_file.yaml", 'w') as fid:
    yaml.dump({'results': df.to_dict(orient='records')}, fid, default_flow_style=False)

You can read the file to create a new Pandas DataFrame:

In [None]:
with open('new_file.yaml', mode="rt") as fid: #, encoding="utf-8") as fid:
    df_merged = pd.DataFrame(yaml.full_load(fid)['results'])

In [None]:
df_merged

In [None]:
df_merged.index = list_planets
df_merged

## <font color="red">Best practices</font>

- __Use consistent indentation__: – Be consistent with the number of spaces (two or four is appropriate) you use throughout your file.
- __Keep your YAML files short__: If you have many parameters, try to split your configuration settings into multiple files to facilitate the debugging process.
- __Use descriptive key names__: Key names need to To make your configuration smooth and easy to understand by other people, the key names used throughout the file should be meaningful.
- __Structure complex data__: Lists are denoted by a dash (`-`), while maps are key-value pairs.
- __Use comments__: – Comments help to understand what a data is used for. 
- __Use quotes for strings__: This is especially useful if your string contains special characters or starts with a number.
- __Avoid using the `yaml.load()` function without specifying the `Loader` argument__: This can lead to potential security vulnerabilities, as arbitrary code execution is possible if the YAML file contains malicious data.
- __Validate the file before parsing it__: Use a YAML linter or validator to check the syntax and structure of the file. This can help catch errors or inconsistencies before they cause issues in your application.
- __Handle errors gracefully when parsing YAML files__: Use `try-except` blocks to catch any exceptions that may occur during the parsing process and handle them appropriately.
