# The `pyyaml` Library

Often, configurations are stored in YAML files (a superset of JSON).

Reading those configs in our Python app is dead easy with the `pyyaml` library. Of course, you're not limited to config files, you can also just as easily read (and write) docker compose yaml files, etc.

This library can be found here:

[https://pyyaml.org/](https://pyyaml.org/)

and can be pip installed via:

```bash
pip install pyyaml
```

For this notebook, you'll find two companion yaml file, one named `config.yaml`, and the other `docker-compose.yaml`.

Let's look at how to read the `config.yaml` file and access data in it.

In [3]:
import yaml

Although the yaml library provides a `load()` function, it is not considered safe, specifically when loading yaml data that you are not 100% in control of (there are ways for yaml files to cause code to execute) - for that reason, `safe_load()` is the preferred method.

In [4]:
with open("config.yaml") as f:
    config = yaml.safe_load(f)

And let's see what we got:

In [5]:
config

{'observer': {'latitude': 33.4,
  'longitude': -111.8,
  'horizon_file': 'data/horizon.csv'},
 'catalog': {'file': 'data/dso_catalog.csv',
  'categories': ['emission_nebula',
   'reflection_nebula',
   'hii_regions',
   'galaxies',
   'galaxy_clusters']}}

From this point on, we can use however we want, possibly even loading it into a Pydantic model:

In [6]:
from pydantic import BaseModel

class ObserverSettings(BaseModel):
    latitude: float
    longitude: float
    horizon_file: str | None = None

class CatalogSettings(BaseModel):
    file: str
    categories: list[str]
    
class Settings(BaseModel):
    observer: ObserverSettings
    catalog: CatalogSettings

In [7]:
settings = Settings.model_validate(config)
settings

Settings(observer=ObserverSettings(latitude=33.4, longitude=-111.8, horizon_file='data/horizon.csv'), catalog=CatalogSettings(file='data/dso_catalog.csv', categories=['emission_nebula', 'reflection_nebula', 'hii_regions', 'galaxies', 'galaxy_clusters']))

One very interesting thing about yaml is that it is techncially a superset of JSON - so valid JSON also happens to be valid YAML.

As an example of this, you'll find a file in this folder named `nobel_prizes.json`. It is JSON, and we could certainly load it up into a Python dictionary structure using the `json.load` function:

In [8]:
import json

with open("nobel_prizes.json") as f:
    data_json = json.load(f)

In [9]:
data_json["prizes"][:3]

[{'year': 2023,
  'category': 'chemistry',
  'overallMotivation': None,
  'laureates': [{'id': 1029,
    'firstname': 'Moungi',
    'surname': 'Bawendi',
    'motivation': '"for the discovery and synthesis of quantum dots"',
    'share': 3},
   {'id': 1030,
    'firstname': 'Louis',
    'surname': 'Brus',
    'motivation': '"for the discovery and synthesis of quantum dots"',
    'share': 3},
   {'id': 1031,
    'firstname': 'Aleksey',
    'surname': 'Yekimov',
    'motivation': '"for the discovery and synthesis of quantum dots"',
    'share': 3}]},
 {'year': 2023,
  'category': 'economics',
  'overallMotivation': None,
  'laureates': [{'id': 1034,
    'firstname': 'Claudia',
    'surname': 'Goldin',
    'motivation': '"for having advanced our understanding of women’s labour market outcomes"',
    'share': 1}]},
 {'year': 2023,
  'category': 'literature',
  'overallMotivation': None,
  'laureates': [{'id': 1032,
    'firstname': 'Jon',
    'surname': 'Fosse',
    'motivation': '"for his

But interestingly enough, you can also load this same data using `pyyaml`:

In [10]:
with open("nobel_prizes.json") as f:
    data_yaml = yaml.safe_load(f)

In [11]:
data_yaml["prizes"][:3]

[{'year': 2023,
  'category': 'chemistry',
  'overallMotivation': None,
  'laureates': [{'id': 1029,
    'firstname': 'Moungi',
    'surname': 'Bawendi',
    'motivation': '"for the discovery and synthesis of quantum dots"',
    'share': 3},
   {'id': 1030,
    'firstname': 'Louis',
    'surname': 'Brus',
    'motivation': '"for the discovery and synthesis of quantum dots"',
    'share': 3},
   {'id': 1031,
    'firstname': 'Aleksey',
    'surname': 'Yekimov',
    'motivation': '"for the discovery and synthesis of quantum dots"',
    'share': 3}]},
 {'year': 2023,
  'category': 'economics',
  'overallMotivation': None,
  'laureates': [{'id': 1034,
    'firstname': 'Claudia',
    'surname': 'Goldin',
    'motivation': '"for having advanced our understanding of women’s labour market outcomes"',
    'share': 1}]},
 {'year': 2023,
  'category': 'literature',
  'overallMotivation': None,
  'laureates': [{'id': 1032,
    'firstname': 'Jon',
    'surname': 'Fosse',
    'motivation': '"for his

In [12]:
data_json == data_yaml

True

As you can see, identical results.

Now, I am not suggesting you stop using `json.load` to parse JSON data and switch to `pyyaml` instead - the json loader is faster since it has less work to do - I just wanted to point out that since JSON is a subset of YAML, a YAML parser is totally able to parse JSON.

Of course, we could further load this data into a Pydantic model:

In [13]:
class Laureate(BaseModel):
    id: int
    firstname: str | None = None
    surname: str | None = None
    motivation: str | None = None
    share: int | None = None

class Prize(BaseModel):
    year: int | None = None
    category: str | None = None
    overallMotivation: str | None = None
    laureates: list[Laureate] = []

class Prizes(BaseModel):
    prizes: list[Prize]

data_pydantic = Prizes.model_validate(data_yaml)

In [14]:
data_pydantic.prizes[0:3]

[Prize(year=2023, category='chemistry', overallMotivation=None, laureates=[Laureate(id=1029, firstname='Moungi', surname='Bawendi', motivation='"for the discovery and synthesis of quantum dots"', share=3), Laureate(id=1030, firstname='Louis', surname='Brus', motivation='"for the discovery and synthesis of quantum dots"', share=3), Laureate(id=1031, firstname='Aleksey', surname='Yekimov', motivation='"for the discovery and synthesis of quantum dots"', share=3)]),
 Prize(year=2023, category='economics', overallMotivation=None, laureates=[Laureate(id=1034, firstname='Claudia', surname='Goldin', motivation='"for having advanced our understanding of women’s labour market outcomes"', share=1)]),
 Prize(year=2023, category='literature', overallMotivation=None, laureates=[Laureate(id=1032, firstname='Jon', surname='Fosse', motivation='"for his innovative plays and prose which give voice to the unsayable"', share=1)])]

## Writing YAML back to file

Now let's look at how we can write yaml out to a file.

We're going to start with the file `docker-compose.yaml` file which sets up a dockerized Redis database.

We're going to read the data in, modify something and write it back out - to keep the original docker-compose file intact, I won't overwrite the original file, but instead write it to a copy until we have things ironed out.

In [15]:
with open('docker-compose.yaml') as f:
    redis = yaml.safe_load(f)
redis

{'version': '3',
 'services': {'redis': {'image': 'redis:latest',
   'container_name': 'redis_queue',
   'restart': 'always',
   'ports': ['6379:6379'],
   'volumes': ['data-volume:/data']}},
 'volumes': {'data-volume': None}}

Now, let's say I need to modify the port mappings, and the container name.

In [16]:
redis["services"]["redis"]["container_name"] = "redis_instance_2"
redis["services"]["redis"]["ports"] = ["9000:6379"]

In [17]:
redis

{'version': '3',
 'services': {'redis': {'image': 'redis:latest',
   'container_name': 'redis_instance_2',
   'restart': 'always',
   'ports': ['9000:6379'],
   'volumes': ['data-volume:/data']}},
 'volumes': {'data-volume': None}}

And let's write it out to file:

In [20]:
with open("docker-compose.new.yaml", "w") as f:
    yaml.dump(redis, f)

Taking a look at the output file and we see this:
```yaml
services:
  redis:
    command: redis-server --save 20 1 --loglevel warning --requirepass secret
    container_name: redis_instance_2
    image: redis:latest
    ports:
    - 9000:6379
    restart: always
    volumes:
    - data-volume:/data
version: '3'
volumes:
  data-volume: null
  ```

This is in contrast to how our original file looked:

```yaml
---
version: '3'

services:
  redis:
    image: redis:latest
    container_name: redis_queue
    restart: always
    ports:
      - '6379:6379'
    command: redis-server --save 20 1 --loglevel warning --requirepass secret
    volumes:
      - data-volume:/data

volumes:
  data-volume:
```

The new version almost looks identical to our original version except our keys have been sorted (e.g. `version` is in the middle of the file, not at the top where we would like it to be, etc). Not also how our file started with some dashes (specifies an explicit document, whereas the newly created yaml file does not, i.e. it is an implicit document).

Fortunately, `pyyaml` has different arguments we can use to control all this. In this case, we want to retain the ordering.

In [21]:
with open("docker-compose.new.yaml", "w") as f:
    yaml.dump(redis, f, sort_keys=False)

And this now looks like this:

```yaml
version: '3'
services:
  redis:
    image: redis:latest
    container_name: redis_instance_2
    restart: always
    ports:
    - 9000:6379
    command: redis-server --save 20 1 --loglevel warning --requirepass secret
    volumes:
    - data-volume:/data
volumes:
  data-volume: null
```

Only thing missing is the fact that we had an explicit document - would be nice to preserve that.

In [22]:
with open("docker-compose.new.yaml", "w") as f:
    yaml.dump(redis, f, sort_keys=False, explicit_start=True)

## Conclusion

The `pyyaml` library has much more functionality than I cover here, but to get started reading and writing yaml files, what we covered here is more than sufficient. By all means, go check out the library docs and learn more!

In future videos I'll show you a yaml query tool that we can use from Python - and since, as I showed earlier JSON is a subset of YAML, it can also be used as a JSON query tool. but that's for another day!