# Common File Formats

Your data will come to you in many different formats, and your outputs will come in at least as many.  This notebook shows examples of some common formats and how to deal with them.

*Terminology:* the verb *serialize* will come up a lot.  "To serialize some data" just means to convert it to a format that can be stored permanently in a file.

All cells in this notebook are designed to be be runnable without depending on any other cells.  This means you can copy-paste them into your editor to see how they work.  It also means I'm going to repeat a lot of code (like imports) in every cell, which you usually won't see with Notebooks.

## A note about file extensions

File extensions are a lie.  Sort of.

The extension in a file--.txt, .docx, .exe, .py--is just part of the file's name.  You can change--or even remove--the file extension, and *nothing about the file's contents changes.*  The extensions *are* used by your operating system, though, to determine what the default program is for opening the file.  Notepad for .txt files, LibreOffice for .docx, PyCharm for .py, etc; but you *can* open any file in any program.

A file extension is just a way to convey "this file contains data formatted according to the .txt/.docx/.py specifications."  It's like changing the name of a file from "Paper - Draft.docx" to "Paper - Final.docx"; just changing the name changes nothing about the contents of the file.  It only changes the message you're conveying to the person who's looking at the file: "this is the final, complete version of the paper, not a draft."  That person can, of course, ignore that information and still treat it like a draft!  In this same way, you can force your computer to open a .csv file in Notepad, or a .txt file in Word, or a .py file in Excel.  It might look and act weird when you do, but you can absolutely do it.

# General note: text files

By and large, text files (sometimes called "plain" text files, or "plain text") are the most common ways you will receive your data.  A text file is just a file that contains text: it does *not* contain any fancy binary data to specify layouts and formats.  A huge number of files that you interact with probbly every day are actually text files (some of these will be discussed in more detail later):

- CSV and TSV files
- JSON files
- .txt files
- HTML files
- Code--in Python, R, Julia, etc
- Config files (.cfg, .ini)

A narrow definition of "text file" would be something like: any file where, when you open it, the first bit of binary data directly correspond to some text character (e.g. a letter, or space, or accent mark), as does the next, etc etc up until (and including) the very last one.

By this token, some common files that are *not* text files:

- Images: the binary data corresponds to pixel colors/brightness, *not* text characters.
- Compiled code (.dmg, .exe, .elf, .o, etc): the binary data corresponds to *machine instructions,* not text characters.
- Sound files (.mp3, .wav, .ogg, .flac, etc): the binary data corresponds to *audio information,* not text characters.

## Encoding: the bane of everyone who works with text files

All files are, ultimately, binary data (1s and 0s).  To map these digits to text, an *encoding* is needed.  An encoding is just a list of text characters and their corresponding binary values.  So, you might define a capital 'A' as 0, 'B' as 1, 'C' as 10, etc.

There are a few common encodings you'll run into. Each one of these differs in terms of 1) what characters they have mappings for in the first place; 2) what the specific mappings are.

- ASCII: The simplest and most universal encoding, for anything using the Latin alphabet plus a few common accent marks.  Most other encodings are supersets of ASCII.
- UTF-8: The one you should be using by default.  UTF-8 extends ASCII by adding mappings for text in (almost) every script ever used, including extinct ones like cuneiform and Egyptian hieroglyphs.  This is the default encoding on almost every modern Linux distribution, and on modern verions of macOS.
- Windows-1252, often incorrectly identified as ISO-8859-1: Very similar to UTF-8, but uses slightly different mappings for some characters.  This is the default on modern Windows versions.

By and large, most people have standardized on UTF-8 as a general default, with ISO-8859-1 sometimes coming into the mix.  Just by looking at a file, you can't usually tell what the encoding is.  Some text editing programs will try to infer the encoding, but they can make mistakes.  And if you try to open a file with an incompatible encoding, you'll get errors.  Debugging these can be a pain, but here's the good rule of thumb I've learned:

1. UTF-8 should be your first guess.
2. If that doesn't work, try Windows-1252.
3. If that doesn't work, try your system's default encoding.
4. If that doesn't work, go back to where you got the data and see if the encoding is documented, or ask the person who gave it to you.
5. Worst case, you might have to find the specific bytes that can't be read, google them, and try to piece together what encoding is probably being used based on what encodings have a mapping for that string of bytes.

When you open a file in Python, you can optionally specify the file encoding.  If you don't, Python will default to whatever your system's default encoding is (Windows-1252 for Windows, UTF-8 for basically everything else).  I strongly recommend always explicitly specifying an encoding when reading and writing files, in case someone need to run your code on a different computer that may have a different operating system, or the same operating system configured with a different default encoding.

## File "modes"

Files can be opened in several different "modes."

1. "Read" mode (`mode="r"`).  The file's contents can only be read; no modifications can be made.
2. "Write" mode (`mode="w"`).  The file's contents are erased, and then re-written to.
3. "Append" mode (`mode="a"`).  The file's contents are preserved, and new data is written to the very end of the file.

In addition, Python lets you specify whether a file's contents should be *decoded* from binary to text using a specified encoding, or whether the data should be read in as "raw bytes" that are *not* decoded.  The former is the default behavior.  The latter can be specified by adding a `"b"` to the end of the `mode` string.  E.g.:

1. Read the file as raw bytes; don't decode anything: `mode="rb"`
2. Open the file for writing, and write raw bytes into it; don't encode data before writing: `mode="wb"`
3. Open the file for appending, and write raw bytes into it: `mode="ab"`

Usually you only use these "binary modes" in fairly specific circumstances.  It's rare that you'll use them for general-purpose file access.

## Basic text files in Python

In [1]:
# Create two files to show different encodings.  Smart quote characters
# are where I've seen the most differences and issues with encodings,
# mostly because UTF-8 and ISO-8859-1 encode them differently.
example_text = "“Hello”, he said, “to you”"

# We'll come back to `with` in just a moment.
with open("my file.txt", "w", encoding="utf8") as OUT:
    OUT.write(example_text)
    
with open("my windows-1252 file.txt", "w", encoding="windows-1252") as OUT:
    OUT.write(example_text)

*Digression: `with`*

Sometimes when you open a file, your program might unexpectedly exit (e.g. crash) before the file gets closed.  This can cause some issues.  The `with` statement in Python helps you avoid this: if the program exits for any reason, the `with` statement makes sure that the file gets closed.

You'll usually see `with` used for opening files, opening connections (e.g., connecting to a database), or sometimes, temporarily moving computations to a different device (e.g. from the CPU to the GPU; neural network libaries are the most common places to see this).

You can think of `with` as taking code that looks like:
```python
my_file = open("some file")
# do stuff with my_file
my_file.close()
```

And making sure that `my_file.close()` *always* gets called.  The above block, rewritten using `with`, looks like:

```python
with open("some file") as my_file:
    # do stuff with my_file
```

This will bind the result of `open("some file")` to the variable `my_file`, and make that variable accessible to you within the scope of the `with` block.

This paradigm of using `with` is often called *context management,* since it's kind of like saying "in this context, `my_file` means `open("some file")`.  Now, do some stuff with `my_file`..."

*End of digression.*

In [2]:
# Open a file using the default system encoding.
# Note: this is a file that *does not have a file extention.*
# Python--and all programming languages I know of--generally
# require you to include the file extension on a file, if present.
#
# Note that `encoding="utf8"` is *required* on Windows computers--
# it my not be on Mac and Linux, where UTF-8 is often the system-wide
# default.
contents = open(
    "my file.txt",   # Open the file "my file"...
    mode="r"  ,       # ...in "read" mode...
    encoding="utf8", # ...using utf-8 encoding...
).read()             # ...and read the resulting text into memory.
print(contents)

# Note: `mode=` is usually omitted, and the open mode passed positionally.
# WHen reading in binary mode, you cannot specify an encoding--the encoding
# tells how to map from binary to text, but reading a file in binary mode
# says "don't do that conversion."
binary_contents = open(
    "my file.txt", # Open the file "my file"...
    "rb"           # ...in "binary read" mode...
).read()           # ..and read the resulting bytes into memory.
print(binary_contents)

“Hello”, he said, “to you”
b'\xe2\x80\x9cHello\xe2\x80\x9d, he said, \xe2\x80\x9cto you\xe2\x80\x9d'


In [3]:
# Specify an encoding.  The name of the encoding is usually case-insensitive;
# this is eon of the only places in Python where things are case-insensitive.
# The `encoding` argument is usually passed by name; it's the fourt or fifth
# defined argument to `open()`.
#
# Note that the windows-1252 encoding will be used by default on Windows computers
# if you don't specify encoding=.
contents = open(
    "my windows-1252 file.txt", # Open the file "my file"...
    "r",                        # ...in "read" mode...
    encoding="windows-1252"     # ...decode the binary data using the UTF-8 codec...
).read()                        # ...and read the resuting text into memory.
print(contents)

“Hello”, he said, “to you”


In [4]:
# Errors will happen if the file can't be decoded using the specified encoding.
contents = open("my windows-1252 file.txt", "r", encoding="utf8").read()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 0: invalid start byte

Once you've read data in to your program, it's just a string.  It's up to you at that point to figure out how to manipulate that string into the format you need.

Sometimes you have to do that manipulation by hand, e.g., if there's some strange custom data format that you need to parse.  But, for the most common kinds of data format--CSV, JSON, XML, and a few others--there are tools in Python to do that for you.

# CSV: Tabular Data

CSV files are basically universal for storing tabular data.  They are text files that store data using the following conventions:

1. There is a one-to-one relationship between the rows in the tabular data and the lines in the text file.
2. Columns are sparated by commas.
3. If a comma appears inside a cell, e.g. you have a string of text, that string must be *escaped*, usually by just surrounding it with quotes.

Every spreadsheet program and data-oriented language can work with CSV files quite easily.  They have some downsides when it comes to speed--they can be slower to read and write than some other formats--and when you have *huge* data files, this can be exacerbated.  But for 95% of all use cases, CSVs is a really good go-to, if only because of how universally used they are.

(Sidebar: .xlsx files from Excel are not text files; they're actually compressed archives storing a mishmash of folders and files, with the actual file contents being stored as XML.  Excel files are pretty universal too, and will generally be smaller than the equivalent CSV, but can be a *lot* slower to read and write).

Strictly speaking, CSV files are a subset of *delimited files.*  Any symbol (or sequence of symbols) could be used to separate columns.  Tabs are a common alternative--yielding Tab Separated Value (.tsv) files--as are *pipe* characters (i.e., "|").  All CSV parsers I know of allow you to specify the column delimiter character, though, so the same approaches that work with CSV can work very easily with other delimited formats.

Python's standard library has a `csv` module that contains some basic, but useful, tools for reading and writing CSV files.  These work with Python `list`s and `dict`ionaries, and don't make it easy to do much beyond very basic analyses, but they do make it very easy to work with enormous files (they only read/write one row at a time, so you can process a huge file with a very small amount of RAM).

Later on, we'll see the `pandas` third-party library, which offers a *much* richer set of ways to interact with tabular data--including CSV data--and is generally a better option if your data isn't enormous.  But, the standard library's `csv` tools are still important to know about.

In [5]:
import csv

# Read a CSV file into a list of lists:
# [
#     [row1_col1, row1_col2, row1_col3, ...],
#     [row2_col1, row2_col2, row2_col3, ...],
#     ...
# ]
with open("simple_csv.csv", "r", encoding="utf8") as INFILE:
    reader = csv.reader(INFILE)
    for row in reader:
        print(row)
        
# csv.DictReader --> returns a list of dictionaries, one dictionary
# per row, containing {column_name: value} pairs.
# [
#     {"col1": row1_col1, "col2": row1_col2, ...},
#     {"col1": row2_col1, "col2": row2_col2, ...},
#     ...
# ]
# Unlike csv.reader(), this will default to parsing the first row as
# column headers.  csv.reader() treats the first row like any other
# row.
print()
with open("simple_csv.csv", "r", encoding="utf8") as INFILE:
    reader = csv.DictReader(INFILE)
    for row in reader:
        print(row)

['Language', 'Number of Speakers (millions)', 'Language Family']
['English', '1000', 'Indo-European']
['Zulu', '28', 'Niger-Congo']
['Tamil', '83', 'Dravidian']
['Zapotec', '0.5', 'Oto-Manguean']

{'Language': 'English', 'Number of Speakers (millions)': '1000', 'Language Family': 'Indo-European'}
{'Language': 'Zulu', 'Number of Speakers (millions)': '28', 'Language Family': 'Niger-Congo'}
{'Language': 'Tamil', 'Number of Speakers (millions)': '83', 'Language Family': 'Dravidian'}
{'Language': 'Zapotec', 'Number of Speakers (millions)': '0.5', 'Language Family': 'Oto-Manguean'}


In [6]:
import csv

# Write to files with the corresponding writer objects.
# NOTE: newline="" is needed here.  Normally multiple calls to open(...).write()
# will automatically append either "\n" or (on Windows) "\r\n".  The CSV writers
# handle the newline writing to make sure everything complies with the CSV
# specification; so setting `newline=""` essentially says "it is the job
# of whatever is writing data into this file to manage when text moves down
# to the next line"--and in this case, that's good, because the csv writers
# can handle it.
with open("my new csv file.csv", "w", encoding="utf8", newline="") as OUTFILE:
    writer = csv.writer(OUTFILE)
    # first row is not "special"--just write the column names.
    # It's up to us to make sure the order of columns later is correct.
    writer.writerow(["Language", "Speakers", "Language Family"])
    
    # Write data one row at a time, with columns in the same order
    # as the first row we wrote.
    writer.writerow(["Djinang", "125", "Pama-Nyungan"])
    writer.writerow(["Unangam Tunuu", "150", "Eskimo-Aleut"])
    writer.writerow(["Hurrian", "0 (extinct)", "Hurro-Urartian"])
    
# csv.DictWriter will manage column order for us.  We pass its .writerow()
# method a dictionary of {column: value} pairs, and it puts them in the
# right places.  So the order of the pairs doesn't matter.
with open("my new csv file.csv", "w", encoding="utf8", newline="") as OUTFILE:
    writer = csv.DictWriter(OUTFILE, fieldnames=["Language", "Speakers", "Language Family"])
    writer.writeheader()
    writer.writerow({"Language": "Djinang", "Speakers": "125", "Language Family": "Pama-Nyungan"})
    writer.writerow({"Speakers": "150", "Language": "Unangam Tunuu", "Language Family": "Eskimo-Aleut"})
    writer.writerow({"Language Family": "Hurro-Urartian", "Speakers": "0 (extinct)", "Language": "Hurrian"})

In [7]:
# Of course, since CSV files are just text, we can open them like normal text files.
csv_data = open("simple_csv.csv", "r", encoding="utf8").read()
print(csv_data)

Language,Number of Speakers (millions),Language Family
English,1000,Indo-European
Zulu,28,Niger-Congo
Tamil,83,Dravidian
Zapotec,0.5,Oto-Manguean



## JSON files: universal, flexible, and fast (enough)

JavaScript Object Notation (JSON) is a more general way to store data than CSV.  It's used almost literally everywhere.  Usually, the term JSON is used as a noun to refer to "a piece of data formatted using the JSON specification."

JSONs look a *lot* like Python `dict`ionaries.  They're key-value pairs surrounded by curly brackets.  `dict`ionaries are actually just a more general form of JSONs, if you squint.  Here are the major differences:

- JSON keys can only be strings.
- JSON values can only be strings, numbers, arrays, booleans (`true` and `false`--all lowercase, unlike Python's `True` and `False`), missing (`null`, compare to Python's `None`), or other JSONs.  Arrays can contain strings, numbers, arrays, and JSONs in any combination.
- Text *must* be surrounded by *single quotes.*  Python doesn't care about `'single'` versus `"double"` quotes for strings, but JSON does, and JSON only allows double quotes.
- All values must be *literal values.*  JSON does not have a notion of variables.  (when you're creating a JSON in Python, Python will convert variables to their literal values--you don't have to worry about that step).

Beyond that, JSONs look like Python dictionaries:

```json
{
    "name": "Henry",
    "job title": "Data Scientist",
    "favorite languages": ["Python", "Julia", "Haskell", "Lua"],
    "favorite numbers": [42, 5, 2.718],
    "some random junk": [1, "two", [3, 4, 5], {"six": 6, "7": "seven"}]
}
```

You'll often see JSONs written on a single line and without spaces between things, to save a bit of space in files:
```json
{"1":2,"3":4,"5":"six"}
```

You'll also see a lot of files in the not-quite-official-but-effectively-official *newline-delimited JSON* (ndjson) format, where each line in a file contains one JSON object.  Strictly speaking, this ndjson format is *not* part of the JSON specification; for a file to contain multiple JSONs and still be a completely valid JSON file, top to bottom, you'd need it to be an array of JSONs.  But, there is no benefit to doing this; to parse the JSON array, you'd need to parse the entire file all at once, which might take a while if the file is large.  With ndjson, you cn read one line at a time, parse it, do what you need, and then grab the next.  This *incremental* parsing is way easier to work with, so ndjson is *effectively* the way that JSON data is always stored in files.

The simplest way to parse JSON data in Python is with the `json` module, which only has four functions you ever really need to care about (and really, only two of those are ever super useful):

- `json.loads("some text")`: parse a JSON string and return a corresponding Python dictionary.
- `json.dumps(a dictionary)`: convert a Python dictionary to a corresponding JSON string, and throw an error if there's something that can't be JSON'd.  (e.g., a special Python object).
- `json.load(opened file)`: read the specified file's contents, and parse it as a single JSON object. (`opened file` should be a readable file-like object, e.g. what you get from calling `open()`)
- `json.dump(dictionary, opened file)`: convert the Python dictionary `dictionary` into a JSON object, and write it to `opened file` (`opened file` should be a writeable file-like object, e.g. what you get from calling `open()`).

Realisically, you'll very rarely use `json.load()` or `json.dump()`.  Usually you'll be working with `json.loads()` and `json.dumps()`.

In [8]:
import json

json_data = """{
    "name": "Henry",
    "job title": "Data Scientist",
    "favorite languages": ["Python", "Julia", "Haskell", "Lua"],
    "favorite numbers": [42, 5, 2.718],
    "some random junk": [1, "two", [3, 4, 5], {"six": 6, "7": "seven"}]
}"""
parsed = json.loads(json_data)
print(parsed)
print(type(parsed))

{'name': 'Henry', 'job title': 'Data Scientist', 'favorite languages': ['Python', 'Julia', 'Haskell', 'Lua'], 'favorite numbers': [42, 5, 2.718], 'some random junk': [1, 'two', [3, 4, 5], {'six': 6, '7': 'seven'}]}
<class 'dict'>


In [9]:
import json

my_dict = {1:2, 3:4, "5": 6}
# All keys will be converted to strings!
# repr() just makes sure that strings get printed with quotes around them;
# normally the quotes we put around strings aren't part of printed output.
print(repr(json.dumps(my_dict)))

# Dictionaries, lists, strings, ints, bools, and None can all be serialized.
print(repr(json.dumps([1,2,3])))
print(repr(json.dumps("blah blah blah")))
print(repr(json.dumps([True, {"Nope": False}])))

'{"1": 2, "3": 4, "5": 6}'
'[1, 2, 3]'
'"blah blah blah"'
'[true, {"Nope": false}]'


Sometimes, you may end up needing way more speed than you can get with the built-in `json` module.  Usually, it's good enough--but sometimes you have so much data that you need to squeeze some extra speed out of it.  If you want to see some libaries that are specialized for faster JSON parsing, check the `01a - Faster JSON processing.ipynb` notebook.

# XML: JSON's bigger, badder, uglier cousin

eXtended Markup Language (XML) is a data format that is almost as universal as JSON, but is more commonly reserved for marking up document structure.  HTML is a variant of XML, and all the Microsoft Office file formats--.docx, .xlsx, .pptx, etc--store their information as XML under the hood (that's actually what the "x" in the extension indicates).

XML is based around a few ideas:
- The *tag*.  A basic tag looks like: `<tagname>contents</tagname>`.  `tagname` can be basically anything.  `contents` can be text, or more XML tags.
- Some tags have no contents: `<tagname />`
- Some tags have *attributes*: `<person name=Henry job=DataScientist>`.

With a few exceptions, you can just make tag names and attribute names on the fly as you see fit, kind of like keys in a JSON.  

XML is far more flexible, but far more complex, than JSON.  A quick list of the tradeoffs:

Pros of XML:
- Allow you to store much more complex and detailed data.
- Can easily be tweaked/extended to create new "dialects" with new features (HTML is the most common dialect of XML!).
- A very natural way to express certain kinds of data.
- XML supports variables (sort of), JSON does not.
- XML supports comments inside the document.

Cons of XML:
- It is extremely "noisey"--the syntax has a lot of stuff that isn't the data itself.  This makes it very hard for a human to read, and means XML files are almost always larger than equivalent JSON files for the same data.
- It is usually slower to read and write than JSON data, all else being equal.
- XML doesn't have notions of arrays or data types; this can make it a bit unintuitive to store some kinds of data.
- It's harder--not impossible, but harder--to parse an XML file incrementally.  You can have tags that open once and take forever to close, which means you have to keep something around in RAM until it closes.
- It's *extremely* rare, but there are some vulnerabilities in XML.  E.g., the delightfully named "billion laughs" exploit.

If we take the previous JSON example:
```json
{
    "name": "Henry",
    "job title": "Data Scientist",
    "favorite languages": ["Python", "Julia", "Haskell", "Lua"],
    "favorite numbers": [42, 5, 2.718],
    "some random junk": [1, "two", [3, 4, 5], {"six": 6, "7": "seven"}]
}
```

We can convert it to XML (there are many possible ways we could do this, but this is just one):
```xml
<person>
    <name>Henry</name>
    <job_title>Data Scientit</job_title>
    <favorite_languages>
        <item>Python</item>
        <item>Julia</item>
        <item>Haskell</item>
        <item>Lua</item>
    </favorite_languages>
    <favorite_numbers>
        <item>42</item>
        <item>5</item>
        <item>2.718</item>
    </favorite_numbers>
    <some_random_junk>
        <item>1</item>
        <item>two</item>
        <item>
            <item>3</item>
            <item>4</item>
            <item>5</item>
        </item>
        <item>
            <six>6</six>
            <seven>7</seven>
        </item>
    </some_random_junk>
</person>
```

Or, we could maybe write the very last bit as an empty tag with some attributes:
```xml
<person>
    <!-- some stuff -->
    <item><map six=6 seven=7 /></item>
</person>
```

It's a *lot* wordier and more complex than JSON, and typing it out by hand is a pain.  But, it allows a *lot* more flexiblity.  I still avoid it wherever I can, though; usually, I never need the extra flexibility, and can just add a field to a JSON and get whatever I need.

Python has some tools for parsing XML.  I'm not going to cover them too deeply here, since XML data is relatively uncommon compare to CSV/tabular data and JSON.  That said, there are three libraries you should be aware of:

- `xml`, in the standard library.
- `lxml`, a third party library.  Faster and generally upgraded version of `xml`.
- `BeautifulSoup`, a third party library.  Specialized for HTML parsing, but you can probably make it work on general XML.  Provides a very easy-to-use interface, if a slightly slow one, for navigating an XML document.

In [10]:
import xml.etree.ElementTree as etree

# Quick parsing demo--mostly borrowed from Python's `xml` documentation.
xml_data = """<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>"""

# Load the XML data into a tree-like structure.
root = etree.fromstring(xml_data)
# if you have a .xml file:
# tree = etree.parse("filename")
# root = tree.getroot()

# `root` is now an iterable over XML `element`s.
# Each element can be indexed to get tags "inside" of it,
# and we can access different attributes on them.
for child in root:
    print(child.tag, child.attrib)
    # print("\t", child[0].tag, child[0].attrib)
    # print("\t", child[-1].tag, child[-1].attrib)

country {'name': 'Liechtenstein'}
country {'name': 'Singapore'}
country {'name': 'Panama'}


In [11]:
# using lxml
from lxml import etree

xml_data = """<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>"""

root = etree.fromstring(xml_data)
# root = lxml.etree.ElementTree(root)

# then, same as before
for child in root:
    print(child.tag, child.attrib)
    print("\t", child[0].tag, child[0].attrib)
    print("\t", child[-1].tag, child[-1].attrib)

country {'name': 'Liechtenstein'}
	 rank {}
	 neighbor {'name': 'Switzerland', 'direction': 'W'}
country {'name': 'Singapore'}
	 rank {}
	 neighbor {'name': 'Malaysia', 'direction': 'N'}
country {'name': 'Panama'}
	 rank {}
	 neighbor {'name': 'Colombia', 'direction': 'E'}


In [12]:
# using BeautifulSoup.  Install as `beautifulsoup4`,
# import as `bs4`.
from bs4 import BeautifulSoup

xml_data = """<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>"""

# `features="xml"` tells BeautifulSoup that this is plain
# XML; BeautifulSoup normally assumes it is specifically HTML,
# and will throw a warning if you don't specify this.  It will
# also use an HTML parser, which won't be as reliable as an XML
# parser if the data is in fact XML.
soup = BeautifulSoup(xml_data, features="xml")

# bs4 sometimes adds surrounding <html></html> tags to documents
# if they're not already present.
# HTML uses .find() to find tags with certian attributes.
for i in soup.find_all("country"):
    print(i.find("neighbor"))
    
# find all neighbors with direction="W" attributes
print()
for i in soup.find_all("neighbor", direction="W"):
    print(i)

<neighbor direction="E" name="Austria"/>
<neighbor direction="N" name="Malaysia"/>
<neighbor direction="W" name="Costa Rica"/>

<neighbor direction="W" name="Switzerland"/>
<neighbor direction="W" name="Costa Rica"/>


In general, I recommend *not* using XML for storing your own data, unless you have to.

# Pickle, the Python-native format

Plaintext, CSV, JSON, and XML all have their pros and cons, but the biggest con is this: *they can't store native Python object.*  Sure, if you're only using lists and dictionaries and simple data types, they'll all work. But what if you need to store some more complex object, like a set, or a custom class, or even a function?  This is where `pickle` comes in: Python's native, in-house solution to serializing data.

Compared to the previous formats, `pickle` has some very important properties:
- It is a binary format, *not* a text format.  The 1s and 0s are for Python's eyes, not yours.
- It can store *any kind of thing that Python can work with.*
- Like JSON, it stores on-thing-per-file.  But that thing can be a list, or dictionary, or set, or tuple that contains other thing you need.
- Pickle is *only* meant for Python.  Other languages *can* read pickled data--everything about how it works is completely open-source--but they usually don't, because there's no reason to load a Python dictionary in R.  (or, there are usually better ways to do it).
- ***Very importantly***: pickled files *can execute arbitrary Python code when you load them.* This can be a big security risk.  This is *not* an issue if it's a pickle file you created, but treat pickle files like executables: only use them if you trust where they came from.
    - Usually, you should not be using pickle files to share data with people online, for this very reason.  It's better to convert your data to CSV/JSON/XML/etc and share any code that they might need to read the data into their program.
    
The `pickle` module in the standard library has basically the same interface as `json`.

In [13]:
# This will throw an error; JSON can't serialize Python sets.
my_data = {
    "Texas": {"Houston", "Dallas", "Fort Worth", "Arlington", "Austin", "San Antonio"},
    "California": {"Los Angeles", "San Francisco", "Sacramento", "San Bruno"},
    "New York": {"New York", "Albany", "Syracuse", "Rochester", "Buffalo"}
}
print(json.dumps(my_data))

TypeError: Object of type set is not JSON serializable

In [14]:
# This wiill work just fine!
# But you won't be able to read most of this.
import pickle
my_data = {
    "Texas": {"Houston", "Dallas", "Fort Worth", "Arlington", "Austin", "San Antonio"},
    "California": {"Los Angeles", "San Francisco", "Sacramento", "San Bruno"},
    "New York": {"New York", "Albany", "Syracuse", "Rochester", "Buffalo"}
}
print(pickle.dumps(my_data))
print()

dumped = pickle.dumps(my_data)
print(pickle.loads(dumped))

b'\x80\x04\x95\xd7\x00\x00\x00\x00\x00\x00\x00}\x94(\x8c\x05Texas\x94\x8f\x94(\x8c\tArlington\x94\x8c\x06Dallas\x94\x8c\nFort Worth\x94\x8c\x0bSan Antonio\x94\x8c\x06Austin\x94\x8c\x07Houston\x94\x90\x8c\nCalifornia\x94\x8f\x94(\x8c\nSacramento\x94\x8c\tSan Bruno\x94\x8c\x0bLos Angeles\x94\x8c\rSan Francisco\x94\x90\x8c\x08New York\x94\x8f\x94(\x8c\x06Albany\x94\x8c\tRochester\x94\x8c\x08Syracuse\x94\x8c\x07Buffalo\x94h\x0f\x90u.'

{'Texas': {'Arlington', 'Fort Worth', 'Austin', 'Houston', 'San Antonio', 'Dallas'}, 'California': {'Los Angeles', 'San Bruno', 'Sacramento', 'San Francisco'}, 'New York': {'New York', 'Rochester', 'Syracuse', 'Buffalo', 'Albany'}}


In [15]:
# when saving to and loading from files, the file needs to be opened in binary mode.
# .p is the typical extension for pickled files.
import pickle
my_data = {
    "Texas": {"Houston", "Dallas", "Fort Worth", "Arlington", "Austin", "San Antonio"},
    "California": {"Los Angeles", "San Francisco", "Sacramento", "San Bruno"},
    "New York": {"New York", "Albany", "Syracuse", "Rochester", "Buffalo"}
}

# Dump the data to a file
pickle.dump(my_data, open("my_data.p", "wb"))

# Load the data from file
print(pickle.load(open("my_data.p", "rb")))

{'Texas': {'Arlington', 'Fort Worth', 'Austin', 'Houston', 'San Antonio', 'Dallas'}, 'California': {'Los Angeles', 'San Bruno', 'Sacramento', 'San Francisco'}, 'New York': {'New York', 'Rochester', 'Syracuse', 'Buffalo', 'Albany'}}


# A note on avoiding binary data formats

Plain text formats for data storage are popular for a reason.  They're extremely easy to work with.  There are plenty of excellent *binary* data formats that are great for storing and sharing data--Parquet for tabular data, for instance--but they usually require more work to implement a parser for, and have more specialized use cases and goals.

I recommend, in general, that you default to storing your data in plain text files, with a UTF-8 encoding.  They're easy to store, easy to share, and easy to inspect without having to load the whole thing into a program first.

# Compressed files: Making files smaller without losing any data

Data compression is extremely important.  A lot of data contains some amount of repetition; any time there's repetition, you can essentially "get rid" of some of it, replace it with a placeholder, and save some space in the file.  There are a lot of different algorithms for compressing data, each with their own pros and cons, but you'll see a few pop up all over the place:

- GZIP, which uses the .gz extension.
- BZIP2, which uses the .bz2 extension.
- LZMA (sometimes erroneou called XZ; XZ is the name of a program that implements LZMA compression), which uses the .xz extension.
- LZ4, which uses the lz4 extension.
- Zstandard, which uses the .zst extention.  It's taking over as the general go-to compression algorithm, for good reason.

GZIP, BZIP2, and LZMA/XZ are supported in Python's standard library.  Zstandard is supported in the `zstandard` third-party library, and LZ4 in the `lz4` third-party library.  (clever names, I know).

The details of how compression works and how to pick the right algorithm is a very detailed discussion for another time, but each algorithm makes different tradeoffs between compression ratio (how small the final compressed data is), compression speed, decompression speed, and memory use for compression/decompression.  Here's my quick rule of thumb for how to pick a compression algorithm:
- LZ4: bad compression ratios, but extremely fast.  Pick this when you need to be able to read and write to your file very quickly, and saving space is a "nice to have" feature.
- GZIP/BZIP2 when you're willing to give up more speed in exchange for better compression ratios.
- LZMA/XZ for when you are willing to give up a lot of speed for *really* good compression ratios, and pretty fast decompression later on.
- Zstandard if you can use it, and you don't have any weird compatibility requirements.  (It can be as fast as LZ4 and have compression ratios as high as XZ).  Ust Zstandard when you need good compression ratios and *very* fast decompression--only LZ4 is faster, but Zstandard has way better compression ratios.

Every compression algorithm lets you specify a *compression level*.  Higher compresion levels mean your computer will spend more time figuring out how to save as much space as possible when compressing the file, so it's a speed-versus-size tradeoff.

An important note: once a file is compressed, it's a binary file, no longer a text file!

Reading and writing compressed files looks the same regardless of the algorithm used, so the following examples will just use LZMA/XZ.  Every library has a `open()` function that behaves almost exactly like Python's built-in `open()`, but which will manage compression/decompression for the specified algorithm.

In [16]:
import lzma

# open in "wt" mode --> Python will handle compresing the text and writing the binary
# data to the file.
with lzma.open("compression_demo.xz", "wt") as OUTFILE:
    OUTFILE.write("I'm some data!  This isn't enough data to get good compression with.  "
                  "IN fact, the compressed file may actually get _larger_, since it needs "
                  "to add some data about the compression into the file.")
    
# Read the data with Python's open() function
print(open("compression_demo.xz", "rb").read())

# Read it with lzma.open(), which will handle decompression transparently for us.
# "rb" mode will decompress the data and return it as bytes, which can be a bit faster
# than "rt" mode, which performs the extra step of decoding this into a normal Python
# string.  WHen in doubt, you generally want "rt".
print()
print(lzma.open("compression_demo.xz", "rb").read())
print()
print(lzma.open("compression_demo.xz", "rt").read())

b'\xfd7zXZ\x00\x00\x04\xe6\xd6\xb4F\x02\x00!\x01\x16\x00\x00\x00t/\xe5\xa3\xe0\x00\xc1\x00\x90]\x00$\x89\xc9\xa2\x03\xd1\x14\xfc0Y\x02\xf3\xbd@j\x0f\xceK\x93\xce-\xe9\x08?\xf8\xf6\x8f\x9f\r}\x99g\x02[\xb4\x80/w0(\xb2\xff\xb4\x82\x84\xc94\xb5j\xcf;\x19\x93\xb0\xfdCi3\xf0;N&\x07:yn\xe4\xa7\xdb\xc1&heO\x0c\xa9\x13\xde\xa0\x96\n\x95\x02t\xeb_"\xa9\xfcw\xceX6\x16sp\x9b\x05X\xe9C\x02%\xd5\x04\x8c\x06OP\x08\xd6d\x05\xa1mO\xef\x1f\xbbM\x088h9\xc0W\xb2\xec\xd3!iF9\x15"r&\x06C\xbf\xeb}*n\x00@\x1fU4ss\xd26\x00\x01\xac\x01\xc2\x01\x00\x00B`mM\xb1\xc4g\xfb\x02\x00\x00\x00\x00\x04YZ'

b"I'm some data!  This isn't enough data to get good compression with.  IN fact, the compressed file may actually get _larger_, since it needs to add some data about the compression into the file."

I'm some data!  This isn't enough data to get good compression with.  IN fact, the compressed file may actually get _larger_, since it needs to add some data about the compression into the file.
