# Parsing a KML using Beautiful Soup

We are going to process this open source, publicly curated [SAM KML](https://indigo.sgn.missouri.edu/static/Africa_SAM_2013.kml)

```XML
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2" xmlns:gx="http://www.google.com/kml/ext/2.2" xmlns:kml="http://www.opengis.net/kml/2.2" xmlns:atom="http://www.w3.org/2005/Atom">
<Document>
	<name>Africa.kml</name>
...


						<Placemark>
							<name>Empty EW site</name>
							<visibility>0</visibility>
							<description><![CDATA[Date of last activity:  N/A<br>]]></description>
							<LookAt>
								<longitude>32.32708810370011</longitude>
								<latitude>31.25651784460001</latitude>
								<altitude>0</altitude>
								<heading>-3.294199353736108</heading>
								<tilt>0</tilt>
								<range>832.197179666657</range>
								<gx:altitudeMode>relativeToSeaFloor</gx:altitudeMode>
							</LookAt>
							<styleUrl>#msn_open-diamond16730</styleUrl>
							<Point>
								<coordinates>32.32774961582337,31.2566830854452,0</coordinates>
							</Point>
						</Placemark>

...
</Document>
</kml>
```

It has a collection of "placemarks", which are object locations on Earth.  Placemarks are one of the most common elements of KML files and can have many different attributes. If you are unfamiliar with KML files, you may find it helpful to [review their basic structure](https://developers.google.com/kml/documentation/kml_tut#basic_kml).

This lab shows how BeautifulSoup can be used to load the data into a data frame.
Afterward, you can take the data frame forward for plotting and other tasks.

Recall from a previous lab: You will need to restart your notebook kernel (Kernel > Restart) after the next command completes!

In [1]:
!conda install -y lxml

Collecting package metadata: done
Solving environment: - 
The environment is inconsistent, please check the package plan carefully
The following packages are causing the inconsistency:

  - conda-forge/linux-64::matplotlib==3.0.3=py37_1
\ Killed


In [None]:
from bs4 import BeautifulSoup

# With regard to KML, spaces after commas are strictly prohibited, and this script takes advantage of this.
# Source: https://groups.google.com/forum/#!msg/kml-support-getting-started/lf6V4YCJlr8/32e1YlfJ8joJ

FILENAME = '/dsa/data/all_datasets/Africa_SAM_2013.kml'

# going to save all entries here because recursion is hard.
## FIX THE FILENAME FOR THE FINAL VERSION
entries = []

with open(FILENAME) as f:
    print("Input File '{}' opened for reading".format(FILENAME))
    
    tree = BeautifulSoup(f, "lxml-xml")

    print("XML File parsed into a DOM")


In [None]:
# The 'tree' object returned from BeautifulSoup is a beautifulSoup object, aka a 'parse tree'
# This is referred to as a container in the functions defined below.
type(tree)

In [None]:
print(tree.Placemark)  # Here's a glimpse of the source tree for one Placemark

The following functions parse the KML tree, 
pulling out the attributes we need into the `entries` list defined above.  
The first function loops over "Folders" in the tree iteratively.  
For each leaf in the tree, it calls the second function, `kml_sqlgen_single_container`.  This function loops processes each placemark, 
and puts the component geometries (point, line, polygon) into the `entries` list.

In [None]:
def kml_sqlgen_all_DFS(tree):
    """
    Executes a depth-first search on the provided KML container recursively.
    If a container has either a Folder or Document child, it will recurse down.
    Once a container is met that has no relevant children,
    and after the recursion has finished,
    the current container will be sent to
    kml_sqlgen_single_container for final processing.

    Inputs:
        tree:       The container to be processed
    """

    ignoredFolderNames = ['Facilities', 'SHORAD SAMs']

    # KML is a hierarchy of Nodes... "each" is a poorly named container node 
    # Search for child containers and pass them to this function
    for each in tree.findAll(['Folder', 'Document'], recursive=False):
        
        print("Processing : {}".format(each.find('name').contents[0]))

        if (each.find('name').contents[0] in ignoredFolderNames):
            print("Skipping")
            continue
        
        # Recursive call to process a subtree
        kml_sqlgen_all_DFS(each)

    # Recursion Stop if the (sub-)tree object is an ignored folder
    if (tree.find('name').contents[0] in ignoredFolderNames):
        return

    kml_sqlgen_single_container(tree)
    return

In [None]:
def kml_sqlgen_single_container(container):
    """
    Search a single container, processing all of the place-marks in the
    container and their relevant geometries.
    
    Adds entries to the global list
    """

    name = container.find('name').text
    print("Processing {} as a single container".format(name))

    temp = container.find_parent(['Folder', 'Document'])
    while (temp):
        name = temp.find('name').text + "/" + name
        temp = temp.find_parent(['Folder', 'Document'])

    print("Found Nav-Path to be {}".format(name))
    for placemark in container.findAll('Placemark', recursive=False):

        geometries = placemark.findAll(['Point', 'LineString', 'Polygon'])

        # We don't want to process Placemarks that don't have any geometry.
        if (len(geometries) == 0):
            continue

        print("Found a placemark : {}".format(placemark.find('name').text))
        pathString = name + "/" + placemark.find('name').text

        for geometry in geometries:

            if placemark.Point:
                # this sample data doesn't contain polygons or linestrings, coords will most likely need to be modified
                # in a similar fashion for those types
                entries.append(dict(type='point', coords=placemark.Point.find('coordinates').text, path=pathString))
                
            elif placemark.Polygon:
                entries.append(dict(type='polygon', coords=str(placemark.Polygon), path=pathString))
                
            else:   # At this point, it can only be a LineString.
                entries.append(dict(type='line', coords=str(placemark.LineString), path=pathString))
    
    return

In [None]:
entries.clear()

kml_sqlgen_all_DFS(tree.Document)


In [None]:
print(len(entries))
print(entries[0])
print(type(entries))

In [None]:
import pandas
frame = pandas.DataFrame(entries)

frame

Now that the data is loaded into a data frame, you can use a collection of other Python tools to plot, manipulate, and analyze the data.

# Save your Notebook