# 7. Collections

What would a LiDAR processing package be without handling multiple acquisition tiles at a time? As of the **0.3.0** update, this is now possible with pyfor. pyfor sublasses `geopandas.GeoDataFrame` class to create a new class, `CloudDataFrame`. For those familiar with `geopandas`, this should create a flexible and extensible class for manipulating collections of point cloud tiles.

The `CloudDataFrame` is a memory-efficient API for handling large collections of `.las` files that is capable of making spatial queries and executing arbitrary functions in parallel.

## Creating a Collection

The first step is to import the collection. To do so we point it to a directory with many `.las` files. As an optional argument, we can also set the number of threads the collection will use when conducting processing tasks via the `n_jobs` argument.

In [7]:
import pyfor
col = pyfor.collection.from_dir("/home/bryce/Desktop/pyfor_macdunn/", n_jobs=4)
col.head()

Unnamed: 0,las_path,bounding_box
0,/home/bryce/Desktop/pyfor_macdunn/44123F6102.las,"POLYGON ((471419.78 1105407.84, 474843.21 1105..."
1,/home/bryce/Desktop/pyfor_macdunn/44123F6103.las,"POLYGON ((474664.8 1105280.45, 478087.76 11052..."
2,/home/bryce/Desktop/pyfor_macdunn/44123F6101.las,"POLYGON ((468174.57 1105535.66, 471598.82 1105..."


On initialization (using `.from_dir`) pyfor generates the file path of each las file in the directory we provided, these paths are stored in the `las_path` column. Additionally, a bounding box is read from the header of each `.las` file and automatically appended to the `bounding_box` column, this column is set as the data frame's geometry column.

## Mapping a Collection

It may be of interest to generate information about where our tiles exist. If we trust the headers stored in our `.las` files, we can easily accomplish this with the `CloudDataFrame`. Because this object inherits from `GeoDataFrame`, we simply need to call the `.to_file` method and save our geometries. Let's also set the `.crs` attribute of our data frame.

In [14]:
import pyproj

col.crs = pyproj.Proj({'init': 'epsg:2994'}).srs
col.to_file('/home/bryce/Desktop/pyfor_macdunn/geoms/bbox.shp')

## Processing in Parallel

Now we can use the `par_apply` method, this allows us to apply arbitrary functions to each tile in the collection. Let's define a function we might want to apply to each tile. This function must accept a single argument upon each iteration, this argument is the path of the las file.

In [9]:
def my_func(las_path):
    # Load a cloud object
    pc = pyfor.cloud.Cloud(las_path)
    return (pc.data.min[2])

# Set a new column for the CloudDataFrame
col["min_z"] = col.par_apply(my_func, "las_path")
col.head()

Unnamed: 0,las_path,bounding_box,min_z
0,/home/bryce/Desktop/pyfor_macdunn/44123F6102.las,"POLYGON ((471419.78 1105407.84, 474843.21 1105...",354.0
1,/home/bryce/Desktop/pyfor_macdunn/44123F6103.las,"POLYGON ((474664.8 1105280.45, 478087.76 11052...",262.8
2,/home/bryce/Desktop/pyfor_macdunn/44123F6101.las,"POLYGON ((468174.57 1105535.66, 471598.82 1105...",299.54


## Spatial Indexing and Queries

`CloudDataFrame` takes advantage of the `.lax` spatial indexing method implemented in Martin Isenburg's `lastools` (which is automatically installed as part of your environment). This relies on another python package, `laxpy`, to generate and parse these `.lax` files. `.lax` files are required for conducting spatial queries on collections. Our first step is to generate these `.lax` files if they are not already present, and `CloudDataFrame` has a convenient method for this.

For lower-level access to `.lax` files, please see [laxpy](https://github.com/brycefrank/laxpy) for more information.

In [11]:
col.create_index()

This simple function calls `lasindex` and generates `.lax` files via the default settings. Now that `.lax` files are present, we can conduct spatial queries via the `CloudDataFrame.clip` method. Note that this method is different from `Cloud.clip` in that it is intended for memory-efficient clipping of multiple polygons.

First, we need to access our query polygon data. `.clip` will accept either a list of `shapely.geometry.Polygon` objects or, more conveniently, a `geopandas.GeoSeries`. We will use the latter.

In [24]:
import geopandas as gpd

polys = gpd.GeoSeries.from_file('/home/bryce/Desktop/pyfor_macdunn/geoms/query_polygons.shp')
col.clip(polys, '/home/bryce/Desktop/pyfor_macdunn/query')