# Project 
## Our Proposal
We choose to divide key-biotopes by the types of habitats they contain. Then run a simple learning algorithm on Laser data (stored in `.laz` files) and see whether we can distinguish areas with that include particular Key-Biotopes from others.

### Method:
Laser measurements are available for square areas, 2.5 km X 2.5 km. To facilitate analysis, we will do it in steps:

- **Step 1:** Divide these squares into smaller squares of equal areas (called henceforth tiles)
- **Step 2:** Compute the percentage of area within a tile which is occupied by key-biotopes. Then, according to these percentages, label tiles for whether they contain key-bishops or not, i.e. a positive set and a negative set.
- **Step 3:** From the laser data, compute a set of variables that characterize each tile.
- **Step 4:** Apply a classification algorithm on the resulting dataset.

### Recap
In the last set of exercises you have went through step 1 and most step 2. You have produced shapefiles of tiles, describing their location, ratio of their area occupied by key-biotopes (coniferous forests) and referencing `.laz` files that contain the laser measurement.
In this set of exercises you will first choose a dataset, a subset of tiles. Then you will load the corresponding `.laz` files and summerize the data in each to a handful of variables.

# Select a Dataset
## Load Data
- To Do:
1. Load the tiles shapefiles from the last step.

## Pick a Dataset Randomly?
- To Do:
1. How much of the dataset is positive (contains key-biotopes)?
2. If we used a simple classification rule: "all tiles belong to negative class and don't contain key-biotopes", how much of the time we will be correct?

Most of the machine learning algorithms used for classification are designed with the assumption of an equal number of instances for each class. A dataset which has many more instances of one class is called **imbalanced**.
While an imbalanced dataset can be the result of miscollecting the data, in some problems it is expected. In our problem, it is natural to expect that majority of the area is not covered by key-biotopes.

One way to deal with this imbalance is to *resample* the data. We can choose a subset of the data that contain almost equal numbers of instances in each class.
- To Do:
1. Pick a subset of the dataset randomly, but make sure it is not imbalanced.

In [None]:
N_SAMPLE = 1000

# Step 3:
Now that you have a dataset of tiles, it is time to get the actual laser measurement that describes these tiles.

## 3.1 Introduction to Lidar and `. Las` Files:
**Read this section to understand the code provided** 

### Lidar
Lidar (light detection and ranging) is an optical remote-sensing technique. It is a method to determine ranges (variable distance) by targeting an object with a laser and measuring the time for the reflected light to return to the receiver.
Lidar is used primarily in airborne laser mapping to densely sample the surface of the earth, producing highly accurate x, y, z measurements.

For each laser pulse, several attributes are recorded in addition to x, y and z values. This additional information includes
- intensity: the return strength of the laser pulse,
- return number & number of returns: an emitted laser pulse can have up to 5 returns depending on the reflecting surface,
- point classification value: a classification defines the type of object which reflected the laser pulse. Lidar points can be classified into a number of categories including bare earth or ground, top of canopy, and water. Different classes are defined by integer codes.

### LAS and LAZ Files
Lidar produces mass point cloud datasets which are stored in `.las` files or their compressed version `.laz` files.

A LAS file contains a number of fields for each point in the lidar point cloud. Those fields (or dimensions) include coordinates X, Y and Z, intensity, return_number, number_of_returns, scan_direction_flag, edge_of_flight_line, classification, etc.

A LAS file also has a header. It contains metadata about the file, including information about the coordinating reference system (CRS), range of elevation, scale and offset of coordinates. 

See for more about [Lidar](https://pro.arcgis.com/en/pro-app/latest/help/data/las-dataset/what-is-lidar-.htm)


### Reading LAZ files

- We use module `pylas` to read these files

```python
import pylas

las_data = pylas.read('path/to/file.laz') 

```

- Read the coordinates, classification value and other attributes

```python

x, y ,z = las_data.X, las_data.Y, las_data.Z
classification = las_data.classification
```

- Sometimes the coordinates need to be scaled and offset with some value stored in the header.

```python
x = las_data.X * las_data.header.x_scale + las_data.header.x_offset
```

## 3.2 Summarize LAZ Files?
LAZ files store millions of points, each with multiple attributes, that cover some surface area. We want to boil this huge amount of data down to a few variables that can meaningfully characterize the area in question. However, there is no clear way to do this. We don't know which variables could fulfill this purpose, how many variables should we search for? It might even be impossible to get a summary this way.

A systematic solution to this problem should be an iterative process, where the set of variables will be tested in machine learning and subsequently improved.

**In this project we will carry one such test: In the following code we chose to summarize each LAZ file with 30 variables: percentage of points, mean height "Z_mean" and standard diviation of height "Z_std" for classification values 1 & 2 and for each return number 1,...,5.**

In [None]:
# This function takes a laz file's name, bounds of a tile, and returns a list of 30 varibales that summarize laser
# measurements of the tile. 

# path_to_dir: path to directory where laz files are stored

import pylas

def summary_variables(path_to_dir, file_name, x_min, y_min, x_max, y_max):
    var = [np.nan]*30
    
    try:
        laz = pylas.read(path_to_dir + file_name[:6] + '/' + file_name.replace('.laz.laz', '.laz'))
    except:
        return var
    else:
        x = laz.X * laz.header.x_scale + laz.header.x_offset
        ind_x = (x >= x_min) & (x <= x_max )
        y = laz.Y[ind_x] * laz.header.y_scale + laz.header.y_offset
        ind_y = (y >= y_min) & (y <= y_max )
        
        z = laz.Z[ind_x][ind_y] * laz.header.z_scale + laz.header.z_offset
        cls = laz.classification[ind_x][ind_y]
        rn = laz.return_number[ind_x][ind_y]
        num_pts = len(z)
        
        i = 0
        for c in [1,2]:
            ind_c = (cls==c)
            z_c = z[ind_c]
            
            if len(z_c) > 0:
                for n in [1,2,3,4,5]:
                    ind_n = (rn[ind_c] == n)
                    z_n = z_c[ind_n]
                    var[i] = len(z_n)/num_pts
                    
                    if len(z_n) > 0:
                        var[i+1] = z_n.mean()
                        var[i+2] = z_n.std()
                        i += 3
                    else:
                        i += 3
                        continue
            else:
                i += 15
                continue
        return var

- To Do:
1. In a new DataFrame, use the function provided to write the `summary_variables` corresponding to your dataset.
2. Save the resulting DataFrame into a `.csv` file for new step of the project.

# Important Note:

The `.laz` files are available from [lantmäteriet website](https://www.lantmateriet.se/en/maps-and-geographic-information/geodataprodukter/produktlista/laserdata-nedladdning-skog/). However, those files are very large (the total is >2TB, we used here is tens of GBs), hence it might be better to remove the part dealing with them from the exercises, and instead provide the summarized dataset as "learning_data.csv" file.