# Project 

## The Challenge
The Swedish Forest Agency has mapped high conservation value forests, though field surveys. Approximately 67 000 areas (=nyckelbiotoper in Swedish) are delimited in the database.
However, it is costly and time consuming to conduct field surveys. So it is proposed to consider national continuous cover datasets, such as the laser scanning, satellite images and aerial photos, to identify high conservation value forests.

### The Problem:
To know whether the database of forests with high conservation value (key-biotopes) could be used in machine learning to train a model to recognise similar forests. In particular, one need to answer the following:

    - Is this dataset appropriate to use as training data?
    - If yes, how should the data set be prepared to be optimal training data? For example:
        - Should the dataset be divided into subsets of sites that exhibit similar characteristics? Which characteristics?
        - Should the polygons be edited in some way to improve accuracy?
    - Is more data needed?

## Our Proposal
We choose to divide key-biotopes by the types of habitats they contain. Then run a simple learning algorithm on Laser data (stored in `.laz` files) and see whether we can distinguish areas with that include particular Key-Biotopes from others.

### Method:
Laser measurements are available for square areas, 2.5 km X 2.5 km. To facilitate analysis, we will do it in steps:

- **Step 1:** Divide these squares into smaller squares of equal areas (called henceforth tiles)
- **Step 2:** Compute the percentage of area within a tile which is occupied by key-biotopes. Then, according to these percentages, label tiles for whether they contain key-bishops or not, i.e. a positive set and a negative set.
- **Step 3:** From the laser data, compute a set of variables that characterize each tile.
- **Step 4:** Apply a classification algorithm on the resulting dataset.

### Choices:
In the following we will only attempt a simplified application of the solution. There are several parameters which will affect the subsequent learning, and -in principle- they should be fine tuned systematically or through trial and error. Instead, we will just make some naive choices.

1. We choose to analyse tiles that include coniferous forests (Barrskogar) as key-biotopes. You are welcome to pick a different habitat type.

In [None]:
HABITAT = 'Barrskogar'

2. Selecting the size of tiles.
    - If the size is small (high resolution): one gets a bigger number of tiles to compare. There is a higher probability to get tiles that are fully covered by key-biotopes and hence their characteristics will correspond perfectly to key-biotopes. However, if the size is too small a tile might fail to capture the characteristics that define the key-biotope.
    - If the size is big (low resolution): a tile with a bigger size might capture better the overall properties that define a key-biotope from everything else. Also a smaller dataset will be less expensive to analyse. However, if the size is too big we will lose all ability to differentiate key-biotopes.
    
We should try different sizes to systematically arrive to the optimum value. But, for simplicity, we will just select a to **divide each laser square into 10 X 10 smaller squares, i.e. tiles of size 250m X 250m.**

In [None]:
TILE_SIZE = 250*250 # in meters
DIV_SIDE = 10 # divide each side of a laser square by this value
DIV_AREA = 100 # divide the area into 100 squares

3. We choose to label a tile as containing a key-beat-up if 50% or more of its area is covered by key-biotopes.

In [None]:
THRESHOLD = 0.5

# Step 1:
In this step we will be concerned with laser data referenced in 'rutor_shapefiles'. Remember 'rutor_shapefiles' describe square-shaped areas in Sweden for which laser measurements are available (and stored in `.laz` files). We aim to divide each square area into 100 equal square tiles.

## 1.1 Load The Data, Explore and Filter
- To Do:
1. Import the necessary modules.
2. Load 'rutor_shapefiles' into a GeoDataFrame.
3. Explore the data, drop all uninformative columns. Note: in the following analysis, we also won't be needing any dates.
4. Read Coordinate Reference System (CRS) of the GeoDataFrame. What are the units of distance in this CRS?
5. Currently we only have access to laser data of the year 2020, i.e. folders (column 'Block') starting with the number '20' as in `20A012` to `20F050`. Filter your GeoDataFrame accordingly.

## 1.2 Split
- To Do:
1. Write a function that produces a rectangular polygon from four values (x_min, y_min, x_max, y_max). (x_min, y_min) is lower left corner, and (x_max, y_max) is the upper right corner.
        - Import the needed module from `shapely`.
2. Similar to 1, write a function that returns a list of $nXn$ rectangles instead of one. $n$ is some integer.
3. Write a function that produces a list of $nXn$ squares from a square polygon.
        - `polygon.bounds` gives the values (x_min, y_min, x_max, y_max)
4. Now utilize the function from the last point. Write a new empty GeoDataFrame. Fill its 'geometry' by dividing the polygons from the geometry of GeoDataFrame from the previous section. Add CRS information to the new GeoDataFrame.
5. Fill the data of the new GeoDataFrame with the corresponding values from columns: 'square', 'Block', 'Las_Namn'. Add a new code to identify each tile.

# Step 2:
Now that we have divided the area into tiles, it's time to determine how much of their area is occupied by coniferous forests key-biotopes. Remember 'keybiotopes_habitatgroups_shapefiles':
- These shapefiles are the result of the first set of exercises analysing sksNyckelBiotoper_shapefiles. Both shapefiles display the locations and attributes of the key biotopes (Nyckelbiotoper) in Sweden.

In this step we will find the subset of key-biotopes that contain coniferous forests (a subset of 'keybiotopes_habitatgroups_shapefiles'), then find the intersection between this subset and the tiles from step 1.

## 2.1 Load and Filter Data
- To Do:
1. Read 'keybiotopes_habitatgroups_shapefiles' into a GeoDataFrame.
2. Find the subset of keybiotopes which include coniferous forests (Barrskogar).

## 2.2 Intersection and Key-biotope Ratio
- To Do:
1. Find the intersection between the subset of key-biotopes and laser tiles.
2. The area of each tile is TILE_SIZE, find the ratio occupied by key-biotopes for each tile
        -  use function `groupby`
3. Get a column with these ratios and add it to the GeoDataFrame of all tiles. Note tiles not in the intersection will have ratio = 0

## 2.3 Which Tiles Contain Key-biotopes?
- To Do:
1. Add a new column to the tiles' GeoDataFrame that labels a tile as 'contains key-biotopes' (value = 1) or 'doesn't contain key-biotopes (value = 0). This should depend on whether area occupied by key-biotopes in a tile exceed the threshold 50% of the total area.

## 2.4 Write The Result
- To Do:
1. Write the tiles' GeoDataFrame into a shapefile to use in the next steps of the project.