# Analysing forest recovery after logging events - data exploration

You have recently started working as a graduate data analyst at the Victorian Forest Monitoring Program, which is part of the Victorian Department of Energy, Environment and Climate Action. 

Your team is interested in how different logging events appear in satellite imagery, which might be able to help them detect logging events in the future. They have a dataset of known logging events in Victoria since the 1950s, and they want to see what you can learn by examining the satellite imagery for known events. 

For your first practical, your team would like you to familiarise yourself with the available data, specifically, the different kinds of logging events that can occur. You'll start by exploring the available data, which will give you insight into how the data is structured, and how to pull useful information out.

## Overview

During this activity, you will learn to

* load geospatial data files and explore their contents.
* clean, explore and filter all available data to identify different types of logging events.
* export a collection of events that can be compared with satellite imagery (topic of the next practical).

Some of the Python code for performing the sample selection has been provided for you, but there will be a number of opportunities to write your own code, as well as customise existing code to explore different results.

### Guiding text

This practical contains a number of headings to help guide you. 

* <span style="color:blue;font-weight:bold">Your task</span>: This indicates there is a task you must complete before proceeding. It will usually require you to add code or text before you can move on.
* <span style="color:green;font-weight:bold">Need some help?</span>: Your demonstrators are here to help -- this text is there to remind you to ask for help if you're not sure what you need to do. You can ask for help at any time.
* <span style="color:orange;font-weight:bold">Going further</span>: This indicates that there is an *optional* extension you can try if you've already completed the tasks.
* **Code explanation**: The text following this header will provide you with more information about how the code works -- you only need to read this if you're interested.

## Notebook setup

Another analyst on your team has some existing code that you'll be able to use to start your work. They've already identified the useful Python packages you'll work with: `pandas` and `geopandas`.
* `pandas` is a tool for working with data in tables. You can learn more about it in the [pandas documentation](https://pandas.pydata.org/docs/getting_started/overview.html).
* `geopandas` is a tool for working with geospatial data in tables. It extends the functionality of `pandas` to work with geospatial data. You can learn more about it in the [geopandas documentation](https://geopandas.org/en/stable/getting_started/introduction.html).

To run the code, click on the next cell, and press `Shift`+`Enter` on your keyboard.

In [None]:
import geopandas
import pandas

# Change a pandas setting to view all columns and all rows of loaded data
pandas.set_option("display.max_columns", None)
pandas.set_option("display.max_rows", None)

## Load the data

### Geospatial data

Your colleagues have provided you with a GeoPackage file in the data folder, which contains timber harvesting events in Victoria. [GeoPackages](https://www.geopackage.org/) are a geospatial data format, and use the file extension ".gpkg". 

This particular GeoPackage contains **polygons** representing logged areas, along with a number of **attributes** that capture data about the event, such as the year it happened, what kind of trees were logged, and many other useful pieces of data. Your colleagues have provided the GeoPacakge in the data folder, and it is called "LOG_SEASON.gpkg". The figure below shows a preview of the data.

<center><img src="polygon_attribute.PNG"/></center>

*The **polygons** in blue show the areas that have been logged. The polygon's **attributes** in the table are the data associated with the polygon, such as when it took place and how big the area is.*

### Code to load geospatial data

To load GeoPackages, your colleagues recommend you use the `geopandas` package. This will allow you to view a table containing each logging event and all the relevant information. Each row of the table corresponds to a polygon and the columns display the attributes for that polygon. The data is located in your data folder - the **path** to the data is `"LOG_SEASON.gpkg"`.

They let you know that `geopandas` has a useful function called `read_file` that will import the polygons and their attributes into a table. You can read more about this function in the [geopandas documentation](https://geopandas.org/en/stable/docs/reference/api/geopandas.read_file.html#geopandas.read_file).

### <span style="color:blue;font-weight:bold">Your task</span>

> Load the logging data using the `read_file` function from `geopandas` and assign the loaded data to a variable called `logging_season_data`.
>
> What's provided:
> * The variable you'll assign the data to: `logging_season_data`.
> * The `=` sign that will assign the results of any code that comes after it to the variable.
>
> What you'll need to add:
> * After the `=` sign, type `geopandas.read_file()` to call the function.
> * Inside the `()` for the function, type the path to the file: `"LOG_SEASON.gpkg"`
>
>After adding the required code to the cell below, run the cell by clicking on the cell and pressing `Shift`+`Enter` on your keyboard.

### <span style="color:green;font-weight:bold">Need some help?</span>
>If you're not sure what to do, get in touch with a demonstrator (in the room or online) and show them your screen to talk through what you've tried and what the next step might be.

In [None]:
# Include the code to load the data after the = sign.
logging_season_data = 

## View the loaded data

The `logging_season_data` variable is a geopandas `GeoDataFrame` -- a special type of object (known as a **class**) that lists geospatial data in a table with one row per polygon. You can read more about the `GeoDataFrame` class in the [geopandas documentation](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.html).

A `GeoDataFame` comes with a number of useful **methods** that can be run to perform common operations. A `GeoDataFrame` has many methods, but two useful ones are `head()` and `tail()` which allow you to view the first 5 rows and last 5 rows of the `GeoDataFrame`.

**To use a method**, you type the name of the `GeoDataFame` then a `.` then the `method()`. To view the first 5 rows of our loaded data, you would type: `logging_season_data.head()`

### <span style="color:blue;font-weight:bold">Your task</span>
> View the first 5 and last 5 rows of the `logging_season_data` table using the `head()` and `tail()` methods.
>
> What's provided:
> * Two empty cells, one for each method
>
> What you'll need to add:
> * the full command for viewing the first 5 rows
> * The full command for viewing the last 5 rows
>
> After adding the required code to the cells below, run the cells by clicking on the each cell and pressing `Shift`+`Enter` on your keyboard.
>
> Remember, you can call a method using the following format: `name_of_geodataframe.method()`


In [None]:
# View the first five rows using the head() method


In [None]:
# View the last 5 rows using the tail() method


### <span style="color:orange;font-weight:bold">Going further</span>
> This is an optional exercise to further your own understanding. There are no questions to answer for this component.
> 
> What happens if you add a number inside the brackets of the `head` and `tail` methods?

### Exercise: understanding columns

When working with new data, it's important to have a clear understanding of the different pieces of information available to you. This can be obtained by a combination of viewing the data, online research, and collaboration with people who are more knowledgeable about the data. For example, your colleagues share a few pieces of useful terminology specific to forestry:

* [Silviculture](https://www.forestrycorporation.com.au/operations/silviculture) is the science of forestry.
* A [coupe](https://www.vicforests.com.au/vicforest-forest-management/ops-planning/where-vicforests-operates/timber-release-plan) is a defined area in a forest that timber can be harvested from.
* A [silvicultural system](https://www.fs.usda.gov/Internet/FSE_DOCUMENTS/fseprd530429.pdf) is the planned strategy for managing a coupe, including the harvesting and regeneration of timber.

> **Note**: It is a normal part of working to use the internet to investigate new concepts you come across, and we encourage you to search online if you come across terminology you don't understand.

### <span style="color:blue;font-weight:bold">Your task</span>
> Look through both the **first** and **last** 5 entries of the `logging_season_data` above, and write a sentence to describe the purpose of each column listed below. Often, the name will be descriptive, but in some cases, you will need to look at the values in that column to help your understanding. Your colleague has provided you with some of the descriptions to help you out. If some of the values are unfamiliar to you, try searching for them online.  
>
> Double click the text below to add your answers after each column name.

### <span style="color:green;font-weight:bold">Need some help?</span>
>If you're not sure what to do, get in touch with a demonstrator (in the room or online) and discuss your current thoughts about what the columns represent.

**LOGHISTID**: The unique identifier for each logging event. It is built from other columns, using the following format: FMA/BLOCK/COMPART/COUPENO/SEASON/SECTN_SDE

**COUPEADD**: The unique identifier for each logging coupe. It is built from other columns, using the following format: FMA/BLOCK/COMPART/COUPENO

**SEASON**: The financial year the logging event took place. For example, a logging event happening in the 2019-2020 financial year would have the SEASON code 201920

**STARTDATE**:

**ENDDATE**:

**HECTARES**:

**X_SILVSYS**:

**X_FORETYPE**:


## Clean the data

Your colleagues let you know that the data isn't necessarily complete in some sections. For example, there are some logging events where the start date or end date is missing, which might make it hard to group logging events by year between. Luckily, you can use the SEASON column instead.

The SEASON code's current format isn't very easy to work with, so your colleagues have provided you with some code to extract the first year listed in the SEASON code. This is given as a string (a text label of the year) and an integer (the numeric value of the year). Each will have its own use later on. The code below will identify the starting year in each format, and add two new columns (STARTYEAR_STRING and STARTYEAR_INTEGER) to your table.

To run the code, click on the next cell, and press `Shift`+`Enter` on your keyboard.

In [None]:
# Create a new column STARTYEAR_STRING, using the first four digits of the SEASON column
logging_season_data.loc[:, "STARTYEAR_STRING"] = logging_season_data.loc[:, "SEASON"].str[0:4]

# Create a new column STARTYEAR_INTEGER, which converts STARTYEAR_STRING to a numerical value
logging_season_data.loc[:, "STARTYEAR_INTEGER"] = logging_season_data.loc[:, "STARTYEAR_STRING"].astype(int)

**Code Explanation**
> The code above uses another `GeoDataFrame` method called `loc[]`. Here, *loc* is short for *location*, as this method lets you access data in a given row and column.
> 
> `loc[]` takes two arguments: 
> 
> * the **first item** inside the square brackets defines which **row** to access
> * the **second item** inside the square brackets defines which **column** to access
> * the arguments can be a specific row number, a specific column name, or `:` to access all values in the column or row
>
> for example, the following code would access all rows in the `HECTARES` column and assign them their original value [rounded](https://www.w3schools.com/python/ref_func_round.asp) to the nearest integer:
>
> `logging_season_data.loc[:, "HECTARES"] = round(logging_season_data.loc[:, "HECTARES"])`

### <span style="color:blue;font-weight:bold">Your task</span>
> Add code to the cell below that will display the **first 5 rows** in the dataset. To run the code, click on the next cell, and press `Shift`+`Enter` on your keyboard.
>
>Then, scroll to the far right of the table to see the new columns.

### <span style="color:green;font-weight:bold">Need some help?</span>
>If you're not sure what to do, get in touch with a demonstrator (in the room or online) and show them your screen to talk through what you've tried and what the next step might be.

In [None]:
# View the first five rows of the logging_season_data GeoDataFrame to see the new columns


## Filter the data

During the next stage of the analysis, you'll be comparing your selected logging events to satellite imagery. Your colleagues recommend that you use Sentinel-2 imagery, as it is freely available, and has higher resolution than Landsat. The first Sentinel-2 satellite started collecting data over Australia in 2015, with the second satellite joining it in 2017. Once both satellites were operational, they began to collect data over Australia every 3-5 days.

Before proceeding with the analysis, it is valuable to filter the dataset to only keep logging events that occurred after Sentinel-2 was launched. This will help with the next part of the analysis when finding satellite images that coincide with the logging events.

### <span style="color:blue;font-weight:bold">Your task</span>
> Choose a value for the starting year that is after the launch of Sentinel-2.
>
> What's provided:
> * The variable you'll assign the data to: `starting_year`.
> * The `=` sign that will assign the results of any code that comes after it to the variable.
>
> What you'll need to add:
> * After the `=` sign, type the value of a year that is after the launch of Sentinel-2
>
> After adding the required code to the cell below, run the cell by clicking on it and pressing `Shift`+`Enter` on your keyboard. Then run the cell after it to filter the data and see the results.

In [None]:
# Set the value for the starting year
starting_year = 

In [None]:
# This cell filters the data based on the starting_year

# Identify all rows where the starting year (as an integer) is greater than or equal to the set starting_year
events_post_start_year = logging_season_data.loc[:, "STARTYEAR_INTEGER"] >= starting_year

# Create a new table containing only the rows where the logging event occured during or after the starting year
logging_events_for_analysis = logging_season_data.loc[events_post_start_year, :]

# Display the number of logging events remaining
print(f"The number of logging events that occur during or after {starting_year} is {len(logging_events_for_analysis)}")

**Code Explanation**
> The code above uses `loc` in two ways
> 
> * The first line of code identifies all rows (using `:`) where the value in the `"STARTYEAR_INTEGER"` column is greater or equal to (`>=`) the `starting_year` value. This produces a value of `True` or `False` for each row. These results are then assigned to `events_post_start_year`.
> * The second line of code filters the data by locating the rows where the `events_post_start_year` value is `True` and assigns it to `logging_events_for_analysis`.
>
> Finally, the last line of code uses a special print statement to display the results, substituting in any values for the variables inside the `{}`. In this code, the Python function `len()` is used to count the number of rows in the filtered GeoDataFrame.

### Exercise: Changing the year

You can change the value of the starting year to identify different selections of logging events. 

### <span style="color:blue;font-weight:bold">Your task</span>
> Edit the `starting_year value` above and rerun the two cells above to see the number of logging events occuring during and after both 2015 and 2016. Record the number of events for each value of `starting_year` below.
> Double click the text below to add your answers for each year.

Number of logging events during and after each starting year:

**2015**:

**2016**:

**2017**: 1371

### Exercise: Choosing a year

Based on your results above, decide which year you want to use to filter your dataset. **There is no right or wrong answer**, all the years are likely to have some Sentinel-2 data available around the time of the logging event. 


### <span style="color:blue;font-weight:bold">Your task</span>
> Set the `starting_year` value above to the year of your choice, and re-run both cells.

## Explore the data

Your colleague recommends two useful **methods** for viewing the data:

- `.value_counts(subset="COLUMNNAME")` which identifies all the unique values in a subset of the GeoDataFrame and counts how many times they occur.
- `.explore(column="COLUMNNAME")` which displays an interactive map of all the polygons in your dataset, colour-coded by the chosen column.

Run the cells below to look at how the data are distributed across time:

In [None]:
column_for_value_counts = "STARTYEAR_STRING"
logging_events_for_analysis.value_counts(subset=column_for_value_counts)

In [None]:
column_for_explore = "STARTYEAR_STRING"
logging_events_for_analysis.explore(column=column_for_explore)

### Exercise: Explore other columns

Your team is interested in the answers to the following questions:

* **Question 1**: How often was the Single Tree Selection silvicultural system used?
* **Question 2**: What was the most common silviculutral system used? 
* **Question 3**: What was the most common type of forest?
* **Question 4**: Which two forest types appear most commonly around Bairnsdale?

### <span style="color:blue;font-weight:bold">Your task</span>
> In the two cells below, assign a column name to the `column_for_value_counts` and `column_for_explore` variables, choosing the column name that will help you view the relevant data for each question. Review your earlier answers that describe each column if you're not sure which column to use.
>
> What's provided:
> * The variables you'll assign the chosen column to: `column_for_value_counts` and `column_for_explore`.
> * The `=` sign that will assign the results of any code that comes after it to the variables.
>
> What you'll need to add:
> * After the `=` sign, type the name of the column that will select the relevant data for each question. Column names must be surrounded by quotation marks: `"COLUMNNAME"`
>
> When you have viewed the relevant data, add your answers to the four questions by double-clicking the <span style="color:blue;font-weight:bold">Your answers</span> cell.

#### Code to count how many times each value appears in the column

In [None]:
# Assign the column name to the column_for_value_counts variable to view the relevant data for each question. You must keep the quote marks around the column name.
column_for_value_counts = 
logging_events_for_analysis.value_counts(subset=column_for_value_counts)

#### Code to view all points on a map, colour-coded by the provided column

In [None]:
# Assign the column name to the column_for_explore variable to view the relevant data for each question. You must keep the quote marks around the column name.
column_for_explore = 
logging_events_for_analysis.explore(column=column_for_explore)

### <span style="color:blue;font-weight:bold">Your answers</span>

**Question 1**: 

**Question 2**:

**Question 3**:

**Question 4**: 



### <span style="color:orange;font-weight:bold">Going further</span>
> This is an optional exercise to further your own understanding. There are no questions to answer for this component.
> 
> What else can you learn from using the `.value_counts()` and `.explore()` methods? Try setting different columns as the values for the variables above and re-run the cells.

## Export your cleaned and filtered data for use in the next step

Now that you know a bit more about the data, and have cleaned and filtered it, you can save a copy to work with in the next stage, where you'll find satellite imagery that was captured before and after the logging event.

Run the next cell to save your cleaned data to a new GeoPackage, which will allow you to use your filtered data in the next stage. 

In [None]:
logging_events_for_analysis.to_file("LOG_SEASON_FILTERED.gpkg")

## Submit your work

1. Ensure you have added answers to all the questions and save your file (in the menu bar, click File > Save Notebook).

2. In the file browser, right-click the `prac1_logging_site_selection.ipynb` file, and press "Download"

3. Email the downloaded file to caitlinisabeladams@swin.edu.au