# Analysing forest recovery after logging events - data exploration

You have recently started working as a graduate data analyst at the Victorian Forest Monitoring Program, which is part of the Victorian Department of Environment, Land, Water and Planning. 

Your team is interested in how different logging events appear in satellite imagery, which might be able to help them detect logging events in the future. They have a dataset of known logging events in Victoria since the 1950s, and they want to see what you can learn by examining the satellite imagery for known events. 

For your first practical, your team would like you to familiarise yourself with the available data, specifically, the different kinds of logging events that can occur. You'll start by exploring the available data, which will give you insight into how the data is structured, and how to pull useful information out.

## Overview

During this activity, you will learn to

* load geospatial data files and explore their contents
* clean, explore and filter all available data to identify different types of logging events
* export a collection of events that can be compared with satellite imagery (topic of the next practical)

Some of the Python code for performing the sample selection has been provided for you, but there will be a number of opportunities to write your own code, as well as customise existing code to explore different results.

## Notebook setup

Another analyst on your team has some existing code that you'll be able to use to start your work. They've already identified the useful Python packages you'll work with: `pandas` and `geopandas`.
* `pandas` is a tool for working with data in tables. You can learn more about it in the [pandas documentation](https://pandas.pydata.org/docs/getting_started/overview.html).
* `geopandas` is a tool for working with geospatial data in tables. It extends the functionality of `pandas` to work with geospatial data. You can learn more about it in the [geopandas documentation](https://geopandas.org/en/stable/getting_started/introduction.html).

To run the code, click on the next cell, and press `Shift`+`Enter` on your keyboard.

In [None]:
import geopandas
import pandas

# Change a pandas setting to view all columns and all rows of loaded data
pandas.set_option("display.max_columns", None)
pandas.set_option("display.max_rows", None)

## Load the data

### Geospatial data

Your colleagues have provided you with a GeoPackage file containing timber harvesting events in Victoria. [GeoPackages](https://www.geopackage.org/) are a geospatial data format, and use the file extension ".gpkg". 

This particular GeoPackage contains **polygons** representing logged areas, along with a number of **attributes** that capture data about the event, such as the year it happened, what kind of trees were logged, and many other useful pieces of data. Your colleagues have provided the GeoPacakge in the data folder, and it is called "LOG_SEASON.gpkg". The figure below shows a preview of the data.

<center><img src="images/polygon_attribute.PNG"/></center>

*The **polygons** in blue show the areas that have been logged. The polygon's **attributes** in the table are the data associated with the polygon, such as when it took place and how big the area is.*

### Code to load geospatial data

To load GeoPackages, your colleagues recommend you use the `geopandas` package. This will allow you to view a table containing each logging event and all the relevant information. The data is located in your data folder - the **path** to the data is `"data/LOG_SEASON.gpkg"`.

They let you know that `geopandas` has a useful function called `read_file` that will import the polygons and their attributes into a table. You can read more about this function in the [geopandas documentation](https://geopandas.org/en/stable/docs/reference/api/geopandas.read_file.html#geopandas.read_file).

> **Your task**: Load the logging data using the `read_file` function from `geopandas` and assign the loaded data to a variable called `logging_season_data`.
>
> What's provided:
> * The variable you'll assign the data to: `logging_season_data`.
> * The `=` sign that will assign the results of any code that comes after it to the variable.
>
> What you'll need to add:
> * After the `=` sign, type `geopandas.read_file()` to call the function.
> * Inside the `()` for the function, type the path to the file: `"data/LOG_SEASON.gpkg"`
>
>After adding the required code to the cell below, run the cell by clicking on the cell and pressing `Shift`+`Enter` on your keyboard.

In [None]:
# Include the code to load the data after the = sign.
logging_season_data = 

## View the loaded data

To get an understanding of what kind of data is in this file, you can view the **first five entries** in the dataset using the `.head()` function, or the **last five entires** in the dataset using the `.tail()` function:

In [None]:
logging_season_data.tail()

### Exercise: understanding columns

When working with new data, it's important to have a clear understanding of the different pieces of information available to you. 

> **Your task**: look through both the _first_ and _last_ five entiries of the dataset above, and write a sentence to describe the purpose of each column listed below. Often, the name will be descriptive, but in some cases, you will need to look at the values in that column to help your understanding. Your colleague has provided you with some of the descriptions to help you out.  

>Double click the text below to add your answers after each column name.

**LOGHISTID**: The unique identifier for each logging event. It is built from other columns, using the following format: FMA/BLOCK/COMPART/COUPENO/SEASON/SECTN_SDE

**COUPEADD**: The unique identifier for each logging coupe. It is built from other columns, using the following format: FMA/BLOCK/COMPART/COUPENO

**SEASON**: The financial year the logging event took place. For example, a logging event happening in the 2019-2020 financial year would have the SEASON code 201920

**STARTDATE**:

**ENDDATE**:

**HECTARES**:

**X_SILVSYS**:

**X_FORETYPE**:


## Clean the data

Your colleagues let you know that the data isn't necessarily complete in some sections. For example, there are some logging events where the start date or end date is missing, which might make it hard to group logging events by year between. Luckily, you can use the SEASON column instead.

The SEASON code's current format isn't very easy to work with, so your colleagues have provided you with some code to extract the first year listed in the SEASON code. This is given as a string (a text label of the year) and an integer (the numeric value of the year). Each will have its own use later on. The code below will identify the starting year in each format, and add two new columns (STARTYEAR_STRING and STARTYEAR_INTEGER) to your table.

To run the code, click on the next cell, and press `Shift`+`Enter` on your keyboard.

In [None]:
# Create a new column STARTYEAR_STRING, using the first four digits of the SEASON column
logging_season_data.loc[:, "STARTYEAR_STRING"] = logging_season_data.loc[:, "SEASON"].str[0:4]

# Create a new column STARTYEAR_INTEGER, which converts STARTYEAR_STRING to a numerical value
logging_season_data.loc[:, "STARTYEAR_INTEGER"] = logging_season_data.loc[:, "STARTYEAR_STRING"].astype(int)

To see the new columns, you can view the first five entries in the dataset using the `.head()` function:

In [None]:
# View the first five rows of the dataset to see the new columns
logging_season_data.head()

## Filter the data

During the next stage of the analysis, you'll be comparing your selected logging events to satellite imagery. Your colleagues recommend that you use Sentinel-2 imagery, as it is freely available, and has higher resolution than Landsat. The first Sentinel-2 satellite started collecting data over Australia in 2015, with the second satellite joining it in 2017. Once both satellites were operational, they began to collect data over Australia every 3-5 days.

Given this, you decide to filter your dataset to only look at events after both Sentinel-2 satellites began collecting data in 2017.

In [None]:
# Set the value for the starting year
starting_year = 2017

# Identify all rows where the starting year (as an integer) is greater than or equal to the set starting_year
events_post_start_year = (
    logging_season_data.loc[:, "STARTYEAR_INTEGER"] >= starting_year
)

# Create a new table containing only the rows where the logging event occured during or after the starting year
logging_events_for_analysis = logging_season_data.loc[events_post_start_year, :]

# Display the number of logging events remaining
print(
    f"The number of logging events that occur during or after {starting_year} is {len(logging_events_for_analysis)}"
)

### Exercise: Changing the year

You can change the value of the starting year to identify different selections of logging events. 

> **Your task**: edit the `starting_year value` above an rerun the cell to see the number of logging events occuring during and after both 2015 and 2016. Record the number of events for each value of `starting_year` below.

>Double click the text below to add your answers for each year.

Number of logging events during and after each starting year:

**2015**:

**2016**:

**2017**: 1371

### Exercise: Choosing a year

Based on your results above, decide which year you want to use to filter your dataset. There is no right answer, all the years are likely to have some Sentinel-2 data available around the time of the logging event. 

> **Your task**: set the `starting_year` value above to the year of your choice, and re-run the cell.

## Explore the data

Your colleague recommends two useful functions for viewing the data:

- `.value_counts()` which counts the number of rows that have a given value in a chosen column. 
- `.explore()` which displays an interative map of all the polygons in your dataset.


Run the cells below to look at how the data are distibuted across time:

In [None]:
column_to_explore = "STARTYEAR_STRING"
logging_events_for_analysis.loc[:, column_to_explore].value_counts()

In [None]:
column_to_explore = "STARTYEAR_STRING"
logging_events_for_analysis.explore(column=column_to_explore)

### Exercise: Explore other columns

Your team is interested in the answers to the following questions:

* **Question 1**: In your filtered dataset, how many Clearfelling events occured?
* **Question 2**: What was the most common type of event in your filtered dataset?
* **Question 3**: How many categories of Forest Type are there? 
* **Question 4**: What two Forest Types appear most commonly around Bairnsdale?

> **Your task**: modify the `column_to_explore` variable in the two cells below to explore other columns of the data, and record your answers below the code cells.

> Double click the text below to add your answers.

#### Code to count how many times each value appears in the column

In [None]:
column_to_explore = "STARTYEAR_STRING"  # Replace the column name in the quotes with the column that will help answer each question. You must keep the quote marks around the column name.
logging_events_for_analysis.loc[:, column_to_explore].value_counts()

#### Code to view all points on a map, colour-coded by the provided column

In [None]:
column_to_explore = "STARTYEAR_STRING"  # Replace the column name in the quotes with the column that will help answer each question. You must keep the quote marks around the column name.
logging_events_for_analysis.explore(column=column_to_explore)

### Your answers

**Question 1**: 

**Question 2**:

**Question 3**:

**Question 4**: 



## Export your cleaned and filtered data for use in the next step

Now that you know a bit more about the data, and have cleaned and filtered it, you can save a copy to work with in the next stage, where you'll find satellite imagery that was captured before and after the logging event.

Run the next cell to save your cleaned data to a new GeoPackage, which will allow you to use your filtered data in the next stage. 

In [None]:
logging_events_for_analysis.to_file("data/LOG_SEASON_FILTERED.gpkg")

## Submit your work

1. Ensure you have added answers to all the questions and save your file (in the menu bar, click File > Save Notebook).

2. In the file browser, right-click the `prac1_logging_site_selection.ipynb` file, and press "Download"

3. Email the downloaded file to caitlinisabeladams@swin.edu.au