<div>
<img src="images/jrc_ec_logo.jpg" width="400"/>
</div>

# **QUASSA**
### IACS **QU**ality **AS**essment **SA**mple Extraction Tool

The code included in this notebook, developed by the JRC, implements the procedure for Member States to define their samples for the IACS quality assessment. This involves extracting parcels from the ranked list provided by the Commission, following the selection rules in the Union Level Methodology 2024 (Chapters 3 and 4). Member States have discretion over the sample sizes, within the constraints of the Methodology.

The tool takes the ranked list in CSV format and returns a list of parcels assigned to buckets representing different interventions.

Authors: Mateusz Dobrychłop (mateusz.dobrychlop@ext.ec.europa.eu), Fernando Fahl (fernando.fahl@ext.ec.europa.eu), Ferdinando Urbano (ferdinando.urbano@ec.europa.eu)

# Release notes

**Version 2 (Jupyter Notebook)**

---
**Latest update: September 6th** - v2.7
* **Holding level intervention selection** - the target selection UI now allows for selecting Holding Level interventions. Selecting one or more Holding Level interventions results in creating an additional output file, with the summary of all holdings related to the Holding Level interventions.
* **Retain / filter out non-contributing parcels** - a new optional parameter that allows the user to decide if a highest-ranked parcel of a holding should be filtered out or retained, when all parcels of a recently completed bucket are filtered out from the parcel list.
* Summary output files - after every extraction a new file is included in the output. It's a simple text summary file listing the parameters used and some statistics calculated on the extracted dataset.
* Extraction ID - the user can now define a unique ID for each extraction. If a custom ID is not defined, the tool will automatically assign a randomized ID to simplify the distinction between output files generated for different extraction iterations.
* Increase all targets by x% - taking the place of the now deprecated target capping option - allows the user to automatically increase all bucket targets by a defined percentage
* Default target value definition - the default placeholder values for bucket targets are now calculated based on the algorithm described in the Union Level Methodology.
---

The solution is still in its beta testing phase. Implementation of some of new features, optimization, and fixes based on user feedback are planned.

# Introduction

## Python, Jupyter Notebook, Prerequisites

### Python

The first prerequisite for running this solution is **Python**. You can download and install Python for your system using [this link](https://www.python.org/downloads/). We recommend version 3.12 or newer.

### Jupyter Notebook

The solution is written in Python, and made available in the form of a **Jupyter Notebook**. It is an interactive document that allows you to combine executable code, text, images, and visualizations all in one easily accessible place. Documentation can be found [here](https://docs.jupyter.org/en/latest/).

To open and use this notebook, you will need an environment that supports Jupyter Notebooks. Some recommended options to look into:
* Running the notebook in your web browser [(link)](https://docs.jupyter.org/en/latest/running.html)
* JupyterLab [(link)](https://jupyterlab.readthedocs.io/en/latest/)
* Visual Studio Code [(link)](https://code.visualstudio.com/download)

### Other Prerequisites

Once Python and the Jupyter Notebook environment are installed, please install the following Python libraries:

* **IPython**: An enhanced interactive Python interpreter.
* **ipywidgets**: Tools for creating interactive GUIs (sets of widgets) within Jupyter notebooks.
* **pandas**: A data manipulation and analysis library.
* **openpyxl**: A library for reading and writing Excel 2010 xlsx/xlsm/xltx/xltm files.
* **matplotlib**: A popular data visualization library

Assuming the Python and pip paths are correctly added to your system's PATH, you can quickly install the required libraries using the following command:

```
pip install ipython ipywidgets pandas openpyxl notebook matplotlib
```

Alternatively, you can use the *requirements.txt* file provided with the code. This will install all the necessary requirements as well.

```
pip install -r requirements.txt
```

## How to Use This Notebook


* **Running the Code:** Each section of code, called a "code cell", can be executed independently by selecting it and pressing Shift + Enter, or by using the "Run" button in the toolbar. This will run the code within that cell and display any outputs directly below it. The code included in the cells is intentionally kept very brief and simple - to have a closer look at the details behind the algorithm, you can open one of the Python files that the notebook imports from.

* **Interactive Elements:** This notebook includes interactive widgets (like checkboxes, buttons, text boxes, and possibly more in the future) designed to make it easier for you to change parameters and interact with the data without needing to modify the code directly. We actually advise against changing anything in the code cells. The widgets are not displayed by default - they appear after running a code cell that implements them.

* **Sequential Execution:** It's important to run the cells in the order they appear. Some cells depend on code or data from earlier cells, so running them out of order might result in errors or incorrect outputs.

# Input Data

The solution takes the following CSV files as input

* **Parcel file (mandatory)** - a file that contains information about parcels, holdings, and ua groups, that the algorithm selects sample data from.

* **Target file (optional)** - a file that defines the target number of parcels that each bucket (corresponding to a certain ua group) should be populated with, as long as there is enough data in the parcel file.

The bucket target values, as well as other parameters, can also be defined manually using interactive widgets.

Detailed instructions on how to format the input files are included below.

## Parcel File - introduction

### Parcel file format

The parcel file should be a standard CSV (*comma separated value*) file. Despite the file format's name, the delimiter (symbol separating columns) used in some CSV files is not always actually a comma, so make sure the delimiter used in the file you use is in fact a comma.

Below is the list of columns that the parcel file must contain:

| **Column name** | **Type** | **Description** | **Comments** |
| --------------- | -------- | --------------- | ------------ |
| gsa_par_id | integer (whole number) or string (text) | Parcel ID | |
| gsa_hol_id | integer (whole number) | Holding ID | |
| ua_grp_id | string (text) | UA group ID | |
| covered | integer (whole number) | Is the parcel covered by a VHR image? (1 - yes, 0 - no) | Can only contain 0 or 1. This column has to exist even if the option of using all parcels regardless of coverage is selected. |
| ranking | integer (whole number) | Ranking value defining the priority in saple parcel selection order. | |


What's also important to note:

* The column names must be identical to the names included in the list above.
* The data types have to match the requirements listed above.
* There should be no empty cells in the dataset.
* Including other columns in the file should not prevent the tool from working, but could potentially slow it down.
* The order of the columns does not matter.

Preview of a correctly formatted dataset:

```csv
gsa_par_id,gsa_hol_id,ua_grp_id,ranking,covered
QWER-5668-44453,3221,E5,1,1
QWER-5668-44453,3221,ANC,1,1
UIOP-4671-02080,2137,BIS,2,1
UIOP-4671-02080,2137,ANC,2,1
HJKL-4470-03366,8901,ANC,3,1
```

Please keep in mind that the naming conventions for different identificators included in the file can be different for different member states.

## Loading the parcel file

Run the code cell below to display a set of widgets that will allow you to select your input file.

You can point to the target file in your file system by clicking the *Select* button, and then using the frame that appears in order to locate and select the file. Once the file is selected, the path to that file should appear in the *Target file path* field.

Alternatively, you can define or modify the path manually. In order to do so, put the path to the parcel file in the *Parcel file path* field. The path can be a direct or a relative path. For example, if the file is named *parcels.csv* and located in the *input* folder that exists in the same location as this notebook, then the relative path that can be placed in the *Parcel file path* field should be:

```
input/parcels.csv
```

**After selecting the target file or manually providing the path, click the *Load* button to verify and load the data.** The interface will let you know in case issues are encountered.

In [None]:
from modules.data_manager import DataManager
from modules import gui

dm = DataManager()

uploaded_files = gui.display_parcel_input_config(dm)

## Bucket Targets - introduction

The bucket target is the target number of parcels that a given bucket (corresponding to a given ua group) should be populated with, as long as there is enough data in the parcel file.

There are 3 methods of editing the target values:

* Manually writing the values using widgets / fields.
* Loading a CSV file with target values listed for each bucket / UA group.
* Setting a cutoff threshold that sets the maximum target value across all buckets.

###  Target File Format

The target values can be loaded from a CSV file too. This option is not mandatory, but can speed up the process of setting up target values for larger sets of UA groups.

It is important to remember that the UA groups considered in the analysis are derived from the parcel file, so the target file, if used, must be consistent with the parcel file. It can't list UA group IDs that are not present in the parcel file, it has to list all the UA group IDs that are included in the parcel file, and it has to follow the exact same naming as the parcel file.

The target file should be a standard CSV (*comma separated value*) file. Despite the file format's name, the delimiter (symbol separating columns) used in some CSV files is not always actually a comma, so make sure the delimiter used in the file you use is in fact a comma.

Below is the list of columns that the target file must contain:

| **Column name** | **Type** | **Description** | **Comments** |
| --------------- | -------- | --------------- | ------------ |
| ua_grp_id | string (text) | UA group ID | Values must be unique and consistent with the IDs included in the parcel file (can't contain IDs not included in the parcel file, and must contain all Ids that are include in the parcel file) |
| target | integer (whole number) | Target number of parcels the bucket should be populated with, if there is enough data in the parcel file | Can be 0 or more |

What's also important to note:

* The column names must be identical to the names included in the list above.
* The data types have to match the requirements listed above.
* There should be no empty cells in the dataset (buckets with target value equal to 0 are allowed).
* Including other columns in the file should not prevent the tool from working.
* The order of the columns does not matter.

Preview of a correctly formatted dataset:

```csv
ua_grp_id,target
E5,70
ANC,250
BIS,2
```



### Loading the Target File

In addition to displaying the widgets for manual input of target values, the code cell below displays a set of widgets that will allow you to select your target file.

You can point to the target file in your file system by clicking the *Select* button, then using the frame that appears to locate and select the file. After selecting your file, the path to that file should appear in the *Target file path* field. 

Alternatively, you can use the field to directly define the path to the file. In order to do so, put the path to the target file in the *Target file path* field. The path can be a direct or a relative path. For example, if the file is named *targets.csv* and located in the *input* folder that exists in the same location as this notebook, then the relative path that can be placed in the *Parcel file path* field should be:

```
input/targets.csv
```

**After selecting the target file or manually providing the path, click the *Load* button to verify and load the data.** The interface will let you know in case issues are encountered.

### Selecting Holding Level interventions

To mark an intervention as Holding Level, click the corresponding "HL" button. After selecting, it will change color to yellow. Click the button again to deselect.

### Increasing all targets by a given percentage

The *Increase all targets by* field can be used to increase all target values by a certain percentage. Putting a value in the field and clicking *Recalculate* adjusts all the target values by multiplying them by the provided value divided by 100. In other words, putting "10" in the field increases all target values by 10%.

## Setting target values

**Once the parcel file is loaded**, run the code cell below to display a set of widgets that will let you define bucket sizes for each ua group identified in the parcel file.

The code cell below displays a set of widgets that will allow you to define ua group bucket targets. By default, the target values are calculated based on the number of parcels related to a given ua group detected in the input data, following the algorithm described in the Union Level Methodology. The values can also be loaded from a file (process described in previous sections) and / or modified manually.

Each widget corresponds to a single bucket / UA group and consists of 4 elements:

* The UA group's identifier (non-modifiable)
* Target number of parcels for the bucket (modifiable)
* A "HL" button allowing to select Holding Level interventions (see below)
* Number of rows identified in the parcel file corresponding to the UA group (in brackets, non-modifiable)

A message above the widget grid displays information on the last method that was used to set the values for the targets.

If the UA group's name is truncated due to its length, hover your mouse over it to see the full name.


In [None]:
try:
    gui.display_bucket_targets_config(dm)
except NameError:
    print("Please run the previous code cell first ('Loading the parcel file')")

In [None]:
print(dm.ua_groups)

## Setting Optional Parameters

Run the code below to display widgets that allow for the control over advanced / optional parameters.

* **Limit search to 3% of holdings** - after the limit of 3% of all holdings present in the dataset is reached (all interventions selected represent a number of holdings that is 3% of the total), the algorithm continues to pick interventions only belonging to the already selected holdings. If by the end of that procedure, some buckets are empty (contain 0 iterventions), the remaining data is searched to add 1 intervention to each empty bucket.

**VHR image coverage options**

* **Include all parcels in the sample extraction** - search through all interventions, both covered and not covered by the VHR images.

* **Prioritize parcels covered by VHR images (beta)** - prioritize covered parcels without completely disregarding the non-covered ones. **Currently in testing phase.**

* **Include only parcels covered by VHR images** - only consider interventions with value of the "covered" column set to 1.

**Non-contributing, highest-ranked parcel of a holding**

* **Filter out when a bucket is filled** - when a bucked assigned to a ua group is filled, all rows related to that ua group, that remain in the dataset, are filtered out from it

In [None]:
try:
    gui.display_optional_parameters(dm)
except NameError:
    print("Please run the first code cell first ('Loading the parcel file')")

# Output Information

### Output file

Once the tool finishes work, it automatically saves two files in two different formats (.xslx and .csv), containing the same naming convention and the same information. Each row of the output file corresponds to a single intervention within a parcel that was assigned to one of the UA group buckets.

In the future, the interface will offer some flexibility regarding the nature of the output files.

The output file contains the following columns:

| **Column name** | **Type** | **Description** | **Comments** |
| --------------- | -------- | --------------- | ------------ |
| bucket_id | string (text) | Analogous to the UA group ID. ID of the UA group associated with the bucket. | |
| gsa_par_id | string (text) | Parcel ID. | |
| gsa_hol_id | integer (whole number) | Holding ID | |
| ranking | integer (whole number) | Ranking value derived directly from the parcel file, defining the priority in sample parcel selection order. | |
| covered | integer (whole number) | Is the parcel covered by a HR image? (1 - yes, 0 - no) | Can only contain 0 or 1 | |
| order_added | integer (whole number) | A counter value indicating the general order in which the information was added to buckets. | |

### Live progress preview

After running the cell that executes the algorithm, a set of widgets is displayed that illustrates the progress of assigning parcels to different buckets. Orange text corresponds to buckets that are not yet full. Green text corresponds to buckets succesfully filled. In the upcoming versions of the tool, we're planning to let the users set the refresh rate of the output widgets, allowing for slight performance upgrade at the cost of the progress widgets accuracy.

# Running Extraction

Before you run the cell below, please make sure the target values are set up correctly.

Even though we are working on optimizing the code, it can potentially run for a very long time, heavily depending on the size, but also other aspects of the input data. At the initial stages, the buckets should populate at a relativly high rate, with very visible changes to the progress widgets. However, in later stages of the analysis, some buckets can stay empty for a longer period of time, while the algorithm is looking for the right parcels to populate them with. It is not uncommon for most of the analysis to be completed very quickly, just for the last one or two small buckets to require most of the time to be filled.

For details regarding the current state of the algorithm, please se the **Release notes** section near the top of the notebook.

In [None]:
from modules import sample_extraction as se

if not dm.parcel_file_loaded:
    print("Please load the parcel file first and run this cell again. Also, don't forget to set the bucket targets.")

elif not dm.targets_displayed_and_set:
    print("Please set the bucket targets first (or set them again) and run this cell again.")

else:
    buckets = se.prepare_buckets(dm.ua_groups)
    parcels = se.prepare_input_dataframe(dm)
    widgets = gui.display_output_area(buckets, dm)

    final_buckets = se.iterate_over_interventions_fast(parcels, buckets, dm, widgets)

    if se.buckets_full(buckets):
        print("Analysis completed. All bucket s are full!")
    else:
        print("Analysis completed. Some buckets are not full.") 

    prefix, summary_path, excel_path, csv_path, hlcsv, hlexcel = se.generate_samples_extract_output(final_buckets, dm)
    print("Output file generated.")

    print(f"Excel file: {excel_path}")
    print(f"CSV file: {csv_path}")
    print(f"Summary file: {summary_path}")

    if dm.holding_level_interventions != []:
        print(f"Holding level Excel file: {hlexcel}")
        print(f"Holding level CSV file: {hlcsv}")

# Final statistics

Run the cell below after the extraction is finished, to display a detailed summary of the extracted data.

In [None]:
print(final_buckets)

In [None]:
from modules import visualizations as vis

for ua_group in dm.ua_groups:
        dm.ua_groups[ua_group]["selected"] = len(dm.final_bucket_state[ua_group]["parcels"])

vis.display_statistics_summary(dm)
vis.display_reused_and_covered(dm) 
vis.display_bucket_stats(dm)