# IACS QA Sample Extraction
<div>
<img src="images/jrc_ec_logo.jpg" width="400"/>
</div>

The code included in this notebook, developed by the JRC, implements the procedure for Member States to define their samples for the IACS quality assessment. This involves extracting parcels from the ranked list provided by the Commission, following the selection rules in the Union Level Methodology 2024 (Chapters 3 and 4). Member States have discretion over the sample sizes, within the constraints of the Methodology.

The tool takes the ranked list in CSV format and returns a list of parcels assigned to buckets representing different interventions.

Authors: Mateusz Dobrychłop (mateusz.dobrychlop@ext.ec.europa.eu), Fernando Fahl (fernando.fahl@ext.ec.europa.eu), Ferdinando Urbano (ferdinando.urbano@ec.europa.eu)

## Release notes

**The solution is still in its beta testing phase. Implementation of some of the expected features and general optimization is planned for the upcoming weeks.**

Current version of the tool follows a **simplified algorithm**, that will be expanded with new parameters in the near future:

* All parcels, covered and not covered by the HR images, are currently taken into account. This aspect will be controlled by a set of configuration parameters in an upcoming update (that will let the user prioritize covered parcels)

* Limiting the fraction of holdings (the *3% rule*) is not implemented yet (but will be soon)

* Some more flexibility regarding the output will soon be introduced. We are also working on making the output more informative.

* In the upcoming updates, a lot of emphasis will be placed on speed improvements.

## Introduction

### Python, Jupyter Notebook, Prerequisites

#### Python

The first prerequisite for running this solution is **Python**. You can download and install Python for your system using [this link](https://www.python.org/downloads/). We recommend version 3.12 or newer.

#### Jupyter Notebook

The solution is written in Python, and made available in the form of a **Jupyter Notebook**. It is an interactive document that allows you to combine executable code, text, images, and visualizations all in one easily accessible place. Documentation can be found [here](https://docs.jupyter.org/en/latest/).

To open and use this notebook, you will need an environment that supports Jupyter Notebooks. Some recommended options to look into:
* Running the notebook in your web browser [(link)](https://docs.jupyter.org/en/latest/running.html)
* JupyterLab [(link)](https://jupyterlab.readthedocs.io/en/latest/)
* Visual Studio Code [(link)](https://code.visualstudio.com/download)

#### Other Prerequisites

Once Python and the Jupyter Notebook environment are installed, please install the following Python libraries:

* **IPython**: An enhanced interactive Python interpreter.
* **ipywidgets**: Tools for creating interactive GUIs (sets of widgets) within Jupyter notebooks.
* **pandas**: A data manipulation and analysis library.
* **openpyxl**: A library for reading and writing Excel 2010 xlsx/xlsm/xltx/xltm files.

Assuming the Python and pip paths are correctly added to your system's PATH, you can quickly install the required libraries using the following command:

```
pip install ipython ipywidgets pandas openpyxl notebook
```

### How to Use This Notebook

* **Running the Code:** Each section of code, called a "code cell", can be executed independently by selecting it and pressing Shift + Enter, or by using the "Run" button in the toolbar. This will run the code within that cell and display any outputs directly below it. The code included in the cells is intentionally kept very brief and simple - to have a closer look at the details behind the algorithm, you can open one of the Python files that the notebook imports from.

* **Interactive Elements:** This notebook includes interactive widgets (like checkboxes, buttons, text boxes, and possibly more in the future) designed to make it easier for you to change parameters and interact with the data without needing to modify the code directly. We actually advise against changing anything in the code cells. The widgets are not displayed by default - they appear after running a code cell that implements them.

* **Sequential Execution:** It's important to run the cells in the order they appear. Some cells depend on code or data from earlier cells, so running them out of order might result in errors or incorrect outputs.

## Loading Input Data

The solution takes the following CSV files as input

* **Parcel file (mandatory)** - a file that contains information about parcels, holdings, and ua groups, that the algorithm selects sample data from.

* **Target file (optional)** - a file that defines the target number of parcels that each bucket (corresponding to a certain ua group) should be populated with, as long as there is enough data in the parcel file.

The bucket target values, as well as other parameters, can also be defined manually using interactive widgets.

Detailed instructions on how to format the input files are included below.

### Parcel File

#### File format

The parcel file should be a standard CSV (*comma separated value*) file. Despite the file format's name, the delimiter (symbol separating columns) used in some CSV files is not always actually a comma, so make sure the delimiter used in the file you use is in fact a comma.

Below is the list of columns that the parcel file must contain:

| **Column name** | **Type** | **Description** | **Comments** |
| --------------- | -------- | --------------- | ------------ |
| gsa_par_id | string (text) | Parcel ID | |
| gsa_hol_id | integer (whole number) | Holding ID | |
| ua_grp_id | string (text) | UA group ID | |
| covered | integer (whole number) | Is the parcel covered by a HR image? (1 - yes, 0 - no) | Can only contain 0 or 1 |
| ranking | integer (whole number) | Ranking value defining the priority in saple parcel selection order. | |


What's also important to note:

* The column names must be identical to the names included in the list above.
* The data types have to match the requirements listed above.
* There should be no empty cells in the dataset.
* Including other columns in the file should not prevent the tool from working, but could potentially slow it down.
* The order of the columns does not matter.

Preview of a correctly formatted dataset:

```csv
gsa_par_id,gsa_hol_id,ua_grp_id,covered,ranking
QWER-5668-44453,3221,E5,1,1
QWER-5668-44453,3221,ANC,1,1
UIOP-4671-02080,2137,BIS,2,1
UIOP-4671-02080,2137,ANC,2,1
HJKL-4470-03366,8901,ANC,3,1
```

Please keep in mind that the naming conventions for different identificators included in the file can be different for different member states.

#### Loading the parcel file

Run the code cell below to display a set of widgets that will allow you to select your input file.

Put the path to the parcel file in the *Parcel file path* field. The path can be a direct or a relative path. For example, if the file is named *parcels.csv* and located in the *input* folder that exists in the same location as this notebook, then the relative path that can be placed in the *Parcel file path* field should be:

```
input/parcels.csv
```

After providing the path, click the *Load* button to verify and load the data. The interface will let you know in case issues are encountered.

In [None]:
from modules import gui

uploaded_files = gui.display_parcel_input_config()

### Bucket targets

The bucket target is the target number of parcels that a given bucket (corresponding to a given ua group) should be populated with, as long as there is enough data in the parcel file.

There are 3 methods of editing the target values:

* Manually writing the values using widgets / fields.
* Loading a CSV file with target values listed for each bucket / UA group.
* Setting a cutoff threshold that sets the maximum target value across all buckets.

#### **Bucket Target Widgets**

**Once the parcel file is loaded**, run the code cell below to display a set of widgets that will let you define bucket sizes for each ua group identified in the parcel file.

The code cell below displays a set of widgets that will allow you to define ua group bucket targets. In case you load the target values from a file (process explained below), the widgets will also show a preview of the currently loaded values. The values can also be loaded from a file and then edited manually. By default, the target values are set to 300.

Each widget corresponds to a single bucket / UA group and consists of 3 elements:

* The UA group's identifier (non-modifiable)
* Target number of parcels for the bucket (modifiable)
* Number of rows identified in the parcel file corresponding to the UA group (in brackets, non-modifiable)

A message above the widget grid displays information on the last method that was used to set the values for the targets. 

#### **Target File**

The target values can be loaded from a CSV file too. This option is not mandatory, but can speed up the process of setting up target values for larger sets of UA groups.

It is important to remember that the UA groups considered in the analysis are derived from the parcel file, so the target file, if used, must be consistent with the parcel file. It can't list UA group IDs that are not present in the parcel file, it has to list all the UA group IDs that are included in the parcel file, and it has to follow the exact same naming as the parcel file.

##### File Format

The target file should be a standard CSV (*comma separated value*) file. Despite the file format's name, the delimiter (symbol separating columns) used in some CSV files is not always actually a comma, so make sure the delimiter used in the file you use is in fact a comma.

Below is the list of columns that the target file must contain:

| **Column name** | **Type** | **Description** | **Comments** |
| --------------- | -------- | --------------- | ------------ |
| ua_grp_id | string (text) | UA group ID | Values must be unique and consistent with the IDs included in the parcel file (can't contain IDs not included in the parcel file, and must contain all Ids that are include in the parcel file) |
| target | integer (whole number) | Target number of parcels the bucket should be populated with, if there is enough data in the parcel file | Can be 0 or more |

What's also important to note:

* The column names must be identical to the names included in the list above.
* The data types have to match the requirements listed above.
* There should be no empty cells in the dataset (buckets with target value equal to 0 are allowed).
* Including other columns in the file should not prevent the tool from working.
* The order of the columns does not matter.

Preview of a correctly formatted dataset:

```csv
ua_grp_id,target
E5,70
ANC,250
BIS,2
```

##### Loading the Target File

In addition to displaying the widgets for manual input of target values, the code cell below displays a set of widgets that will allow you to select your target file.

Put the path to the target file in the *Target file path* field. The path can be a direct or a relative path. For example, if the file is named *targets.csv* and located in the *input* folder that exists in the same location as this notebook, then the relative path that can be placed in the *Parcel file path* field should be:

```
input/targets.csv
```

After providing the path, click the *Load* button to verify and load the data. The interface will let you know in case issues are encountered.

#### **Target cutoff value**

The *Target cutoff* field can be used to define a cutoff value for all target values at once. Putting a value in the field and clicking *Recalculate* adjusts all the target values so that none of them exceeds the cutoff value.

In [None]:
gui.display_bucket_targets_config()

## Output Information

### Output file

Once the tool finishes work, it automatically saves two files in two different formats (.xslx and .csv), containing the same naming convention and the same information. Each row of the output file corresponds to a single intervention within a parcel that was assigned to one of the UA group buckets.

In the future, the interface will offer some flexibility regarding the nature of the output files.

The output file contains the following columns:

| **Column name** | **Type** | **Description** | **Comments** |
| --------------- | -------- | --------------- | ------------ |
| bucket_id | string (text) | Analogous to the UA group ID. ID of the UA group associated with the bucket. | |
| gsa_par_id | string (text) | Parcel ID. | |
| gsa_hol_id | integer (whole number) | Holding ID | |
| ranking | integer (whole number) | Ranking value derived directly from the parcel file, defining the priority in sample parcel selection order. | |
| order_added | integer (whole number) | A counter value indicating the general order in which the information was added to buckets. | |

### Live progress preview

After running the cell that executes the algorithm, a set of widgets is displayed that illustrates the progress of assigning parcels to different buckets. Orange text corresponds to buckets that are not yet full. Green text corresponds to buckets succesfully filled. In the upcoming versions of the tool, we're planning to let the users set the refresh rate of the output widgets, allowing for slight performance upgrade at the cost of the progress widgets accuracy.

## Setting Optional Parameters (work in progress)

Unavailable for now. 

In [None]:
# gui.display_advanced_config()

## Running Extraction

Before you run the cell below, please make sure the target values are set up correctly.

Even though we are working on optimizing the code, it can potentially run for a very long time, heavily depending on the size, but also other aspects of the input data. At the initial stages, the buckets should populate at a relativly high rate, with very visible changes to the progress widgets. However, in later stages of the analysis, some buckets can stay empty for a longer period of time, while the algorithm is looking for the right parcels to populate them with. It is not uncommon for most of the analysis to be completed very quickly, just for the last one or two small buckets to require most of the time to be filled.

For details regarding the current state of the algorithm, please se the **Release notes** section near the top of the notebook.

In [None]:
from modules import sample_extraction

buckets = sample_extraction.prepare_buckets(gui.PARAMETERS["ua_groups"])
parcels = sample_extraction.prepare_input_dataframe(gui.PARAMETERS["parcels_df"])

widgets = gui.display_output_area(buckets)


sample_extraction.iterate_over_interventions(parcels, buckets, widgets)

if sample_extraction.buckets_full(buckets):
    print("\nAll buckets full!")
else:
    print("\nSome buckets not full!")

sample_extraction.generate_output(buckets)   
print("\nOutput file generated.")