# JRC Sample Extraction
<div>
<img src="jrc_ec_logo.jpg" width="400"/>
</div>

Software tool that selects inspection samples for the quality assessments (QA) of the Area Monitoring System (AMS) and GeoSpatial Application (GSA) 

Authors: Mateusz Dobrychłop (mateusz.dobrychlop@ext.ec.europa.eu), Fernando Fahl (fernando.fahl@ext.ec.europa.eu), Ferdinando Urbano (ferdinando.urbano@ec.europa.eu)

## Introduction

### Working With an Interactive Jupyter Notebook

#### What is a Jupyter Notebook?

What you are looking at is a *Jupyter Notebook*. It's an interactive document that allows you to combine executable code, text, images, and visualizations all in one easily accessible place. This format is particularly useful for data analysis, scientific research, and complex computations, providing a hands-on way to explore and present data dynamically. It also can provide ways of running Python scripts and providing input data for them that is more user friendly than a standard, purely command-line based solution.

#### Where to Open This Notebook

To open and use this notebook, you will need an environment that supports Jupyter Notebooks. The simplest option for many is to use Visual Studio Code (VSCode) with the Python and Jupyter extensions installed, which provides support for running Jupyter Notebooks directly within the editor. Alternatively, you can use traditional Jupyter environments such as Jupyter Lab or Jupyter Notebook, available through Anaconda or directly in your web browser.

#### How to Use This Notebook

* **Running the Code:** Each section of code, called a "cell", can be executed independently by selecting it and pressing Shift + Enter, or by using the "Run" button in the toolbar. This will run the code within that cell and display any outputs directly below it. The code included in the cells is intentionally kept very brief and simple - to have a closer look at the details behind the algorithm, you can open one of the Python files that the notebook imports from.

* **Interactive Elements:** This notebook includes interactive widgets (like checkboxes, buttons, text boxes, and possibly more in the future) designed to make it easier for you to change parameters and interact with the data without needing to modify the code directly. We actually advise against changing anything in the code cells. The widgets are not displayed by default - they appear after running a code cell that implements them.

* **Sequential Execution:** It's important to run the cells in the order they appear. Some cells depend on code or data from earlier cells, so running them out of order might result in errors or incorrect outputs.

#### Getting Started

To begin, simply start at the top of the notebook and work your way down. Each cell is designed to guide you through the process and provide interactive elements to adjust the analysis. If you encounter any issues, ensure that you have run all preceding cells in the order they appear.

### Installation of Python and other prerequisites

To ensure that you can run this Jupyter Notebook smoothly, you'll need to install Python along with several required libraries. Follow these step-by-step instructions to set up everything you need.

#### Step 1: Install Python
1. Download Python: Go to the official Python website at python.org and download the latest version of Python 3.12.x for your operating system (Windows, macOS, or Linux).
2. Install Python: Open the downloaded file and run the installer. Ensure that you check the box that says Add Python 3.12 to PATH before clicking Install Now.

#### Step 2: Install Required Libraries
1. Open your command prompt or terminal:

    * On Windows, you can search for CMD or Command Prompt in your start menu.

    * On macOS, open the Terminal application from your Applications/Utilities folder.

    * On Linux, open your terminal from your applications menu or by pressing Ctrl+Alt+T.

2. Install the libraries using pip: Type the following command and press Enter:

    ```
    pip install ipython ipywidgets pandas openpyxl notebook
    ```

    This command will install:

    * IPython: An enhanced interactive Python interpreter.
    * ipywidgets: Tools for creating interactive GUIs (sets of widgets) within Jupyter notebooks.
    * pandas: A powerful data manipulation and analysis library.
    * openpyxl: A library for reading and writing Excel 2010 xlsx/xlsm/xltx/xltm files.
    * notebook: The package that includes Jupyter Notebook.

#### Step 3: Launching the Notebook

* Option 1: Using Visual Studio Code (VSCode)
    * Install VSCode: Download and install Visual Studio Code from code.visualstudio.com.
    * Install the Python extension: Open VSCode, go to the Extensions view by clicking on the square icon on the sidebar, or pressing Ctrl+Shift+X. Search for 'Python' and install the extension offered by Microsoft.
    * Open the notebook in VSCode: Open the folder containing your notebook using File > Open Folder, and then click on the notebook file to open it in a VSCode tab.

* Option 2: Using Jupyter Notebook directly
    * Launch the notebook: Type the following command in your terminal or command prompt:

        ```
        jupyter notebook
        ```
    * This will start the Jupyter Notebook server and should automatically open a web browser window showing the Notebook Dashboard. From here, you can navigate to and open the notebook file you need to use.

## Loading Input Data

The solution takes the following CSV files as input

* **Parcel file (mandatory)** - a file that contains information about parcels, holdings, and ua groups, that the algorithm selects sample data from.

* **Target file (optional)** - a file that defines the target number of parcels that each bucket (corresponding to a certain ua group) should be populated with, as long as there is enough data in the parcel file.

The bucket target values, as well as other parameters, can also be defined manually using interactive widgets.

Detailed instructions on how to format the input files are included below.

### Parcel File

#### File format

The parcel file should be a standard CSV (*comma separated value*) file. Despite the file format's name, the delimiter (symbol separating columns) used in some CSV files is not always actually a comma, so make sure the delimiter used in the file you use is in fact a comma.

Below is the list of columns that the parcel file must contain:

| **Column name** | **Type** | **Description** | **Comments** |
| --------------- | -------- | --------------- | ------------ |
| gsa_par_id | string (text) | Parcel ID | |
| gsa_hol_id | integer (whole number) | Holding ID | |
| ua_grp_id | string (text) | UA group ID | |
| covered | integer (whole number) | Is the parcel covered by a HR image? (1 - yes, 0 - no) | Can only contain 0 or 1 |
| ranking | integer (whole number) | Ranking value defining the priority in saple parcel selection order. | |


What's also important to note:

* The column names must be identical to the names included in the list above.

* The data types have to match the requirements listed above.

* There should be no empty cells in the dataset.

* Including other columns in the file should not prevent the tool from working, but could potentially slow it down.

* The order of the columns does not matter.

Preview of a correctly formatted dataset:

| **gsa_par_id** | **gsa_hol_id** | **ua_grp_id** | **covered** | **ranking** |
| -------------- | -------------- | ------------- | ----------- | ----------- |
| QWER-5668-44453 |	3221 | E5 | 1 | 1 |
| QWER-5668-44453 |	3221 | ANC | 1 | 1 |
| UIOP-4671-02080 |	2137 | BIS | 2 | 1 |
| UIOP-4671-02080 | 2137 | ANC | 2 | 1 |
| HJKL-4470-03366 |	8901 | ANC | 3 | 1 |

Please keep in mind that the naming conventions for different identificators included in the file can be different for different member states.

Run the code cell below to display a set of widgets that will allow you to select your input file.

[INSTRUCTIONS + PARCEL FILE FORMATTING HERE]

In [1]:
import sample_extraction_gui as gui

uploaded_files = gui.display_parcel_input_config()

VBox(children=(HBox(children=(Button(icon='info', layout=Layout(width='30px'), style=ButtonStyle(), tooltip='I…

## Define bucket targets

Run the code cell below to display a set of widgets that will allow you to define ua group bucket targets.

[INSTRUCTIONS + TARGET FILE FORMATTING HERE]

In [2]:
gui.display_bucket_targets_config()

# what if the parcels are not loaded, then this cell is loaded (no widgets are shown), and then the parcel is loaded?
# add reload button? maybe it could be shown when the "no ua_grp_id" error is shown?

VBox(children=(HBox(children=(Label(value='(Target values loaded from the parcel file.)'),)), GridBox(children…

## Set output

## Set optional parameters

To be added in future versions.

In [None]:
# gui.display_advanced_config()

## Running extraction

In [3]:
import sample_extraction

buckets = sample_extraction.prepare_buckets(gui.PARAMETERS["ua_groups"])
parcels = sample_extraction.prepare_input_dataframe(gui.PARAMETERS["parcels_df"])

widgets = gui.display_output_area(buckets)


sample_extraction.iterate_over_interventions(parcels, buckets, widgets)

if sample_extraction.buckets_full(buckets):
    print("\nAll buckets full!")
else:
    print("\nSome buckets not full!")

sample_extraction.generate_output(buckets)   
print("\nOutput file generated.")

GridBox(children=(VBox(children=(Label(value='Bucket: E5', _dom_classes=('orange_label_bold',)), Label(value='…


All buckets full!

Output file generated.


main changes:
- when looping through a holding, every time a parcel is checked, all interventions from it have to be added to all possible buckets (unless bucket is full)
- when looping through holding parcels, stop adding parcels to a bucket if 3 parcels from that holding are already in the bucket. but the other buckets still have to be checked. implement some kind of local bucket list? that is then merged with the big one?

- work on the interface a bit
- add some parameters


---
- 3 interventions per bucket


- parcel -> check all interventions

- holding check -> 



- once a bucket is filled, maybe remove all rows related to it from the main list?
---


modifiable parameters: 
- 3% limit
- just the parcels covered by area images or whole list?
- priority to the covered?
- 

- costas, augusta, slavko, gilbert, paulo




- 3% rule = for all buckets, if 3% of holdings are added, stop adding to a bucket even if below threshold. if you reach 3% of holdings, only keep adding parcels from the 3% holdings already added
- grouping rows together (1 row per parcel with a list of intervensions in one columns)