# JRC Sample Extraction
<div>
<img src="jrc_ec_logo.jpg" width="400"/>
</div>

Solution that selects inspection samples for the quality assessments (QA) of the Area Monitoring System (AMS) and GeoSpatial Application (GSA) 

[INTRODUCTION + GENERAL INSTRUCTIONS HERE]

Authors: Fernando Fahl (fernando.fahl@ext.ec.europa.eu), Mateusz Dobrychłop (mateusz.dobrychlop@ext.ec.europa.eu), Ferdinando Urbano (ferdinando.urbano@ec.europa.eu)

## Load parcel list

Run the code cell below to display a set of widgets that will allow you to select your input file.

[INSTRUCTIONS + PARCEL FILE FORMATTING HERE]

In [1]:
import sample_extraction_gui as gui
uploaded_files = gui.display_parcel_input_config()

VBox(children=(HBox(children=(Button(icon='info', layout=Layout(width='30px'), style=ButtonStyle(), tooltip='T…

## Define bucket targets

Run the code cell below to display a set of widgets that will allow you to define ua group bucket targets.

[INSTRUCTIONS + TARGET FILE FORMATTING HERE]

In [2]:
gui.display_bucket_targets_config()

VBox(children=(HBox(children=(Button(icon='info', layout=Layout(width='30px'), style=ButtonStyle(), tooltip='Y…

## Set output and optional parameters

[INSTRUCTIONS]

In [3]:
gui.display_advanced_config()

VBox(children=(VBox(children=(Checkbox(value=True, description='Prioritize parcels of a holding until a limit …

## Running extraction

[INSTRUCTIONS]

In [None]:
# pseudocodish code that makes it easy to follow the algorithm workflow

# Old script:

## Imports

In [None]:
import pandas as pd
import datetime
import cosmetics_tools

## Data extraction and output writing functions

Processing of two key input files:
 - the ranked "interventions" csv file, that contains rows corresponding to intervention types associated with parcels, which are associated with holdings
 - the "targets" csv file, listing target parcel count for each bucket corresponding to a single intervention type

 The intervention file is transformed into a DataFrame (filtering out some of its columns) and then used as the main data structure that is iterated over.
 The intervention DataFrame is sorted by the "ranking" column.

 The targets file is used to create a dictionary that represents buckets, that is gradually populated with info from the intervention DataFrame.

 This cell also defines a simple function that saves an output excel file.

 **TODO: Probably instead of just adding a row to a bucket, I should also implement a loop through all rows of a parcel.**

In [None]:
def extract_interventions(path):
    """
    Extracts the interventions from the csv file and returns a dataframe with the columns:
    - parcel_id
    - holding_id
    - intervention_type_id
    - ranking
    """
    interventions_full_df = pd.read_csv(path)
    # filter donw to take only rows where covered = 1
    interventions_full_df = interventions_full_df[interventions_full_df["covered"] == 1]
    interventions_df = interventions_full_df[["gsa_par_id", "gsa_hol_id", "ua_grp_id", "ranking"]]
    interventions_df = interventions_df.sort_values(by="ranking")
    interventions_df = interventions_df.rename(columns={"ua_grp_id": "intervention_type_id", 
                                                        "gsa_hol_id": "holding_id", 
                                                        "gsa_par_id": "parcel_id"})
    
    # add a unique row id column that is combination of parcel_id, holding_id and intervention_type_id
    # this will be used to identify which rows were already added to buckets
    interventions_df['row_id'] = interventions_df['parcel_id'].astype(str) + interventions_df['holding_id'].astype(str) + interventions_df['intervention_type_id'].astype(str)
    interventions_df['order_added'] = 0


    return interventions_df

def buckets_global_count(buckets):
    """
    Returns a number of rows already added to all buckets together
    """
    return sum([len(bucket['parcels']) for bucket in buckets.values()])

def extract_buckets(path):
    """
    Extracts the targets from the csv file and returns a dictionary with the keys being the intervention_type_id
    and the values being a dictionary with the keys:
    - target: the target number of parcels
    - parcels: a list of dictionaries with the keys:
        - parcel_id
        - holding_id
        - ranking
    """
    targets_full_df = pd.read_csv(path)
    targets_df = targets_full_df[["ua_grp_id", "target1"]]
    targets = targets_df.set_index('ua_grp_id').T.to_dict('records')[0]
    buckets = {}
    for id, target in targets.items():
        if target > 300:
            target = 300
        buckets[id] = {'target': target, 'parcels': []}
    return buckets

def generate_output(buckets):
    """
    Generates an output xlsx file with the following columns:
    - bucket_id
    - parcel_id
    - holding_id
    - ranking
    - target
    """
    output = []
    for bucket_id, bucket in buckets.items():
        for parcel in bucket['parcels']:
            output.append([bucket_id, parcel["parcel_id"], parcel["holding_id"], parcel["ranking"], parcel["order_added"], bucket['target']])
    output_df = pd.DataFrame(output, columns=["bucket_id", "parcel_id", "holding_id", "ranking", "order_added", "target"])

    filename = "output/output_" + datetime.datetime.now().strftime("%Y-%m-%d_%H%M%S") + ".xlsx"
    output_df.to_excel(filename, index=False)

## Row allocation functions

Functions that distribute intervention DataFrame information into the buckets dictionary.

In [None]:


def buckets_full(buckets):
    """
    Returns True if all buckets are full, False otherwise
    """
    return all(len(bucket['parcels']) >= bucket['target'] for bucket in buckets.values())


def check_holding_group_old(holding_group, buckets, added_rows):
    """
    Checks the holding group for parcels that can be added to buckets.
    If possible, adds up to 3 parcels from the holding group to buckets.
    Adds added rows to the added_rows set, and the holding group to the checked_holdings set.
    """
    counter = 3
    for index, holding_row in holding_group.iterrows():
        if buckets_full(buckets) or counter == 0:
            break
        for bucket_id, bucket in buckets.items():
            if holding_row["intervention_type_id"] == bucket_id and len(bucket['parcels']) < bucket['target'] and holding_row["row_id"] not in added_rows:
                bucket['parcels'].append({"parcel_id": holding_row["parcel_id"],
                                          "holding_id": holding_row["holding_id"],
                                          "ranking": holding_row["ranking"],
                                          
                                          })
                added_rows.add(holding_row["row_id"])
                counter -= 1

    return buckets

def all_buckets_used_3_times(bucket_counter):
    """
    Returns True if all buckets have been used 3 times, False otherwise.
    """
    return all(value == 3 for value in bucket_counter.values())

def check_holding_group(holding_group, buckets, added_rows):
    """
    New version that adds parcels the right (?) way.
    """
    # create a dictionary that has the same keys as buckets, but the values are just zeros
    bucket_counter = {key: 0 for key in buckets.keys()}
    for index, holding_row in holding_group.iterrows():
        if buckets_full(buckets) or all_buckets_used_3_times(bucket_counter):
            break
        parcel_group = holding_group[holding_group["parcel_id"] == holding_row["parcel_id"]]
        buckets, bucket_counter = check_parcel(parcel_group, buckets, added_rows, bucket_counter)

    return buckets


def check_parcel(parcel_group, buckets, added_rows, bucket_counter=None):
    for index, parcel_row in parcel_group.iterrows():
        if buckets_full(buckets):
            break
        for bucket_id, bucket in buckets.items():
            
                if parcel_row["intervention_type_id"] == bucket_id and len(bucket['parcels']) < bucket['target'] and parcel_row["row_id"] not in added_rows:
                    if bucket_counter == None or bucket_counter[bucket_id] < 3:
                        bucket['parcels'].append({"parcel_id": parcel_row["parcel_id"],
                                                "holding_id": parcel_row["holding_id"],
                                                "ranking": parcel_row["ranking"],
                                                "order_added" : buckets_global_count(buckets)+1,
                                                })
                        added_rows.add(parcel_row["row_id"])
                        bucket_counter[bucket_id] += 1
    
    return buckets, bucket_counter


def iterate_over_interventions(interventions_df, buckets): #, progress_bars):
    """
    Main loop of the script.
    Iterates over the rows in the interventions dataframe and adds parcels to the buckets.
    """

    print("Buckets: (\033[92mgreen\033[0m = full, \033[93myellow\033[0m = still looking for parcels)")
    checked_holdings = set()
    added_rows = set()
    for index, row in interventions_df.iterrows():
        if buckets_full(buckets):
            break
        if row["holding_id"] not in checked_holdings:
            checked_holdings.add(row["holding_id"])
            holding_group = interventions_df[interventions_df["holding_id"] == row["holding_id"]]
            buckets = check_holding_group(holding_group, buckets, added_rows)
        else:
            #buckets = check_individual_row(row, buckets, added_rows)
            parcel_group = interventions_df[interventions_df["parcel_id"] == row["parcel_id"]]
            buckets, bucket_counter = check_parcel(parcel_group, buckets, added_rows)

        cosmetics_tools.print_progress(buckets)
        # print(dir(cosmetics_tools))
        # exit()
        #cosmetics_tools.update_progress_bars(buckets, progress_bars)

    return buckets

## Execute solution

In [None]:
interventions_path = "input/MT_ua_grp_tiles.csv"
targets_path = "input/MT_view_target_sample_size.csv"

interventions_df = extract_interventions(interventions_path)
buckets = extract_buckets(targets_path)
# pbars = cosmetics_tools.show_progress_bars(buckets)

iterate_over_interventions(interventions_df, buckets) #, pbars)    

# Indicate the reason why the run ended
if buckets_full(buckets):
    print("\nAll buckets full!")
else:
    print("\nSome buckets not full!")

generate_output(buckets)   
print("\nOutput file generated.")

main changes:
- when looping through a holding, every time a parcel is checked, all interventions from it have to be added to all possible buckets (unless bucket is full)
- when looping through holding parcels, stop adding parcels to a bucket if 3 parcels from that holding are already in the bucket. but the other buckets still have to be checked. implement some kind of local bucket list? that is then merged with the big one?

- work on the interface a bit
- add some parameters


---
- 3 interventions per bucket


- parcel -> check all interventions

- holding check -> 



- once a bucket is filled, maybe remove all rows related to it from the main list?
---


modifiable parameters: 
- 3% limit
- just the parcels covered by area images or whole list?
- priority to the covered?
- 

- costas, augusta, slavko, gilbert, paulo




- 3% rule = for all buckets, if 3% of holdings are added, stop adding to a bucket even if below threshold. if you reach 3% of holdings, only keep adding parcels from the 3% holdings already added
- grouping rows together (1 row per parcel with a list of intervensions in one columns)