<span style="font-size:xx-large; font-weight:bold"> *BdApPV* Analysis of crowdsourced data of detection of PV panels</span>

This notebook presents the analysis of crowd sourced data from the [BdApPv](https://www.bdpv.fr/_BDapPV/) campaigns by [BDPV](https://www.bdpv.fr), for the collaborative annotation of photovoltaic panels.

This notebook can be viewed online :
* Live, with MyBinder [at this URL](https://mybinder.org/v2/git/https%3A%2F%2Fgit.sophia.mines-paristech.fr%2Foie%2Fbdappv.git/HEAD?labpath=annotations.ipynb)
* Static with NbViewer [at this URL](https://nbviewer.org/urls/git.sophia.mines-paristech.fr/oie/bdappv/-/raw/master/annotations.ipynb)

In [None]:
# To download the data folder from the Zenodo repository 
# Skip this cell if you downloaded and placed it in the folder.
!wget 'https://zenodo.org/record/7358126/files/data.zip?download=1' -O 'data.zip'
# unzip the file
!unzip 'data.zip' 
# delete the zip file
!rm 'data.zip'

In [1]:
# Initial settings and imports
%load_ext autoreload
%autoreload 2

from lib.utils import *
from click_analysis import process_img as process_click
from polygon_analysis import process_img as process_surface
from lib.utils import load_js, previous_next, interactive_plot, get_image
import pandas as pd
import matplotlib
from plotly import express as xp
import seaborn as sb
from matplotlib import pyplot as plt
import plotly.figure_factory as ff
import re
from collections import Counter
import plotly.express as px
matplotlib.rcParams['figure.figsize'] = (15, 12)

# Selection of the campaign

*BDApPv* consists in 2 distinct campaigns made on two datasets of images from Google (1) and [IGN](https://www.ign.fr/) (2).

Here we choose the campaign we want to work on :

In [2]:
CAMPAIGN="google"
#CAMPAIGN="ign"

# Load input data

We provide data of each campaign (meta data and annotation data) as a separate JSON file.

In [3]:
# Load JSON file
INPUT_FILE="data/raw/input-{campaign}.json".format(campaign=CAMPAIGN)
items = load_js(INPUT_FILE)

In [4]:
items_by_id = {item.id : item for item in items}

## Map of input data

We display here the number of input images per departement in France. Note that some of input images are for neighbor countries. 

In [5]:
# Extract departements for France
def dep(item) :
    if not hasattr(item, "department") or not "FRANCE" in item.department:
        return None
    res = re.sub(r"\D", "", item.department)
    return res

# Count per departement
deps_count = Counter(dep(item) for item in items if dep(item) is not None)

# Build dataframe
deps_count_df = pd.DataFrame(deps_count.items(), columns=["dep", "count"])

In [6]:
# Load GeoJSON of France departements
with urlopen('https://raw.githubusercontent.com/gregoiredavid/france-geojson/master/departements-version-simplifiee.geojson') as response:
    geojson = json.load(response)

In [None]:
# display number of input images per department
fig = px.choropleth(
    deps_count_df, 
    geojson=geojson, 
    locations="dep",
    projection="mercator",
    featureidkey="properties.code",
    color='count',  
    fitbounds="locations",
    color_continuous_scale="Viridis",
    width=700, height=700)
fig.show()

# Process

Each campaign consisted in two separate phases : 
- Click on panels
- Draw polygon 

In the code below, we demonstrate the process for aggregating data of several users for the same image. Those processes are also integrated into batch scripts that can be run on the whole database at once. See ```click_analysis.py``` and ```polygon_analysis.py```

## Phase 1 : clicks

In the first phase, users were asked to click on PV panels in images. We aggregate the locations of those clicks by doing the sum of density kernels around it and then finding maximas above a given threshold. 

Note that **several matches / points can be found** on each image with this method, corresponding to several PV panels.

The code below provides an interactive process over the images.

The image on the left shows the clicks and kernels.

The image on the right is the one that was shown to users. In case of match (maximas found above threshold), a green cross is drawn on it. In the JSON output (in batch scripts) a score is attached to each match, as the level of maxima found in the left image.

In [None]:
def item_f(item) :
    print("Id: ", item.id)
    process_click(item, display=True, campaign=CAMPAIGN, threshold=2.0)
    
previous_next(items, item_f)

## Phase 2 : polygons

In the second phase, users were asked to draw polygons corresponding to the panels of the images selected on phase 1. 

We aggregate those polygons on a single image, and apply a threshold on it (as absolute number or ratio of the number of users), thus generating a mask.

Finally, we apply [polygon detection](https://www.geeksforgeeks.org/python-detect-polygons-in-an-image-using-opencv/) (with [OpenCV](https://opencv.org/) library) on the resulting mask. 

Note that **several matches / polygons can be found** on each image with this method, corresponding to several PV panels.

The code below provides an interactive process over the images.

The image on the left shows the sum of polygons, before threshold, and the final polygons detected on the mask. In case of success, the detected polygons are drawn on the image on the right.

In [None]:
def item_f(item) :
    print("Id: ", item.id)
    process_surface(item, display=True, threshold=0.45, campaign=CAMPAIGN)

previous_next(items, item_f)

# Threshold analysis

In this section, we load batch results for both phases, computed with low thresholds (1.0), and we explore the best threshold to be chosen for each phase.

In [None]:
# First we gather general stats in a dictionnary of ID => stats
stats = {
    item.id : dict( 
        id=item.id,
        nb_poly = len(item.polygons),
        nb_clicks = len(item.clicks),
        nb_poly_actors = len(set(poly.action.actorId for poly in item.polygons)))
    for item in items
}

## Clicks threshold

First, we load click analysis output from a batch analysis done with low threshold :

In [None]:
analysed_clicks = load_js("data/validation/campaign-%s/click-analysis-thres=1.0.json" % CAMPAIGN)
analysed_clicks_by_id = {res.id:res for res in analysed_clicks}

In [None]:
click_stats = [
    dict(
        id=clicks.id,
        click_idx= click_idx,
        score=click.score,
        nb_clicks=stats[clicks.id]["nb_clicks"]) 
            for clicks in analysed_clicks  
            for click_idx, click in enumerate(clicks.clicks)]
click_stats = pd.DataFrame(click_stats)
click_stats.head()
click_stats["relative_score"] = click_stats.score / click_stats.nb_clicks
click_stats

Then we display a scatter plot of each detected location, placed on the following axes :
- **Score** : The score of this match = the absolute level of the detected maxima
- **Relative score** : Score divided by number of clicks on this image

In [None]:
# Scatter plot
fig = xp.scatter(click_stats, y="score", x="relative_score")

# Add interactive preview of selected location
def show_item(row) :
    id = row["id"]
    res = analysed_clicks_by_id[id]
    
    process_click(
        items_by_id[id], 
        display=True, 
        clicks_to_draw=res.clicks, 
        selected_idx=row["click_idx"],
        campaign=CAMPAIGN)

print_html("<h3>Click on the points to show the corresponding image and location (highlighted in red)</h3>")

interactive_plot(click_stats, fig, show_item, event="click")

By exploring the results, we see that **2.0** seems to be a proper **absolute threshold value** for both campigns (Google and IGN), to exclude false positives.

## Polygons threshold

First, we load data from a batch analysis done with low threshold value.

In [None]:
result_polys = load_js("data/validation/campaign-%s/polygon-analysis-thres=1.0.json" % CAMPAIGN)
result_polys_by_id = {res.id:res for res in result_polys}

In [None]:
poly_stats = [
    dict(
        img_id= polys.id,
        poly_idx= poly_idx,
        score=poly.score,
        nb_actors=stats[polys.id]["nb_poly_actors"]) 
            for polys in result_polys 
            for poly_idx, poly in enumerate(polys.polygons)]
poly_stats = pd.DataFrame(poly_stats)
poly_stats.head()
poly_stats["relative_score"] = poly_stats.score / poly_stats.nb_actors
poly_stats

Then we display a scatter graph of each identified panels, along:
- **Score** : the score of this polygon = the sum value in this polygon
- **Relative score** : Score divided by the number of actors (between 0 and 1)

In [None]:
# Scatter plot
fig = xp.scatter(poly_stats, y="score", x="relative_score")


# Add interactive preview of selected panel
def show_item(row) :
    id = row["img_id"]
    print(id)
    res = result_polys_by_id[id]
    process_surface(
        items_by_id[id], 
        display=True, 
        polys_to_draw=res.polygons, 
        selected_idx=row["poly_idx"],
        campaign=CAMPAIGN)

print_html("<h3>Click on the points to show the corresponding image and panel (highlighted in red)</h3>")
interactive_plot(poly_stats, fig, show_item, event="click")

By exploring the data we find **0.45** being a good relative threshold for polygon detection : It excludes false postives (windows, swimming pools, etc).