# Dataset Comparison & Validation 

This notebook aims to compare multiple datasets on solar power plants in China in order to assess their consistency, coverage, and accuracy, and to evaluate different methods used to detect or map solar installations.

The comparison process includes:
- General statistics : Spatial coverage, province-level and region-level distribution counts...
- Visual inspection using map-based plots
- Matching quality assessment between datasets

Our goal is to assess how many installations are detected in each dataset, how they are distributed across provinces and to identify potential discrepancies or complementarities between them.

# Setup

In [None]:
import pandas as pd 
import geopandas as gpd
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns

# Load the Datasets 

We'll compare the following datasets:
- [GEM](https://globalenergymonitor.org/): Manual compilation of solar power plant data from public sources, reports, and satellite imagery, curated by the Global Energy Monitor (GEM) with geolocation and attribute verification.
- [PV_China_2020 (Zhang et al., 2022)](https://doi.org/10.5194/essd-14-3743-2022): Pixel-based classification of Landsat-8 imagery using a Random Forest model in Google Earth Engine, with training data from crowdsourced and manually labeled PV/non-PV regions, followed by morphological filtering and polygon vectorization.
- [ChinaPV_Vectorized_2020 (Liu et al., 2024)](https://doi.org/10.1038/s41597-024-04356-z): Random Forest classification on cloud-filtered Landsat-8 imagery enriched with texture features, refined by morphological operations and manual correction using Google Earth Pro to produce high-accuracy polygon vectors of PV installations.

We focus on the year 2020 because it is the common reference year across all three datasets. This temporal alignment ensures that our comparison is meaningful and unbiased by time-based discrepancies.