# Importing and cleaning training data over Africa

In [1]:
import sys
import geopandas as gpd
import pandas as pd
from shapely.geometry import Point
sys.path.append('../Scripts')
from deafrica_plotting import map_shapefile

## GFSAD Training data

Data generated for creating global crop extent maps.  African classification method [here](https://www.mdpi.com/2072-4292/9/10/1065).  Getting data from here: https://web.croplands.org/app/data/search?page=1&page_size=200

Definition of cropland:
- “…lands cultivated with plants harvested for food, feed, and fiber, include both seasonal crops (e.g., wheat, rice, corn, soybeans, cotton) and continuous plantations (e.g., coffee, tea, rubber, cocoa, oil palms). Cropland fallow are lands uncultivated during a season or a year but are farmlands and are equipped for cultivation, including plantations (Teluguntla et al., 2015). Cropland extent includes all planted crops and fallow lands. Non-croplands include all other land cover classes other than croplands and cropland fallow.”


In [25]:
file = "data/training_data/GFSAD_training_data.csv"
df = pd.read_csv(
    file, delimiter=",")
df.head()

Unnamed: 0,id,year,month,lat,lon,country,land_use_type,crop_primary,crop_secondary,water,intensity,source_type,source_class,source_description,use_validation
0,165750,2015,7,34.081668,68.181839,Afghanistan,1,0,0,0,0,derived,,labeled_vhri,False
1,141512,2015,4,32.180843,62.988052,Afghanistan,1,0,0,0,0,derived,,labeled_vhri,False
2,137125,2016,1,41.027053,19.620895,Albania,1,0,0,0,0,derived,,labeled_vhri,False
3,164279,2016,7,36.194971,0.466232,Algeria,1,0,0,0,0,derived,,labeled_vhri,False
4,156753,2015,9,35.508195,1.856003,Algeria,1,0,0,0,0,derived,,labeled_vhri,False


In [26]:
GFSAD_train = gpd.GeoDataFrame(
    df.drop(['lon', 'lat'], axis=1),
    crs='epsg:4326',
    geometry=[Point(xy) for xy in zip(df.lon, df.lat)])

In [27]:
afr = gpd.read_file('data/african_countries.shp')

In [28]:
GFSAD_train_afr = gpd.overlay(GFSAD_train, afr, how='intersection')
# GFSAD_train_afr.plot()

In [29]:
GFSAD_train_afr['class'] = GFSAD_train_afr['land_use_type']

In [31]:
# GFSAD_train_afr

In [33]:
map_shapefile(GFSAD_train_afr, attribute='class')

Label(value='')

Map(basemap={'url': 'http://server.arcgisonline.com/ArcGIS/rest/services/World_Imagery/MapServer/tile/{z}/{y}/…

In [34]:
GFSAD_train_afr.to_file("data/training_data/GFSAD_training_Africa.shp")

## [Bayas et al. (2017)](https://www.nature.com/articles/sdata2017136) Global Crop Reference Dataset 

Collected in Sept 2016 using geo-wiki.
Reference data from here:
- Any 30m cell classified as crop: http://store.pangaea.de/Publications/See_2017/crop_all.zip
- Control dataset, validated cells classified as crop: http://store.pangaea.de/Publications/See_2017/crop_con.zip 

Definition of cropland:
- "...the definition used for the campaign follows that of GEOGLAM/JECAM.  The annual cropland from a remote sensing perspective is a piece of land of a minimum of 0.25 ha (minimum width of 30 m) that is sowed/planted and harvestable at least once within the 12 months after the sowing/planting date. The annual cropland produces an herbaceous cover and is sometimes combined with some tree or woody vegetation’. According to this GEOGLAM/JECAM definition, perennial crops, agroforestry plantations, palm oil, coffee, tree crops and fallows are not included in the cropland class"

Dataset contains only 'cropland' points, no other land classes. As the dataset contains nearly 120,000 points, its probably best to randomly sample the shapefile with `df.sample(n=2000)`


In [8]:
file = "data/training_data/global_crop_reference_dataset_control.csv"
df = pd.read_csv(
    file, delimiter=",")
df.head()

Unnamed: 0,locationid,userid,centroid_X,centroid_Y
0,1642116,222222,-1.25119,52.952381
1,1642116,222222,-1.250595,52.952381
2,1642116,222222,-1.25119,52.952976
3,1642116,222222,-1.250595,52.952976
4,1642116,222222,-1.25119,52.953571


In [9]:
crop_train = gpd.GeoDataFrame(
    df.drop(['centroid_X', 'centroid_Y'], axis=1),
    crs='epsg:4326',
    geometry=[Point(xy) for xy in zip(df.centroid_X, df.centroid_Y)])

In [10]:
afr = gpd.read_file('data/african_countries.shp')

In [11]:
crop_train_afr = gpd.overlay(crop_train, afr, how='intersection')

In [15]:
crop_train_afr['class'] = 1

In [23]:
# map_shapefile(crop_train_afr, attribute='ID') #can't plot all points (120,000!)

In [22]:
crop_train_afr.to_file("data/training_data/globalCropRefernceData_Africa_2016_control.shp")

## CrowdVal project data

Collected using geo-wiki by/for the ESA CCI Land Cover Team to assist in validating thir prototype 20m Sentinel 2A landcover product.
Data available from here: https://geo-wiki.org/Application/index.php