# Working with External Dataset (Vector files)

DE Africa sandbox allows users to add external data such as shapefiles, geojson, etc in their algorithms.

This tutorial will take you through

1. The packages to import
2. Setting the path for the vector file
3. Loading the external dataset
4. Displaying the dataset on a basemap
5. Loading the satellite imagery by using the extent of the external dataset
6. Mask the Area of interest from the satellite imagery using the extenal dataset

For this tutorial the external dataset is in a shapefile format

## Set up notebook

In your **Training folder**, create a new Python 3 notebook. Name it `external_dataset.ipynb`. For more instructions on creating a new notebook, see the [instructions from Session 2](../session_2/04_load_data_exercise.ipynb#Make-a-new-notebook).

### Load packages and functions

In the first cell, type the following code and then run the cell to import necessary Python dependencies.

    import sys
    import datacube
    import numpy as np
    import pandas as pd
    import geopandas as gpd
    
    from datacube.utils import geometry

    sys.path.append('../Scripts')
    from deafrica_datahandling import load_ard, mostcommon_crs
    from deafrica_plotting import map_shapefile, rgb
    from deafrica_spatialtools import xr_rasterize

Take note of these packages from the above codes.

    import geopandas as gpd
    from datacube.utils import geometry
    from deafrica_plotting import map_shapefile
    from deafrica_spatialtools import xr_rasterize

### Connect to the datacube

Enter the following code and run the cell to create our `dc` object, which provides access to the datacube.

    dc = datacube.Datacube(app='import_dataset')

Create a folder called `data` in the Training directory.
Download the [Zip file](_static/external_dataset/reserve.zip) and extract on your local machine.
Copy the reserve shapefile(cpg, dbf, shp, shx) into the created folder

Create a variable called `shapefile_path`,to store the path of the shapefile as shown below

    shapefile_path = "data/reserve.shp"

Read shapefile into a GeoDataFrame using `gpd.read_file` function

    gdf = gpd.read_file(shapefile_path)

Convert all of the shapes into a datacube geometry using `geometry.Geometry`

    geom = geometry.Geometry(gdf.unary_union, gdf.crs)

Use the `map_shapefile` function to display the shapefile on a basemap

    map_shapefile(gdf, attribute=gdf.columns[0], fillOpacity=0, weight=2)

Create a query object

We will relace `x` and `y`, with `geopolygon`

Previously the below query object was

    query = {
        'x' : x,
        'y':y,
        'group_by': 'solar_day',
        'time' : ('2019-01-15'),
         'resolution': (-10, 10),
    }

Update it with `geopolygon`

    query = {
        'geopolygon' : geom,
        'group_by': 'solar_day',
        'time' : ('2019-01-15'),
         'resolution': (-10, 10),
    }


Identify the most common projection system in the input query
    
    output_crs = mostcommon_crs(dc=dc, product='s2_l2a', query=query)

    ds = load_ard(dc=dc,
                  products=['s2_l2a], 
                  output_crs=output_crs,
                  measurements=["red","green","blue"],
                  min_gooddata=0.95,
                  **query
                 )

print the `ds` result

    ds

Display the 
    
    rgb(ds)

Convert the shapefile to raster using the `xr_rasterize` function
    
    mask = xr_rasterize(gdf, ds)

Mask dataset using the `ds.where` to set pixels outside the polygon to `NaN`

    ds = ds.where(mask)

Display the results of the masked area
rgb(ds)