This notebook explains initial data management steps.  Functions related to this step can be found in the file_management.py module in utils -> 'utils/file_management.py' or 'from utils import file_management'

Step 1a) This step documents data used in the project.  ScienceBase is used to document source data and retrieval metadata.  When permitted data were downloaded using Python code.  In some cases data needed to be requested and delivered. 

Step 1b) Documents data management steps and concepts to be considered before processing starts.

Step 1c) Create file that documents information about Tiff files.  This file will be used to control processing methods.

Step 2) Create serlialized versions of HydroSHEDS basin data for quick retrieval of information in processing steps

## Step 1a: Download Source Data
      
#### Standard HydroBASINS from HydroSHEDS: 
    * Store src HydroSHEDS data in repo under 'data/HydroSHEDS'
    * Individual file names need to remain as downloaded
    * Structure under this folder does not matter
    * See more information and download directions here: Coming Soon
    
#### Landscape Variables (var files):
    * Store src landscape variable (var) files in repo under 'data/var'
    * Structure under this folder does not matter
    * See more information and download directions for var files used in this assessment here: Coming Soon
    
#### Source File Information:
    * User created csv file containing processing information about each var file
    * For ease of use name and store file in the repo as: 'data/var/file_processing_info.csv'
    * Required fields for each file include: file_name, variable, src_short, summary_type, label 
    * File name must match corresponsing var file name, shapefiles can be represented by only the .shp file
    * See more information and download directions here: Coming Soon

In [1]:
#Example showing a list of URLS with files for Land Cover data
#Note was having trouble downloading these using Python, server kept cutting off access
import urllib.request as url_r

try:
    target_url = 'https://s3-eu-west-1.amazonaws.com/vito-lcv/2015/ZIPfiles/manifest_cgls_lc_v2_100m_global_2015.txt'
    download_urls = url_r.urlopen(target_url).read()
    download_urls_str = download_urls.decode("utf8")
    list_urls = list(download_urls_str.split('\r\n'))
except Exception as e:
    raise Exception(e)

In [2]:
#Shows first 3 urls
list_urls[0:3]

['https://s3-eu-west-1.amazonaws.com/vito-lcv/2015/ZIPfiles/E000N00_ProbaV_LC100_epoch2015_global_v2.0.1_products_EPSG-4326.zip',
 'https://s3-eu-west-1.amazonaws.com/vito-lcv/2015/ZIPfiles/E000N20_ProbaV_LC100_epoch2015_global_v2.0.1_products_EPSG-4326.zip',
 'https://s3-eu-west-1.amazonaws.com/vito-lcv/2015/ZIPfiles/E000N40_ProbaV_LC100_epoch2015_global_v2.0.1_products_EPSG-4326.zip']

==================================================================================================================
******************************************************************************************************************
==================================================================================================================

## Step 1b: Create intermediate files containing data needed for processing steps

### A) Understand and modify initial data as needed
    * Currently all source data were in the same coordinate system and no tests were built in yet.  User must ensure data are in {'init':'EPSG:4326'}, WGS84.
    * The nodata value must be changed on some grids before the attribution process can be completed.  Not sure what is causing this, as the same nodata value is successfuly used by methods for some grids. Some example grids that needed this step include Pasture2000_5m.tif and Cropland2000_5m.tif.  
    

In [3]:
#Example of how to verify coordinate system

#for raster use rasterio to read data, then use crs method to get coordinate system information
import rasterio
file_name = 'data/var/CroplandPastureArea2000_Geotiff/Pasture2000_5m.tif'
data = rasterio.open(file_name)
coordinate_system = data.crs
print (f'Raster coordinate system {coordinate_system}')

#for shapefile use geopandas, read data in and call crs method, then use crs method to get coordinate system information
import geopandas as gpd
file_name = 'data/HydroSHEDS/basins/hybas_ar_lev12_v1c/hybas_ar_lev12_v1c.shp'
gdf = gpd.read_file(file_name)
gdf_coordinate_system = gdf.crs
print (f'Shapefile coordinate system {gdf_coordinate_system}')

Raster coordinate system EPSG:4326
Shapefile coordinate system {'init': 'epsg:4326'}


In [None]:
#Example of reassigning nodata value for the Pasture2000_5m grid

import rasterio
import numpy as np

file = 'data/var/CroplandPastureArea2000_Geotiff/Pasture2000_5m.tif'
data = rasterio.open(file)
init_data = data.read()
#In this case the initial nodata value is the only negative number, we replace that with the unique value of 999
new_data = np.where(init_data<0, 999, init_data) 

#Use the profile from the orginal data but update the profile to have a nodata value of 999 and when writing
#new file using the new_data where nodata is now represented as 999
with rasterio.Env():
    profile = data.profile
    profile.update(nodata=999)
    
    #save new data to file 'data/var/CroplandPastureArea2000_Geotiff/Pasture2000_5m_update_nodata999.tif'
    with rasterio.open('data/var/CroplandPastureArea2000_Geotiff/Pasture2000_5m_update_nodata999.tif', 'w', **profile) as dst:
        dst.write(new_data)
    


==================================================================================================================
******************************************************************************************************************
==================================================================================================================

## Step 1c) Create a file documenting information about landscape variable files that need processing.  This file provides information so methods know how to handle the files in summarization steps.

    * Creates file: f'{directory}/{short_name}_ file_info.json'  -> example   data/var/tif_var_file_info.json
    
    * Example below creates json file  'data/var/tif_vars_file_info.json'.  The information in this file is used in later steps to inform how tif data from our source landscape variable datasets (stored in data/var directory in the repository) are attributed to basins in HydroSHEDS HYDROBASINS.  
    
    * Currently only gridded information (such as .tif) use this system, but hope to include similar process for vector data 

In [4]:
from utils import file_management as f_mng

directory = 'data/var'
file_type = 'tif'
extension = f'.{file_type}'

short_name = f'{file_type}_vars'
file_list, directory = f_mng.find_files(directory, suffix=extension)

#convert file information to dictionary and export to json
file_info, missing_info = f_mng.store_file_info(file_list, directory, short_name)

  s = DatasetReader(path, driver=driver, sharing=sharing, **kwargs)
  s = DatasetReader(path, driver=driver, sharing=sharing, **kwargs)


In [5]:
#Show information about 2 files that were documented in the JSON
file_info[0:2]

[{'file_name': 'W120N80_ProbaV_LC100_epoch2015_global_v2.0.1_water-seasonal-coverfraction-layer_EPSG-4326.tif',
  'file_path': 'data/var/LC100_epoch2015/W120N80_ProbaV_LC100_epoch2015_global_v2.0/W120N80_ProbaV_LC100_epoch2015_global_v2.0.1_water-seasonal-coverfraction-layer_EPSG-4326.tif',
  'bounds': {'xmin': -120.00049603174602,
   'xmax': -100.0004960317461,
   'ymin': 60.00049603174611,
   'ymax': 80.00049603174604},
  'no_data_val': 255.0,
  'pixel_size': 0.0009920634920634888,
  'crs': 'EPSG:4326',
  'variable': 'water-seasonal-coverfraction-layer',
  'src_short': 'lc100_epoch2015_v2.0.1',
  'summary_type': 'count mean nodata',
  'label': 'lc2015_waterseas_cov',
  'categorical': 'no',
  'pixel_inclusion': 'centroid'},
 {'file_name': 'W120N80_ProbaV_LC100_epoch2015_global_v2.0.1_moss-coverfraction-layer_EPSG-4326.tif',
  'file_path': 'data/var/LC100_epoch2015/W120N80_ProbaV_LC100_epoch2015_global_v2.0/W120N80_ProbaV_LC100_epoch2015_global_v2.0.1_moss-coverfraction-layer_EPSG-4326

______________________________________________________________________________________________
#### Below is a count of the number of .tif files ready to be processed and a view of the information for the first 5 files.

Note: File processing information is available in two formats.  The function passes back a list of dictionaries.  Each dictionary stores information about a file.  The same information is also saved to disk as 'data/var/tif_vars_file_info.json', for processing in later sessions.  Here is an example of printing the number of files processed and showing the first 5 records using the returned variable file_info.  

In [6]:
print (f'{len(file_info)} files are documented and ready to process \n')

import pandas as pd
pd.DataFrame(file_info[0:5])

940 files are documented and ready to process 



Unnamed: 0,bounds,categorical,crs,file_name,file_path,label,no_data_val,pixel_inclusion,pixel_size,src_short,summary_type,variable
0,"{'xmin': -120.00049603174602, 'xmax': -100.000...",no,EPSG:4326,W120N80_ProbaV_LC100_epoch2015_global_v2.0.1_w...,data/var/LC100_epoch2015/W120N80_ProbaV_LC100_...,lc2015_waterseas_cov,255.0,centroid,0.000992,lc100_epoch2015_v2.0.1,count mean nodata,water-seasonal-coverfraction-layer
1,"{'xmin': -120.00049603174602, 'xmax': -100.000...",no,EPSG:4326,W120N80_ProbaV_LC100_epoch2015_global_v2.0.1_m...,data/var/LC100_epoch2015/W120N80_ProbaV_LC100_...,lc2015_moss_cov,255.0,centroid,0.000992,lc100_epoch2015_v2.0.1,count mean nodata,moss-coverfraction-layer
2,"{'xmin': -120.00049603174602, 'xmax': -100.000...",no,EPSG:4326,W120N80_ProbaV_LC100_epoch2015_global_v2.0.1_c...,data/var/LC100_epoch2015/W120N80_ProbaV_LC100_...,lc2015_crops_cov,255.0,centroid,0.000992,lc100_epoch2015_v2.0.1,count mean nodata,crops-coverfraction-layer
3,"{'xmin': -120.00049603174602, 'xmax': -100.000...",no,EPSG:4326,W120N80_ProbaV_LC100_epoch2015_global_v2.0.1_t...,data/var/LC100_epoch2015/W120N80_ProbaV_LC100_...,lc2015_tree_cov,255.0,centroid,0.000992,lc100_epoch2015_v2.0.1,count mean nodata,tree-coverfraction-layer
4,"{'xmin': -120.00049603174602, 'xmax': -100.000...",no,EPSG:4326,W120N80_ProbaV_LC100_epoch2015_global_v2.0.1_g...,data/var/LC100_epoch2015/W120N80_ProbaV_LC100_...,lc2015_grass_cov,255.0,centroid,0.000992,lc100_epoch2015_v2.0.1,count mean nodata,grass-coverfraction-layer


In [7]:
len(file_info)

940

In [8]:
#Examples where files are in the directory but the file_processing_information.csv file does not include the files, shows 3 records 
missing_info[0:3]

[{'file_name': 'WaterDepletionCat_WG3.tif',
  'missing_fields': ['field_name',
   'variable',
   'src_short',
   'summary_type',
   'label',
   'all_touched',
   'conditional']},
 {'file_name': 'Pasture2000_5m_update_nodata999.tif',
  'missing_fields': ['field_name',
   'variable',
   'src_short',
   'summary_type',
   'label',
   'all_touched',
   'conditional']},
 {'file_name': 'Cropland2000_5m.tif',
  'missing_fields': ['field_name',
   'variable',
   'src_short',
   'summary_type',
   'label',
   'all_touched',
   'conditional']}]

_____________________________________________________________________________________________________________
#### The process also returns the variable 'missing_info', which is a list of dictionaries holding information about .tif files that were available and meeting the criteria (i.e. specified directory and file_type) yet were either not in the file_processing_info.csv file or were in the file but had missing values.

In [9]:
#Example of how to use missing_info to understand what files that met criteria are not currently ready for processing

#Import file_processing_info to get a count of number of columns
import pandas as pd
df = pd.read_csv('data/var/file_processing_info.csv')
rows, cols = df.shape

#If missing_info has data, count and print number of files that are not in the csv.
#Also count and print number of files that are in the csv but have missing data.  In this case print the file name and missing fields.
if missing_info:
    w_missing_file = 0
    w_missing_fields = 0
    for record in missing_info:
        if 'missing_fields' in record and len(record['missing_fields'])==cols:
            w_missing_file += 1
        elif 'missing_fields' in record and len(record['missing_fields'])<cols:
            w_missing_fields += 1
            print (f"{record['file_name']} found in file_processing_info.csv but has missing fields: {record['missing_fields']}")
            print ('\n')
        elif 'file_failed' in record:
            print (f"{record['file_name']} failed")
            print ('\n')
    print (f'{w_missing_fields} file(s) with missing fields in data/var/file_processing_info.csv')
    print (f'{w_missing_file} files(s) are not in data/var/file_processing_info.csv')
else:
    print('No files with missing processing information detected in processed directory')

0 file(s) with missing fields in data/var/file_processing_info.csv
3188 files(s) are not in data/var/file_processing_info.csv


==================================================================================================================
******************************************************************************************************************
==================================================================================================================

## Step 2) Create serlialized versions of HydroSHEDS level 12 basins 

3 variations are created and pickled for future use allowing for quick access to consistent information throughout processing steps
    * geodataframe containing all attributes and all geospatial information (poly)
    * dataframe containing all attributes except geospatial information -> requires less RAM and loads faster when geospatial data are not needed
    * list of hybas_ids 

In [None]:
from utils import file_management as f_mng
f_mng.build_basin_data(level='12', version='v1c', directory = 'data/HydroSHEDS')
#Prints number of records

In [10]:
#Test read gdf, show first 2 rows
pkl_gdf = 'data/basins_lvl12_gdf.pkl'
gdf = f_mng.read_pkl_gdf(pkl_gdf)
gdf.head(2)

Unnamed: 0,HYBAS_ID,NEXT_DOWN,NEXT_SINK,MAIN_BAS,DIST_SINK,DIST_MAIN,SUB_AREA,UP_AREA,PFAF_ID,ENDO,COAST,ORDER,SORT,geometry
0,1120000010,0,1120000010,1120000010,0.0,0.0,11.0,11.0,111011001000,0,1,0,1,"POLYGON ((32.50000000000002 29.94583333333337,..."
1,1120000020,0,1120000020,1120000020,0.0,0.0,137.0,416.8,111011002100,0,0,1,2,"POLYGON ((32.36250000000002 29.97083333333337,..."


In [11]:
#There should be 1034083 rows and 14 columns
gdf.shape

(1034083, 14)

In [12]:
#Test read df, show first 3 rows
pkl_df_path = 'data/basins_lvl12_df.pkl'
df=f_mng.read_pkl_df(pkl_df_path)
df.head(2)

Unnamed: 0,HYBAS_ID,NEXT_DOWN,NEXT_SINK,MAIN_BAS,DIST_SINK,DIST_MAIN,SUB_AREA,UP_AREA,PFAF_ID,ENDO,COAST,ORDER,SORT
0,1120000010,0,1120000010,1120000010,0.0,0.0,11.0,11.0,111011001000,0,1,0,1
1,1120000020,0,1120000020,1120000020,0.0,0.0,137.0,416.8,111011002100,0,0,1,2


In [13]:
#There should be 1034083 rows and 13 columns (same as gdf but no geometry)
df.shape

(1034083, 13)

In [14]:
#Test read list of HydroBASIN IDs, print first five
pkl_df_path = 'data/basins_lvl12.txt'
hybas_id_list=f_mng.read_pkl_df(pkl_df_path)
hybas_id_list[0:5]

[1121976320, 4120903680, 3120562180, 8120172550, 2120220680]