## Step 1a: Download Source Data
      
#### Standard HydroBASINS from HydroSHEDS: 
    * Store src HydroSHEDS data in repo under 'data/HydroSHEDS'
    * Individual file names need to remain as downloaded
    * Structure under this folder does not matter
    * See more information and download directions here:
    
#### Landscape Variables (var files):
    * Store src landscape variable (var) files in repo under 'data/var'
    * Structure under this folder does not matter
    * See more information and download directions for var files used in this assessment here:
    
#### Source File Information:
    * User created csv file containing processing information about each var file
    * For ease of use name and store file in the repo as: 'data/var/file_processing_info.csv'
    * Required fields for each file include: file_name, variable, src_short, summary_type, label 
    * File name must match corresponsing var file name, shapefiles can be represented by only the .shp file
    * See more information and download directions here:

In [1]:
#Note that the nodata value must be changed on some grids before the attribution process can be completed
#I am unsure why but guessing due to they way they were made?
#Pasture2000_5m.tif, Cropland2000_5m.tif

import rasterio
import numpy as np

file = 'data/var/CroplandPastureArea2000_Geotiff/Pasture2000_5m.tif'
data = rasterio.open(file)
init_data = data.read()
#In this case the initial nodata value is the only negative number, we replace that with the unique value of 999
new_data = np.where(init_data<0, 999, init_data) 

#Use the profile from the orginal data but update the profile to have a nodata value of 999 and when writing
#new file use the new_data
with rasterio.Env():
    profile = data.profile
    profile.update(nodata=999)
    
    with rasterio.open('data/var/CroplandPastureArea2000_Geotiff/Pasture2000_5m_update_nodata999.tif', 'w', **profile) as dst:
        dst.write(new_data)

In [2]:
#Example showing a list of URLS with files for Land Cover data
#Note was having trouble downloading these using Python, server kept cutting off access
import urllib.request as url_r

try:
    target_url = 'https://s3-eu-west-1.amazonaws.com/vito-lcv/2015/ZIPfiles/manifest_cgls_lc_v2_100m_global_2015.txt'
    download_urls = url_r.urlopen(target_url).read()
    download_urls_str = download_urls.decode("utf8")
    list_urls = list(download_urls_str.split('\r\n'))
except Exception as e:
    raise Exception(e)

In [3]:
#Shows first 4 urls
list_urls[0:4]

['https://s3-eu-west-1.amazonaws.com/vito-lcv/2015/ZIPfiles/E000N00_ProbaV_LC100_epoch2015_global_v2.0.1_products_EPSG-4326.zip',
 'https://s3-eu-west-1.amazonaws.com/vito-lcv/2015/ZIPfiles/E000N20_ProbaV_LC100_epoch2015_global_v2.0.1_products_EPSG-4326.zip',
 'https://s3-eu-west-1.amazonaws.com/vito-lcv/2015/ZIPfiles/E000N40_ProbaV_LC100_epoch2015_global_v2.0.1_products_EPSG-4326.zip',
 'https://s3-eu-west-1.amazonaws.com/vito-lcv/2015/ZIPfiles/E000N60_ProbaV_LC100_epoch2015_global_v2.0.1_products_EPSG-4326.zip']

==================================================================================================================
******************************************************************************************************************
==================================================================================================================

## Step 1b: Create intermediate files containing data needed for processing steps

### A) create file: f'{directory}/{short_name}_ file_info.json'  -> example   data/var/tif_var_file_info.json
    * Example below creates json file  'data/var/tif_vars_file_info.json'.  The information in this file is used in later steps to inform how tif data from our source landscape variable datasets (stored in data/var directory in the repository) are attributed to basins in HydroSHEDS HYDROBASINS

In [4]:
from utils import file_management as f_mng

directory = 'data/var'
file_type = '.tif'

short_name = f'{file_type}_vars'
file_list, directory = f_mng.find_files(directory, suffix=file_type)

#convert file information to dictionary and export to json
file_info, missing_info = f_mng.store_file_info(file_list, directory, short_name)

______________________________________________________________________________________________
#### Below is a count of the number of .tif files ready to be processed and a view of the information for the first 5 files.

Note: File processing information is available in two formats.  The function passes back a list of dictionaries.  Each dictionary stores information about a file.  The same information is also saved to disk as 'data/var/tif_vars_file_info.json', for processing in later sessions.  Here is an example of printing the number of files processed and showing the first 5 records using the returned variable file_info.  

In [5]:
print (f'{len(file_info)} files are documented and ready to process \n')

import pandas as pd
pd.DataFrame(file_info[0:5])

1038 files are documented and ready to process 



Unnamed: 0,categorical,file_name,file_path,label,pixel_inclusion,src_short,summary_type,variable
0,yes,WaterDepletionCat_WG3.tif,data/var/WaterDepletion_WaterGap3/WaterDepleti...,h2o_depletion,all_touching,earthstat_water_gap3,majority max min,water_depletion_class
1,no,Pasture2000_5m_update_nodata999.tif,data/var/CroplandPastureArea2000_Geotiff/Pastu...,pasture2000,all_touching,earthstat_cropland_pasture_2000,count mean nodata,pasture
2,no,Cropland2000_5m_update_nodata999.tif,data/var/CroplandPastureArea2000_Geotiff/Cropl...,cropland2000,all_touching,earthstat_cropland_pasture_2000,count mean nodata,cropland
3,no,gpw_v4_population_density_rev11_2015_30_sec.tif,data/var/Gridded Population of the World_v4.11...,gpw2015_pop_den,centroid,gpw2015_v4_rev11_30_sec,count mean nodata,population_density
4,no,W120N80_ProbaV_LC100_epoch2015_global_v2.0.1_w...,data/var/LC100_epoch2015/W120N80_ProbaV_LC100_...,lc2015_waterseas_cov,centroid,lc100_epoch2015_v2.0.1,count mean nodata,water-seasonal-coverfraction-layer


In [6]:
len(file_info)

1038

In [9]:
#Examples of missing data dictionaries
missing_info[0:3]

[{'file_name': 'Cropland2000_5m.tif',
  'missing_fields': ['field_name',
   'variable',
   'src_short',
   'summary_type',
   'label',
   'all_touched',
   'conditional']},
 {'file_name': 'Pasture2000_5m.tif',
  'missing_fields': ['field_name',
   'variable',
   'src_short',
   'summary_type',
   'label',
   'all_touched',
   'conditional']},
 {'file_name': 'W120N80_ProbaV_LC100_epoch2015_global_v2.0.1_moss-coverfraction-StdDev_EPSG-4326.tif',
  'missing_fields': ['field_name',
   'variable',
   'src_short',
   'summary_type',
   'label',
   'all_touched',
   'conditional']}]

_____________________________________________________________________________________________________________
#### The process also returned the variable missing_info, which is a list of dictionaries holding information about .tif files that were available and meeting the criteria (i.e. specified directory and file_type) yet were either not in the file_processing_info.csv file or were in the file but had missing values.

In [10]:
#Example of how to use missing_info to understand what files that met criteria are not currently ready for processing

#Import file_processing_info to get a count of number of columns
import pandas as pd
df = pd.read_csv('data/var/file_processing_info.csv')
rows, cols = df.shape

#If missing_info has data, count and print number of files that are not in the csv.
#Also count and print number of files that are in the csv but have missing data.  In this case print the file name and missing fields.
if missing_info:
    w_missing_file = 0
    w_missing_fields = 0
    for record in missing_info:
        if 'missing_fields' in record and len(record['missing_fields'])==cols:
            w_missing_file += 1
        elif 'missing_fields' in record and len(record['missing_fields'])<cols:
            w_missing_fields += 1
            print (f"{record['file_name']} found in file_processing_info.csv but has missing fields: {record['missing_fields']}")
            print ('\n')
        elif 'file_failed' in record:
            print (f"{record['file_name']} failed")
            print ('\n')
    print (f'{w_missing_fields} file(s) with missing fields in data/var/file_processing_info.csv')
    print (f'{w_missing_file} files(s) are not in data/var/file_processing_info.csv')
else:
    print('No files with missing processing information detected in processed directory')

0 file(s) with missing fields in data/var/file_processing_info.csv
3083 files(s) are not in data/var/file_processing_info.csv


### B) Create serlialized versions of HydroSHEDS level 12 basins 

3 variations are created and pickled for future use allowing for quick access to consistent information throughout processing steps
    * geodataframe containing all attributes and all geospatial information (poly)
    * dataframe containing all attributes except geospatial information
    * list of hybas_ids 

In [11]:
from utils import file_management as f_mng
f_mng.build_basin_data(level='12', version='v1c', directory = 'data/HydroSHEDS')
#Prints number of records

1034083


In [12]:
#Test read gdf
pkl_gdf = 'data/basins_lvl12_gdf.pkl'
gdf = f_mng.read_pkl_gdf(pkl_gdf)
gdf.head(5)

Unnamed: 0,HYBAS_ID,NEXT_DOWN,NEXT_SINK,MAIN_BAS,DIST_SINK,DIST_MAIN,SUB_AREA,UP_AREA,PFAF_ID,ENDO,COAST,ORDER,SORT,geometry
0,1120000010,0,1120000010,1120000010,0.0,0.0,11.0,11.0,111011001000,0,1,0,1,"POLYGON ((32.50000000000002 29.94583333333337,..."
1,1120000020,0,1120000020,1120000020,0.0,0.0,137.0,416.8,111011002100,0,0,1,2,"POLYGON ((32.36250000000002 29.97083333333337,..."
2,1121694330,1120000020,1120000020,1120000020,19.5,19.5,135.1,280.0,111011002200,0,0,1,3,"POLYGON ((32.36250000000002 29.9666666666667, ..."
3,1121693980,1121694330,1120000020,1120000020,35.3,35.3,144.9,144.9,111011002300,0,0,1,4,"POLYGON ((32.25833333333335 29.9916666666667, ..."
4,1120000030,0,1120000030,1120000030,0.0,0.0,186.8,186.9,111011003000,0,1,0,5,"POLYGON ((32.40000000000002 29.73750000000003,..."


In [13]:
gdf.shape

(1034083, 14)

In [14]:
#Test read df
pkl_df_path = 'data/basins_lvl12_df.pkl'
df=f_mng.read_pkl_df(pkl_df_path)
df.head()

Unnamed: 0,HYBAS_ID,NEXT_DOWN,NEXT_SINK,MAIN_BAS,DIST_SINK,DIST_MAIN,SUB_AREA,UP_AREA,PFAF_ID,ENDO,COAST,ORDER,SORT
0,1120000010,0,1120000010,1120000010,0.0,0.0,11.0,11.0,111011001000,0,1,0,1
1,1120000020,0,1120000020,1120000020,0.0,0.0,137.0,416.8,111011002100,0,0,1,2
2,1121694330,1120000020,1120000020,1120000020,19.5,19.5,135.1,280.0,111011002200,0,0,1,3
3,1121693980,1121694330,1120000020,1120000020,35.3,35.3,144.9,144.9,111011002300,0,0,1,4
4,1120000030,0,1120000030,1120000030,0.0,0.0,186.8,186.9,111011003000,0,1,0,5


In [15]:
df.shape

(1034083, 13)

In [16]:
#Test read list
pkl_df_path = 'data/basins_lvl12.txt'
hybas_id_list=f_mng.read_pkl_df(pkl_df_path)
hybas_id_list[0:5]

[1121976320, 4120903680, 3120562180, 8120172550, 2120220680]