This is a placeholder exploring methods to attribute vector data (points/lines/polygons) to HydroBASINS.  Many open source methods are available but finding a method that isn't extremly slow is the challenge.  Each vector data type may have different solutions that are most efficient.  This step needs more exploration, although the point to basin example seems to be working well.

get list of level 3 basins
for lvl3 basin
    if #rows <4500
        get level 5 basins:
        for lvl 5 basin:
            get field and stat info from csv
            run stats
    else:
        run stats
        
        
        in run stats for stat in stats = area_weighted
                     for stat in stats == len_weighted
                     for stat in stat == max
                     for stat in stat == min
                     for stat in stat ==

## Step 3a: Example Summarizing Point Based Data to Basins
      *Point to basin example shown below using GRanD dam data

In [1]:
#Import needed packages
import geopandas as gpd
from geopandas.tools import sjoin   #note there is a lot of documentation to resolve rtree issues 
from timeit import default_timer as timer
import pandas as pd
from utils import file_management as f_mng

* First prep data by reading in basin and dam data into geopandas.  Subset the basin data as we only need the geometry and the identifiers.

In [2]:
#Read GRanD data into geodataframe
grand_file_nm = 'data/var/GRanD_Version_1_3/GRanD_dams_v1_3.shp'
grand_gdf = gpd.read_file(grand_file_nm)

#Read in pickled level 12 HydroBASINS with geographic information (see step1_data_management.ipynb)
gdf = f_mng.read_pkl_gdf()

#subset basin data to only include identifiers and geometry
gdf_sub = gdf[['HYBAS_ID','PFAF_ID','geometry']]

* Next use geopandas to spatially join the two datasets.  We use a left join from dams to basins to ensure each dam is assigned the basin it falls within
* Note that using op='within' vs. op='intersection' save a lot of processing time (both gives same results)

In [3]:
start= timer()

#Specifying within vs. intersection saves a lot of processing time
spatial_join_gdf = sjoin(grand_gdf,gdf_sub,how='left',op='within')

#Just for reference of how long takes to process
proc_time = str(timer()-start) 
print(f'Processing time: {proc_time}') 

Processing time: 116.36953754999558


In [4]:
# The new dataframe should have the same number of rows as the original dams geodataframe.
print ((grand_gdf.shape)[0])
print ((spatial_join_gdf.shape)[0])

7320
7320


* Next summarize variables of interest.  In this case we are interested in count of dams and sum of max storage capacity.  If there is one dam in a basin the count = 1 and the storage capacity for that dam will carry over to the basin, but if multiple dams are in the same basin, we count the number of dams and sum the max capacities for all dams in the basin.

In [6]:
#Select id columns and columns we would like to summarize
summarize_prep = spatial_join_gdf[['HYBAS_ID','CAP_MCM']]
#add count field that we can use to count dams per basin
summarize_prep['count'] = 1
#sum number of dams and max storage capacity of reservoirs for each basin with GRanD dams
summarize_df = summarize_prep.groupby(['HYBAS_ID']).sum()
summarize_df.reset_index(inplace=True)
summarize_df = summarize_df.rename(columns={'count':'grand_dam_count','CAP_MCM':'grand_cap_mcm_sum'})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


In [7]:
summarize_df.columns

Index(['HYBAS_ID', 'grand_cap_mcm_sum', 'grand_dam_count'], dtype='object')

In [9]:
#link back to original data and fill nan values with 0s (e.g. count of 0 dams for basins without data)
all_basins_df = pd.DataFrame(gdf_sub.drop(columns=['geometry']))
df_merge = all_basins_df.merge(summarize_df, how='left', on='HYBAS_ID')
df_merge = df_merge.fillna(0.0)

In [12]:
print (df_merge.shape)
print (df_merge.columns)

(1034083, 4)
Index(['HYBAS_ID', 'PFAF_ID', 'grand_cap_mcm_sum', 'grand_dam_count'], dtype='object')


In [14]:
df_merge.tail(3)

Unnamed: 0,HYBAS_ID,PFAF_ID,grand_cap_mcm_sum,grand_dam_count
1034080,2120113730,227405670002,0.0,0.0
1034081,2120113740,214092090020,0.0,0.0
1034082,2120113750,213097060014,0.0,0.0


In [None]:
#Export to csv
outfile_name = 'output/hb12_grand_local.csv'
df_merge.to_csv(outfile_name, sep=',')

In [None]:
#exploring polygon and line to basin

In [None]:
from utils import file_management as f_mng

In [None]:
#any level can be used but in testing performance was best when ~ 3,000 lvl12 basins were processed at one time
#these steps help ensure we are fairly close to the best performance
#the average number of lvl12 basins in lvl 3 basins is , but the max is Y so we need to break those out further

level = 3
pfaf_id_list = f_mng.basin_list_by_pfaf_lvl(level=level)

#read pickled geodataframe of lvl12 basins
gdf = f_mng.read_pkl_gdf()
gdf['PFAF_ID_str'] = gdf['PFAF_ID'].astype(str)

In [None]:
over = 0
under = 0
for pfaf_id in pfaf_id_list:
    pid = str(pfaf_id)
    lvl3_gdf = gdf.loc[gdf['PFAF_ID_str'].str.startswith(pid,na=False)]
    rows, cols = lvl3_gdf.shape
    
    if rows > 3000:
        over+=1
        #print (f'{pid} : {rows}')
        #level = 5
        #pfaf_id_list = f_mng.basin_list_by_pfaf_lvl(level=level)
        #lvl5_list = [x for x in pfaf_id_list if str(x).startswith(pid)]
        #for pfaf_id in lvl5_list:
        #    pid5 = str(pfaf_id)
        #    lvl5_gdf = lvl3_gdf.loc[lvl3_gdf['PFAF_ID_str'].str.startswith(pid5,na=False)]
        #    rows, cols = lvl5_gdf.shape
        #    if rows > 3000 or rows < 200:
        #        print (f'{pid} : {rows}')
    elif rows < 250:
        under+=1
        #print (f'{pid} : {rows}')
    

In [None]:
print (over)

In [None]:
print (under)

In [1]:
test['id']=1

NameError: name 'test' is not defined