### SPLIT AFRICA DATASET

I re-ran this notebook until the split was reasonably even between train/validation/test. Results:

```
block relative split: 0.7013422818791947 0.19966442953020133 0.09899328859060402 0.1989247311827957
block split fracs: 0.5618279569892473 0.15994623655913978 0.0793010752688172 0.1989247311827957
--------------------------------------------------

train group frac: 0.5754954954954955
* low-med-high rfracs: 0.9912093359573009 0.9953038273162859 1.01233024086334

valid group frac: 0.11729729729729729
* low-med-high rfracs: 0.99800672957355 0.9987246993654794 1.0203430467621144

test group frac: 0.07117117117117117
* low-med-high rfracs: 1.1151295961422545 1.0108583524961854 0.917252000150246

hold_out group frac: 0.23621621621621622
* low-med-high rfracs: 0.9869556136718608 1.008040234194403 0.984026961999244

```

In a perfect split the `low-med-high rfracs` would all equal 1 and the `block split fracs` would equal the respective `group fracs`.  Notes for this split:

    1. validation is a little low in group count but that speeds up training. its the test group that matters
    2. the low-med-high fracs are good. the test group is high on low water content and low on high water content which you can argue is a feature since it requires us to score better on the harder categories.
    3. the hold out group can be used for additional training or additional validation if needed.
 

---

In [1]:
import pandas as pd
from random import sample

In [2]:
BASENAMES=['africa_small','africa','africa_mtn']
URI_TMPL='https://storage.googleapis.com/surface-water-public/data/v1/datasets/{}.csv'

In [3]:
df=pd.concat([ pd.read_csv(URI_TMPL.format(b)) for b in BASENAMES ])
df.shape[0],df.drop_duplicates(subset=['block_id']).shape[0],df.drop_duplicates(subset=['group_id']).shape[0]

(6731, 744, 5550)

---

In [4]:
def water_counter(df,quantiles=[0.15,0.85],labels=['low','medium','high']):
    wqs=[0]+[df.water.quantile(q) for q in quantiles]+[1+1e-8]
    dfs=[]
    for i,l in enumerate(labels):
        _df=df.copy()[(df.water>=wqs[i])&(df.water<wqs[i+1])]
        _df['water_tag']=l
        dfs.append(_df)
    return pd.concat(dfs).sample(frac=1)


def split_groups(blocks,valid_frac=0.2,test_frac=0.1,hold_out_frac=0.2,relative_frac=True):
    if relative_frac:
        valid_frac=(1-hold_out_frac)*valid_frac
        test_frac=(1-hold_out_frac)*test_frac
    nb_blocks=len(blocks)
    nb_valid=int(valid_frac*nb_blocks)    
    nb_test=int(test_frac*nb_blocks)
    nb_hold_out=int(hold_out_frac*nb_blocks)
    blocks=sample(blocks,nb_blocks)
    valid=blocks[:nb_valid]
    test=blocks[nb_valid:nb_valid+nb_test]
    hold_out=blocks[nb_valid+nb_test:nb_valid+nb_test+nb_hold_out]
    train=blocks[nb_valid+nb_test+nb_hold_out:]
    rtotal=len(train)+len(valid)+len(test)
    print('block relative split:',len(train)/rtotal,len(valid)/rtotal,len(test)/rtotal,len(hold_out)/nb_blocks)
    print('block split fracs:',len(train)/nb_blocks,len(valid)/nb_blocks,len(test)/nb_blocks,len(hold_out)/nb_blocks)
    return train, valid, test, hold_out


def sub_df(typ,df,blocks,cnt):
    
    sdf=df.copy()[df.block_id.isin(blocks)]
    sdf['data_split']=typ
    
    sgcnt=sdf.drop_duplicates(subset=['group_id']).shape[0]
    frac=sgcnt/cnt
    
    ltcnt=df[df.water_tag=='low'].shape[0]*frac
    mtcnt=df[df.water_tag=='medium'].shape[0]*frac
    htcnt=df[df.water_tag=='high'].shape[0]*frac
    
    sltcnt=sdf[sdf.water_tag=='low'].shape[0]
    smtcnt=sdf[sdf.water_tag=='medium'].shape[0]
    shtcnt=sdf[sdf.water_tag=='high'].shape[0]
    print()
    print(f'{typ} group frac:',frac)
    print('* low-med-high rfracs:',sltcnt/ltcnt,smtcnt/mtcnt,shtcnt/htcnt)
    return sdf
    
    
def df_splitter(df,train, valid, test, hold_out):
    cnt=df.drop_duplicates(subset=['group_id']).shape[0]
    tdf=sub_df('train',df,train,cnt)    
    vdf=sub_df('valid',df,valid,cnt)    
    sdf=sub_df('test',df,test,cnt)    
    hdf=sub_df('hold_out',df,hold_out,cnt)
    df=pd.concat([tdf, vdf, sdf, hdf])
    return df

---

In [5]:
df=water_counter(df)

In [6]:
blocks=list(df.block_id.unique())

In [7]:
train, valid, test, hold_out=split_groups(blocks)
print('-'*50)
df=df_splitter(df, train, valid, test, hold_out)

block relative split: 0.7013422818791947 0.19966442953020133 0.09899328859060402 0.1989247311827957
block split fracs: 0.5618279569892473 0.15994623655913978 0.0793010752688172 0.1989247311827957
--------------------------------------------------

train group frac: 0.5754954954954955
* low-med-high rfracs: 0.9912093359573009 0.9953038273162859 1.01233024086334

valid group frac: 0.11729729729729729
* low-med-high rfracs: 0.99800672957355 0.9987246993654794 1.0203430467621144

test group frac: 0.07117117117117117
* low-med-high rfracs: 1.1151295961422545 1.0108583524961854 0.917252000150246

hold_out group frac: 0.23621621621621622
* low-med-high rfracs: 0.9869556136718608 1.008040234194403 0.984026961999244


---

In [8]:
df.sample(3)

Unnamed: 0,year,month,aoi,crs,block_id,group_id,lat,lon,water,not_water,no_data,water_score,timestamp,lat_id,lon_id,tile_id,gsw_path,s1_path,water_tag,data_split
1009,2015,2,africa,EPSG:32636,block_31_28,group_30.92_27.67,27.66922,30.915104,0.064522,0.930309,0.005169,0.060485,1422749000000.0,lat_27.6692193670,lon_30.9151032240,lon_30d9151032240_lat_27d6692193670-201502,gs://surface-water-public/data/v1/jrc/africa/G...,gs://surface-water-public/data/v1/sentinel_1/a...,medium,train
2545,2015,10,africa_mtn,EPSG:32734,block_23_-33,group_22.60_-33.49,-33.493187,22.59989,0.082783,0.916355,0.000862,0.086269,1443658000000.0,lat_-33.4931859987,lon_22.5998896684,lon_22d5998896684_lat_-33d4931859987-201510,gs://surface-water-public/data/v1/jrc/africa_m...,gs://surface-water-public/data/v1/sentinel_1/a...,medium,train
3544,2015,4,africa,EPSG:32632,block_9_12,group_8.82_11.82,11.818235,8.820251,0.011711,0.983963,0.004326,0.011703,1427846000000.0,lat_11.8182351470,lon_8.8202508459,lon_8d8202508459_lat_11d8182351470-201504,gs://surface-water-public/data/v1/jrc/africa/G...,gs://surface-water-public/data/v1/sentinel_1/a...,low,train


---

In [9]:
df.to_csv('surface-water.africa.master.csv',index=False)