### SPLIT AFRICA DATASET 256

I re-ran this notebook until the split was reasonably even between train/validation/test. In a perfect split the `low-med-high rfracs` would all equal 1 and the `block split fracs` would equal the respective `group fracs`.  

RESULTS:

```
-------------------------------------------------------------------------------------
block relative split: 0.7023809523809523 0.1984126984126984 0.0992063492063492 0.2
block split fracs: 0.5619047619047619 0.15873015873015872 0.07936507936507936 0.2
-------------------------------------------------------------------------------------

train group frac: 0.5904858072478403
* low-med-high rfracs: 1.0148718087916262 0.9897717827057199 1.0396853004491233

valid group frac: 0.1447324133288455
* low-med-high rfracs: 1.0072902291506942 1.0529494510312574 0.7643257517676123

test group frac: 0.07921014248849995
* low-med-high rfracs: 1.0821125050586806 1.0016411902022613 0.8878871836378919

hold_out group frac: 0.18557163693481432
* low-med-high rfracs: 0.9119430750627848 0.9905487518178216 1.1053855455306483

```

NOTES:

    1. validation is a little low in group count but that speeds up training. its the test group that matters
    2. the low-med-high fracs are good. 
    3. the hold out group can be used for additional training or additional validation if needed.
 

---

In [1]:
import pandas as pd
from random import sample

In [2]:
WSIZE=256
DSET=f'surface-water.africa.win{WSIZE}'
NO_DATA_MAX=0.05
IDENTS=['group_id','window_index']

In [3]:
df=pd.read_csv(f'{DSET}.csv')
print(
    df.shape[0],
    df.drop_duplicates(subset=['block_id']).shape[0],
    df.drop_duplicates(subset=IDENTS).shape[0])
df=df[df.no_data<=NO_DATA_MAX]
print(
    df.shape[0],
    df.drop_duplicates(subset=['block_id']).shape[0],
    df.drop_duplicates(subset=IDENTS).shape[0])
df.sample(3)

10458 673 10241
9095 630 8913


Unnamed: 0,gsw_path,s1_path,block_id,group_id,window_index,window,water,not_water,no_data,dataset
2835,gs://surface-water-public/data/v1/jrc/africa/G...,gs://surface-water-public/data/v1/sentinel_1/a...,block_17_-28,group_17.86_-28.75,1,"(0, 256, 256, 256)",0.037018,0.962051,0.000931,africa
2468,gs://surface-water-public/data/v1/jrc/africa/G...,gs://surface-water-public/data/v1/sentinel_1/a...,block_22_-18,group_22.42_-18.97,3,"(256, 256, 256, 256)",0.017105,0.982895,0.0,africa
6177,gs://surface-water-public/data/v1/jrc/africa/G...,gs://surface-water-public/data/v1/sentinel_1/a...,block_14_12,group_14.32_12.85,1,"(0, 256, 256, 256)",0.109314,0.844833,0.045853,africa


---

In [4]:
def water_counter(df,quantiles=[0.15,0.85],labels=['low','medium','high']):
    wqs=[0]+[df.water.quantile(q) for q in quantiles]+[1+1e-8]
    dfs=[]
    for i,l in enumerate(labels):
        _df=df.copy()[(df.water>=wqs[i])&(df.water<wqs[i+1])]
        _df['water_tag']=l
        dfs.append(_df)
    return pd.concat(dfs).sample(frac=1)


def split_groups(blocks,valid_frac=0.2,test_frac=0.1,hold_out_frac=0.2,relative_frac=True):
    if relative_frac:
        valid_frac=(1-hold_out_frac)*valid_frac
        test_frac=(1-hold_out_frac)*test_frac
    nb_blocks=len(blocks)
    nb_valid=int(valid_frac*nb_blocks)    
    nb_test=int(test_frac*nb_blocks)
    nb_hold_out=int(hold_out_frac*nb_blocks)
    blocks=sample(blocks,nb_blocks)
    valid=blocks[:nb_valid]
    test=blocks[nb_valid:nb_valid+nb_test]
    hold_out=blocks[nb_valid+nb_test:nb_valid+nb_test+nb_hold_out]
    train=blocks[nb_valid+nb_test+nb_hold_out:]
    rtotal=len(train)+len(valid)+len(test)
    print('block relative split:',len(train)/rtotal,len(valid)/rtotal,len(test)/rtotal,len(hold_out)/nb_blocks)
    print('block split fracs:',len(train)/nb_blocks,len(valid)/nb_blocks,len(test)/nb_blocks,len(hold_out)/nb_blocks)
    return train, valid, test, hold_out


def sub_df(typ,df,blocks,cnt):
    
    sdf=df.copy()[df.block_id.isin(blocks)]
    sdf['data_split']=typ
    
    sgcnt=sdf.drop_duplicates(subset=IDENTS).shape[0]
    frac=sgcnt/cnt
    
    ltcnt=df[df.water_tag=='low'].shape[0]*frac
    mtcnt=df[df.water_tag=='medium'].shape[0]*frac
    htcnt=df[df.water_tag=='high'].shape[0]*frac
    
    sltcnt=sdf[sdf.water_tag=='low'].shape[0]
    smtcnt=sdf[sdf.water_tag=='medium'].shape[0]
    shtcnt=sdf[sdf.water_tag=='high'].shape[0]
    print()
    print(f'{typ} group frac:',frac)
    print('* low-med-high rfracs:',sltcnt/ltcnt,smtcnt/mtcnt,shtcnt/htcnt)
    return sdf
    
    
def df_splitter(df,train, valid, test, hold_out):
    cnt=df.drop_duplicates(subset=IDENTS).shape[0]
    tdf=sub_df('train',df,train,cnt)    
    vdf=sub_df('valid',df,valid,cnt)    
    sdf=sub_df('test',df,test,cnt)    
    hdf=sub_df('hold_out',df,hold_out,cnt)
    df=pd.concat([tdf, vdf, sdf, hdf])
    return df

---

In [5]:
df=water_counter(df)

In [6]:
blocks=list(df.block_id.unique())

In [7]:
print('-'*85)
train, valid, test, hold_out=split_groups(blocks)
print('-'*85)
df=df_splitter(df, train, valid, test, hold_out)

-------------------------------------------------------------------------------------
block relative split: 0.7023809523809523 0.1984126984126984 0.0992063492063492 0.2
block split fracs: 0.5619047619047619 0.15873015873015872 0.07936507936507936 0.2
-------------------------------------------------------------------------------------

train group frac: 0.5904858072478403
* low-med-high rfracs: 1.0148718087916262 0.9897717827057199 1.0396853004491233

valid group frac: 0.1447324133288455
* low-med-high rfracs: 1.0072902291506942 1.0529494510312574 0.7643257517676123

test group frac: 0.07921014248849995
* low-med-high rfracs: 1.0821125050586806 1.0016411902022613 0.8878871836378919

hold_out group frac: 0.18557163693481432
* low-med-high rfracs: 0.9119430750627848 0.9905487518178216 1.1053855455306483


---

In [8]:
df.sample(3)

Unnamed: 0,gsw_path,s1_path,block_id,group_id,window_index,window,water,not_water,no_data,dataset,water_tag,data_split
1584,gs://surface-water-public/data/v1/jrc/africa/G...,gs://surface-water-public/data/v1/sentinel_1/a...,block_-7_32,group_-7.83_32.64,3,"(256, 256, 256, 256)",0.010162,0.988968,0.00087,africa,low,valid
4416,gs://surface-water-public/data/v1/jrc/africa/G...,gs://surface-water-public/data/v1/sentinel_1/a...,block_27_-15,group_27.27_-15.71,1,"(0, 256, 256, 256)",0.239059,0.759201,0.00174,africa,high,valid
7501,gs://surface-water-public/data/v1/jrc/africa_m...,gs://surface-water-public/data/v1/sentinel_1/a...,block_-13_9,group_-13.37_9.58,3,"(256, 256, 256, 256)",0.025818,0.974182,0.0,africa_mtn,medium,train


In [9]:
file_name=f'{DSET}.split.csv'
df.to_csv(file_name,index=False)
print(f'UPLOAD: gsutil cp {file_name} gs://surface-water-public/data/v1/datasets')

UPLOAD: gsutil cp surface-water.africa.win256.split.csv gs://surface-water-public/data/v1/datasets


---

In [14]:
# !gsutil cp surface-water.africa.win256.split.csv gs://surface-water-public/data/v1/datasets

Copying file://surface-water.africa.win256.split.csv [Content-Type=text/csv]...
/ [1 files][  2.8 MiB/  2.8 MiB]                                                
Operation completed over 1 objects/2.8 MiB.                                      


---

##### WATER/NOT RATIO

In [11]:
_df=df[df.data_split.isin(['train','valid'])]

In [12]:
_df.not_water.mean()/_df.water.mean()

8.499056392503826

In [13]:
_df.drop_duplicates(subset=IDENTS)[['water','not_water']].describe()

Unnamed: 0,water,not_water
count,6553.0,6553.0
mean,0.104713,0.890427
std,0.122543,0.124226
min,0.005005,0.369141
25%,0.022202,0.855698
50%,0.052628,0.941559
75%,0.137955,0.973724
max,0.598251,0.994995
