# Test effect of number of samples used in training

We previously established the best kmer size and training parameters. We also found that the method is somewhat robust to sample quality, as long as most of the training samples have high quality. We will now test how many samples we need in training to get accurate predictions.

In [1]:
from functions import *
import warnings, os, json
from numpy import nan
warnings.filterwarnings('ignore')

get_device_name(0)

'Tesla V100-PCIE-32GB'

We will make training sets with 1-7 samples per species, leaving the remaining as validation, and we will record for each of the species in the validation set whether we got the correct prediction.


In [None]:
kmer_size = 7
all_bp_tr = [x*1e6 for x in [1,2,5,10,20,50,100,200]]

with open('n_training.txt','w') as outfile:
    for n_training in [i+1 for i in range(7)]:
        for replicate in range(50):
            if not replicate % 10: clear_output()
            print('N training', n_training)
            print('Replicate', replicate)

            df = get_training_set(kmer_size, all_bp_tr, n_valid = 10-n_training, minbp_filter = None)

            learn = train_cnn(df, 
                              pretrained = True,
                              callbacks = CutMix,
                              architecture = 'ig_resnext101_32x8d',
                              transforms = aug_transforms(do_flip = False,
                                                  max_rotate = 0,
                                                  max_zoom = 1,
                                                  max_lighting = 0.5,
                                                  max_warp = 0,
                                                  p_affine = 0,
                                                  p_lighting = 0.75
                                                 ),
                              loss_fn = LabelSmoothingCrossEntropyFlat()
                             )
            preds = get_predictions(learn, df, all_bp_tr)

            for i,x in preds.iterrows():
                
                print(x['sample'], x['actual'], x['prediction'])

                entry = dict(kmer_size=kmer_size,
                             replicate = replicate,
                             bp_training = '|'.join([str(x) for x in sorted(all_bp_tr)]),
                             bp_valid = x['bp_valid'],
                             samples_training = '|'.join(df.loc[~df['is_valid']]['sample'].drop_duplicates().sort_values().tolist()),
                             n_samp_training = df.loc[~df['is_valid']]['sample'].drop_duplicates().shape[0],
                             sample_valid = x['sample'],
                             valid_actual = x['actual'],
                             valid_prediction = x['prediction'])
                print(entry, file = outfile)
                

print('DONE')
            

N training 4
Replicate 40


S-97 S_bannisterioides S_bannisterioides
S-97 S_bannisterioides S_bannisterioides
S-97 S_bannisterioides S_bannisterioides
S-97 S_bannisterioides S_bannisterioides
S-97 S_bannisterioides S_bannisterioides
S-97 S_bannisterioides S_bannisterioides
S-97 S_bannisterioides S_bannisterioides
S-97 S_bannisterioides S_paralias
S-93 S_bannisterioides S_paralias
S-93 S_bannisterioides S_paralias
S-93 S_bannisterioides S_paralias
S-93 S_bannisterioides S_ciliatum
S-91 S_bannisterioides S_bogotense
S-91 S_bannisterioides S_bannisterioides
S-91 S_bannisterioides S_bogotense
S-91 S_bannisterioides S_bogotense
S-91 S_bannisterioides S_lindenianum
S-91 S_bannisterioides S_lindenianum
S-91 S_bannisterioides S_ciliatum
S-91 S_bannisterioides S_paralias
S-99 S_bannisterioides S_bannisterioides
S-99 S_bannisterioides S_bannisterioides
S-99 S_bannisterioides S_bannisterioides
S-99 S_bannisterioides S_bannisterioides
S-99 S_bannisterioides S_bannisterioides
S-99 S_bannisterioides S_bannisterioides
S-99 S_ba

S-100 S_bannisterioides S_bannisterioides
S-100 S_bannisterioides S_bannisterioides
S-100 S_bannisterioides S_bannisterioides
S-100 S_bannisterioides S_bannisterioides
S-100 S_bannisterioides S_bannisterioides
S-100 S_bannisterioides S_bannisterioides
S-100 S_bannisterioides S_bannisterioides
S-100 S_bannisterioides S_bannisterioides
S-98 S_bannisterioides S_bannisterioides
S-98 S_bannisterioides S_bannisterioides
S-98 S_bannisterioides S_bannisterioides
S-98 S_bannisterioides S_bannisterioides
S-98 S_bannisterioides S_bannisterioides
S-98 S_bannisterioides S_bannisterioides
S-98 S_bannisterioides S_bannisterioides
S-98 S_bannisterioides S_bannisterioides
S-96 S_bannisterioides S_bannisterioides
S-96 S_bannisterioides S_bannisterioides
S-96 S_bannisterioides S_bannisterioides
S-96 S_bannisterioides S_bannisterioides
S-96 S_bannisterioides S_bannisterioides
S-96 S_bannisterioides S_bannisterioides
S-96 S_bannisterioides S_bannisterioides
S-96 S_bannisterioides S_bannisterioides
S-93 S_b

S-95 S_bannisterioides S_bannisterioides
S-95 S_bannisterioides S_bannisterioides
S-95 S_bannisterioides S_bannisterioides
S-95 S_bannisterioides S_bannisterioides
S-95 S_bannisterioides S_bannisterioides
S-95 S_bannisterioides S_bannisterioides
S-95 S_bannisterioides S_bannisterioides
S-95 S_bannisterioides S_bannisterioides
S-96 S_bannisterioides S_paralias
S-96 S_bannisterioides S_bannisterioides
S-96 S_bannisterioides S_bannisterioides
S-96 S_bannisterioides S_bannisterioides
S-96 S_bannisterioides S_bannisterioides
S-96 S_bannisterioides S_bannisterioides
S-96 S_bannisterioides S_bannisterioides
S-96 S_bannisterioides S_bannisterioides
S-100 S_bannisterioides S_bannisterioides
S-100 S_bannisterioides S_bannisterioides
S-100 S_bannisterioides S_bannisterioides
S-100 S_bannisterioides S_bannisterioides
S-100 S_bannisterioides S_bannisterioides
S-100 S_bannisterioides S_bannisterioides
S-100 S_bannisterioides S_bannisterioides
S-100 S_bannisterioides S_bannisterioides
S-91 S_banniste

Now that we defined functions, let's train the CNN while varying some parameters. Let's start by making a list containing the training conditions we want to test.

Now that we finished training and testing all models, let's save results as a table that can be easily read in R:

In [None]:
with open('n_training.txt','r') as infile:
    df = pd.DataFrame([eval(x) for x in infile])
    df.to_csv('n_training.csv')

Path('n_training.txt').unlink()

print('DONE')