# Summary
This kernel is a cleaned version of my submission scripts. I learn a lot from this challenge. In short, fast geometric ensembling gives me an incredible  boost in performance and bucket iterator makes it possible in 7200s. Minor improvements are made based on great kernels. Thank you all!

# References
model structure & clr from https://www.kaggle.com/shujian/single-rnn-with-4-folds-clr  
hidden size 256 from https://www.kaggle.com/artgor/text-modelling-in-pytorch  
speed up pre-processing from https://www.kaggle.com/syhens/speed-up-your-preprocessing  
the idea to reduce oov from https://www.kaggle.com/christofhenkel/how-to-preprocessing-when-using-embeddings  
misspell dictionary & punctuations from https://www.kaggle.com/theoviel/improve-your-score-with-some-text-preprocessing  
latex cleaning from https://www.kaggle.com/sunnymarkliu/more-text-cleaning-to-increase-word-coverage  
pytorch text processing routines from https://github.com/howardyclo/pytorch-seq2seq-example/blob/master/seq2seq.ipynb  
capsule  from https://www.kaggle.com/spirosrap/bilstm-attention-kfold-clr-extra-features-capsule  
Please correct me if I miss any.

# Some new things
## performance
**fast geometric ensemble** from https://arxiv.org/abs/1802.10026. It gives a consistant and significant boost in both LB score and CV score for various models when combined with a learnable embedding.  
**semi-supervised ensemble** similar to [Malware Classification Challenge 1st solution]( https://www.kaggle.com/c/malware-classification/discussion/13897).    ~~Marginal significance can be observed with a large test set.~~   it doesn't bring me any benefits in the 2nd stage.  
**"mix up" embeddings**. The idea is to randomly choose a linear combination between two embeddings rather  than simple averaging. Though no significant improvement can be observed, I still keep it in my solution as  regularization.
## speed
**bucket iterator**. similar to the one in torchtext.  It  runs twice as fast as static padding.  
## miscs   
  - speed up  capsule.  
  - load embedding file with pandas. It saves ~80 seconds per embedding.  
  - reduce oov by  replacing oov word with its capitized, upper, lower version if  available. The final oov rate is about 7.5%. 
  - minor changes to the model structure. I don't think they really work...

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 
import os
import random
import re
import time
from collections import Counter
from itertools import chain
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from sklearn.metrics import f1_score, roc_auc_score
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.utils import shuffle
from torch import optim
from torch.utils.data import Dataset, Sampler, DataLoader
from tqdm import tqdm

In [2]:
# constants
embedding_glove = '../input/embeddings/glove.840B.300d/glove.840B.300d.txt'
embedding_fasttext = '../input/embeddings/wiki-news-300d-1M/wiki-news-300d-1M.vec'
embedding_para = '../input/embeddings/paragram_300_sl999/paragram_300_sl999.txt'
embedding_w2v = '../input/embeddings/GoogleNews-vectors-negative300/GoogleNews-vectors-negative300.bin'
train_path = '../input/train.csv'
test_path = '../input/test.csv'

mispell_dict = {"ain't": "is not", "aren't": "are not", "can't": "cannot", "'cause": "because",
                "could've": "could have", "couldn't": "could not", "didn't": "did not", "doesn't": "does not",
                "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not",
                "he'd": "he would", "he'll": "he will", "he's": "he is", "how'd": "how did",
                "how'd'y": "how do you", "how'll": "how will", "how's": "how is", "i'd": "i would",
                "i'd've": "i would have", "i'll": "i will", "i'll've": "I will have", "i'm": "i am",
                "i've": "I have", "isn't": "is not", "it'd": "it would",
                "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have", "it's": "it is",
                "let's": "let us", "ma'am": "madam", "mayn't": "may not", "might've": "might have",
                "mightn't": "might not", "mightn't've": "might not have", "must've": "must have",
                "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not",
                "needn't've": "need not have", "o'clock": "of the clock", "oughtn't": "ought not",
                "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not",
                "shan't've": "shall not have", "she'd": "she would", "she'd've": "she would have",
                "she'll": "she will", "she'll've": "she will have", "she's": "she is",
                "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have",
                "so've": "so have", "so's": "so as", "this's": "this is", "that'd": "that would",
                "that'd've": "that would have", "that's": "that is", "there'd": "there would",
                "there'd've": "there would have", "there's": "there is", "here's": "here is",
                "they'd": "they would", "they'd've": "they would have", "they'll": "they will",
                "they'll've": "they will have", "they're": "they are", "they've": "they have",
                "to've": "to have", "wasn't": "was not", "we'd": "we would", "we'd've": "we would have",
                "we'll": "we will", "we'll've": "we will have", "we're": "we are", "we've": "we have",
                "weren't": "were not", "what'll": "what will", "what'll've": "what will have",
                "what're": "what are", "what's": "what is", "what've": "what have", "when's": "when is",
                "when've": "when have", "where'd": "where did", "where's": "where is", "where've": "where have",
                "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have",
                "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not",
                "won't've": "will not have", "would've": "would have", "wouldn't": "would not",
                "wouldn't've": "would not have", "y'all": "you all", "y'all'd": "you all would",
                "y'all'd've": "you all would have", "y'all're": "you all are", "y'all've": "you all have",
                "you'd": "you would", "you'd've": "you would have", "you'll": "you will",
                "you'll've": "you will have", "you're": "you are", "you've": "you have", 'colour': 'color',
                'centre': 'center', 'favourite': 'favorite', 'travelling': 'traveling',
                'counselling': 'counseling', 'theatre': 'theater', 'cancelled': 'canceled', 'labour': 'labor',
                'organisation': 'organization', 'wwii': 'world war 2', 'citicise': 'criticize',
                'youtu ': 'youtube ', 'qoura': 'quora', 'sallary': 'salary', 'whta': 'what',
                'narcisist': 'narcissist', 'howdo': 'how do', 'whatare': 'what are', 'howcan': 'how can',
                'howmuch': 'how much', 'howmany': 'how many', 'whydo': 'why do', 'doi': 'do I',
                'thebest': 'the best', 'howdoes': 'how does', 'mastrubation': 'masturbation',
                'mastrubate': 'masturbate', "mastrubating": 'masturbating', 'pennis': 'penis',
                'etherium': 'ethereum', 'narcissit': 'narcissist', 'bigdata': 'big data', '2k17': '2017',
                '2k18': '2018', 'qouta': 'quota', 'exboyfriend': 'ex boyfriend', 'airhostess': 'air hostess',
                "whst": 'what', 'watsapp': 'whatsapp', 'demonitisation': 'demonetization',
                'demonitization': 'demonetization', 'demonetisation': 'demonetization'}

puncts = '\'!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n'
punct_mapping = {"‘": "'", "₹": "e", "´": "'", "°": "", "€": "e", "™": "tm", "√": " sqrt ", "×": "x", "²": "2",
                 "—": "-", "–": "-", "’": "'", "_": "-", "`": "'", '”': '"', '“': '"', "£": "e",
                 '∞': 'infinity', 'θ': 'theta', '÷': '/', 'α': 'alpha', '•': '.', 'à': 'a', '−': '-', 'β': 'beta',
                 '∅': '', '³': '3', 'π': 'pi', '\u200b': ' ', '…': ' ... ', '\ufeff': '', 'करना': '', 'है': ''}
for p in puncts:
    punct_mapping[p] = ' %s ' % p

p = re.compile('(\[ math \]).+(\[ / math \])')
p_space = re.compile(r'[^\x20-\x7e]')

In [3]:
#  seeding functions
def set_seed(seed):
    np.random.seed(seed)
    torch.manual_seed(seed + 1)
    if torch.cuda.is_available(): torch.cuda.manual_seed_all(seed + 2)
    random.seed(seed + 4)


In [4]:
#data loading & pre-processing

def clean_text(text):
    # clean latex maths
    text = p.sub(' [ math ] ', text)
    # clean invisible chars
    text = p_space.sub(r'', text)
    # clean punctuations
    for punct in punct_mapping:
        if punct in text:
            text = text.replace(punct, punct_mapping[punct])
    tokens = []
    for token in text.split():
        # replace contractions & correct misspells
        token = mispell_dict.get(token.lower(), token)
        tokens.append(token)
    text = ' '.join(tokens)
    return text

def load_data(train_path=train_path, test_path=test_path, debug=False):
    train_df = pd.read_csv(train_path)
    test_df = pd.read_csv(test_path)
    if debug:
        train_df = train_df[:10000]
        test_df = test_df[:10000]
    s = time.time()
    train_df['question_text'] = train_df['question_text'].apply(clean_text)
    test_df['question_text'] = test_df['question_text'].apply(clean_text)
    print('preprocssing {}s'.format(time.time() - s))
    return train_df, test_df

In [5]:
# vocabulary functions
def build_counter(sents, splited=False):
    counter = Counter()
    for sent in tqdm(sents, ascii=True, desc='building conuter'):
        if splited:
            counter.update(sent)
        else:
            counter.update(sent.split())
    return counter


def build_vocab(counter, max_vocab_size):
    vocab = {'token2id': {'<PAD>': 0, '<UNK>': max_vocab_size + 1}}
    vocab['token2id'].update(
        {token: _id + 1 for _id, (token, count) in
         tqdm(enumerate(counter.most_common(max_vocab_size)), desc='building vocab')})
    vocab['id2token'] = {v: k for k, v in vocab['token2id'].items()}
    return vocab

def tokens2ids(tokens, token2id):
    seq = []
    for token in tokens:
        token_id = token2id.get(token, len(token2id) - 1)
        seq.append(token_id)
    return seq

#  data set
class TextDataset(Dataset):
    def __init__(self, df, vocab=None, num_max=None, max_seq_len=100,
                 max_vocab_size=95000):
        if num_max is not None:
            df = df[:num_max]

        self.src_sents = df['question_text'].tolist()
        self.qids = df['qid'].values
        if vocab is None:
            src_counter = build_counter(self.src_sents)
            vocab = build_vocab(src_counter, max_vocab_size)
        self.vocab = vocab
        if 'src_seqs' not in df.columns:
            self.src_seqs = []
            for sent in tqdm(self.src_sents, desc='tokenize'):
                seq = tokens2ids(sent.split()[:max_seq_len], vocab['token2id'])
                self.src_seqs.append(seq)
        else:
            self.src_seqs = df['src_seqs'].tolist()
        if 'target' in df.columns:
            self.targets = df['target'].values
        else:
            self.targets = np.random.randint(2, size=(len(self.src_sents),))
        self.max_seq_len = max_seq_len

    def __len__(self):
        return len(self.src_sents)

    # for bucket iterator
    def get_keys(self):
        lens = np.fromiter(
            tqdm(((min(self.max_seq_len, len(c.split()))) for c in self.src_sents), desc='generate lens'),
            dtype=np.int32)
        return lens

    def __getitem__(self, index):
        return self.qids[index], self.src_sents[index], self.src_seqs[index], self.targets[index]

In [6]:

#  dynamic padding
def _pad_sequences(seqs):
    lens = [len(seq) for seq in seqs]
    max_len = max(lens)

    padded_seqs = torch.zeros(len(seqs), max_len).long()
    for i, seq in enumerate(seqs):
        end = lens[i]
        padded_seqs[i, :end] = torch.LongTensor(seq)
    return padded_seqs, lens


def collate_fn(data):
    qids, src_sents, src_seqs, targets, = zip(*data)
    src_seqs, src_lens = _pad_sequences(src_seqs)
    return qids, src_sents, src_seqs, src_lens, torch.FloatTensor(targets)


#  bucket iterator
def divide_chunks(l, n):
    if n == len(l):
        yield np.arange(len(l), dtype=np.int32), l
    else:
        # looping till length l
        for i in range(0, len(l), n):
            data = l[i:i + n]
            yield np.arange(i, i + len(data), dtype=np.int32), data


def prepare_buckets(lens, bucket_size, batch_size, shuffle_data=True, indices=None):
    lens = -lens
    assert bucket_size % batch_size == 0 or bucket_size == len(lens)
    if indices is None:
        if shuffle_data:
            indices = shuffle(np.arange(len(lens), dtype=np.int32))
            lens = lens[indices]
        else:
            indices = np.arange(len(lens), dtype=np.int32)
    new_indices = []
    extra_batch = None
    for chunk_index, chunk in (divide_chunks(lens, bucket_size)):
        # sort indices in bucket by descending order of length
        indices_sorted = chunk_index[np.argsort(chunk, axis=-1)]
        batches = []
        for _, batch in divide_chunks(indices_sorted, batch_size):
            if len(batch) == batch_size:
                batches.append(batch.tolist())
            else:
                assert extra_batch is None
                assert batch is not None
                extra_batch = batch
        # shuffling batches within buckets
        if shuffle_data:
            batches = shuffle(batches)
        for batch in batches:
            new_indices.extend(batch)

    if extra_batch is not None:
        new_indices.extend(extra_batch)
    return indices[new_indices]


class BucketSampler(Sampler):

    def __init__(self, data_source, sort_keys, bucket_size=None, batch_size=1536, shuffle_data=True):
        super().__init__(data_source)
        self.shuffle = shuffle_data
        self.batch_size = batch_size
        self.sort_keys = sort_keys
        self.bucket_size = bucket_size if bucket_size is not None else len(sort_keys)
        if not shuffle_data:
            self.index = prepare_buckets(self.sort_keys, bucket_size=self.bucket_size, batch_size=self.batch_size,
                                         shuffle_data=self.shuffle)
        else:
            self.index = None
        self.weights = None

    def set_weights(self, w):
        assert w >= 0
        total = np.sum(w)
        if total != 1:
            w = w / total
        self.weights = w

    def __iter__(self):
        indices = None
        if self.weights is not None:
            total = len(self.sort_keys)

            indices = np.random.choice(total, (total,), p=self.weights)
        if self.shuffle:
            self.index = prepare_buckets(self.sort_keys, bucket_size=self.bucket_size, batch_size=self.batch_size,
                                         shuffle_data=self.shuffle, indices=indices)
        return iter(self.index)

    def get_reverse_indexes(self):
        indexes = np.zeros((len(self.index),), dtype=np.int32)
        for i, j in enumerate(self.index):
            indexes[j] = i
        return indexes

    def __len__(self):
        return len(self.sort_keys)


In [7]:
# embedding stuffs
def read_embedding(embedding_file):
    """
    read embedding file into a dictionary
    each line of the embedding file should in the format like  word 0.13 0.22 ... 0.44
    :param embedding_file: path of the embedding.
    :return: a dictionary of word to its embedding (numpy array)
    """
    if os.path.basename(embedding_file) != 'wiki-news-300d-1M.vec':
        skip_head = None
    else:
        skip_head = 0
    if os.path.basename(embedding_file) == 'paragram_300_sl999.txt':
        encoding = 'latin'
    else:
        encoding = 'utf-8'
    embeddings_index = {}
    t_chunks = pd.read_csv(embedding_file, index_col=0, skiprows=skip_head, encoding=encoding, sep=' ', header=None,
                           quoting=3,
                           doublequote=False, quotechar=None, engine='c', na_filter=False, low_memory=True,
                           chunksize=10000)
    for t in t_chunks:
        for k, v in zip(t.index.values, t.values):
            embeddings_index[k] = v.astype(np.float32)
    return embeddings_index


def get_emb(embedding_index, word, word_raw):
    if word == word_raw:
        return None
    else:
        return embedding_index.get(word, None)


def embedding2numpy(embedding_path, word_index, num_words, embed_size, emb_mean=0., emb_std=0.5,
                    report_stats=False):
    embedding_index = read_embedding(embedding_path)
    num_words = min(num_words + 2, len(word_index))
    if report_stats:
        all_coefs = []
        for v in embedding_index.values():
            all_coefs.append(v.reshape([-1, 1]))
        all_coefs = np.concatenate(all_coefs)
        print(all_coefs.mean(), all_coefs.std(), np.linalg.norm(all_coefs, axis=-1).mean())
    embedding_matrix = np.zeros((num_words, embed_size), dtype=np.float32)
    oov = 0
    oov_cap = 0
    oov_upper = 0
    oov_lower = 0
    for word, i in word_index.items():
        if i == 0:  # padding
            continue
        if i >= num_words:
            continue
        embedding_vector = embedding_index.get(word, None)
        if embedding_vector is None:
            embedding_vector = get_emb(embedding_index, word.lower(), word)
            if embedding_vector is None:
                embedding_vector = get_emb(embedding_index, word.upper(), word)
                if embedding_vector is None:
                    embedding_vector = get_emb(embedding_index, word.capitalize(), word)
                    if embedding_vector is None:
                        oov += 1
                        # embedding_vector = (np.zeros((1, embed_size)))
                        embedding_vector = np.random.normal(emb_mean, emb_std, size=(1, embed_size))
                    else:
                        oov_lower += 1
                else:
                    oov_upper += 1
            else:
                oov_cap += 1

        embedding_matrix[i] = embedding_vector

    print('oov %d/%d/%d/%d/%d' % (oov, oov_cap, oov_upper, oov_lower, len(word_index)))
    return embedding_matrix


def load_embedding(vocab, max_vocab_size, embed_size):
    # load embedding
    embedding_matrix1 = embedding2numpy(embedding_glove, vocab['token2id'], max_vocab_size, embed_size,
                                        emb_mean=-0.005838499, emb_std=0.48782197, report_stats=False)
    # -0.005838499 0.48782197 0.37823704
    # oov 9196
    # embedding_matrix2 = embedding2numpy(embedding_fasttext, vocab.token2id, max_vocab_size, embed_size,
    #                                    report_stats=False, emb_mean=-0.0033469985, emb_std=0.109855495, )
    # -0.0033469985 0.109855495 0.07475414
    # oov 12885
    embedding_matrix2 = embedding2numpy(embedding_para, vocab['token2id'], max_vocab_size, embed_size,
                                        emb_mean=-0.0053247833, emb_std=0.49346462, report_stats=False)
    # -0.0053247833 0.49346462 0.3828983
    # oov 9061
    # embedding_w2v
    # -0.003527845 0.13315111 0.09407869
    # oov 18927
    return [embedding_matrix1, embedding_matrix2]


In [8]:

# cyclic learning rate
def set_lr(optimizer, lr):
    for g in optimizer.param_groups:
        g['lr'] = lr


class CyclicLR:
    def __init__(self, optimizer, base_lr=0.001, max_lr=0.002, step_size=300., mode='triangular',
                 gamma=0.99994, scale_fn=None, scale_mode='cycle'):
        super(CyclicLR, self).__init__()
        self.optimizer = optimizer
        self.base_lr = base_lr
        self.max_lr = max_lr
        self.step_size = step_size
        self.mode = mode
        self.gamma = gamma
        if scale_fn is None:
            if self.mode == 'triangular':
                self.scale_fn = lambda x: 1.
                self.scale_mode = 'cycle'
            elif self.mode == 'triangular2':
                self.scale_fn = lambda x: 1 / (2. ** (x - 1))
                self.scale_mode = 'cycle'
            elif self.mode == 'exp_range':
                self.scale_fn = lambda x: gamma ** x
                self.scale_mode = 'iterations'
        else:
            self.scale_fn = scale_fn
            self.scale_mode = scale_mode
        self.clr_iterations = 0.
        self.trn_iterations = 0.
        self.history = {}
        self._reset()

    def _reset(self, new_base_lr=None, new_max_lr=None,
               new_step_size=None):
        if new_base_lr is not None:
            self.base_lr = new_base_lr
        if new_max_lr is not None:
            self.max_lr = new_max_lr
        if new_step_size is not None:
            self.step_size = new_step_size
        self.clr_iterations = 0.

    def clr(self):
        cycle = np.floor(1 + self.clr_iterations / (2 * self.step_size))
        x = np.abs(self.clr_iterations / self.step_size - 2 * cycle + 1)
        if self.scale_mode == 'cycle':
            return self.base_lr + (self.max_lr - self.base_lr) * np.maximum(0, (1 - x)) * self.scale_fn(cycle)
        else:
            return self.base_lr + (self.max_lr - self.base_lr) * np.maximum(0, (1 - x)) * self.scale_fn(
                self.clr_iterations)

    def on_train_begin(self):
        if self.clr_iterations == 0:
            set_lr(self.optimizer, self.base_lr)
        else:
            set_lr(self.optimizer, self.clr())

    def on_batch_end(self):
        self.trn_iterations += 1
        self.clr_iterations += 1
        set_lr(self.optimizer, self.clr())


In [9]:
# model

class Capsule(nn.Module):
    def __init__(self, input_dim_capsule=1024, num_capsule=5, dim_capsule=5, routings=4):
        super(Capsule, self).__init__()
        self.num_capsule = num_capsule
        self.dim_capsule = dim_capsule
        self.routings = routings
        self.activation = self.squash
        self.W = nn.Parameter(
            nn.init.xavier_normal_(torch.empty(1, input_dim_capsule, self.num_capsule * self.dim_capsule)))

    def forward(self, x):
        u_hat_vecs = torch.matmul(x, self.W)
        batch_size = x.size(0)
        input_num_capsule = x.size(1)
        u_hat_vecs = u_hat_vecs.view((batch_size, input_num_capsule,
                                      self.num_capsule, self.dim_capsule))
        u_hat_vecs = u_hat_vecs.permute(0, 2, 1,
                                        3).contiguous()  # (batch_size,num_capsule,input_num_capsule,dim_capsule)
        with torch.no_grad():
            b = torch.zeros_like(u_hat_vecs[:, :, :, 0])
        for i in range(self.routings):
            c = torch.nn.functional.softmax(b, dim=1)  # (batch_size,num_capsule,input_num_capsule)
            outputs = self.activation(torch.sum(c.unsqueeze(-1) * u_hat_vecs, dim=2))  # bij,bijk->bik
            if i < self.routings - 1:
                b = (torch.sum(outputs.unsqueeze(2) * u_hat_vecs, dim=-1))  # bik,bijk->bij
        return outputs  # (batch_size, num_capsule, dim_capsule)

    def squash(self, x, axis=-1):
        s_squared_norm = (x ** 2).sum(axis, keepdim=True)
        scale = torch.sqrt(s_squared_norm + 1e-7)
        return x / scale


#  model
class Attention(nn.Module):
    def __init__(self, feature_dim, max_seq_len=70):
        super().__init__()
        self.attention_fc = nn.Linear(feature_dim, 1)
        self.bias = nn.Parameter(torch.zeros(1, max_seq_len, 1, requires_grad=True))

    def forward(self, rnn_output):
        """
        forward attention scores and attended vectors
        :param rnn_output: (#batch,#seq_len,#feature)
        :return: attended_outputs (#batch,#feature)
        """
        attention_weights = self.attention_fc(rnn_output)
        seq_len = rnn_output.size(1)
        attention_weights = self.bias[:, :seq_len, :] + attention_weights
        attention_weights = torch.tanh(attention_weights)
        attention_weights = torch.exp(attention_weights)
        attention_weights_sum = torch.sum(attention_weights, dim=1, keepdim=True) + 1e-7
        attention_weights = attention_weights / attention_weights_sum
        attended = torch.sum(attention_weights * rnn_output, dim=1)
        return attended


class InsincereModel(nn.Module):
    def __init__(self, device, hidden_dim, hidden_dim_fc, embedding_matrixs, vocab_size=None, embedding_dim=None,
                 dropout=0.1, num_capsule=5, dim_capsule=5, capsule_out_dim=1, alpha=0.8, beta=0.8,
                 finetuning_vocab_size=120002,
                 embedding_mode='mixup', max_seq_len=70):
        super(InsincereModel, self).__init__()
        self.beta = beta
        self.embedding_mode = embedding_mode
        self.finetuning_vocab_size = finetuning_vocab_size
        self.alpha = alpha
        vocab_size, embedding_dim = embedding_matrixs[0].shape
        self.raw_embedding_weights = embedding_matrixs
        self.embedding_0 = nn.Embedding(vocab_size, embedding_dim, padding_idx=0).from_pretrained(
            torch.from_numpy(embedding_matrixs[0]))
        self.embedding_1 = nn.Embedding(vocab_size, embedding_dim, padding_idx=0).from_pretrained(
            torch.from_numpy(embedding_matrixs[1]))
        self.embedding_mean = nn.Embedding(vocab_size, embedding_dim, padding_idx=0).from_pretrained(
            torch.from_numpy((embedding_matrixs[0] + embedding_matrixs[1]) / 2))
        self.learnable_embedding = nn.Embedding(finetuning_vocab_size, embedding_dim, padding_idx=0)
        nn.init.constant_(self.learnable_embedding.weight, 0)
        self.learn_embedding = False
        self.spatial_dropout = nn.Dropout2d(p=0.2)
        self.device = device
        self.hidden_dim = hidden_dim
        self.rnn0 = nn.LSTM(embedding_dim, int(hidden_dim / 2), num_layers=1, bidirectional=True, batch_first=True)
        self.rnn1 = nn.GRU(hidden_dim, int(hidden_dim / 2), num_layers=1, bidirectional=True, batch_first=True)
        self.capsule = Capsule(input_dim_capsule=self.hidden_dim, num_capsule=num_capsule, dim_capsule=dim_capsule)
        self.dropout2 = nn.Dropout(0.3)
        self.lincaps = nn.Linear(num_capsule * dim_capsule, capsule_out_dim)
        self.attention1 = Attention(self.hidden_dim, max_seq_len=max_seq_len)
        self.attention2 = Attention(self.hidden_dim, max_seq_len=max_seq_len)
        self.fc = nn.Linear(hidden_dim * 4 + capsule_out_dim, hidden_dim_fc)
        self.norm = torch.nn.LayerNorm(hidden_dim * 4 + capsule_out_dim)
        self.dropout1 = nn.Dropout(0.2)
        self.dropout_linear = nn.Dropout(p=dropout)
        self.hidden2out = nn.Linear(hidden_dim_fc, 1)

    def set_embedding_mode(self, embedding_mode):
        self.embedding_mode = embedding_mode

    def enable_learning_embedding(self):
        self.learn_embedding = True

    def init_weights(self):
        ih = (param.data for name, param in self.named_parameters() if 'weight_ih' in name)
        hh = (param.data for name, param in self.named_parameters() if 'weight_hh' in name)
        b = (param.data for name, param in self.named_parameters() if 'bias' in name)
        for k in ih:
            nn.init.xavier_uniform_(k)
        for k in hh:
            nn.init.orthogonal_(k)
        for k in b:
            nn.init.constant_(k, 0)

    def apply_spatial_dropout(self, emb):
        emb = emb.permute(0, 2, 1).unsqueeze(-1)
        emb = self.spatial_dropout(emb).squeeze(-1).permute(0, 2, 1)
        return emb

    def forward(self, seqs, lens, return_logits=True):
        # forward embeddings
        if self.embedding_mode == 'mixup':
            emb0 = self.embedding_0(seqs)  # batch_size x seq_len x embedding_dim
            emb1 = self.embedding_1(seqs)
            prob = np.random.beta(self.alpha, self.beta, size=(seqs.size(0), 1, 1)).astype(np.float32)
            prob = torch.from_numpy(prob).to(self.device)
            emb = emb0 * prob + emb1 * (1 - prob)
        elif self.embedding_mode == 'emb0':
            emb = self.embedding_0(seqs)
        elif self.embedding_mode == 'emb1':
            emb = self.embedding_1(seqs)
        elif self.embedding_mode == 'mean':
            emb = self.embedding_mean(seqs)
        else:
            assert False
        if self.learn_embedding:
            seq_clamped = torch.clamp(seqs, 0, self.finetuning_vocab_size - 1)
            emb_learned = self.learnable_embedding(seq_clamped)
            emb = emb + emb_learned
        emb = self.apply_spatial_dropout(emb)
        # forward rnn encoder
        lstm_output0, _ = self.rnn0(emb)
        lstm_output1, _ = self.rnn1(lstm_output0)
        # forward capsule
        content3 = self.capsule(lstm_output1)
        batch_size = content3.size(0)
        content3 = content3.view(batch_size, -1)
        content3 = self.dropout2(content3)
        content3 = torch.relu(self.lincaps(content3))
        # forward feature extractor
        feature_att1 = self.attention1(lstm_output0)
        feature_att2 = self.attention2(lstm_output1)
        feature_avg2 = torch.mean(lstm_output1, dim=1)
        feature_max2, _ = torch.max(lstm_output1, dim=1)
        feature = torch.cat((feature_att1, feature_att2, feature_avg2, feature_max2, content3), dim=-1)
        feature = self.norm(feature)
        feature = self.dropout1(feature)
        feature = torch.relu(feature)
        # forward dense layer
        out = self.fc(feature)
        out = self.dropout_linear(out)
        out = self.hidden2out(out)  # batch_size x 1
        if not return_logits:
            out = torch.sigmoid(out)
        return out



In [10]:

#  util functions
class AverageMeter(object):
    """Computes and stores the average and current value"""

    def __init__(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count


def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)


def margin_score(targets, predictions):
    return ((targets == 1) * (1 - predictions) + (targets == 0) * (predictions)).mean()


def report_perf(valid_dataset, predictions_va, threshold, idx, epoch_cur, desc='val set'):
    val_f1 = f1_score(valid_dataset.targets, predictions_va > threshold)
    val_auc = roc_auc_score(valid_dataset.targets, predictions_va)
    val_margin = margin_score(valid_dataset.targets, predictions_va)
    print('idx {} epoch {} {} f1 : {:.4f} auc : {:.4f} margin : {:.4f}'.format(
        idx,
        epoch_cur,
        desc,
        val_f1,
        val_auc,
        val_margin))


def get_gpu_memory_usage(device_id):
    return round(torch.cuda.max_memory_allocated(device_id) / 1000 / 1000)


def avg(loss_list):
    if len(loss_list) == 0:
        return 0
    else:
        return sum(loss_list) / len(loss_list)



In [11]:


# evaluation
def eval_model(model, data_iter, device, order_index=None):
    model.eval()
    predictions = []
    with torch.no_grad():
        for batch_data in data_iter:
            qid_batch, src_sents, src_seqs, src_lens, tgts = batch_data
            src_seqs = src_seqs.to(device)
            out = model(src_seqs, src_lens, return_logits=False)
            predictions.append(out)
    predictions = torch.cat(predictions, dim=0)
    if order_index is not None:
        predictions = predictions[order_index]
    predictions = predictions.to('cpu').numpy().ravel()
    return predictions


In [12]:
# cross validation

def cv(train_df, test_df, device=None, n_folds=10, shared_resources=None, share=True, **kwargs):
    if device is None:
        device = torch.device("cuda:{}".format(0) if torch.cuda.is_available() else "cpu")
    max_vocab_size = kwargs['max_vocab_size']
    embed_size = kwargs['embed_size']
    threshold = kwargs['threshold']
    max_seq_len = kwargs['max_seq_len']
    if shared_resources is None:
        shared_resources = {}
    if share:
        if 'vocab' not in shared_resources:
            # also include the test set

            counter = build_counter(chain(train_df['question_text'], test_df['question_text']))
            vocab = build_vocab(counter, max_vocab_size=max_vocab_size)
            shared_resources['vocab'] = vocab
            # tokenize sentences
            seqs = []
            for sent in tqdm(train_df['question_text'], desc='tokenize'):
                seq = tokens2ids(sent.split()[:max_seq_len], vocab['token2id'])
                seqs.append(seq)
            train_df['src_seqs'] = seqs
            seqs = []
            for sent in tqdm(test_df['question_text'], desc='tokenize'):
                seq = tokens2ids(sent.split()[:max_seq_len], vocab['token2id'])
                seqs.append(seq)
            test_df['src_seqs'] = seqs
    if 'embedding_matrix' not in shared_resources:
        embedding_matrix = load_embedding(shared_resources['vocab'], max_vocab_size, embed_size)
        shared_resources['embedding_matrix'] = embedding_matrix
    splits = list(
        StratifiedKFold(n_splits=n_folds, shuffle=True).split(train_df['target'], train_df['target']))
    scores = []
    best_threshold = []
    best_threshold_global = None
    best_score = -1
    predictions_train_reduced = []
    targets_train = []
    predictions_tes_reduced = np.zeros((len(test_df), n_folds))
    predictions_te =  np.zeros((len(test_df),))
    for idx, (train_idx, valid_idx) in enumerate(splits):
        grow_df = train_df.iloc[train_idx].reset_index(drop=True)
        dev_df = train_df.iloc[valid_idx].reset_index(drop=True)
        predictions_te_i, predictions_va, targets_va, best_threshold_i = main(grow_df, dev_df, test_df, device,
                                                                              **kwargs,
                                                                              idx=idx,
                                                                              shared_resources=shared_resources,
                                                                              return_reduced=True)
        # predictions_va_raw shape (#len_va,n_models)
        predictions_tes_reduced[:, idx] = predictions_te_i
        scores.append([f1_score(targets_va, predictions_va > threshold), roc_auc_score(targets_va, predictions_va)])
        best_threshold.append(best_threshold_i)
        predictions_te += predictions_te_i / n_folds
        predictions_train_reduced.append(predictions_va)
        targets_train.append(targets_va)
    # calculate model coefficient
    coeff = (np.corrcoef(predictions_tes_reduced, rowvar=False).sum() - n_folds) / n_folds / (n_folds - 1)
    # create data set for stacking
    predictions_train_reduced = np.concatenate(predictions_train_reduced)
    targets_train = np.concatenate(targets_train)  # len_train
    # train optimal combining weights

    # simple average
    for t in np.arange(0, 1, 0.01):
        score = f1_score(targets_train, predictions_train_reduced > t)
        if score > best_score:
            best_score = score
            best_threshold_global = t
    print('avg of best threshold {} macro-f1 best threshold {} best score {}'.format(best_threshold,
                                                                                     best_threshold_global, best_score))
    return predictions_te, predictions_te, scores, best_threshold_global, coeff


In [13]:
#main routine
def main(train_df, valid_df, test_df, device=None, epochs=3, fine_tuning_epochs=3, batch_size=512, learning_rate=0.001,
         learning_rate_max_offset=0.001, dropout=0.1,
         threshold=None,
         max_vocab_size=95000, embed_size=300, max_seq_len=70, print_every_step=500, idx=0, shared_resources=None,
         return_reduced=True):
    if device is None:
        device = torch.device("cuda:{}".format(0) if torch.cuda.is_available() else "cpu")

    if shared_resources is None:
        shared_resources = {}
    batch_time = AverageMeter()
    data_time = AverageMeter()
    mean_len = AverageMeter()
    # build vocab of raw df

    if 'vocab' not in shared_resources:
        counter = build_counter(chain(train_df['question_text'], test_df['question_text']))
        vocab = build_vocab(counter, max_vocab_size=max_vocab_size)
    else:
        vocab = shared_resources['vocab']
    if 'embedding_matrix' not in shared_resources:
        embedding_matrix = load_embedding(vocab, max_vocab_size, embed_size)
    else:
        embedding_matrix = shared_resources['embedding_matrix']
    # create test dataset
    test_dataset = TextDataset(test_df, vocab=vocab, max_seq_len=max_seq_len)
    tb = BucketSampler(test_dataset, test_dataset.get_keys(), batch_size=batch_size,
                       shuffle_data=False)
    test_iter = DataLoader(dataset=test_dataset,
                           batch_size=batch_size,
                           sampler=tb,
                           # shuffle=False,
                           num_workers=0,
                           collate_fn=collate_fn)

    train_dataset = TextDataset(train_df, vocab=vocab, max_seq_len=max_seq_len)
    # keys = train_dataset.get_keys()  # for bucket sorting
    valid_dataset = TextDataset(valid_df, vocab=vocab, max_seq_len=max_seq_len)
    vb = BucketSampler(valid_dataset, valid_dataset.get_keys(), batch_size=batch_size,
                       shuffle_data=False)
    valid_index_reverse = vb.get_reverse_indexes()
    # init model and optimizers
    model = InsincereModel(device, hidden_dim=256, hidden_dim_fc=16, dropout=dropout,
                           embedding_matrixs=embedding_matrix,
                           vocab_size=len(vocab['token2id']),
                           embedding_dim=embed_size, max_seq_len=max_seq_len)
    if idx == 0:
        print(model)
        print('total trainable {}'.format(count_parameters(model)))
    model = model.to(device)
    optimizer = optim.Adam([p for p in model.parameters() if p.requires_grad], lr=learning_rate)

    # init iterator
    train_iter = DataLoader(dataset=train_dataset,
                            batch_size=batch_size,
                            # shuffle=True,
                            # sampler=NegativeSubSampler(train_dataset, train_dataset.targets),
                            sampler=BucketSampler(train_dataset, train_dataset.get_keys(), bucket_size=batch_size * 20,
                                                  batch_size=batch_size),
                            num_workers=0,
                            collate_fn=collate_fn)

    valid_iter = DataLoader(dataset=valid_dataset,
                            batch_size=batch_size,
                            sampler=vb,
                            # shuffle=False,
                            collate_fn=collate_fn)

    # train model

    loss_list = []
    global_steps = 0
    total_steps = epochs * len(train_iter)
    loss_fn = torch.nn.BCEWithLogitsLoss()
    end = time.time()
    predictions_tes = []
    predictions_vas = []
    n_fge = 0
    clr = CyclicLR(optimizer, base_lr=learning_rate, max_lr=learning_rate + learning_rate_max_offset,
                   step_size=300, mode='exp_range')
    clr.on_train_begin()
    fine_tuning_epochs = epochs - fine_tuning_epochs
    predictions_te = None
    for epoch in tqdm(range(epochs)):

        fine_tuning = epoch >= fine_tuning_epochs
        start_fine_tuning = fine_tuning_epochs == epoch
        if start_fine_tuning:
            model.enable_learning_embedding()
            optimizer = optim.Adam([p for p in model.parameters() if p.requires_grad], lr=learning_rate)
            # fine tuning embedding layer
            global_steps = 0
            total_steps = (epochs - fine_tuning_epochs) * len(train_iter)
            clr = CyclicLR(optimizer, base_lr=learning_rate, max_lr=learning_rate + learning_rate_max_offset,
                           step_size=int(len(train_iter) / 8))
            clr.on_train_begin()
            predictions_te = np.zeros((len(test_df),))
            predictions_va = np.zeros((len(valid_dataset.targets),))
        for batch_data in train_iter:
            data_time.update(time.time() - end)
            qids, src_sents, src_seqs, src_lens, tgts = batch_data
            mean_len.update(sum(src_lens))
            src_seqs = src_seqs.to(device)
            tgts = tgts.to(device)
            model.train()
            optimizer.zero_grad()

            out = model(src_seqs, src_lens, return_logits=True).view(-1)
            loss = loss_fn(out, tgts)
            loss.backward()
            optimizer.step()

            loss_list.append(loss.detach().to('cpu').item())

            global_steps += 1
            batch_time.update(time.time() - end)
            end = time.time()
            if global_steps % print_every_step == 0:
                curr_gpu_memory_usage = get_gpu_memory_usage(device_id=torch.cuda.current_device())
                print('Global step: {}/{} Total loss: {:.4f}  Current GPU memory '
                      'usage: {} maxlen {} '.format(global_steps, total_steps, avg(loss_list), curr_gpu_memory_usage,
                                                    mean_len.avg))
                loss_list = []

                # print(f'Time {batch_time.val:.3f} ({batch_time.avg:.3f})\t'
                #      f'Data {data_time.val:.3f} ({data_time.avg:.3f})\t')
            if fine_tuning and global_steps % (2 * clr.step_size) == 0:
                predictions_te_tmp2 = eval_model(model, test_iter, device)
                predictions_va_tmp2 = eval_model(model, valid_iter, device, valid_index_reverse)
                report_perf(valid_dataset, predictions_va_tmp2, threshold, idx, epoch,
                            desc='val set mean')
                predictions_te = predictions_te * n_fge + (
                    predictions_te_tmp2)
                predictions_va = predictions_va * n_fge + (
                    predictions_va_tmp2)
                predictions_te /= n_fge + 1
                predictions_va /= n_fge + 1
                report_perf(valid_dataset, predictions_va, threshold, idx, epoch
                            , desc='val set (fge)')
                predictions_tes.append(predictions_te_tmp2.reshape([-1, 1]))
                predictions_vas.append(predictions_va_tmp2.reshape([-1, 1]))
                n_fge += 1

            clr.on_batch_end()
        if not fine_tuning:
            predictions_va = eval_model(model, valid_iter, device, valid_index_reverse)
            report_perf(valid_dataset, predictions_va, threshold, idx, epoch)
    # pprint(model.attention1.bias.data.to('cpu'))
    # pprint(model.attention2.bias.data.to('cpu'))
    # reorder index
    if predictions_te is not None:
        predictions_te = predictions_te[tb.get_reverse_indexes()]
    else:
        predictions_te = eval_model(model, test_iter, device, tb.get_reverse_indexes())
    best_score = -1
    best_threshold = None
    for t in np.arange(0, 1, 0.01):
        score = f1_score(valid_dataset.targets, predictions_va > t)
        if score > best_score:
            best_score = score
            best_threshold = t
    print('best threshold on validation set: {:.2f} score {:.4f}'.format(best_threshold, best_score))
    if not return_reduced and len(predictions_vas) > 0:
        predictions_te = np.concatenate(predictions_tes, axis=1)
        predictions_te = predictions_te[tb.get_reverse_indexes(), :]
        predictions_va = np.concatenate(predictions_vas, axis=1)

    # make predictions
    return predictions_te, predictions_va, valid_dataset.targets, best_threshold



In [14]:
# seeding
set_seed(233)
epochs = 8
batch_size = 512
learning_rate = 0.001
learning_rate_max_offset = 0.002
fine_tuning_epochs = 2
threshold = 0.31
max_vocab_size = 120000
embed_size = 300
print_every_step = 500
max_seq_len = 70
share = True
dropout = 0.1
sub = pd.read_csv('../input/sample_submission.csv')
train_df, test_df = load_data()
# shuffling
trn_idx = np.random.permutation(len(train_df))
train_df = train_df.iloc[trn_idx].reset_index(drop=True)
n_folds = 5
n_repeats = 1
args = {'epochs': epochs, 'batch_size': batch_size, 'learning_rate': learning_rate, 'threshold': threshold,
        'max_vocab_size': max_vocab_size,
        'embed_size': embed_size, 'print_every_step': print_every_step, 'dropout': dropout,
        'learning_rate_max_offset': learning_rate_max_offset,
        'fine_tuning_epochs': fine_tuning_epochs, 'max_seq_len': max_seq_len}
predictions_te_all = np.zeros((len(test_df),))
for _ in range(n_repeats):
    if n_folds > 1:
        _, predictions_te, _, threshold, coeffs = cv(train_df, test_df, n_folds=n_folds, share=share, **args)
        print('coeff between predictions {}'.format(coeffs))
    else:
        predictions_te, _, _, _ = main(train_df, test_df, test_df, **args)
    predictions_te_all += predictions_te / n_repeats
sub.prediction = predictions_te_all > threshold
sub.to_csv("submission.csv", index=False)

preprocssing 18.934462547302246s


building conuter: 1681928it [00:11, 140238.71it/s]
building vocab: 120000it [00:00, 828733.67it/s]
tokenize: 100%|██████████| 1306122/1306122 [00:15<00:00, 86189.80it/s] 
tokenize: 100%|██████████| 375806/375806 [00:04<00:00, 90142.27it/s] 


oov 6264/308/297/761/120002
oov 6162/50165/0/0/120002


generate lens: 375806it [00:00, 651869.25it/s]
generate lens: 261225it [00:00, 531373.75it/s]


InsincereModel(
  (embedding_0): Embedding(120002, 300)
  (embedding_1): Embedding(120002, 300)
  (embedding_mean): Embedding(120002, 300)
  (learnable_embedding): Embedding(120002, 300, padding_idx=0)
  (spatial_dropout): Dropout2d(p=0.2)
  (rnn0): LSTM(300, 128, batch_first=True, bidirectional=True)
  (rnn1): GRU(256, 128, batch_first=True, bidirectional=True)
  (capsule): Capsule()
  (dropout2): Dropout(p=0.3)
  (lincaps): Linear(in_features=25, out_features=1, bias=True)
  (attention1): Attention(
    (attention_fc): Linear(in_features=256, out_features=1, bias=True)
  )
  (attention2): Attention(
    (attention_fc): Linear(in_features=256, out_features=1, bias=True)
  )
  (fc): Linear(in_features=1025, out_features=16, bias=True)
  (norm): LayerNorm(torch.Size([1025]), eps=1e-05, elementwise_affine=True)
  (dropout1): Dropout(p=0.2)
  (dropout_linear): Dropout(p=0.1)
  (hidden2out): Linear(in_features=16, out_features=1, bias=True)
)
total trainable 36762931


generate lens: 1044897it [00:01, 552802.88it/s]
  0%|          | 0/8 [00:00<?, ?it/s]

Global step: 500/16328 Total loss: 0.1396  Current GPU memory usage: 1374 maxlen 7555.308 
Global step: 1000/16328 Total loss: 0.1151  Current GPU memory usage: 1374 maxlen 7559.864 
Global step: 1500/16328 Total loss: 0.1088  Current GPU memory usage: 1374 maxlen 7555.432 
Global step: 2000/16328 Total loss: 0.1072  Current GPU memory usage: 1374 maxlen 7557.9285 


 12%|█▎        | 1/8 [01:50<12:50, 110.06s/it]

idx 0 epoch 0 val set f1 : 0.6634 auc : 0.9637 margin : 0.0564
Global step: 2500/16328 Total loss: 0.1037  Current GPU memory usage: 1374 maxlen 7558.7384 
Global step: 3000/16328 Total loss: 0.1027  Current GPU memory usage: 1374 maxlen 7559.904333333333 
Global step: 3500/16328 Total loss: 0.1014  Current GPU memory usage: 1374 maxlen 7559.050571428572 
Global step: 4000/16328 Total loss: 0.1010  Current GPU memory usage: 1374 maxlen 7558.1075 


 25%|██▌       | 2/8 [03:41<11:02, 110.37s/it]

idx 0 epoch 1 val set f1 : 0.6773 auc : 0.9671 margin : 0.0541
Global step: 4500/16328 Total loss: 0.0958  Current GPU memory usage: 1374 maxlen 7554.211555555556 
Global step: 5000/16328 Total loss: 0.0956  Current GPU memory usage: 1374 maxlen 7557.3308 
Global step: 5500/16328 Total loss: 0.0963  Current GPU memory usage: 1374 maxlen 7558.135636363636 
Global step: 6000/16328 Total loss: 0.0958  Current GPU memory usage: 1374 maxlen 7558.583666666666 


 38%|███▊      | 3/8 [05:32<09:13, 110.63s/it]

idx 0 epoch 2 val set f1 : 0.6850 auc : 0.9685 margin : 0.0558
Global step: 6500/16328 Total loss: 0.0916  Current GPU memory usage: 1374 maxlen 7557.627230769231 
Global step: 7000/16328 Total loss: 0.0910  Current GPU memory usage: 1374 maxlen 7558.000142857143 
Global step: 7500/16328 Total loss: 0.0910  Current GPU memory usage: 1374 maxlen 7557.5212 
Global step: 8000/16328 Total loss: 0.0918  Current GPU memory usage: 1374 maxlen 7557.510375 


 50%|█████     | 4/8 [07:23<07:23, 110.77s/it]

idx 0 epoch 3 val set f1 : 0.6787 auc : 0.9695 margin : 0.0540
Global step: 8500/16328 Total loss: 0.0877  Current GPU memory usage: 1374 maxlen 7558.384705882353 
Global step: 9000/16328 Total loss: 0.0857  Current GPU memory usage: 1374 maxlen 7557.049555555555 
Global step: 9500/16328 Total loss: 0.0874  Current GPU memory usage: 1374 maxlen 7556.988842105263 
Global step: 10000/16328 Total loss: 0.0876  Current GPU memory usage: 1374 maxlen 7557.8059 


 62%|██████▎   | 5/8 [09:14<05:32, 110.90s/it]

idx 0 epoch 4 val set f1 : 0.6874 auc : 0.9699 margin : 0.0545
Global step: 10500/16328 Total loss: 0.0841  Current GPU memory usage: 1374 maxlen 7558.084285714286 
Global step: 11000/16328 Total loss: 0.0823  Current GPU memory usage: 1374 maxlen 7555.868818181818 
Global step: 11500/16328 Total loss: 0.0838  Current GPU memory usage: 1374 maxlen 7558.569217391304 
Global step: 12000/16328 Total loss: 0.0838  Current GPU memory usage: 1374 maxlen 7558.248583333333 


 75%|███████▌  | 6/8 [11:06<03:42, 111.05s/it]

idx 0 epoch 5 val set f1 : 0.6808 auc : 0.9690 margin : 0.0533
Global step: 500/4082 Total loss: 0.0817  Current GPU memory usage: 1891 maxlen 7557.673387729484 
idx 0 epoch 6 val set mean f1 : 0.6824 auc : 0.9685 margin : 0.0520
idx 0 epoch 6 val set (fge) f1 : 0.6824 auc : 0.9685 margin : 0.0520
Global step: 1000/4082 Total loss: 0.0836  Current GPU memory usage: 1891 maxlen 7557.647818209271 
idx 0 epoch 6 val set mean f1 : 0.6858 auc : 0.9686 margin : 0.0527
idx 0 epoch 6 val set (fge) f1 : 0.6899 auc : 0.9700 margin : 0.0523
Global step: 1500/4082 Total loss: 0.0856  Current GPU memory usage: 1891 maxlen 7557.5714389640625 
idx 0 epoch 6 val set mean f1 : 0.6837 auc : 0.9690 margin : 0.0527
idx 0 epoch 6 val set (fge) f1 : 0.6925 auc : 0.9707 margin : 0.0524
Global step: 2000/4082 Total loss: 0.0873  Current GPU memory usage: 1891 maxlen 7557.898989189948 
idx 0 epoch 6 val set mean f1 : 0.6888 auc : 0.9698 margin : 0.0487
idx 0 epoch 6 val set (fge) f1 : 0.6943 auc : 0.9713 margi

 88%|████████▊ | 7/8 [14:48<02:24, 144.57s/it]

Global step: 2500/4082 Total loss: 0.0692  Current GPU memory usage: 1891 maxlen 7558.092092770921 
idx 0 epoch 7 val set mean f1 : 0.6790 auc : 0.9666 margin : 0.0502
idx 0 epoch 7 val set (fge) f1 : 0.6950 auc : 0.9713 margin : 0.0512
Global step: 3000/4082 Total loss: 0.0692  Current GPU memory usage: 1891 maxlen 7558.122392758756 
idx 0 epoch 7 val set mean f1 : 0.6795 auc : 0.9665 margin : 0.0518
idx 0 epoch 7 val set (fge) f1 : 0.6952 auc : 0.9714 margin : 0.0513
Global step: 3500/4082 Total loss: 0.0729  Current GPU memory usage: 1891 maxlen 7558.388670138448 
idx 0 epoch 7 val set mean f1 : 0.6806 auc : 0.9668 margin : 0.0497
idx 0 epoch 7 val set (fge) f1 : 0.6975 auc : 0.9714 margin : 0.0511
Global step: 4000/4082 Total loss: 0.0752  Current GPU memory usage: 1891 maxlen 7558.051335713406 
idx 0 epoch 7 val set mean f1 : 0.6793 auc : 0.9674 margin : 0.0515
idx 0 epoch 7 val set (fge) f1 : 0.6972 auc : 0.9714 margin : 0.0511


100%|██████████| 8/8 [18:32<00:00, 168.16s/it]


best threshold on validation set: 0.35 score 0.6979


generate lens: 375806it [00:00, 667980.08it/s]
generate lens: 261225it [00:00, 542069.32it/s]
generate lens: 1044897it [00:01, 554797.57it/s]
  0%|          | 0/8 [00:00<?, ?it/s]

Global step: 500/16328 Total loss: 0.1398  Current GPU memory usage: 1891 maxlen 7558.452 
Global step: 1000/16328 Total loss: 0.1159  Current GPU memory usage: 1891 maxlen 7562.229 
Global step: 1500/16328 Total loss: 0.1094  Current GPU memory usage: 1891 maxlen 7561.242666666667 
Global step: 2000/16328 Total loss: 0.1084  Current GPU memory usage: 1891 maxlen 7558.8605 


 12%|█▎        | 1/8 [01:51<13:00, 111.44s/it]

idx 1 epoch 0 val set f1 : 0.6554 auc : 0.9648 margin : 0.0567
Global step: 2500/16328 Total loss: 0.1040  Current GPU memory usage: 1891 maxlen 7561.014 
Global step: 3000/16328 Total loss: 0.1008  Current GPU memory usage: 1891 maxlen 7561.043 
Global step: 3500/16328 Total loss: 0.1015  Current GPU memory usage: 1891 maxlen 7555.178857142857 
Global step: 4000/16328 Total loss: 0.1016  Current GPU memory usage: 1891 maxlen 7559.97525 


 25%|██▌       | 2/8 [03:42<11:08, 111.40s/it]

idx 1 epoch 1 val set f1 : 0.6838 auc : 0.9682 margin : 0.0558
Global step: 4500/16328 Total loss: 0.0970  Current GPU memory usage: 1891 maxlen 7559.31 
Global step: 5000/16328 Total loss: 0.0954  Current GPU memory usage: 1891 maxlen 7559.0486 
Global step: 5500/16328 Total loss: 0.0956  Current GPU memory usage: 1891 maxlen 7560.981636363636 
Global step: 6000/16328 Total loss: 0.0957  Current GPU memory usage: 1891 maxlen 7560.251666666667 


 38%|███▊      | 3/8 [05:34<09:16, 111.38s/it]

idx 1 epoch 2 val set f1 : 0.6865 auc : 0.9703 margin : 0.0583
Global step: 6500/16328 Total loss: 0.0917  Current GPU memory usage: 1891 maxlen 7557.930153846154 
Global step: 7000/16328 Total loss: 0.0916  Current GPU memory usage: 1891 maxlen 7560.280285714286 
Global step: 7500/16328 Total loss: 0.0909  Current GPU memory usage: 1891 maxlen 7559.520933333333 
Global step: 8000/16328 Total loss: 0.0905  Current GPU memory usage: 1891 maxlen 7560.793375 


 50%|█████     | 4/8 [07:25<07:25, 111.35s/it]

idx 1 epoch 3 val set f1 : 0.6860 auc : 0.9698 margin : 0.0556
Global step: 8500/16328 Total loss: 0.0876  Current GPU memory usage: 1891 maxlen 7559.288823529412 
Global step: 9000/16328 Total loss: 0.0861  Current GPU memory usage: 1891 maxlen 7559.685888888889 
Global step: 9500/16328 Total loss: 0.0873  Current GPU memory usage: 1891 maxlen 7557.630631578947 
Global step: 10000/16328 Total loss: 0.0870  Current GPU memory usage: 1891 maxlen 7560.3489 


 62%|██████▎   | 5/8 [09:16<05:34, 111.33s/it]

idx 1 epoch 4 val set f1 : 0.6912 auc : 0.9710 margin : 0.0513
Global step: 10500/16328 Total loss: 0.0828  Current GPU memory usage: 1891 maxlen 7558.223333333333 
Global step: 11000/16328 Total loss: 0.0824  Current GPU memory usage: 1891 maxlen 7559.4941818181815 
Global step: 11500/16328 Total loss: 0.0831  Current GPU memory usage: 1891 maxlen 7558.728260869565 
Global step: 12000/16328 Total loss: 0.0840  Current GPU memory usage: 1891 maxlen 7559.7404166666665 


 75%|███████▌  | 6/8 [11:07<03:42, 111.22s/it]

idx 1 epoch 5 val set f1 : 0.6903 auc : 0.9706 margin : 0.0534
Global step: 500/4082 Total loss: 0.0818  Current GPU memory usage: 1892 maxlen 7559.809273497568 
idx 1 epoch 6 val set mean f1 : 0.6899 auc : 0.9696 margin : 0.0488
idx 1 epoch 6 val set (fge) f1 : 0.6899 auc : 0.9696 margin : 0.0488
Global step: 1000/4082 Total loss: 0.0846  Current GPU memory usage: 1892 maxlen 7559.967235391817 
idx 1 epoch 6 val set mean f1 : 0.6901 auc : 0.9697 margin : 0.0515
idx 1 epoch 6 val set (fge) f1 : 0.6973 auc : 0.9711 margin : 0.0501
Global step: 1500/4082 Total loss: 0.0848  Current GPU memory usage: 1892 maxlen 7559.830859886512 
idx 1 epoch 6 val set mean f1 : 0.6914 auc : 0.9703 margin : 0.0512
idx 1 epoch 6 val set (fge) f1 : 0.7012 auc : 0.9720 margin : 0.0505
Global step: 2000/4082 Total loss: 0.0861  Current GPU memory usage: 1892 maxlen 7559.622981889654 
idx 1 epoch 6 val set mean f1 : 0.6932 auc : 0.9706 margin : 0.0509
idx 1 epoch 6 val set (fge) f1 : 0.7027 auc : 0.9725 margin

 88%|████████▊ | 7/8 [14:50<02:24, 144.74s/it]

Global step: 2500/4082 Total loss: 0.0690  Current GPU memory usage: 1892 maxlen 7559.354875898549 
idx 1 epoch 7 val set mean f1 : 0.6846 auc : 0.9684 margin : 0.0484
idx 1 epoch 7 val set (fge) f1 : 0.7034 auc : 0.9726 margin : 0.0502
Global step: 3000/4082 Total loss: 0.0705  Current GPU memory usage: 1892 maxlen 7559.531483667847 
idx 1 epoch 7 val set mean f1 : 0.6853 auc : 0.9686 margin : 0.0497
idx 1 epoch 7 val set (fge) f1 : 0.7037 auc : 0.9726 margin : 0.0501
Global step: 3500/4082 Total loss: 0.0732  Current GPU memory usage: 1892 maxlen 7559.235869427156 
idx 1 epoch 7 val set mean f1 : 0.6858 auc : 0.9680 margin : 0.0490
idx 1 epoch 7 val set (fge) f1 : 0.7049 auc : 0.9726 margin : 0.0499
Global step: 4000/4082 Total loss: 0.0745  Current GPU memory usage: 1892 maxlen 7559.275883294349 
idx 1 epoch 7 val set mean f1 : 0.6820 auc : 0.9683 margin : 0.0510
idx 1 epoch 7 val set (fge) f1 : 0.7051 auc : 0.9727 margin : 0.0501


100%|██████████| 8/8 [18:34<00:00, 168.51s/it]


best threshold on validation set: 0.32 score 0.7056


generate lens: 375806it [00:00, 659100.98it/s]
generate lens: 261224it [00:00, 512095.96it/s]
generate lens: 1044898it [00:01, 551131.59it/s]
  0%|          | 0/8 [00:00<?, ?it/s]

Global step: 500/16328 Total loss: 0.1367  Current GPU memory usage: 1892 maxlen 7560.884 
Global step: 1000/16328 Total loss: 0.1146  Current GPU memory usage: 1892 maxlen 7560.271 
Global step: 1500/16328 Total loss: 0.1097  Current GPU memory usage: 1892 maxlen 7552.416 
Global step: 2000/16328 Total loss: 0.1076  Current GPU memory usage: 1892 maxlen 7552.8585 


 12%|█▎        | 1/8 [01:51<13:00, 111.49s/it]

idx 2 epoch 0 val set f1 : 0.6628 auc : 0.9633 margin : 0.0672
Global step: 2500/16328 Total loss: 0.1041  Current GPU memory usage: 1892 maxlen 7555.2724 
Global step: 3000/16328 Total loss: 0.1025  Current GPU memory usage: 1892 maxlen 7553.816333333333 
Global step: 3500/16328 Total loss: 0.1008  Current GPU memory usage: 1892 maxlen 7553.732 
Global step: 4000/16328 Total loss: 0.0998  Current GPU memory usage: 1892 maxlen 7549.6235 


 25%|██▌       | 2/8 [03:42<11:08, 111.35s/it]

idx 2 epoch 1 val set f1 : 0.6844 auc : 0.9672 margin : 0.0571
Global step: 4500/16328 Total loss: 0.0968  Current GPU memory usage: 1892 maxlen 7550.959777777778 
Global step: 5000/16328 Total loss: 0.0949  Current GPU memory usage: 1892 maxlen 7552.5042 
Global step: 5500/16328 Total loss: 0.0972  Current GPU memory usage: 1892 maxlen 7552.246181818182 
Global step: 6000/16328 Total loss: 0.0953  Current GPU memory usage: 1892 maxlen 7554.101166666666 


 38%|███▊      | 3/8 [05:33<09:16, 111.35s/it]

idx 2 epoch 2 val set f1 : 0.6889 auc : 0.9688 margin : 0.0547
Global step: 6500/16328 Total loss: 0.0920  Current GPU memory usage: 1892 maxlen 7554.217230769231 
Global step: 7000/16328 Total loss: 0.0924  Current GPU memory usage: 1892 maxlen 7553.983285714286 
Global step: 7500/16328 Total loss: 0.0916  Current GPU memory usage: 1892 maxlen 7553.5232 
Global step: 8000/16328 Total loss: 0.0905  Current GPU memory usage: 1892 maxlen 7553.504125 


 50%|█████     | 4/8 [07:24<07:25, 111.29s/it]

idx 2 epoch 3 val set f1 : 0.6944 auc : 0.9691 margin : 0.0537
Global step: 8500/16328 Total loss: 0.0880  Current GPU memory usage: 1892 maxlen 7552.598352941176 
Global step: 9000/16328 Total loss: 0.0864  Current GPU memory usage: 1892 maxlen 7553.432222222223 
Global step: 9500/16328 Total loss: 0.0862  Current GPU memory usage: 1892 maxlen 7552.822947368421 
Global step: 10000/16328 Total loss: 0.0872  Current GPU memory usage: 1892 maxlen 7553.5118 


 62%|██████▎   | 5/8 [09:16<05:33, 111.24s/it]

idx 2 epoch 4 val set f1 : 0.6952 auc : 0.9697 margin : 0.0504
Global step: 10500/16328 Total loss: 0.0829  Current GPU memory usage: 1892 maxlen 7551.47380952381 
Global step: 11000/16328 Total loss: 0.0838  Current GPU memory usage: 1892 maxlen 7552.6246363636365 
Global step: 11500/16328 Total loss: 0.0828  Current GPU memory usage: 1892 maxlen 7552.619043478261 
Global step: 12000/16328 Total loss: 0.0832  Current GPU memory usage: 1892 maxlen 7551.484833333333 


 75%|███████▌  | 6/8 [11:07<03:42, 111.25s/it]

idx 2 epoch 5 val set f1 : 0.6883 auc : 0.9694 margin : 0.0520
Global step: 500/4082 Total loss: 0.0819  Current GPU memory usage: 1892 maxlen 7552.282284638318 
idx 2 epoch 6 val set mean f1 : 0.6858 auc : 0.9686 margin : 0.0511
idx 2 epoch 6 val set (fge) f1 : 0.6858 auc : 0.9686 margin : 0.0511
Global step: 1000/4082 Total loss: 0.0835  Current GPU memory usage: 1892 maxlen 7552.454552317681 
idx 2 epoch 6 val set mean f1 : 0.6911 auc : 0.9691 margin : 0.0503
idx 2 epoch 6 val set (fge) f1 : 0.6955 auc : 0.9705 margin : 0.0507
Global step: 1500/4082 Total loss: 0.0850  Current GPU memory usage: 1892 maxlen 7552.1380037829185 
idx 2 epoch 6 val set mean f1 : 0.6913 auc : 0.9693 margin : 0.0511
idx 2 epoch 6 val set (fge) f1 : 0.7004 auc : 0.9712 margin : 0.0508
Global step: 2000/4082 Total loss: 0.0879  Current GPU memory usage: 1892 maxlen 7552.500982731995 
idx 2 epoch 6 val set mean f1 : 0.6955 auc : 0.9696 margin : 0.0522
idx 2 epoch 6 val set (fge) f1 : 0.7015 auc : 0.9716 margi

 88%|████████▊ | 7/8 [14:50<02:24, 144.81s/it]

Global step: 2500/4082 Total loss: 0.0686  Current GPU memory usage: 1892 maxlen 7552.128645056287 
idx 2 epoch 7 val set mean f1 : 0.6838 auc : 0.9675 margin : 0.0485
idx 2 epoch 7 val set (fge) f1 : 0.7022 auc : 0.9717 margin : 0.0506
Global step: 3000/4082 Total loss: 0.0704  Current GPU memory usage: 1892 maxlen 7552.653482880755 
idx 2 epoch 7 val set mean f1 : 0.6861 auc : 0.9676 margin : 0.0486
idx 2 epoch 7 val set (fge) f1 : 0.7029 auc : 0.9718 margin : 0.0503
Global step: 3500/4082 Total loss: 0.0732  Current GPU memory usage: 1892 maxlen 7552.444938397053 
idx 2 epoch 7 val set mean f1 : 0.6846 auc : 0.9673 margin : 0.0514
idx 2 epoch 7 val set (fge) f1 : 0.7035 auc : 0.9718 margin : 0.0504
Global step: 4000/4082 Total loss: 0.0743  Current GPU memory usage: 1892 maxlen 7552.490336082728 
idx 2 epoch 7 val set mean f1 : 0.6845 auc : 0.9671 margin : 0.0538
idx 2 epoch 7 val set (fge) f1 : 0.7037 auc : 0.9717 margin : 0.0509


100%|██████████| 8/8 [18:33<00:00, 168.35s/it]


best threshold on validation set: 0.34 score 0.7048


generate lens: 375806it [00:00, 628678.45it/s]
generate lens: 261224it [00:00, 536819.22it/s]
generate lens: 1044898it [00:01, 524318.48it/s]
  0%|          | 0/8 [00:00<?, ?it/s]

Global step: 500/16328 Total loss: 0.1389  Current GPU memory usage: 1892 maxlen 7542.16 
Global step: 1000/16328 Total loss: 0.1136  Current GPU memory usage: 1892 maxlen 7552.615 
Global step: 1500/16328 Total loss: 0.1101  Current GPU memory usage: 1892 maxlen 7550.710666666667 
Global step: 2000/16328 Total loss: 0.1076  Current GPU memory usage: 1892 maxlen 7550.202 


 12%|█▎        | 1/8 [01:51<12:59, 111.40s/it]

idx 3 epoch 0 val set f1 : 0.6615 auc : 0.9631 margin : 0.0594
Global step: 2500/16328 Total loss: 0.1027  Current GPU memory usage: 1892 maxlen 7552.8512 
Global step: 3000/16328 Total loss: 0.1033  Current GPU memory usage: 1892 maxlen 7549.687666666667 
Global step: 3500/16328 Total loss: 0.1034  Current GPU memory usage: 1892 maxlen 7550.526285714286 
Global step: 4000/16328 Total loss: 0.0989  Current GPU memory usage: 1892 maxlen 7550.97325 


 25%|██▌       | 2/8 [03:42<11:08, 111.45s/it]

idx 3 epoch 1 val set f1 : 0.6828 auc : 0.9667 margin : 0.0602
Global step: 4500/16328 Total loss: 0.0960  Current GPU memory usage: 1892 maxlen 7551.875333333333 
Global step: 5000/16328 Total loss: 0.0958  Current GPU memory usage: 1892 maxlen 7551.2826 
Global step: 5500/16328 Total loss: 0.0956  Current GPU memory usage: 1892 maxlen 7548.973818181818 
Global step: 6000/16328 Total loss: 0.0963  Current GPU memory usage: 1892 maxlen 7550.0171666666665 


 38%|███▊      | 3/8 [05:34<09:16, 111.34s/it]

idx 3 epoch 2 val set f1 : 0.6893 auc : 0.9686 margin : 0.0501
Global step: 6500/16328 Total loss: 0.0921  Current GPU memory usage: 1892 maxlen 7549.584923076923 
Global step: 7000/16328 Total loss: 0.0922  Current GPU memory usage: 1892 maxlen 7551.096428571429 
Global step: 7500/16328 Total loss: 0.0910  Current GPU memory usage: 1892 maxlen 7549.363333333334 
Global step: 8000/16328 Total loss: 0.0913  Current GPU memory usage: 1892 maxlen 7551.2225 


 50%|█████     | 4/8 [07:25<07:25, 111.27s/it]

idx 3 epoch 3 val set f1 : 0.6878 auc : 0.9691 margin : 0.0539
Global step: 8500/16328 Total loss: 0.0868  Current GPU memory usage: 1892 maxlen 7549.5905882352945 
Global step: 9000/16328 Total loss: 0.0856  Current GPU memory usage: 1892 maxlen 7548.676222222222 
Global step: 9500/16328 Total loss: 0.0876  Current GPU memory usage: 1892 maxlen 7549.413157894737 
Global step: 10000/16328 Total loss: 0.0874  Current GPU memory usage: 1892 maxlen 7548.8495 


 62%|██████▎   | 5/8 [09:16<05:33, 111.30s/it]

idx 3 epoch 4 val set f1 : 0.6883 auc : 0.9694 margin : 0.0529
Global step: 10500/16328 Total loss: 0.0832  Current GPU memory usage: 1892 maxlen 7550.3786666666665 
Global step: 11000/16328 Total loss: 0.0835  Current GPU memory usage: 1892 maxlen 7551.063727272727 
Global step: 11500/16328 Total loss: 0.0829  Current GPU memory usage: 1892 maxlen 7550.367217391305 
Global step: 12000/16328 Total loss: 0.0841  Current GPU memory usage: 1892 maxlen 7549.294833333333 


 75%|███████▌  | 6/8 [11:07<03:42, 111.25s/it]

idx 3 epoch 5 val set f1 : 0.6890 auc : 0.9693 margin : 0.0508
Global step: 500/4082 Total loss: 0.0819  Current GPU memory usage: 1893 maxlen 7550.141142319159 
idx 3 epoch 6 val set mean f1 : 0.6848 auc : 0.9680 margin : 0.0531
idx 3 epoch 6 val set (fge) f1 : 0.6848 auc : 0.9680 margin : 0.0531
Global step: 1000/4082 Total loss: 0.0837  Current GPU memory usage: 1893 maxlen 7550.431375509588 
idx 3 epoch 6 val set mean f1 : 0.6868 auc : 0.9683 margin : 0.0516
idx 3 epoch 6 val set (fge) f1 : 0.6930 auc : 0.9697 margin : 0.0524
Global step: 1500/4082 Total loss: 0.0864  Current GPU memory usage: 1893 maxlen 7550.469736650662 
idx 3 epoch 6 val set mean f1 : 0.6865 auc : 0.9692 margin : 0.0522
idx 3 epoch 6 val set (fge) f1 : 0.6958 auc : 0.9707 margin : 0.0523
Global step: 2000/4082 Total loss: 0.0873  Current GPU memory usage: 1893 maxlen 7550.176891759091 
idx 3 epoch 6 val set mean f1 : 0.6945 auc : 0.9697 margin : 0.0516
idx 3 epoch 6 val set (fge) f1 : 0.6990 auc : 0.9713 margin

 88%|████████▊ | 7/8 [14:50<02:24, 144.86s/it]

Global step: 2500/4082 Total loss: 0.0690  Current GPU memory usage: 1893 maxlen 7550.340092228401 
idx 3 epoch 7 val set mean f1 : 0.6817 auc : 0.9663 margin : 0.0500
idx 3 epoch 7 val set (fge) f1 : 0.6998 auc : 0.9714 margin : 0.0517
Global step: 3000/4082 Total loss: 0.0704  Current GPU memory usage: 1893 maxlen 7550.638790502427 
idx 3 epoch 7 val set mean f1 : 0.6829 auc : 0.9667 margin : 0.0487
idx 3 epoch 7 val set (fge) f1 : 0.6990 auc : 0.9714 margin : 0.0512
Global step: 3500/4082 Total loss: 0.0727  Current GPU memory usage: 1893 maxlen 7550.77073542487 
idx 3 epoch 7 val set mean f1 : 0.6827 auc : 0.9665 margin : 0.0479
idx 3 epoch 7 val set (fge) f1 : 0.6991 auc : 0.9714 margin : 0.0507
Global step: 4000/4082 Total loss: 0.0743  Current GPU memory usage: 1893 maxlen 7550.3826788132465 
idx 3 epoch 7 val set mean f1 : 0.6828 auc : 0.9668 margin : 0.0535
idx 3 epoch 7 val set (fge) f1 : 0.6994 auc : 0.9714 margin : 0.0511


100%|██████████| 8/8 [18:34<00:00, 168.42s/it]


best threshold on validation set: 0.34 score 0.7007


generate lens: 375806it [00:00, 630569.73it/s]
generate lens: 261224it [00:00, 542757.35it/s]
generate lens: 1044898it [00:01, 558659.47it/s]
  0%|          | 0/8 [00:00<?, ?it/s]

Global step: 500/16328 Total loss: 0.1386  Current GPU memory usage: 1893 maxlen 7551.298 
Global step: 1000/16328 Total loss: 0.1138  Current GPU memory usage: 1893 maxlen 7551.34 
Global step: 1500/16328 Total loss: 0.1092  Current GPU memory usage: 1893 maxlen 7550.491333333333 
Global step: 2000/16328 Total loss: 0.1079  Current GPU memory usage: 1893 maxlen 7554.267 


 12%|█▎        | 1/8 [01:51<13:01, 111.63s/it]

idx 4 epoch 0 val set f1 : 0.6573 auc : 0.9623 margin : 0.0582
Global step: 2500/16328 Total loss: 0.1036  Current GPU memory usage: 1893 maxlen 7554.5264 
Global step: 3000/16328 Total loss: 0.1027  Current GPU memory usage: 1893 maxlen 7554.539 
Global step: 3500/16328 Total loss: 0.0998  Current GPU memory usage: 1893 maxlen 7552.808 
Global step: 4000/16328 Total loss: 0.1007  Current GPU memory usage: 1893 maxlen 7554.8835 


 25%|██▌       | 2/8 [03:42<11:09, 111.52s/it]

idx 4 epoch 1 val set f1 : 0.6756 auc : 0.9660 margin : 0.0577
Global step: 4500/16328 Total loss: 0.0962  Current GPU memory usage: 1893 maxlen 7555.483333333334 
Global step: 5000/16328 Total loss: 0.0943  Current GPU memory usage: 1893 maxlen 7554.1348 
Global step: 5500/16328 Total loss: 0.0961  Current GPU memory usage: 1893 maxlen 7555.3789090909095 
Global step: 6000/16328 Total loss: 0.0959  Current GPU memory usage: 1893 maxlen 7555.644666666667 


 38%|███▊      | 3/8 [05:34<09:17, 111.44s/it]

idx 4 epoch 2 val set f1 : 0.6816 auc : 0.9680 margin : 0.0521
Global step: 6500/16328 Total loss: 0.0911  Current GPU memory usage: 1893 maxlen 7554.718615384615 
Global step: 7000/16328 Total loss: 0.0918  Current GPU memory usage: 1893 maxlen 7554.197428571429 
Global step: 7500/16328 Total loss: 0.0920  Current GPU memory usage: 1893 maxlen 7553.9 
Global step: 8000/16328 Total loss: 0.0903  Current GPU memory usage: 1893 maxlen 7556.113375 


 50%|█████     | 4/8 [07:25<07:25, 111.33s/it]

idx 4 epoch 3 val set f1 : 0.6811 auc : 0.9672 margin : 0.0509
Global step: 8500/16328 Total loss: 0.0872  Current GPU memory usage: 1893 maxlen 7555.6950588235295 
Global step: 9000/16328 Total loss: 0.0857  Current GPU memory usage: 1893 maxlen 7554.451 
Global step: 9500/16328 Total loss: 0.0863  Current GPU memory usage: 1893 maxlen 7553.9912631578945 
Global step: 10000/16328 Total loss: 0.0867  Current GPU memory usage: 1893 maxlen 7553.4212 


 62%|██████▎   | 5/8 [09:16<05:33, 111.32s/it]

idx 4 epoch 4 val set f1 : 0.6828 auc : 0.9688 margin : 0.0543
Global step: 10500/16328 Total loss: 0.0837  Current GPU memory usage: 1893 maxlen 7555.656095238095 
Global step: 11000/16328 Total loss: 0.0817  Current GPU memory usage: 1893 maxlen 7555.474181818182 
Global step: 11500/16328 Total loss: 0.0834  Current GPU memory usage: 1893 maxlen 7554.076434782609 
Global step: 12000/16328 Total loss: 0.0843  Current GPU memory usage: 1893 maxlen 7556.219166666667 


 75%|███████▌  | 6/8 [11:07<03:42, 111.31s/it]

idx 4 epoch 5 val set f1 : 0.6862 auc : 0.9688 margin : 0.0524
Global step: 500/4082 Total loss: 0.0813  Current GPU memory usage: 1893 maxlen 7555.19982739683 
idx 4 epoch 6 val set mean f1 : 0.6795 auc : 0.9674 margin : 0.0526
idx 4 epoch 6 val set (fge) f1 : 0.6795 auc : 0.9674 margin : 0.0526
Global step: 1000/4082 Total loss: 0.0837  Current GPU memory usage: 1893 maxlen 7555.173939302431 
idx 4 epoch 6 val set mean f1 : 0.6854 auc : 0.9675 margin : 0.0527
idx 4 epoch 6 val set (fge) f1 : 0.6885 auc : 0.9689 margin : 0.0526
Global step: 1500/4082 Total loss: 0.0859  Current GPU memory usage: 1893 maxlen 7555.0291721227995 
idx 4 epoch 6 val set mean f1 : 0.6860 auc : 0.9678 margin : 0.0501
idx 4 epoch 6 val set (fge) f1 : 0.6920 auc : 0.9696 margin : 0.0518
Global step: 2000/4082 Total loss: 0.0857  Current GPU memory usage: 1893 maxlen 7555.155903411484 
idx 4 epoch 6 val set mean f1 : 0.6849 auc : 0.9680 margin : 0.0531
idx 4 epoch 6 val set (fge) f1 : 0.6949 auc : 0.9702 margin

 88%|████████▊ | 7/8 [14:50<02:24, 144.79s/it]

Global step: 2500/4082 Total loss: 0.0693  Current GPU memory usage: 1893 maxlen 7555.050183100502 
idx 4 epoch 7 val set mean f1 : 0.6777 auc : 0.9650 margin : 0.0493
idx 4 epoch 7 val set (fge) f1 : 0.6958 auc : 0.9702 margin : 0.0516
Global step: 3000/4082 Total loss: 0.0711  Current GPU memory usage: 1893 maxlen 7554.928243473698 
idx 4 epoch 7 val set mean f1 : 0.6772 auc : 0.9660 margin : 0.0491
idx 4 epoch 7 val set (fge) f1 : 0.6958 auc : 0.9704 margin : 0.0511
Global step: 3500/4082 Total loss: 0.0725  Current GPU memory usage: 1893 maxlen 7555.7302807062115 
idx 4 epoch 7 val set mean f1 : 0.6777 auc : 0.9653 margin : 0.0508
idx 4 epoch 7 val set (fge) f1 : 0.6967 auc : 0.9703 margin : 0.0511
Global step: 4000/4082 Total loss: 0.0735  Current GPU memory usage: 1893 maxlen 7555.381755509048 
idx 4 epoch 7 val set mean f1 : 0.6777 auc : 0.9659 margin : 0.0516
idx 4 epoch 7 val set (fge) f1 : 0.6966 auc : 0.9704 margin : 0.0512


100%|██████████| 8/8 [18:34<00:00, 168.34s/it]


best threshold on validation set: 0.32 score 0.6967
avg of best threshold [0.35000000000000003, 0.32, 0.34, 0.34, 0.32] macro-f1 best threshold 0.32 best score 0.7007802742990707
coeff between predictions 0.9621045053695433
