### Run this notebook in Google Colab for exactly reproducible results. (Results may vary on other GPU specs)

#### Before running this notebook please note that I have GPU Ram Free: 16280 MB, if this is not the case for you in colab, the cell with sequence length 128 and batch size 32 (Model 3) will give memory error.

#### Colab Trick: if you run out of memory in colab while using CPU, genrally colab gives you around 25gb cpu and 14 gb gpu ram if your gmail account is old or you have been using gpu from 12 hours. So to get that extra 2 gb GPU which proves useful in most cases all you have to do is run out of memory in a new google colab account :) . Hope every one has atleast two gmail accounts :D

#### NOTE: This solution uses roberta-large and bert-large-uncased-whole-word-masking.

# SIMPLE TRANSFORMERS

In [1]:
!pip install --upgrade transformers
!pip install simpletransformers
# memory footprint support libraries/code
!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip install gputil
!pip install psutil
!pip install humanize

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/cd/38/c9527aa055241c66c4d785381eaf6f80a28c224cae97daa1f8b183b5fabb/transformers-2.9.0-py3-none-any.whl (635kB)
[K     |████████████████████████████████| 645kB 4.8MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 14.5MB/s 
[?25hCollecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/98/2c/8df20f3ac6c22ac224fff307ebc102818206c53fc454ecd37d8ac2060df5/sentencepiece-0.1.86-cp36-cp36m-manylinux1_x86_64.whl (1.0MB)
[K     |████████████████████████████████| 1.0MB 35.6MB/s 
Collecting tokenizers==0.7.0
[?25l  Downloading https://files.pythonhosted.org/packages/14/e5/a26eb4716523808bb0a799fcfdceb6ebf77a18169d9591b2f46a9adb87d9/tokenizers-0.7.0-cp36-cp36m-manylinux1_x86_64.whl (3.8MB)
[K     |████

In [1]:
import psutil
import humanize
import os
import GPUtil as GPU

GPUs = GPU.getGPUs()
gpu = GPUs[0]
def printm():
    process = psutil.Process(os.getpid())
    print("Gen RAM Free: " + humanize.naturalsize(psutil.virtual_memory().available), " |     Proc size: " + humanize.naturalsize(process.memory_info().rss))
    print("GPU RAM Free: {0:.0f}MB | Used: {1:.0f}MB | Util {2:3.0f}% | Total     {3:.0f}MB".format(gpu.memoryFree, gpu.memoryUsed, gpu.memoryUtil*100, gpu.memoryTotal))
printm()

Gen RAM Free: 12.7 GB  |     Proc size: 156.9 MB
GPU RAM Free: 16280MB | Used: 0MB | Util   0% | Total     16280MB


In [2]:
import numpy as np
import pandas as pd
from google.colab import files
from tqdm import tqdm
import warnings
warnings.simplefilter('ignore')
import gc
from scipy.special import softmax
from simpletransformers.classification import ClassificationModel
from sklearn.model_selection import train_test_split, StratifiedKFold, KFold
import sklearn
from sklearn.metrics import log_loss
from sklearn.metrics import *
from sklearn.model_selection import *
import re
import random
import torch
pd.options.display.max_colwidth = 200

def seed_all(seed_value):
    random.seed(seed_value) # Python
    np.random.seed(seed_value) # cpu vars
    torch.manual_seed(seed_value) # cpu  vars
    
    if torch.cuda.is_available(): 
        torch.cuda.manual_seed(seed_value)
        torch.cuda.manual_seed_all(seed_value) # gpu vars
        torch.backends.cudnn.deterministic = True  #needed
        torch.backends.cudnn.benchmark = False

seed_all(2)

In [3]:
train = pd.read_csv('/content/drive/My Drive/Colab Notebooks/COVID-19 Tweet Classification Challenge/updated_train.csv')
test = pd.read_csv('/content/drive/My Drive/Colab Notebooks/COVID-19 Tweet Classification Challenge/updated_test.csv')
sample_sub = pd.read_csv('/content/drive/My Drive/Colab Notebooks/COVID-19 Tweet Classification Challenge/updated_ss.csv')
train.shape, test.shape, sample_sub.shape

((5287, 3), (1962, 2), (1962, 2))

In [8]:
train.head()

Unnamed: 0,ID,text,target
0,train_0,The bitcoin halving is cancelled due to,1
1,train_1,MercyOfAllah In good times wrapped in its gran...,0
2,train_2,266 Days No Digital India No Murder of e learn...,1
3,train_3,India is likely to run out of the remaining RN...,1
4,train_4,In these tough times the best way to grow is t...,0


In [9]:
test.head()

Unnamed: 0,ID,text
0,test_2,Why is explained in the video take a look
1,test_3,Ed Davey fasting for Ramadan No contest
2,test_4,Is Doja Cat good or do you just miss Nicki Minaj
3,test_8,How Boris Johnson s cheery wounded in action p...
4,test_9,Man it s terrible Not even a reason to get on ...


In [5]:
train.target.value_counts()

0    2746
1    2541
Name: target, dtype: int64

In [6]:
print(train['text'].apply(lambda x: len(x.split())).describe())

count    5287.000000
mean       20.258180
std        10.006057
min         3.000000
25%        14.000000
50%        19.000000
75%        23.000000
max        61.000000
Name: text, dtype: float64


In [7]:
print(test['text'].apply(lambda x: len(x)).describe())

count    1962.000000
mean      110.431193
std        55.119819
min        21.000000
25%        73.000000
50%       106.500000
75%       123.000000
max       288.000000
Name: text, dtype: float64


In [0]:
train1=train.drop(['ID'],axis=1)
test1=test.drop(['ID'],axis=1)
test1['label']=0

# Roberta Large Model 1

In [None]:
%%time
err=[]
y_pred_tot=[]

fold=StratifiedKFold(n_splits=20, shuffle=True, random_state=2)
i=1
for train_index, test_index in fold.split(train1,train1['target']):
    train1_trn, train1_val = train1.iloc[train_index], train1.iloc[test_index]
    model = ClassificationModel('roberta', 'roberta-large', use_cuda=True,num_labels=2, args={'train_batch_size':32,
                                                                         'reprocess_input_data': True,
                                                                         'overwrite_output_dir': True,
                                                                         'fp16': False,
                                                                         'do_lower_case': False,
                                                                         'num_train_epochs': 2,
                                                                         'max_seq_length': 64,
                                                                         'regression': False,
                                                                         'manual_seed': 2,
                                                                         "learning_rate":3e-5,
                                                                         'weight_decay':0,
                                                                         "save_eval_checkpoints": False,
                                                                         "save_model_every_epoch": False,
                                                                         "silent": True})
    model.train_model(train1_trn)
    raw_outputs_val = model.eval_model(train1_val)[1]
    raw_outputs_val = softmax(raw_outputs_val,axis=1)[:,1]
    print(f"Log_Loss: {log_loss(train1_val['target'], raw_outputs_val)}")
    err.append(log_loss(train1_val['target'], raw_outputs_val))
    raw_outputs_test = model.eval_model(test1)[1]
    raw_outputs_test = softmax(raw_outputs_test,axis=1)[:,1]
    y_pred_tot.append(raw_outputs_test)
print("Mean LogLoss: ",np.mean(err))
final=pd.DataFrame()
final['ID']=test['ID']
final['target']=np.mean(y_pred_tot, 0)
print(final.shape)
final.to_csv('20fold_rbl_2_3e5_32_64_0.csv',index=False)

In [None]:
files.download("20fold_rbl_2_3e5_32_64_0.csv")

#### Local Mean LogLoss: 0.2077
#### Public lb: 0.1724
#### Private lb: 0.1638

# Roberta Large Model 2

In [None]:
%%time
err=[]
y_pred_tot=[]

fold=StratifiedKFold(n_splits=20, shuffle=True, random_state=2)
i=1
for train_index, test_index in fold.split(train1,train1['target']):
    train1_trn, train1_val = train1.iloc[train_index], train1.iloc[test_index]
    model = ClassificationModel('roberta', 'roberta-large', use_cuda=True,num_labels=2, args={'train_batch_size':32,
                                                                         'reprocess_input_data': True,
                                                                         'overwrite_output_dir': True,
                                                                         'fp16': False,
                                                                         'do_lower_case': False,
                                                                         'num_train_epochs': 2,
                                                                         'max_seq_length': 64,
                                                                         'regression': False,
                                                                         'manual_seed': 2,
                                                                         "learning_rate":5e-5,
                                                                         'weight_decay':0,
                                                                         "save_eval_checkpoints": False,
                                                                         "save_model_every_epoch": False,
                                                                         "silent": True})
    model.train_model(train1_trn)
    raw_outputs_val = model.eval_model(train1_val)[1]
    raw_outputs_val = softmax(raw_outputs_val,axis=1)[:,1]
    print(f"Log_Loss: {log_loss(train1_val['target'], raw_outputs_val)}")
    err.append(log_loss(train1_val['target'], raw_outputs_val))
    raw_outputs_test = model.eval_model(test1)[1]
    raw_outputs_test = softmax(raw_outputs_test,axis=1)[:,1]
    y_pred_tot.append(raw_outputs_test)
print("Mean LogLoss: ",np.mean(err))
final=pd.DataFrame()
final['ID']=test['ID']
final['target']=np.mean(y_pred_tot, 0)
print(final.shape)
final.to_csv('20fold_rbl_2_5e5_32_64_0.csv',index=False)

In [None]:
files.download("20fold_rbl_2_5e5_32_64_0.csv")

#### Local Mean LogLoss: 0.1974
#### Public lb: 0.1659
#### Private lb: 0.1660

# Roberta Large Model 3

In [None]:
%%time
err=[]
y_pred_tot=[]

fold=StratifiedKFold(n_splits=20, shuffle=True, random_state=2)
i=1
for train_index, test_index in fold.split(train1,train1['target']):
    train1_trn, train1_val = train1.iloc[train_index], train1.iloc[test_index]
    model = ClassificationModel('roberta', 'roberta-large', use_cuda=True,num_labels=2, args={'train_batch_size':32,
                                                                         'reprocess_input_data': True,
                                                                         'overwrite_output_dir': True,
                                                                         'fp16': False,
                                                                         'do_lower_case': False,
                                                                         'num_train_epochs': 2,
                                                                         'max_seq_length': 128,
                                                                         'regression': False,
                                                                         'manual_seed': 2,
                                                                         "learning_rate":3e-5,
                                                                         'weight_decay':0,
                                                                         "save_eval_checkpoints": False,
                                                                         "save_model_every_epoch": False,
                                                                         "silent": True})
    model.train_model(train1_trn)
    raw_outputs_val = model.eval_model(train1_val)[1]
    raw_outputs_val = softmax(raw_outputs_val,axis=1)[:,1]
    print(f"Log_Loss: {log_loss(train1_val['target'], raw_outputs_val)}")
    err.append(log_loss(train1_val['target'], raw_outputs_val))
    raw_outputs_test = model.eval_model(test1)[1]
    raw_outputs_test = softmax(raw_outputs_test,axis=1)[:,1]
    y_pred_tot.append(raw_outputs_test)
print("Mean LogLoss: ",np.mean(err))
final=pd.DataFrame()
final['ID']=test['ID']
final['target']=np.mean(y_pred_tot, 0)
print(final.shape)
final.to_csv('20fold_rbl_2_3e5_32_128.csv',index=False)

In [None]:
files.download("20fold_rbl_2_3e5_32_128.csv")

#### Local Mean LogLoss: 0.1998
#### Public lb: 0.1692
#### Private lb: 0.1663

# Bert Large Model 4

In [10]:
%%time
err=[]
y_pred_tot=[]

fold=StratifiedKFold(n_splits=20, shuffle=True, random_state=2)
i=1
for train_index, test_index in fold.split(train1,train1['target']):
    train1_trn, train1_val = train1.iloc[train_index], train1.iloc[test_index]
    model = ClassificationModel('bert', 'bert-large-uncased-whole-word-masking', use_cuda=True,num_labels=2, args={'train_batch_size':32,
                                                                         'reprocess_input_data': True,
                                                                         'overwrite_output_dir': True,
                                                                         'fp16': False,
                                                                         'do_lower_case': True,
                                                                         'num_train_epochs': 2,
                                                                         'max_seq_length': 64,
                                                                         'regression': False,
                                                                         'manual_seed': 2,
                                                                         "learning_rate":5e-5,
                                                                         'weight_decay':0,
                                                                         "save_eval_checkpoints": False,
                                                                         "save_model_every_epoch": False,
                                                                         "silent": True})
    model.train_model(train1_trn)
    raw_outputs_val = model.eval_model(train1_val)[1]
    raw_outputs_val = softmax(raw_outputs_val,axis=1)[:,1]
    print(f"Log_Loss: {log_loss(train1_val['target'], raw_outputs_val)}")
    err.append(log_loss(train1_val['target'], raw_outputs_val))
    raw_outputs_test = model.eval_model(test1)[1]
    raw_outputs_test = softmax(raw_outputs_test,axis=1)[:,1]
    y_pred_tot.append(raw_outputs_test)
print("Mean LogLoss: ",np.mean(err))
final=pd.DataFrame()
final['ID']=test['ID']
final['target']=np.mean(y_pred_tot, 0)
print(final.shape)
final.to_csv('20fold_bluwwm_2_5e5_32_64.csv',index=False)

In [0]:
files.download("20fold_bluwwm_2_5e5_32_64.csv")

#### Local Mean LogLoss: 0.21xx
#### Public lb: 0.1893
#### Private lb: 0.1821

# So Among these Models only Model 1 was enough to get Private Rank 1 on leaderboard. But I did not know this at that time :)

# Here comes blending: I tried many, will show you a few of them.

In [None]:
tr3=pd.read_csv('20fold_rbl_2_3e5_32_64_0.csv')  # Mean LogLoss: 0.2077, Public lb: 0.1724
tr4=pd.read_csv('20fold_rbl_2_5e5_32_64_0.csv')  # Mean LogLoss: 0.1974, Public lb: 0.1659
tr5=pd.read_csv('20fold_rbl_2_3e5_32_128.csv')   # Mean LogLoss: 0.1998, Public lb: 0.1692
tr6=pd.read_csv('20fold_bluwwm_2_5e5_32_64.csv') # Mean LogLoss: 0.21XX, Public lb: 0.1893

final=pd.DataFrame()
final['ID'] = tr4['ID']

In [None]:
final['target'] =  tr4['target']*0.4 + tr5['target']*0.3 + tr3['target']*0.2 + tr6['target']*0.1
final.to_csv('ensemble6.csv', index=False) # Private lb: 0.1597

In [None]:
final['target'] =  tr4['target']*0.4 + tr5['target']*0.25 + tr3['target']*0.2 + tr6['target']*0.15
final.to_csv('ensemble7.csv', index=False) # Private lb: 0.1591

In [None]:
final['target'] =  ((tr4['target']*0.7 + tr5['target']*0.3)*0.7 + tr3['target']*0.3)*0.7 + tr6['target']*0.3
final.to_csv('ensemble8.csv', index=False) # Private lb: 0.1588

In [None]:
final['target'] =  tr4['target']*0.5 + tr5['target']*0.3 + tr6['target']*0.2
final.to_csv('ensemble9.csv', index=False) # Private lb: 0.1595

# All of them gave private score under 16. Power of Blending :)

### PS: I was exhausted by then so I did not note their public scores.

#### Most of you would be using only roberta large and if you are wondering what was the best score with blending only roberta, it was 0.1629 Just a little less than model 1 single score.

# Best Score: ensemble8

Simple Transformers is a good library to start using SOTA for all the noobs out there like me :)