# Wassim Henia Solution

<font size="4"> Hello, this is my solution AI4D iCompass Social Media Sentiment Analysis for Tunisian Arabizi. Both the inference and the training are performed in one notebook. Both take less than 8 hours to training, meaning that they respect the time limit (8 Hours for training and 2 Hours for inference). </font>

-------

<font size="3"> <b> Solution Overview: </b> The Tunisian dialect, and Maghrebi dialects generally, are a mosaic of different languages, namely: Arabic (To a large degree), French, English and a bit of Italian loan words. Using transfer learning along side with ensembling would boost the model's performance. A good ensemble would allow the understanding of different parts of a sentence that are crucial in determining the underlying sentiment of it. </font>

<font size="3"> However this poses many challenges. First, transfer learning models are huge, with millions of parameters. The eight hours ceiling would greatly limit the size of the ensemble. Second, all open source Bert-like Arabic models only support Arabic script, and the data is in Arabizi. Lastly, the dataset is relatively small, and somewhat biased, and using it to train a model that generalises well is difficult.</font>

<font size="3"> To fix this, my solution focused on speeding up data loading, and optimizing the training time. Finding a good validation strategy to assess the model's performance despite the small dataset, and to reduce bias. Crafting an algorithm to encode Arabizi text into Arabic characters. </font>

<font size="4">  <b>Important Note: </b></font> <font size="3"> This notebook doesn't reproduce exactly the same score as in the leaderboard, however I would be the first regardless. This is because I trained the simple transformers model and the Fastai on Tesla P4 GPU for the leaderboard submission, meanwhile here, it is trained on a P100. Change of GPUs makes models not reproducible despite fixing the seed.

<font size="3"> Models that I have used:  </font>
    
    - Distill Bert multilingual (Simple Transformers)
    - Fast AI text classifier
    - Multilingual Bert (Huggingface)
    - Camembert (Huggingface) for French understanding
    - Distill Bert Multilingual (Huggingface)
    - Arabic Dialect Bert (Hugginface)
    
<font size="3"> Note that English bert models are trained on English wikipedia, meaning it understands common words in other languages, and this proves to be helpful in in inference.

In [None]:
import time

start_time = time.time()

In [None]:
from fastai import *
from fastai.text import *
from fastai.callbacks import *

In [None]:
import pandas as pd
import numpy as np
import torchvision
import os
from torchvision import transforms, utils
from PIL import Image
from torchvision.transforms import ToTensor
from torch.autograd import Variable
import torch
import torch.nn.functional as F
from torch.utils.data.sampler import SubsetRandomSampler
import matplotlib.pyplot as plt
import PIL 
from torch import nn
import time
import random
from matplotlib import gridspec
from sklearn.model_selection import train_test_split

import re

In [None]:
import transformers
from transformers import TFAutoModel, AutoTokenizer
from tqdm.notebook import tqdm
from tokenizers import Tokenizer, models, pre_tokenizers, decoders, processors

from sklearn.model_selection import KFold

import warnings 
warnings.simplefilter('ignore')

from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler, Dataset, Sampler
from transformers import BertTokenizer, AutoModel


In [None]:
#Sorted sentences from shortest to longest to speed up the inference time
test = pd.read_csv("Test.csv")
test.columns = ["ID", "text"]

test["lens"] = test.text.apply(len)
test = test.sort_values(by="lens")

test_idxes = test.ID

# Simple Transformers

In [None]:
!pip install simpletransformers -q

from simpletransformers.classification import ClassificationModel
from sklearn.model_selection import train_test_split

In [None]:
#Utility function that loads train and test in the same order
def get_data():
    
    train = pd.read_csv("Train.csv")
    test = pd.read_csv("Test.csv")
    
    train.columns = ["ID", "text", "label"]
    test.columns = ["ID", "text"]
    
    test = test.set_index("ID", drop=True).loc[test_idxes].reset_index()
    
    return train, test

<font size="3"> This function `removeDuplicates` is useful. It transformes sentence from "Chibkkk yesssser ta7kiii!!!" to "Chibk yeser ta7ki" which improves accuracy in some models. </font>

In [None]:
def removeDuplicates(S): 
          
    n = len(S)  
      
    # We don't need to do anything for  
    # empty or single character string.  
    if (n < 2) : 
        return
          
    # j is used to store index is result  
    # string (or index of current distinct  
    # character)  
    j = 0
      
    # Traversing string  
    for i in range(n):  
          
        # If current character S[i]  
        # is different from S[j]  
        if (S[j] != S[i]): 
            j += 1
            S[j] = S[i]  
      
    # Putting string termination  
    # character.  
    j += 1
    S = S[:j] 
    return "".join(S) 

<font size="3"> Each model has its own fixed seed </font>

In [None]:
def set_seed(seed):
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.enabled = False 
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    np.random.seed(seed)
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    
set_seed(7)

In [None]:
#Add one to the labels because you can't set a negative number to be the target in CrossEntropy loss
train, test = get_data()
train.label+=1

train["text"]=train['text'].apply(lambda x :removeDuplicates(list(x.rstrip())) )
test["text"]=test['text'].apply(lambda x :removeDuplicates(list(x.rstrip())) )

In [None]:
#Utility function to get the model
def get_model(model_type, model_name, n_epochs = 2, train_batch_size = 100, eval_batch_size = 64, seq_len = 120, lr = 2e-5):
  model = ClassificationModel(model_type, model_name,num_labels=3, args={'train_batch_size':train_batch_size,
                                                                         "eval_batch_size": eval_batch_size,
                                                                         'reprocess_input_data': True,
                                                                         'overwrite_output_dir': True,
                                                                         'fp16': False,
                                                                         'do_lower_case': False,
                                                                         'num_train_epochs': n_epochs,
                                                                         'max_seq_length': seq_len,
                                                                         'manual_seed': 2,
                                                                         "learning_rate":lr,
                                                                         "save_eval_checkpoints": False,
                                                                         "save_model_every_epoch": False,})
  return model

In [None]:
tmp = pd.DataFrame()
tmp['text'] = train['text']
tmp['labels'] = train['label']


tmp_trn, tmp_val = train_test_split(tmp, test_size=0.1, random_state=2)


In [None]:
if torch.cuda.is_available():       
    device = torch.device("cuda")
    print(f'There are {torch.cuda.device_count()} GPU(s) available.')
    print('Device name:', torch.cuda.get_device_name(0))

else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

In [None]:
fast_model = get_model('distilbert', 'distilbert-base-multilingual-cased', n_epochs=2,lr=2e-4,seq_len=150,train_batch_size=160)


In [None]:
#Training the model
fast_model.train_model(tmp_trn)

# Fast AI 

FastAI for NLP uses RNN architecture other than the bert arthitecture so it boosts the score in the ensemble.

In [None]:
def set_seed(seed):
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.enabled = False 
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    np.random.seed(seed)
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)

In [None]:
set_seed(7)

In [None]:
train, test = get_data()

In [None]:
data = (TextList.from_df(train.append(test), cols='text')
                .split_by_rand_pct(0.1,seed=7)
                .label_for_lm()  
                .databunch(bs=48))
data.show_batch()

In [None]:
learn = language_model_learner(data,AWD_LSTM, drop_mult=0.8)


In [None]:
learn.lr_find()
learn.recorder.plot()

In [None]:
learn.fit_one_cycle(5, 1e-2, moms=(0.8,0.7))


In [None]:
learn.unfreeze()
learn.fit_one_cycle(5, 1e-3, moms=(0.8,0.7))

In [None]:
learn.save_encoder('fine_tuned_enc')

In [None]:
test_datalist = TextList.from_df(test, cols='text', vocab=data.vocab)


In [None]:
data_clas = (TextList.from_df(train, cols='text', vocab=data.vocab)
             .split_by_rand_pct(0.1,seed=7)
             .label_from_df(cols= 'label')
             .add_test(test_datalist)
             .databunch(bs=32))

data_clas.show_batch()

In [None]:
learn_classifier = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.3)

# load the encoder saved  
learn_classifier.load_encoder('fine_tuned_enc')

In [None]:
learn_classifier.freeze()
learn_classifier.lr_find()
learn_classifier.recorder.plot()


In [None]:
learn_classifier.fit_one_cycle(3, 2e-2, moms=(0.8,0.7))


In [None]:
learn_classifier.freeze_to(-2)
learn_classifier.fit_one_cycle(5, slice(1e-3/(2.6**4),1e-3), moms=(0.8,0.7))

In [None]:
learn_classifier.freeze_to(-3)
learn_classifier.lr_find()
learn_classifier.recorder.plot()

In [None]:
learn_classifier.freeze_to(-3)
learn_classifier.fit_one_cycle(4, slice(2e-5/(2.6**4),2e-5), moms=(0.8,0.7))

In [None]:
learn_classifier.unfreeze()
learn_classifier.lr_find()
learn_classifier.recorder.plot()

In [None]:
learn_classifier.fit_one_cycle(2, slice(2e-4/(2.6**4),2e-4), moms=(0.8,0.7))

# Bert-Base-Uncased

In [None]:
loss_fn = nn.CrossEntropyLoss()

<font size="3"> Same as removeDuplicates but using regular expressions </font>

In [None]:
def text_preprocessing(text): 

    text = text.lower()
    text = re.sub(r'\s+', ' ', text).strip()
    
    text = re.sub(r'([a-g-i-z][a-g-i-z])\1+', r'\1', text)
        
    return text


In [None]:
train, test = get_data()
train.label+=1

In [None]:
train["text"]=train['text'].apply(text_preprocessing)
test["text"]=test['text'].apply(text_preprocessing)

In [None]:
#Load the tokenizer for bert-base-uncased, and set the padding token
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name ,do_lower_case=True)

pad = tokenizer.pad_token_id

In [None]:
#Utility function that tokenizes sentences, and clips sentences longer than 256, and adds special tokens and masks.
#The return value is a tuple containing: list of all sentences, list of attention masks
def preprocessing_for_bert(data, max_len=256):

    input_ids = []
    attention_masks = []
    tmp = tokenizer.encode("ab")[-1]

    for sentence in data:

        encoding = tokenizer.encode(sentence)

        if len(encoding) > max_len:
            encoding = encoding[:max_len-1] + [tmp]

        in_ids = encoding
        att_mask = [1]*len(encoding)
        
        input_ids.append(in_ids)
        attention_masks.append(att_mask)

    return input_ids, attention_masks

In [None]:
#Custom Pytorch dataset that inherits from utils.toch.data
#It gets the tokenized sentence, the mask, and the label (in case of training), and the sentence length
class BertDataset(Dataset):

    def __init__(self, data, masks, label=None):
        
        self.data = data
        self.masks = masks
        
        if label != None:
            self.labels = label
        else:
            self.labels = None
        
        self.lengths = [len(i) for i in data]
        
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        if self.labels !=  None:
            return (self.data[idx], self.masks[idx], self.labels[idx], self.lengths[idx])
        else:  #For validation
            return (self.data[idx], self.masks[idx], None, self.lengths[idx])

In [None]:
#This data collator pads sentences to longest sentence in the batch. It does speed up data loading.
def data_collator(data):
    
    sentence, mask, label, length = zip(*data)
    
    tensor_dim = max(length)
    
    out_sentence = torch.full((len(sentence), tensor_dim), dtype=torch.long, fill_value=pad)
    out_mask = torch.zeros(len(sentence), tensor_dim, dtype=torch.long)

    for i in range(len(sentence)):
        
        out_sentence[i][:len(sentence[i])] = torch.Tensor(sentence[i])
        out_mask[i][:len(mask[i])] = torch.Tensor(mask[i])
    
    if label[0] != None:
        return (out_sentence, out_mask, torch.Tensor(label).long())
    else:
        return (out_sentence, out_mask)

In [None]:
#This custom sampler is key in speeding up the training
#It samples sentences with similar length together
#This greatly reduces the paddings in the sentence, and saves up important computation
#The sampler return the indices in order that the dataloader would use for creating the batches
class KSampler(Sampler):

    def __init__(self, data_source, batch_size):
        self.lens = [x[1] for x in data_source]  #Stores the lengths of the sentences
        self.batch_size = batch_size

    def __iter__(self):

        idx = list(range(len(self.lens)))  #Indexes of the sentences
        arr = list(zip(self.lens, idx))  #Array of tuples containing lengths of the sentence and its index

        random.shuffle(arr)   #Randomly shuffle them
        n = self.batch_size*100

        iterator = []

        for i in range(0, len(self.lens), n):
            dt = arr[i:i+n]  #Get batch_size*100 element
            dt = sorted(dt, key=lambda x: x[0])  #Sort them from shortest, to longest

            for j in range(0, len(dt), self.batch_size):
                indices = list(map(lambda x: x[1], dt[j:j+self.batch_size]))  #Get and store the indices of every batch
                iterator.append(indices)

        random.shuffle(iterator) #Randomly shuffle the batches
        return iter([item for sublist in iterator for item in sublist])  #Flatten nested list

    def __len__(self):
        return len(self.lens)


In [None]:
import torch
import torch.nn as nn
from transformers import BertModel

# The bert Classifier
class BertClassifier(nn.Module):
    """Bert Model for Classification Tasks.
    """
    def __init__(self, freeze_bert=False):
        """
        @param    bert: a BertModel object
        @param    classifier: a torch.nn.Module classifier
        @param    freeze_bert (bool): Set `False` to fine-tune the BERT model
        """
        super(BertClassifier, self).__init__()
        # Specify hidden size of BERT, hidden size of our classifier, and number of labels
        D_in, H, D_out = 768, 200, 3
#768,100,3
        # Instantiate BERT model
        self.bert = BertModel.from_pretrained(model_name)
        print(model_name)

        # Instantiate an one-layer feed-forward classifier
        self.classifier = nn.Sequential(
            nn.Linear(D_in, H),
            nn.ReLU(),
            nn.Dropout(0.05),
            nn.Linear(H, D_out),      
        )

        # Freeze the BERT model
        if freeze_bert:
            for param in self.bert.parameters():
                param.requires_grad = False
        
    def forward(self, input_ids, attention_mask):
        """
        Feed input to BERT and the classifier to compute logits.
        @param    input_ids (torch.Tensor): an input tensor with shape (batch_size,
                      max_length)
        @param    attention_mask (torch.Tensor): a tensor that hold attention mask
                      information with shape (batch_size, max_length)
        @return   logits (torch.Tensor): an output tensor with shape (batch_size,
                      num_labels)
        """
        # Feed input to BERT
        outputs = self.bert(input_ids=input_ids,
                            attention_mask=attention_mask)
        
        # Extract the last hidden state of the token `[CLS]` for classification task
        last_hidden_state_cls = outputs[0][:, 0, :]

        # Feed input to classifier to compute logits
        logits = self.classifier(last_hidden_state_cls)
        return logits

In [None]:
from transformers import AdamW, get_linear_schedule_with_warmup

def initialize_model(epochs=4):
    """Initialize the Bert Classifier, the optimizer and the learning rate scheduler.
    """
    # Instantiate Bert Classifier
    bert_classifier = BertClassifier(freeze_bert=False)

    # Tell PyTorch to run the model on GPU
    bert_classifier.to(device)
    #tried 1e-5/2e-5/6e-5/4e-/3/7/
    # Create the optimizer
    optimizer = AdamW(bert_classifier.parameters(),#5e-5
                      lr=6e-5,    # Default learning rate
                      eps=1e-8    # Default epsilon value
                      )

    # Total number of training steps
    total_steps = len(train_dataloader) * epochs

    # Set up the learning rate scheduler
    scheduler = get_linear_schedule_with_warmup(optimizer,
                                                num_warmup_steps=0, # Default value
                                                num_training_steps=total_steps)
    return bert_classifier, optimizer, scheduler

In [None]:
def set_seed(seed_value=5):
    """Set seed for reproducibility.
    """
    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    torch.cuda.manual_seed_all(seed_value)
    
set_seed()

In [None]:
def train_fn(model, train_dataloader, val_dataloader=None, fold=None, epochs=4, evaluation=False, prefix=""):
    """Train the BertClassifier model.
    """
    # Start training loop
    max_acc = -99
    print("Start training, fold %d ...\n" % (fold))
    for epoch_i in range(epochs):
        # =======================================
        #               Training
        # =======================================
        # Print the header of the result table
        print(f"{'Epoch':^7} | {'Batch':^7} | {'Train Loss':^12} | {'Val Loss':^10} | {'Val Acc':^9} | {'Elapsed':^9}")
        print("-"*70)

        # Measure the elapsed time of each epoch
        t0_epoch, t0_batch = time.time(), time.time()

        # Reset tracking variables at the beginning of each epoch
        total_loss, batch_loss, batch_counts = 0, 0, 0

        # Put the model into the training mode
        model.train()

        # For each batch of training data...
        for step, batch in enumerate(train_dataloader):
            batch_counts +=1
            # Load batch to GPU
            b_input_ids, b_attn_mask, b_labels = tuple(t.to(device) for t in batch)
            # Zero out any previously calculated gradients
            model.zero_grad()

            # Perform a forward pass. This will return logits.
            logits = model(b_input_ids, b_attn_mask)

            # Compute loss and accumulate the loss values
            loss = loss_fn(logits, b_labels)
            batch_loss += loss.item()
            total_loss += loss.item()

            # Perform a backward pass to calculate gradients
            loss.backward()

            # Clip the norm of the gradients to 1.0 to prevent "exploding gradients"
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

            # Update parameters and the learning rate
            optimizer.step()
            scheduler.step()

            # Print the loss values and time elapsed for every 20 batches
            if (step % 20 == 0 and step != 0) or (step == len(train_dataloader) - 1):
                # Calculate time elapsed for 20 batches
                time_elapsed = time.time() - t0_batch
                
                preds = torch.argmax(logits, dim=1).flatten()
                accuracy = (preds == b_labels).cpu().numpy().mean() * 100
                # Print training results
                print(f"{epoch_i + 1:^7} | {step:^7} | {batch_loss / batch_counts:^12.6f} | {'-':^10} | {accuracy:^9.2f} | {time_elapsed:^9.2f}")

                # Reset batch tracking variables
                batch_loss, batch_counts = 0, 0
                t0_batch = time.time()
                
            if step%200 == 0 and step != 0 and epoch_i != 0 and epoch_i != 1: #Calculate validation accuracy every 200 steps
                
                print("-"*70)

                if evaluation == True:

                    val_loss, val_accuracy = evaluate_fn(model, val_dataloader)
                    
                    if val_accuracy > max_acc:
                        max_acc = val_accuracy
                        torch.save(model, prefix+"best_"+str(fold))
                        print("new max")
                        

                    print(val_accuracy)
                    
                    print("-"*70)
                print("\n")
                
                model.train()

        # Calculate the average loss over the entire training data
        avg_train_loss = total_loss / len(train_dataloader)

        print("-"*70)
        # =======================================
        #               Evaluation
        # =======================================
        if evaluation == True:
            # After the completion of each training epoch, measure the model's performance
            # on our validation set.
            val_loss, val_accuracy = evaluate_fn(model, val_dataloader)
            
            if val_accuracy > max_acc:
                max_acc = val_accuracy
                torch.save(model, prefix+"best_"+str(fold))
                print("new max")
                

            # Print performance over the entire training data
            time_elapsed = time.time() - t0_epoch
            
            print(f"{epoch_i + 1:^7} | {'-':^7} | {avg_train_loss:^12.6f} | {val_loss:^10.6f} | {val_accuracy:^9.2f} | {time_elapsed:^9.2f}")
            print("-"*70)
        print("\n")
    
    print("Training complete!")


def evaluate_fn(model, val_dataloader):
    """After the completion of each training epoch, measure the model's performance
    on our validation set.
    """
    # Put the model into the evaluation mode. The dropout layers are disabled during
    # the test time.
    model.eval()

    # Tracking variables
    val_accuracy = []
    val_loss = []

    # For each batch in our validation set...
    for batch in val_dataloader:
        # Load batch to GPU
        b_input_ids, b_attn_mask, b_labels = tuple(t.to(device) for t in batch)

        # Compute logits
        with torch.no_grad():
            logits = model(b_input_ids, b_attn_mask)
            

        # Compute loss
        loss = loss_fn(logits, b_labels)
        val_loss.append(loss.item())

        # Get the predictions
        preds = torch.argmax(logits, dim=1).flatten()

        # Calculate the accuracy rate
        accuracy = (preds == b_labels).cpu().numpy().mean() * 100
        val_accuracy.append(accuracy)

    # Compute the average accuracy and loss over the validation set.
    val_loss = np.mean(val_loss)
    val_accuracy = np.mean(val_accuracy)

    return val_loss, val_accuracy

In [None]:
import torch

if torch.cuda.is_available():       
    device = torch.device("cuda")
    print(f'There are {torch.cuda.device_count()} GPU(s) available.')
    print('Device name:', torch.cuda.get_device_name(0))

else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

In [None]:
MAX_LEN = 256

#Preprocess data
X, X_masks = preprocessing_for_bert(train['text'].values, max_len=MAX_LEN)

In [None]:
def get_indices(arr, idxs):  #Helper function to get multiple indexes from a list
    
    output = []
    for idx in idxs:
        output.append(arr[idx])
        
    return output

<font size="3"> To train every huggingface model, I used cross validation, saved the model of every fold that has the heighest score. </font>

In [None]:
n = 5

kfolds = KFold(n, True, 2020) 
fold = 0

for train_ids, val_ids in kfolds.split(X):
    
    train_inputs, train_masks = get_indices(X, train_ids) , get_indices(X_masks, train_ids)
    train_labels = train.label.values[train_ids]
    
    val_inputs, val_masks = get_indices(X, val_ids) , get_indices(X_masks, val_ids)
    val_labels = train.label.values[val_ids]
    
    batch_size = 32
    
    
    val_inputs, val_labels, val_masks = list(zip(*sorted(zip(val_inputs, val_labels, val_masks), key=lambda x: len(x[0]))))  #Order the validation data for faster validation
    val_inputs, val_labels, val_masks = list(val_inputs), list(val_labels), list(val_masks)
    

    train_labels = torch.tensor(train_labels)
    val_labels = torch.tensor(val_labels)
    
    # Create the DataLoader for our training set
    train_data = BertDataset(train_inputs, train_masks, train_labels)  #Use the custom dataset
    train_sampler = KSampler(train_data, batch_size)  #Use the custom sampler
    train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size, collate_fn=data_collator)  #Use the custom collator

    # Create the DataLoader for our validation set
    val_data = BertDataset(val_inputs, val_masks, val_labels)
    val_sampler = SequentialSampler(val_data)
    val_dataloader = DataLoader(val_data, sampler=val_sampler, batch_size=batch_size, collate_fn=data_collator)
    
    
    bert_classifier, optimizer, scheduler = initialize_model(epochs=3)
    train_fn(bert_classifier, train_dataloader, val_dataloader, fold= fold, epochs=3, evaluation=True, prefix="bert_")
    
    fold += 1

<font size="3"> Using custom samplers, dataset, and data collator improved training time from 90 minutes to 8 minutes. I used the same strategy for the rest of the models, only with a minor difference in the pre-processisng.

# Multi-lingual Bert

In [None]:
def set_seed(seed_value=5):
    """Set seed for reproducibility.
    """
    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    torch.cuda.manual_seed_all(seed_value)
set_seed()


In [None]:
train, test = get_data()
train.label+=1



model_name = 'bert-base-multilingual-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name ,do_lower_case=True)

pad = tokenizer.pad_token_id

In [None]:
train["text"]=train['text'].apply(lambda x :removeDuplicates(list(x.rstrip())) )
test["text"]=test['text'].apply(lambda x :removeDuplicates(list(x.rstrip())) )

In [None]:
X, X_masks = preprocessing_for_bert(train['text'].values, max_len=MAX_LEN)

In [None]:
n = 5

kfolds = KFold(n, True, 2020) 
fold = 0

best_accs = []

for train_ids, val_ids in kfolds.split(X):
    
    train_inputs, train_masks = get_indices(X, train_ids) , get_indices(X_masks, train_ids)
    train_labels = train.label.values[train_ids]
    
    val_inputs, val_masks = get_indices(X, val_ids) , get_indices(X_masks, val_ids)
    val_labels = train.label.values[val_ids]
    
    batch_size = 32
    
    
    val_inputs, val_labels, val_masks = list(zip(*sorted(zip(val_inputs, val_labels, val_masks), key=lambda x: len(x[0]))))  #Order the validation data for faster validation
    val_inputs, val_labels, val_masks = list(val_inputs), list(val_labels), list(val_masks)
    

    train_labels = torch.tensor(train_labels)
    val_labels = torch.tensor(val_labels)
    
    # Create the DataLoader for our training set
    train_data = BertDataset(train_inputs, train_masks, train_labels)
    train_sampler = KSampler(train_data, batch_size)
    train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size, collate_fn=data_collator)

    # Create the DataLoader for our validation set
    val_data = BertDataset(val_inputs, val_masks, val_labels)
    val_sampler = SequentialSampler(val_data)
    val_dataloader = DataLoader(val_data, sampler=val_sampler, batch_size=batch_size, collate_fn=data_collator)
    
    
    bert_classifier, optimizer, scheduler = initialize_model(epochs=3)
    train_fn(bert_classifier, train_dataloader, val_dataloader, fold= fold, epochs=3, evaluation=True, prefix="multi-bert_")
    
    fold += 1

# Camembert

In [None]:
def set_seed(seed_value=5):
    """Set seed for reproducibility.
    """
    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    torch.cuda.manual_seed_all(seed_value)
set_seed()

In [None]:
train, test = get_data()
train.label+=1



model_name = 'camembert-base'
tokenizer = AutoTokenizer.from_pretrained(model_name ,do_lower_case=True)

pad = tokenizer.pad_token_id

In [None]:
import torch
import torch.nn as nn
from transformers import BertModel

# Create the BertClassfier class
class BertClassifier(nn.Module):
    """Bert Model for Classification Tasks.
    """
    def __init__(self, freeze_bert=False):
        """
        @param    bert: a BertModel object
        @param    classifier: a torch.nn.Module classifier
        @param    freeze_bert (bool): Set `False` to fine-tune the BERT model
        """
        super(BertClassifier, self).__init__()
        # Specify hidden size of BERT, hidden size of our classifier, and number of labels
        D_in, H, D_out = 768, 200, 3
#768,100,3
        # Instantiate BERT model
        self.bert = AutoModel.from_pretrained(model_name)
        print(model_name)

        # Instantiate an one-layer feed-forward classifier
        self.classifier = nn.Sequential(
            nn.Linear(D_in, H),
            nn.ReLU(),
            nn.Dropout(0.05),
            nn.Linear(H, D_out),      
        )

        # Freeze the BERT model
        if freeze_bert:
            for param in self.bert.parameters():
                param.requires_grad = False
        
    def forward(self, input_ids, attention_mask):
        """
        Feed input to BERT and the classifier to compute logits.
        @param    input_ids (torch.Tensor): an input tensor with shape (batch_size,
                      max_length)
        @param    attention_mask (torch.Tensor): a tensor that hold attention mask
                      information with shape (batch_size, max_length)
        @return   logits (torch.Tensor): an output tensor with shape (batch_size,
                      num_labels)
        """
        # Feed input to BERT
        outputs = self.bert(input_ids=input_ids,
                            attention_mask=attention_mask)
        
        # Extract the last hidden state of the token `[CLS]` for classification task
        last_hidden_state_cls = outputs[0][:, 0, :]

        # Feed input to classifier to compute logits
        logits = self.classifier(last_hidden_state_cls)
        return logits

In [None]:
X, X_masks = preprocessing_for_bert(train['text'].values, max_len=MAX_LEN)

In [None]:
n = 5

kfolds = KFold(n, True, 2020) 
fold = 0

best_accs = []

for train_ids, val_ids in kfolds.split(X):
    
    train_inputs, train_masks = get_indices(X, train_ids) , get_indices(X_masks, train_ids)
    train_labels = train.label.values[train_ids]
    
    val_inputs, val_masks = get_indices(X, val_ids) , get_indices(X_masks, val_ids)
    val_labels = train.label.values[val_ids]
    
    batch_size = 32
    
    
    val_inputs, val_labels, val_masks = list(zip(*sorted(zip(val_inputs, val_labels, val_masks), key=lambda x: len(x[0]))))  #Order the validation data for faster validation
    val_inputs, val_labels, val_masks = list(val_inputs), list(val_labels), list(val_masks)
    

    train_labels = torch.tensor(train_labels)
    val_labels = torch.tensor(val_labels)
    
    # Create the DataLoader for our training set
    train_data = BertDataset(train_inputs, train_masks, train_labels)
    train_sampler = KSampler(train_data, batch_size)
    train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size, collate_fn=data_collator)

    # Create the DataLoader for our validation set
    val_data = BertDataset(val_inputs, val_masks, val_labels)
    val_sampler = SequentialSampler(val_data)
    val_dataloader = DataLoader(val_data, sampler=val_sampler, batch_size=batch_size, collate_fn=data_collator)
    
    
    bert_classifier, optimizer, scheduler = initialize_model(epochs=3)
    train_fn(bert_classifier, train_dataloader, val_dataloader, fold= fold, epochs=3, evaluation=True, prefix="fr_")
    
    fold += 1

# Distill Bert Multi-lingual

In [None]:
def set_seed(seed_value=5):
    """Set seed for reproducibility.
    """
    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    torch.cuda.manual_seed_all(seed_value)
set_seed()

In [None]:
train, test = get_data()
train.label+=1

train["text"]=train['text'].apply(lambda x :removeDuplicates(list(x.rstrip())) )
test["text"]=test['text'].apply(lambda x :removeDuplicates(list(x.rstrip())) )



model_name = 'distilbert-base-multilingual-cased'
tokenizer = AutoTokenizer.from_pretrained(model_name ,do_lower_case=True)

pad = tokenizer.pad_token_id

In [None]:
X, X_masks = preprocessing_for_bert(train['text'].values, max_len=MAX_LEN)

In [None]:
n = 5

kfolds = KFold(n, True, 2020) 
fold = 0

best_accs = []

for train_ids, val_ids in kfolds.split(X):
    
    train_inputs, train_masks = get_indices(X, train_ids) , get_indices(X_masks, train_ids)
    train_labels = train.label.values[train_ids]
    
    val_inputs, val_masks = get_indices(X, val_ids) , get_indices(X_masks, val_ids)
    val_labels = train.label.values[val_ids]
    
    batch_size = 32
    
    
    val_inputs, val_labels, val_masks = list(zip(*sorted(zip(val_inputs, val_labels, val_masks), key=lambda x: len(x[0]))))  #Order the validation data for faster validation
    val_inputs, val_labels, val_masks = list(val_inputs), list(val_labels), list(val_masks)
    

    train_labels = torch.tensor(train_labels)
    val_labels = torch.tensor(val_labels)
    
    # Create the DataLoader for our training set
    train_data = BertDataset(train_inputs, train_masks, train_labels)
    train_sampler = KSampler(train_data, batch_size)
    train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size, collate_fn=data_collator)

    # Create the DataLoader for our validation set
    val_data = BertDataset(val_inputs, val_masks, val_labels)
    val_sampler = SequentialSampler(val_data)
    val_dataloader = DataLoader(val_data, sampler=val_sampler, batch_size=batch_size, collate_fn=data_collator)
    
    
    bert_classifier, optimizer, scheduler = initialize_model(epochs=3)
    train_fn(bert_classifier, train_dataloader, val_dataloader, fold= fold, epochs=3, evaluation=True, prefix="distill-multi_")
    
    fold += 1

# Arabic Dialect Bert

In [None]:
class BertClassifier(nn.Module):
    """Bert Model for Classification Tasks.
    """
    def __init__(self, freeze_bert=False):
        """
        @param    bert: a BertModel object
        @param    classifier: a torch.nn.Module classifier
        @param    freeze_bert (bool): Set `False` to fine-tune the BERT model
        """
        super(BertClassifier, self).__init__()
        # Specify hidden size of BERT, hidden size of our classifier, and number of labels
        D_in, H, D_out = 768, 200, 3
#768,100,3
        # Instantiate BERT model
        self.bert = BertModel.from_pretrained(model_name)

        # Instantiate an one-layer feed-forward classifier
        self.classifier = nn.Sequential(
            nn.Linear(D_in, H),
            nn.ReLU(),
            nn.Linear(H, D_out),      
        )

        # Freeze the BERT model
        if freeze_bert:
            for param in self.bert.parameters():
                param.requires_grad = False
        
    def forward(self, input_ids, attention_mask):
        """
        Feed input to BERT and the classifier to compute logits.
        @param    input_ids (torch.Tensor): an input tensor with shape (batch_size,
                      max_length)
        @param    attention_mask (torch.Tensor): a tensor that hold attention mask
                      information with shape (batch_size, max_length)
        @return   logits (torch.Tensor): an output tensor with shape (batch_size,
                      num_labels)
        """
        # Feed input to BERT
        outputs = self.bert(input_ids=input_ids,
                            attention_mask=attention_mask)
        
        # Extract the last hidden state of the token `[CLS]` for classification task
        last_hidden_state_cls = outputs[0][:, 0, :]

        # Feed input to classifier to compute logits
        logits = self.classifier(last_hidden_state_cls)
        return logits

In [None]:
def set_seed(seed_value=5):
    """Set seed for reproducibility.
    """
    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    torch.cuda.manual_seed_all(seed_value)
set_seed()

<font size="3"> This function is also important. It converts Latin letters to their Arabic counterparts using str.replace method in python. A light preprocessing is done before the replacement. </font>

In [None]:
def convert(text):
    
    text = text.replace('ß',"b")
    text = text.replace('à',"a")
    text = text.replace('á',"a")
    text = text.replace('ç',"c")
    text = text.replace('è',"e")
    text = text.replace('é',"e")
    text = text.replace('$',"s")
    text = text.replace("1","")
    text = text.replace("ù", "u")
    
    
    text = text.lower()  #Make text lowercase, so Chibk and chbik is the same word
    #Only accept alpha numeric letters, and some punctuation
    text = re.sub(r'[^A-Za-z0-9 ,!?.]', '', text)

    
    # Remove '@name'
    text = re.sub(r'(@.*?)[\s]', ' ', text)

    # Replace '&amp;' with '&'
    text = re.sub(r'&amp;', '&', text)

    # Remove trailing whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    text = re.sub(r'([h][h][h][h])\1+', r'\1', text)  #Keep hhhhh (Laughter) but for a limited length
    text = re.sub(r'([a-g-i-z])\1+', r'\1', text)  #Remove repeating characters
    text = re.sub(r' [0-9]+ ', " ", text)  #Remove standalone numbers
    text = re.sub(r'^[0-9]+ ', "", text)
    
    
    text = " " + text+ " "  #Add spaces and the end and the beginning


    text = text.replace("ouw", "و")
    text = text.replace("th", "ذ")
    text = text.replace("kh", "خ")
    text = text.replace("ch", "ش")
    text = text.replace("ou", "و")
    text = text.replace("aye", "اي")
    text = text.replace("dh", "ض")
    text = text.replace("bil", "بال")
    text = text.replace("ph", "ف")
    text = text.replace("iw", "يو")
    text = text.replace("sh", "ش")
    text = text.replace("ca", "كا")
    text = text.replace("ci", "سي")
    text = text.replace("ce", "سو")
    text = text.replace("co", "كو")
    text = text.replace("ck", "ك")

    text = text.replace(" i", " ا")
    text = text.replace(" a", " ا")
    text = text.replace(" e", " ا")
    text = text.replace(" o", " ا")
    
    text = text.replace("a ", "ا ")
    text = text.replace("e ", "ا ")
    text = text.replace("i ", "ي ")
    text = text.replace("o ", "و ")
    
    text = text.replace("e", "")
    text = text.replace("a", "")
    text = text.replace("o", "")

    text = text.replace("b", "ب")
    text = text.replace("i", "")
    text = text.replace("k", "ك")
    text = text.replace("3", "ع")
    text = text.replace("5", "خ")
    text = text.replace("r", "ر")
    text = text.replace("4", "ر")
    text = text.replace("y", "ي")
    text = text.replace("s", "ص")
    text = text.replace("w", "و")
    text = text.replace("m", "م")
    text = text.replace("9", "ق")
    text = text.replace("n","ن")
    text = text.replace("d", "د")
    text = text.replace("l" ,"ل")
    text = text.replace("h", "ه")
    text = text.replace("7", "ح")
    text = text.replace("j" ,"ج")
    text = text.replace("t", "ت")
    text = text.replace("8", "غ")
    text = text.replace("2", "أ")
    text = text.replace("f", "ف")
    text = text.replace("p", "ب")
    text = text.replace("u", "و")
    text = text.replace("g", "ق")
    text = text.replace("v", "ف")
    text = text.replace("c", "س")
    text = text.replace("z", "ز")
    text = text.replace("q", "ك")
    text = text.replace("x", "اكس")
    
    
    return text.strip()  #Strip from spaces at the beginnig and end


In [None]:
train, test = get_data()
train.label+=1

train["text"]=train['text'].apply(convert)
test["text"]=test['text'].apply(convert)



model_name = 'moha/bert_ar_multi_dialect_c19'  #Load the Arabic dialect model 
tokenizer = AutoTokenizer.from_pretrained(model_name ,do_lower_case=True)

pad = tokenizer.pad_token_id

In [None]:
X, X_masks = preprocessing_for_bert(train['text'].values, max_len=MAX_LEN)

In [None]:
#Due to the importance of this model in the ensemble, it is trained on five folds
n = 10

kfolds = KFold(n, True, 2020) 
fold = 0

best_accs = []

for train_ids, val_ids in kfolds.split(X):
    
    train_inputs, train_masks = get_indices(X, train_ids) , get_indices(X_masks, train_ids)
    train_labels = train.label.values[train_ids]
    
    val_inputs, val_masks = get_indices(X, val_ids) , get_indices(X_masks, val_ids)
    val_labels = train.label.values[val_ids]
    
    batch_size = 32
    
    
    val_inputs, val_labels, val_masks = list(zip(*sorted(zip(val_inputs, val_labels, val_masks), key=lambda x: len(x[0]))))  #Order the validation data for faster validation
    val_inputs, val_labels, val_masks = list(val_inputs), list(val_labels), list(val_masks)
    

    train_labels = torch.tensor(train_labels)
    val_labels = torch.tensor(val_labels)
    
    # Create the DataLoader for our training set
    train_data = BertDataset(train_inputs, train_masks, train_labels)
    train_sampler = KSampler(train_data, batch_size)
    train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size, collate_fn=data_collator)

    # Create the DataLoader for our validation set
    val_data = BertDataset(val_inputs, val_masks, val_labels)
    val_sampler = SequentialSampler(val_data)
    val_dataloader = DataLoader(val_data, sampler=val_sampler, batch_size=batch_size, collate_fn=data_collator)
    
    
    bert_classifier, optimizer, scheduler = initialize_model(epochs=3)
    train_fn(bert_classifier, train_dataloader, val_dataloader, fold= fold, epochs=3, evaluation=True, prefix="ar_")
    
    fold += 1

In [None]:
print("Train time (minutes): %f" % ((time.time() - start_time)/60))

# Inference

In [None]:
#Utility function to predict using bert model
def bert_single_predict(model, test_dataloader):

    model.eval() #Turn on eval mode

    all_logits = []

    for batch in tqdm(test_dataloader): #Iterate through the dataloader

        b_input_ids, b_attn_mask = tuple(t.to(device) for t in batch)[:2]  #Take to GPU
        
        with torch.no_grad():
            logits = model(b_input_ids, b_attn_mask)  #Perform inference, no grad
        all_logits.append(logits)
    
    all_logits = torch.cat(all_logits, dim=0)

    probs = F.softmax(all_logits, dim=1).cpu().numpy()  #Convert them to probabilities and return

    return probs


#Predict for every fold of the same model. It takes sentences (list) and models (a list of models of different folds).
#It returns an array of probabilities of each model
def bert_ensemble_predict(sentences, models, max_len=256):
    
    inputs, masks = preprocessing_for_bert(sentences, max_len=max_len)  #Preprocess
    
    
    dataset = BertDataset(inputs, masks)  #Create the dataset, and the sequential sampler and dataloader
    sample = SequentialSampler(dataset)
    dataloader = DataLoader(dataset, sampler=sample, batch_size=128, collate_fn=data_collator)
    
    preds = []
    
    for model in models:  #Perform inference for each fold of the same base model
        preds.append(bert_single_predict(model, dataloader))
        
    return preds 

In [None]:
start_time = time.time()

In [None]:
#Fast ai inference
fast_ai_probs, target = learn_classifier.get_preds(DatasetType.Test, ordered=True)

In [None]:
#Simple transformers inference
from scipy.special import softmax


train, test = get_data()
test["text"]=test['text'].apply(lambda x :removeDuplicates(list(x.rstrip())) )

pred = fast_model.predict(list(test['text']))
simple_probs = softmax(pred[1],axis=1)

In [None]:
#Arabic

train, test = get_data()
train.label+=1

test["text"]=test['text'].apply(convert)  #Apply same transformation as training


model_name = 'moha/bert_ar_multi_dialect_c19'
tokenizer = AutoTokenizer.from_pretrained(model_name ,do_lower_case=True)

pad = tokenizer.pad_token_id


lang_models = []
for i in range(10):  #Load saved models, and append them to list
    lang_models.append(torch.load("ar_best_"+str(i), map_location=device))

out = bert_ensemble_predict(test.text.tolist(), lang_models, max_len=512)


arabic_probs = out[0]
for i in range(1,10): #Sum up the probabilies of the different foldsd
    arabic_probs = out[i] + arabic_probs
    

<font size="3"> Same thing for the inference of the rest of the models </font>

In [None]:
#Distill

train, test = get_data()
train.label+=1

test["text"]=test['text'].apply(lambda x :removeDuplicates(list(x.rstrip())) )


model_name = 'distilbert-base-multilingual-cased'
tokenizer = AutoTokenizer.from_pretrained(model_name ,do_lower_case=True)

pad = tokenizer.pad_token_id


lang_models = []
for i in range(5):
    lang_models.append(torch.load("distill-multi_best_"+str(i), map_location=device))

out = bert_ensemble_predict(test.text.tolist(), lang_models, max_len=512)


distill_bert = out[0]
for i in range(1,5):
    distill_bert = out[i] + distill_bert
    

In [None]:
#French

train, test = get_data()
train.label+=1


model_name = 'camembert-base'
tokenizer = AutoTokenizer.from_pretrained(model_name ,do_lower_case=True)

pad = tokenizer.pad_token_id


lang_models = []
for i in range(5):
    lang_models.append(torch.load("fr_best_"+str(i), map_location=device))

out = bert_ensemble_predict(test.text.tolist(), lang_models, max_len=512)


camembert_probs = out[0]
for i in range(1,5):
    camembert_probs = out[i] + camembert_probs
    

In [None]:
#Multilingual

train, test = get_data()
train.label+=1


model_name = 'bert-base-multilingual-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name ,do_lower_case=True)

pad = tokenizer.pad_token_id

test["text"]=test['text'].apply(lambda x :removeDuplicates(list(x.rstrip())) )


lang_models = []
for i in range(5):
    lang_models.append(torch.load("multi-bert_best_"+str(i), map_location=device))

out = bert_ensemble_predict(test.text.tolist(), lang_models, max_len=512)


multibert_probs = out[0]
for i in range(1,5):
    multibert_probs = out[i] + multibert_probs
    

In [None]:
#Bert Uncased

train, test = get_data()
train.label+=1


model_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name ,do_lower_case=True)

pad = tokenizer.pad_token_id

test["text"]=test['text'].apply(text_preprocessing)


lang_models = []
for i in range(5):
    lang_models.append(torch.load("bert_best_"+str(i), map_location=device))

out = bert_ensemble_predict(test.text.tolist(), lang_models, max_len=512)


bert_probs = out[0]
for i in range(1,5):
    bert_probs = out[i] + multibert_probs
    

<font size="3"> <b> Ensemble formula </b> =  bert_probs\*1.2 + multibert_probs\*0.8 + camembert_probs + distill_bert\*1.1 + arabic_probs\*1.15 + fast_ai_probs\*0.9 + simple_probs\*0.9 </font>

<font size="3"> The weights signifies the importance of each model in the ensemble. Experimenting with different weight might yield a better model. </font>

In [None]:
train, test = get_data()

#Ensemble formula
test["label"] = (bert_probs*1.2+multibert_probs*0.8+camembert_probs+distill_bert*1.1+arabic_probs*1.15+fast_ai_probs.numpy()*0.9+simple_probs*0.9).argmax(1)-1
test[["ID", "label"]].to_csv("preds.csv", index=False)

In [None]:
print("Inference time (minutes): %f" % ((time.time() - start_time)/60))