# Shopee-Product-Matching
![Shopee](https://cdn.lynda.com/course/563030/563030-636270778700233910-16x9.jpg)


* Similar to images, we can use pretrained neural network-based model in order to extract Text embeddings for sentences. One of the popular is for sure **BERT**.

* Some High Level Explanation For **ArcFace Loss** borrowed from [eneszvo's notebook](https://www.kaggle.com/code/eneszvo/shopee-summary-efficientnet-arcface-bert#Pretrained-embedding)


# ArcFace embedding

Modification that forces similar class embeddings to be closer and dissimilar more distant. The simple example with MNIST https://www.kaggle.com/slawekbiel/arcface-explained/.

Basically, it modifies the softmax loss function in order to achieve better separation for different classes in the embedding space. To understand the formula lets first recall cross-entropy loss and softmax definition.

**Cross-entropy loss** - measures the performance of a classification model whose output is a probability value between 0 and 1. Usually used for multi-class classification. 
$$CE = -\sum_{i=1}^{C} t_{i}\log(a_{i})$$
where $C$ is the number of classes, $t$ is mostly a one-hot vector (or binary vector for multi-label classification) representing labels, and $a$ activation function (usually the output function in the NN).

**Softmax function** - the activation function in the output layer of neural network models that predict a multinomial probability distribution.
$$ softmax(x_{i}) = \frac{e^{x_{i}}}{\sum_{j=1}^{n} e^{x_{j}}}$$
where $x$ is $n$-dimensional vector.

**Softmax loss** - cross-entropy loss applied on softmax activation function,
$$CE = -\sum_{i=1}^{C} t_{i}\log(a_{i}) = -\sum_{i=1}^{C} t_{i}\log(\frac{e^{x_{i}}}{\sum_{j=1}^{n} e^{x_{j}}}).$$

If $t=(t_{1}, t_{2},..., t_{C})$ is one-hot vector where $t_{i}=1$, then softmax loss becomes
$$CE = -\log(\frac{e^{x_{i}}}{\sum_{j=1}^{n} e^{x_{j}}}).$$

**Softmax cost** function is the average of the loss functions over the training set (or batch), 
$$CE = -\frac{1}{N}\sum_{i=1}^{N}\log(\frac{e^{x_{i}}}{\sum_{j=1}^{n} e^{x_{j}}}).$$
Based on the figure below,

**softmax cost** can be written as
$$CE = -\frac{1}{N}\sum_{i}^{N}\log(\frac{e^{W_{y_{i}}^{T}x_{i} + b_{y_{i}}}}{\sum_{j=1}^{n} e^{W_{j}^{T}x_{i} + b_{j}}}).$$
where $x_{i}$ denotes embedding of the $i$-th sample, belonging to the $y_{i}$-th class (from the image above, $y_{i}=2$). $W_{j}$ denotes the $j$-th column of the weight matrix $W$.

Lets fix $b=\mathbf{0}$, normalize all weight columns $\Vert W_{j}\Vert=1$ and normalize embedding vector $\Vert x\Vert=1$. After normalization, embedding will be distributed on a unit hypersphere. Now we have that 
$$W_{j}^{T}x + b_{j}=\frac{W_{j}^{T}x}{\Vert W_{j}\Vert\Vert x\Vert} = \cos(\langle W_{j}^{T}, x\rangle)=\cos(\theta_{j}).$$

Further, we can easily get $\theta$ angle applying $\arccos$ to both sides and after increase $\theta$ by penalty $m$. From the ArcFace paper https://arxiv.org/pdf/1801.07698.pdf, autors explained it as:  
> We add an additive angular margin penalty $m$ between $x_{i}$ and $W_{y_{i}}$ to simultaneously enhance the intra-class compactness and inter-class discrepancy.

After all, the softmax cost becomes 
$$CE = -\frac{1}{N}\sum_{i}^{N}\log(\frac{e^{s \cos(\theta_{y_{i}}+m)}}{e^{s \cos(\theta_{y_{i}}+m)} + \sum_{j=1, j\neq y_{i}}^{n} e^{s \cos(\theta_{j})}})$$
where $s$ is a scaler that defines the radius of hypersphere where embeddings are distributed. 

* I already trained Image Model using ArcFace Module, Please check this Training notebook [ArcFace Training Image Model](https://github.com/cr21/Shopee-Product-Matching/blob/main/notebooks/ArcfaceLoss%5Btraining%5D.ipynb)

In [1]:
%config Completer.use_jedi = False

# Import Packages

In [2]:
import sys
sys.path.append('../input/timmmaster')
import timm

In [3]:
import math
import os
import gc
import numpy as np
import cv2
import pandas as pd
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from sklearn.model_selection import StratifiedKFold
import timm
import torch
from torch import nn 
from torch.utils.data import Dataset, DataLoader
import torch.nn.functional as F 
import albumentations
from albumentations.pytorch.transforms import ToTensorV2
from torch.optim import lr_scheduler
from tqdm import tqdm
import matplotlib.pyplot as plt
from sklearn import metrics
from datetime import date
from sklearn.metrics import f1_score, accuracy_score
from collections  import Counter
import math
import random

import torch.nn.functional as F 
from transformers import AutoTokenizer, AutoModel

import warnings
warnings.filterwarnings('ignore')
from sklearn.neighbors import NearestNeighbors

import cudf
import cuml
import cupy
from cuml.feature_extraction.text import TfidfVectorizer
from cuml import PCA
from cuml.neighbors import NearestNeighbors

# Configuration Options


In [4]:
TRAIN_DIR = '../input/shopee-product-matching/train_images'
TEST_DIR = '../input/shopee-product-matching/test_images'
TRAIN_CSV = '../input/shopee-product-matching/train.csv'
TEST_CSV='../input/shopee-product-matching/test.csv'
MODEL_PATH = './'


class CFG:
    seed = 18900 
    num_workers = 3
    bert_model_name = '../input/bertmodel/paraphrase-xlm-r-multilingual-v1'
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    
    
    ## ARCFACE

    scale = 30
    margin = 0.5
    fc_dim = 768
    classes = 11014
    
    batch_size = 16
    use_fc=True
    max_length=128
    ### NearestNeighbors
    bert_knn = 42
    chunk = 32
    max_preds = 42
    nearlest_one = True # True is better
    bert_knn_threshold = 0.84  # Cosine distance threshold
    
    # MAke it TRue For Testing or submission
    COMPUTE_CV=True
    use_auto_mix_precision=False
    bert_model_path =  "../input/bert-trained-22/paraphrase-xlm-r-multilingual-v1_epoch7-bs16x1.pt"

In [5]:
def seed_everything(seed):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = True # set True to be faster

seed_everything(CFG.seed)

# Solution Approach

* In this competition it is given that,if two or more images have **same label group** then they are **similar products.** 
* Basically we can use this information to transfer the business problem into **multi class classification** problem.
* From Image EDA, I found out that we have **11014** different classes, and dataset is **not balanced dataset**
* If you see below plot, we can clearly see that there are **hardly 1000 data points having more than 10 products per label.*
* In this notebook I used **Weighted Sampler technique used in pytorch for handling imbalanced classification problem**


In [6]:
def read_dataset():
    if CFG.COMPUTE_CV:
        df = pd.read_csv('../input/shopee-product-matching/test.csv')
        print("Total Number of Train DAta", df.shape[0])
    else:
        df = pd.read_csv('../input/shopee-product-matching/train.csv')
        print("Total Number of Test Data", df.shape[0])
    return df
        

In [7]:
def f1_score(y_true, y_pred):
    y_true = y_true.apply(lambda x: set(x.split()))
    y_pred = y_pred.apply(lambda x: set(x.split()))
    intersection = np.array([len(x[0] & x[1]) for x in zip(y_true, y_pred)])
    len_y_pred = y_pred.apply(lambda x: len(x)).values
    len_y_true = y_true.apply(lambda x: len(x)).values
    f1 = 2 * intersection / (len_y_pred + len_y_true)
    return f1

# Build Model

In [8]:
class ArcFaceModule(nn.Module):
    def __init__(self, in_features, out_features, scale=30, margin=0.5, easy_margin=False, ls_eps=0.0 ):
        super(ArcFaceModule, self).__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.scale = scale
        self.margin = margin
        self.weight = nn.Parameter(torch.FloatTensor(out_features, in_features))
        nn.init.xavier_uniform_(self.weight)
        self.easy_margin=easy_margin
        self.ls_eps=ls_eps
        self.cos_m = math.cos(margin)
        self.sin_m = math.sin(margin)
        self.th = math.cos(math.pi - margin)
        self.mm = math.sin(math.pi - margin) * margin
        
        
        
    
    def forward(self, input, label):
        
        # cosine = X.W = ||X|| .||W|| . cos(theta) 
        # if X and W are normalize then dot product X, W = will be cos theta
        cosine = F.linear(F.normalize(input), F.normalize(self.weight))
        # sin(theta)^2 = 1 - cos(theta)^2
        sine = torch.sqrt(1.0 - torch.pow(cosine, 2))
        # phi = cos(theta + margin) = cos theta . cos(margin) -  sine theta .  sin(margin)
        phi = cosine * self.cos_m - sine * self.sin_m
        if self.easy_margin:
            phi = torch.where(cosine > 0, phi, cosine)
        else:
            # torch.tensor(a,dtype=torch.float16)
            phi = torch.where(cosine > self.th, phi, cosine - self.mm)
            
        
        one_hot = torch.zeros(cosine.size(), device=CFG.device)
        # one hot encoded
        one_hot.scatter_(1, label.view(-1, 1).long(), 1)
        if self.ls_eps > 0:
            one_hot = (1 - self.ls_eps) * one_hot + self.ls_eps / self.out_features
        #  output = label == True ? phi : cosine  
        # Add Margin to true label direction only 
        # Add normal cosine to other class weight direction
        output = (one_hot * phi) + ((1.0 - one_hot) * cosine)
        # scale the output
        output *= self.scale
        # return cross entropy loss on scalled output
        return output, nn.CrossEntropyLoss()(output,label)

In [9]:
# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    # First element of model_output contains all token embeddings
    token_embeddings = model_output[0]  
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
    sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    return sum_embeddings / sum_mask

In [10]:
   
    
    
class ShopeeBertEncoder(nn.Module):
    
    def __init__(self,
                     model_name=None,
                     loss_fn='ArcFace',
                     classes = CFG.classes,
                     fc_dim = CFG.fc_dim,
                     pretrained=True,
                     use_fc=True,
                     COMPUTE_CV=CFG.COMPUTE_CV,
                     margin = CFG.margin,
                    scale = CFG.scale,
                ):
        
        super(ShopeeBertEncoder,self).__init__()
        print("Building Model Backbone for {}".format(model_name))
        
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        # create bottlenack backbone network from pretrained model 
        self.backbone = AutoModel.from_pretrained(model_name).to(CFG.device)
        in_features = 768
        self.use_fc = use_fc

        self.dropout = nn.Dropout(p=0.0)
        self.classifier = nn.Linear(in_features, fc_dim)
        self.bn = nn.BatchNorm1d(fc_dim)
        self.init_params()
        in_features = fc_dim

        if loss_fn=='softmax':
            self.final = nn.Linear(in_features, CFG.classes)
        elif loss_fn =='ArcFace':
            self.final = ArcFaceModule( in_features,
                                        CFG.classes,
                                        scale = scale,
                                        margin = margin,
                                        easy_margin = False,
                                        ls_eps = 0.0)
            
    def forward(self, texts, labels=torch.tensor([0])):
        features = self.get_features(texts)
        
        if self.training:
            logits = self.final(features,  labels.to(CFG.device))
            return logits
        else:
            return features
    
    def init_params(self):
        nn.init.xavier_normal_(self.classifier.weight)
        nn.init.constant_(self.classifier.bias,0)
        nn.init.constant_(self.bn.weight, 1)
        nn.init.constant_(self.bn.bias, 0)
        
        
    def get_features(self,texts):
        
        encoding = self.tokenizer(texts, 
                                  padding=True,
                                  truncation=True,
                                  max_length = CFG.max_length,
                                  return_tensors='pt').to(CFG.device)
        input_ids = encoding['input_ids']
        attention_mask = encoding['attention_mask']
        
        embedding = self.backbone(input_ids, attention_mask=attention_mask)
        x = mean_pooling(embedding, attention_mask)
        
        if self.use_fc :
            x = self.dropout(x)
            x = self.classifier(x)
            x = self.bn(x)
            
        return x
    
    


# Get BERT Embeddings

In [11]:
def get_bert_Embeddings(df, col, model_name,model_path,fc_dim=768, use_fc=True, chunk=8):
    
    print("Getting BERT ArcFace Embeddings")
    # this is costly GPU operation so we need to be very careful
    
    # built model from trained weights, put model in evalution mode
    model = ShopeeBertEncoder(model_name=model_name, fc_dim=fc_dim, use_fc=use_fc)
    model.to(CFG.device)
    model.load_state_dict(torch.load(model_path, map_location=CFG.device))
    model.eval()
    
    
    
    bert_embeddings = torch.zeros((df.shape[0], fc_dim)).to(CFG.device)
    for i in tqdm(list(range(0, df.shape[0], chunk)) + [df.shape[0]-chunk], desc="get_bert_embeddings", ncols=80):
        
        titles = []
        
        # get title chunk wise
        for title in df[col][i:i+chunk].values:
            try:
                # preprocess title before passing through tokenization part
                title = title.encode('utf-8').decode("unicode_escape")
                title = title.encode('ascii', 'ignore').decode("unicode_escape")
                
            except Exception:
                raise Exception("Title Processing Error")
                
            title = title.lower()
            titles.append(title)

        # get bert embedding by forward pass chunk by chunk
        with torch.no_grad():
            if CFG.use_auto_mix_precision:
                with torch.cuda.amp.autocast():
                    model_output = model(titles)
            else:
                model_output = model(titles)

        bert_embeddings[i:i+chunk] = model_output  
        
    del model, titles, model_output

    gc.collect()
    torch.cuda.empty_cache()

    return bert_embeddings
                    
    
    

# Get Top K Neighbors

In [12]:
def getTOPK(df, embeddings, k=50, threshold=CFG.bert_knn_threshold, metric='cosine'):
    preds = []
    CHUNK = 128
    model = NearestNeighbors(n_neighbors=k,metric=metric)
    model.fit(embeddings)
    
    print('Finding similar Title...')
    CTS = len(embeddings)//CHUNK
    if len(embeddings)%CHUNK!=0: CTS += 1
    for j in range( CTS ):

        a = j*CHUNK
        b = (j+1)*CHUNK
        b = min(b,len(embeddings))
        print('chunk',a,'to',b)
        distances, indices = model.kneighbors(embeddings[a:b,])

        for k in range(b-a):
            IDX = np.where(distances[k,]<0.83)[0]
            IDS = indices[k,IDX]
            o = df.iloc[IDS].posting_id.values
            preds.append(o)

    del model, distances, indices, embeddings
    _ = gc.collect()
    
    return preds

In [13]:
# def get_neighbors(df, embeddings, k=50,threshold=CFG.bert_knn_threshold, metric='cosine'):
#     knn_model = NearestNeighbors(n_neighbors = k,metric=metric)
#     knn_model.fit(embeddings)
#     distances, indices = knn_model.kneighbors(embeddings)
#     predictions = []
#     for i in tqdm(range(embeddings.shape[0])):
#         idx = np.where(distances[i,] < threshold )[0]
        
#         ids = indices[i,idx]
#         posting_ids = df['posting_id'].iloc[ids].values
#         predictions.append(posting_ids)
    
#     del knn_model, distances, indices
#     gc.collect()
#     return  predictions
    

    



# Make Predictions

In [14]:
df =read_dataset()

def combine_predictions(row):
    x = np.concatenate([row[col] for col in CFG.prediction_cols])
    return ' '.join( np.unique(x) )


Total Number of Train DAta 3


In [15]:

    
    
bert_embeddings = get_bert_Embeddings(df=df, col='title', model_name=CFG.bert_model_name, model_path=CFG.bert_model_path,
                                      fc_dim=768, use_fc=True, chunk=32)
print('bert_embeddings.shape:', bert_embeddings.shape)

if len(df)<=3:
    best_k = 3
else:
    best_k=42
    
predictions = getTOPK(df, bert_embeddings.cpu().detach().numpy(), k=best_k, threshold=0.17, metric='cosine')
# predictions = get_neighbors(df, , k=_k,threshold=0.17, metric='cosine')

df['predictions']=predictions
CFG.prediction_cols =['predictions']




df['matches'] = df.apply(combine_predictions, axis=1)

df[['posting_id', 'matches']].to_csv('submission.csv', index=False)
submission_df = pd.read_csv('submission.csv')
submission_df.head(10)

Getting BERT ArcFace Embeddings
Building Model Backbone for ../input/bertmodel/paraphrase-xlm-r-multilingual-v1


get_bert_embeddings: 100%|████████████████████████| 2/2 [00:00<00:00,  2.23it/s]


bert_embeddings.shape: torch.Size([3, 768])
Finding similar Title...
chunk 0 to 3


Unnamed: 0,posting_id,matches
0,test_2255846744,test_2255846744
1,test_3588702337,test_3588702337
2,test_4015706929,test_4015706929


In [16]:
# bert_embeddings = get_bert_Embeddings(df=df, col='title', model_name=CFG.bert_model_name, model_path=CFG.bert_model_path,
#                                       fc_dim=768, use_fc=True, chunk=32)
# print('bert_embeddings.shape:', bert_embeddings.shape)

# if len(df) <= 3:
#     _k  = 3
# else:
#     _k=42
# predictions = get_neighbors(df, bert_embeddings.cpu().detach().numpy(), k=_k,threshold=0.17, metric='cosine')

# df['predictions']=predictions
# # df['predictions'] = df['posting_id']
# CFG.prediction_cols =['predictions']

# def combine_predictions(row):
#     x = np.concatenate([row[col] for col in CFG.prediction_cols])
#     return ' '.join( np.unique(x) )

# if CFG.COMPUTE_CV:
#     test_df = pd.read_csv('../input/shopee-product-matching/test.csv')
# else:
#     test_df = pd.read_csv('../input/shopee-product-matching/train.csv')
# test_df = test_df[max_cnt:]
# test_df['predictions'] = test_df['posting_id']
# test_df['matches'] = test_df['posting_id']


# df['matches'] = df.apply(combine_predictions, axis=1)
# # df['matches'] = df['posting_id']
# df = pd.concat([df, test_df], ignore_index=True)

# df[['posting_id', 'matches']].to_csv('submission.csv', index=False)
# submission_df = pd.read_csv('submission.csv')
# submission_df.head(10)