# Sentence Transformers for Question Answering

This Jupyter notebook demonstrates the use of Sentence Transformers for building a question-answering model.

It includes installation of necessary libraries, data preprocessing, model training, evaluation, and submission to the Kaggle competition.

> Thanks: https://www.kaggle.com/code/dtakeshi/dtc-zoomcamp-q-a-sentence-transformers

## Setup

- This section covers the installation of the sentence-transformers and datasets libraries, and imports necessary Python modules.

In [3]:
# Attention we worked on Kaggle GPU-P100
!pip install -Uq sentence-transformers datasets nltk session_info pipenv

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
cuml 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
dask-cudf 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
cudf 23.8.0 requires pandas<1.6.0dev0,>=1.3, but you have pandas 2.0.3 which is incompatible.
cudf 23.8.0 requires protobuf<5,>=4.21, but you have protobuf 3.20.3 which is incompatible.
cuml 23.8.0 requires dask==2023.7.1, but you have dask 2023.12.1 which is incompatible.
cuml 23.8.0 requires distributed==2023.7.1, but you have distributed 2023.12.1 which is incompatible.
dask-cuda 23.8.0 requires dask==2023.7.1, but you have dask 2023.12.1 which is incompatible.
dask-cuda 23.8.0 requires distributed==2023.7.1, but you have distributed 2023.12.1 which is incompatible.
dask-cuda 23.8.0 requires pandas<1.6.0

In [4]:
import os
import re
import nltk
import string
import numpy as np
import pandas as pd

import torch
from sentence_transformers import SentenceTransformer, models, evaluation
from sentence_transformers import losses, InputExample  # SentencesDataset
from torch.utils.data import DataLoader
from sentence_transformers import util
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import StandardScaler

from datasets import load_dataset
import datasets; datasets.disable_progress_bar()

from tqdm.notebook import tqdm
tqdm._instances.clear()
# Use tqdm.pandas() to enable with progress_apply
tqdm.pandas()

import warnings
warnings.filterwarnings("ignore")
import gc
gc.collect()

180

In [5]:
import nltk
from nltk.tokenize import sent_tokenize

In [6]:
# !pip install session_info -Uq
import session_info
session_info.show(html=False)

-----
datasets                    2.16.1
nltk                        3.2.4
numpy                       1.24.3
pandas                      2.1.4
sentence_transformers       2.2.2
session_info                1.0.0
sklearn                     1.2.2
torch                       2.0.0
tqdm                        4.66.1
-----
IPython             8.14.0
jupyter_client      7.4.9
jupyter_core        5.3.1
jupyterlab          4.0.10
notebook            6.5.6
-----
Python 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0]
Linux-5.15.133+-x86_64-with-glibc2.35
-----
Session information updated at 2024-01-26 05:13


## Data Loading

Loads training and test data from CSV files.

In [7]:
train_questions_df = pd.read_csv("/kaggle/input/dtc-zoomcamp-qa-challenge/train_questions.csv")
train_answers_df   = pd.read_csv("/kaggle/input/dtc-zoomcamp-qa-challenge/train_answers.csv")
test_questions_df  = pd.read_csv("/kaggle/input/dtc-zoomcamp-qa-challenge/test_questions.csv")
test_answers_df    = pd.read_csv("/kaggle/input/dtc-zoomcamp-qa-challenge/test_answers.csv")

train_questions_df.shape, train_answers_df.shape, test_questions_df.shape, test_answers_df.shape,

((397, 6), (397, 5), (516, 5), (516, 5))

In [8]:
# Check Duplicates
print(train_questions_df.duplicated('question_id').sum())
print(train_answers_df.duplicated('answer_id').sum())
print(test_questions_df.duplicated('question_id').sum())
print(test_answers_df.duplicated('answer_id').sum())

1
1
2
1


In [9]:
# Remove Duplicates
train_questions_df = train_questions_df.drop_duplicates('question_id').copy()
train_answers_df   = train_answers_df.drop_duplicates('answer_id').copy()
test_questions_df  = test_questions_df.drop_duplicates('question_id').copy()
test_answers_df    = test_answers_df.drop_duplicates('answer_id').copy()

## Data Wrangling

- Merging several data sources into one data-set for analysis.
- Identifying gaps or empty cells in data and either filling or removing them.
- Deleting irrelevant or unnecessary data.
- Identifying severe outliers in data and either explaining the inconsistencies or deleting them to facilitate analysis.

### Data Preprocessing¶

- Processes training data to create triplets of question, positive answer, and negative answers.

In [10]:
train_questions_df_triplets = pd.DataFrame()

for question_id, question, course, year, candidate_answers, answer_id in train_questions_df.values:
    answers_list = list(map(int, candidate_answers.split(',')))
    negative_ids = [x for x in answers_list if x != answer_id]
    negative_list, positive_list = [],[]    
        
    positive_list.append(train_answers_df[train_answers_df.answer_id == int(answer_id)]['answer'].values[0])
    for neg_id in negative_ids:
        negative_list.append(train_answers_df[train_answers_df.answer_id == int(neg_id)]['answer'].values[0])
    
    # Adding a single new row
    new_row = {
        'question_id'  : question_id, 
        'question'     : question, 
        'positive'     : positive_list[0],
        'negatives'    : negative_list,
        #'answer_id'    : answer_id,
        #'relevant_docs': answers_list,
    }
    
    train_questions_df_triplets = train_questions_df_triplets._append(new_row, ignore_index=True)

In [11]:
train_questions_df_triplets

Unnamed: 0,question_id,question,positive,negatives
0,79062,"For categorical target set, where the distribu...",Alexey\nShould we use something non-standard t...,[I don't know if you're referring to this. [im...
1,468946,Is there anything that we are not allowed to u...,"No, I don't think there is anything you cannot...","[44. Two people submitted twice. So actually, ..."
2,968800,I have been catching up and have been doing ho...,"Alexey\nYes, you will be. You can submit the p...",[Ankush\nI don't know if there's an example pr...
3,688404,Could you please explain what code we should l...,Alexey\nI think the question refers to the hom...,"[Alexey\nIt's not over for you, because we onl..."
4,63921,Is it just me or does the model have really ba...,"Dmitry\nIt's fine, because this is the showcas...",[I don't know if you're referring to this. [im...
...,...,...,...,...
391,241788,Can the model with the ROC AUC score of around...,"Yes, it can. It's really dataset dependent. Fo...",[I took this course a while ago. Back then it ...
392,595103,When I click tab in the parentheses of the iPy...,"Let's say I do “import numpy as np” and then, ...",[You mean another iteration of the course? Yes...
393,450348,Can you please explain the use cases of Splunk...,Alexey\nSplunk – I don’t know. It's not a data...,[Ankush\nIf you're talking specifically about ...
394,864660,Why did you use model2bin in the last question...,"Yes, it was not mentioned. But what was mentio...",[If two people want to work on the same datase...


In [12]:
from sklearn.model_selection import train_test_split

train_df, val_df = train_test_split(train_questions_df_triplets, test_size=0.2, random_state=42)

# Display the shapes of the resulting sets
print("Train set shape     :", train_df.shape)
print("Validation set shape:", val_df.shape)

Train set shape     : (316, 4)
Validation set shape: (80, 4)


In [13]:
df_train = train_questions_df_triplets.iloc[train_df.index]
df_train.shape

(316, 4)

## Model Initialization

Initializes the Sentence Transformer model.

```
Siamese Neural Networks Architecture:

Input 1  |      |-> Subnetwork 1 ---|
         |      |                   |      Triplet loss                                         Manhattan-L1 distance
Input 2  |----->|-> Subnetwork 2 ---|--> ( Contrastive loss )--> Output (Distance/Similarity) ( Euclidean distance    )
         |      |                   |      Binary loss                                          Angular Cosine distance
Labels   |      |_ _ _ _ _ _ _ _ _ _|
```

Pretrained Models
* https://www.sbert.net/docs/pretrained_models.html
* https://github.com/microsoft/unilm/tree/master/minilm
    * https://huggingface.co/microsoft/MiniLM-L12-H384-uncased


Using the Model 'all-mpnet-base-v2' from HuggingFace

The all-* models where trained on all available training data (more than 1 billion training pairs) and are designed as general purpose models. 

The all-mpnet-base-v2 model provides the best quality, while all-MiniLM-L6-v2 is 5 times faster and still offers good quality.

In [16]:
from transformers import AutoConfig, AutoTokenizer, AutoModel, TFAutoModel
# from transformers import AutoModelForSequenceClassification, TFAutoModelForSequenceClassification

## By default, input text longer than 384 word pieces is truncated.
## It maps sentences & paragraphs to a 768 dimensional dense vector space.
## fine-tuned in on a 1B sentence pairs dataset.
model_checkpoint = 'sentence-transformers/all-mpnet-base-v2'

## By default, input text longer than 512 word pieces is truncated.
## It maps sentences & paragraphs to a 768 dimensional dense vector space.
## trained on 215M (question, answer) pairs from diverse sources.
# model_checkpoint = 'sentence-transformers/multi-qa-mpnet-base-dot-v1'
# model_checkpoint = 'sentence-transformers/multi-qa-mpnet-base-cos-v1'

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
tokenizer.max_len_single_sentence

510

In [18]:
from sentence_transformers import SentenceTransformer, models, evaluation

# Load the model
model = SentenceTransformer(model_checkpoint)
model.max_seq_length = 512
model = model.to('cuda')
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

In [20]:
## Step 1: Define two transformer models, max_seq_length=384
word_embedding_model = models.Transformer(model_checkpoint, max_seq_length=512)
## Step 2: Define the pooling strategy over the token embeddings
pooling_model = models.Pooling(
    word_embedding_model.get_word_embedding_dimension(),
    pooling_mode_cls_token= True,
    pooling_mode_mean_tokens= True,
    pooling_mode_max_tokens = True,
    pooling_mode_mean_sqrt_len_tokens = True,
)
## Step 3: Define the dense layer
dense_layer = models.Dense(
    in_features=pooling_model.get_sentence_embedding_dimension(),
    out_features=768,
)
## Step 4: Define the normalize layer if needed
normalize_layer = models.Normalize()
## Step 5: Construct the final model
model = SentenceTransformer(modules=[word_embedding_model, pooling_model, dense_layer, normalize_layer], device='cuda')

# Move the model to GPU
# model = model.to('cuda')
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': True, 'pooling_mode_mean_sqrt_len_tokens': True})
  (2): Dense({'in_features': 3072, 'out_features': 768, 'bias': True, 'activation_function': 'torch.nn.modules.activation.Tanh'})
  (3): Normalize()
)

## Evaluation Formula for both dot product and cosine similarity:

* The dot product (also known as the inner product or scalar product) between two vectors
* A and B is calculated as the sum of the element-wise products:
    * > $ \text{Dot Product}(A, B) = A \cdot B = \sum_{i=1}^{n} A_i \cdot B_i $
* The cosine similarity between two vectors A and B is calculated as the normalized dot product:
* where ∥A∥ and ∥B∥ represent the Euclidean norms (magnitudes) of the vectors A and B respectively.
    * > $ \text{Cosine Similarity}(A, B) = \frac{A \cdot B}{\|A\| \cdot \|B\|} $
    * > $ \|\mathbf{A}\| = \sqrt{\sum_{i=1}^{n} A_i^2},  \|\mathbf{B}\| = \sqrt{\sum_{i=1}^{n} B_i^2} $

In [21]:
from sentence_transformers import util
from sklearn.metrics.pairwise import cosine_similarity

query_embedding   = model.encode('How big is London', show_progress_bar=0)
passage_embedding = model.encode([
    'London has 9,787,426 inhabitants at the 2011 census',
    'London is known for its finacial district'
], show_progress_bar=0)

print(f"Dot Similarity   : {util.dot_score(query_embedding, passage_embedding).tolist()}")
print(f"Cosine Similarity: {cosine_similarity(query_embedding.reshape(1, -1), passage_embedding)}")

Dot Similarity   : [[0.6112002730369568, 0.48592284321784973]]
Cosine Similarity: [[0.61120033 0.48592287]]


## Post-processing, check validation initial scores

In [22]:
from nltk.tokenize import sent_tokenize

# SEED = 42
# np.random.seed(SEED)
# Function to split, filter and shuffle into sentence pieces
def process_sentences(text, min_sent=7, shuffle=False):
    """Split long sentences small part"""
    sentences = sent_tokenize(text)
    sentences = np.split(sentences, range(min_sent, len(sentences), min_sent))
    sentences = filter(lambda x: len(x), sentences)
    sentences = [np.random.permutation(i) if shuffle else i for i in sentences]
    sentences = list(map(' '.join, sentences))
    return sentences

In [23]:
try:
    del train_questions_df['predictions']
except:
    pass

train_questions_df_predict = pd.DataFrame()

for question_id, question, course, year, candidate_answers, answer_id in tqdm(train_questions_df.values):
    answers_list = candidate_answers.split(',')
    
    candidate_answers_list = []
    for ans_id in answers_list:
        candidate_answers_list.append(train_answers_df[train_answers_df.answer_id == int(ans_id)]['answer'].values[0])

    # Adding a single new row
    new_row = {
        'question_id': question_id, 
        'question': question, 
        'candidate_answers_id': answers_list,
        'candidate_answers': candidate_answers_list,
    }

    train_questions_df_predict = train_questions_df_predict._append(new_row, ignore_index=True)
    
    
train_predictions = []
similarity_scores_df = []
for question, candidate_answers, candidate_answers_id in zip(
    tqdm(train_questions_df_predict['question']), 
    train_questions_df_predict['candidate_answers'],
    train_questions_df_predict['candidate_answers_id']
):
    q_embeddings = model.encode([question], show_progress_bar=0)
    a_embeddings = []    
    for answer in candidate_answers:
        a_embedding = np.mean(model.encode(process_sentences(answer), show_progress_bar=0), axis=0)
        a_embeddings.append(a_embedding)    
    similarity_scores = cosine_similarity(q_embeddings, a_embeddings)
    
    train_predictions.append(int(candidate_answers_id[np.argmax(similarity_scores)]))
    similarity_scores_df.append(similarity_scores)

train_questions_df['predictions'] = train_predictions

  0%|          | 0/396 [00:00<?, ?it/s]

  0%|          | 0/396 [00:00<?, ?it/s]

In [24]:
# Accuracy calculation
accuracy_full  = (train_questions_df['predictions'] == train_questions_df['answer_id']).mean()
print(f'Accuracy full : {accuracy_full:.4f}')

accuracy_val   = train_questions_df[train_questions_df['question_id'].isin(val_df['question_id']  )].apply(lambda row: row['predictions']==row['answer_id'], axis=1).mean()
accuracy_train = train_questions_df[train_questions_df['question_id'].isin(train_df['question_id'])].apply(lambda row: row['predictions']==row['answer_id'], axis=1).mean()
print(f'Accuracy val  : {accuracy_val:.4f}')
print(f'Accuracy train: {accuracy_train:.4f}')

Accuracy full : 0.9242
Accuracy val  : 0.9250
Accuracy train: 0.9241


## evaluation dataset

In [25]:
from sentence_transformers import evaluation

BATCH_SIZE = 16
evaluators = []

# Create lists to store the transformed evaluation data
sentences1, sentences2, labels = [], [], []
for idx, row in tqdm(val_df.iterrows(), total=len(val_df)):
    question = row['question']
    positive = row['positive']
    negative_list = row.get('negatives', [])  # Handle the case where 'negatives' is missing

    # Append the data for positive example
    sentences1.append(question)
    sentences2.append(positive)
    labels.append(1)  # Assuming positive examples should be labeled as 1

    # Append the data for each negative example
    for negative in negative_list:
        sentences1.append(question)
        sentences2.append(negative)
        labels.append(0)  # Assuming negative examples should be labeled as 0
        
print("Number of Binary Evaluator Samples:", len(labels))    
# Initialize the BinaryClassificationEvaluator with sentences1, sentences2, and labels
binary_acc_evaluator = evaluation.BinaryClassificationEvaluator(sentences1, sentences2, labels, batch_size=BATCH_SIZE)
evaluators.append(binary_acc_evaluator)

# Evaluate the model and get the results
# results = binary_evaluator.compute_metrices(model)['dot']["ap"]
results = binary_acc_evaluator(model)
print(f"Binary Accuracy Evaluator Results : {results:.4f}")

  0%|          | 0/80 [00:00<?, ?it/s]

Number of Binary Evaluator Samples: 400
Binary Accuracy Evaluator Results : 0.9140


In [26]:
# Create lists to store the transformed evaluation data
anchors, positives, negatives = [], [], []
for idx, row in tqdm(val_df.iterrows(), total=len(val_df)):
    question = row['question']
    positive = row['positive']
    negative_list = row.get('negatives', [])  # Handle the case where 'negatives' is missing

    # Append the data for each question, positive, negative example
    for negative in negative_list:
        anchors.append(question)
        positives.append(positive)
        negatives.append(negative)

print("Number of Triplet Evaluator Samples:", len(anchors))
# Initialize the TripletEvaluator with sentences1, sentences2, and labels
triplet_evaluator = evaluation.TripletEvaluator(anchors, positives, negatives, batch_size=BATCH_SIZE)
evaluators.append(triplet_evaluator)

# Evaluate the model and get the results
results = triplet_evaluator(model)
print(f"Triplet Evaluator Results          : {results:.4f}")

  0%|          | 0/80 [00:00<?, ?it/s]

Number of Triplet Evaluator Samples: 320
Triplet Evaluator Results          : 0.9688


In [27]:
# Create lists to store the transformed evaluation data
ir_queries       = {}     # qid => query (qid => question)
ir_corpus        = {}     # cid => doc (qid => positive)
ir_relevant_docs = {}     # qid => Set[cid] Mapping of relevant documents for a given query (qid => set([relevant_answer_ids])
# ir_needed_qids   = set()  # QIDs we need in the corpus

for idx, row in tqdm(val_df.iterrows(), total=len(val_df)):
    question = row['question']
    positive = row['positive']
    qid      = row['question_id']
    relevant_docs = set([qid])

    # Populate the dictionaries
    ir_queries[qid] = question
    ir_corpus[qid]  = positive  
    ir_relevant_docs[qid] = relevant_docs
    # ir_needed_qids.add(qid)

# Create an instance of InformationRetrievalEvaluator
ir_evaluator = evaluation.InformationRetrievalEvaluator(ir_queries, ir_corpus, ir_relevant_docs)
evaluators.append(ir_evaluator)

# Evaluate the model and get the results
results = ir_evaluator(model)
print(f"InformationRetrieval Evaluator Results: {results:.4f}")

  0%|          | 0/80 [00:00<?, ?it/s]

InformationRetrieval Evaluator Results: 0.8509


In [28]:
# Create a SequentialEvaluator. This SequentialEvaluator runs all three evaluators in a sequential order.
# We optimize the model with respect to the score from the last evaluator (scores[0])
seq_evaluator = evaluation.SequentialEvaluator(evaluators, main_score_function=lambda scores: scores[:])
seq_evaluator(model, epoch=0, steps=0)

[0.9139851527653133, 0.96875, 0.8509006211180123]

## Training Data Preparation

Prepares training examples and dataloader.

The MultipleNegativesRankingLoss needs a pair of related sentences, in this case, question and answer.

We could use the TripletLoss loss function with the question as an Anchor, the selected answer as a positive and the other candidate answers as negatives

- https://huggingface.co/blog/how-to-train-sentence-transformers

Below is a brief explanation of the suitable triplet inputs for each mentioned loss function:

* **MultipleNegativesRankingLoss**: Requires pairs of anchor and positive, and a list of negatives.
* **CachedMultipleNegativesRankingLoss**: Expects triplets with precomputed negatives for each anchor.
* **MultipleNegativesSymmetricRankingLoss**: Expects symmetric triplets, i.e., anchor, positive, negative, and vice versa.
* **TripletLoss**: Standard triplet loss, requires triplets with anchor, positive, and negative.
* **BatchAllTripletLoss**: This loss considers all possible triplets within a batch.
* **BatchHardSoftMarginTripletLoss**: Expects triplets where the negative is the hardest negative for the anchor, and the positive is the hardest positive for the anchor.
* **BatchHardTripletLoss**: Requires triplets where the negative is the hardest negative for the anchor.
* **BatchSemiHardTripletLoss**: Similar to BatchHardTripletLoss but allows for a marginally softer negative.
* **MegaBatchMarginLoss**: Expects mega-batch format for triplets.
* **CosineSimilarityLoss**: Requires pairs of examples, often used in metric learning with cosine similarity.
* **SoftmaxLoss**: Requires class labels for each example, used in classification tasks.
* **ContrastiveLoss**: Requires pairs of similar (positive) and dissimilar (negative) examples.
* **ContrastiveTensionLoss**: Similar to ContrastiveLoss but considers additional tension loss.
* **OnlineContrastiveLoss**: Similar to ContrastiveLoss but suitable for online learning scenarios.
* **DenoisingAutoEncoderLoss**: Used for denoising autoencoders, might require corrupted versions of the input.
* **MSELoss**: Standard mean squared error loss, suitable for regression tasks.
* **MarginMSELoss**: Similar to MSELoss but with an additional margin.

![](https://i.imgur.com/kV7eWJg.png)

In [29]:
from sentence_transformers import losses, InputExample, SentencesDataset
from torch.utils.data import DataLoader
from itertools import cycle, zip_longest

train_samples_MultipleNegativesRankingLoss = []
train_samples_ConstrativeLoss = []
for question_id, question, positive, negative_list in tqdm(train_df.values):        
    for negative in negative_list:
        positive_sents = process_sentences(positive, shuffle=True)
        negative_sents = process_sentences(negative, shuffle=True)
        # lower = list(filter(lambda x: len(x) == min(map(len, [positive_sents, negative_sents])), [positive_sents, negative_sents])).pop()
        # np.random.shuffle(lower)
        # for positive_sent, negative_sent in zip_longest(positive_sents, negative_sents, fillvalue=lower[0]):
        for positive_sent, negative_sent in zip(positive_sents, negative_sents):
            # MultipleNegativesRankingLoss, if A is a positive of B, then B is a positive of A
            train_samples_MultipleNegativesRankingLoss.append(InputExample(
                texts=[question, positive_sent, negative_sent], label=1
            ))
            train_samples_MultipleNegativesRankingLoss.append(InputExample(
                texts=[positive_sent, question, negative_sent], label=1
            ))
            if np.random.uniform() < 0.7:
                # ConstrativeLoss
                train_samples_ConstrativeLoss.append(InputExample(texts=[question, positive_sent], label=1))
                train_samples_ConstrativeLoss.append(InputExample(texts=[question, negative_sent], label=0))
            
print(f"Number of MultipleNegativesRankingLoss Trainig Examples: {len(train_samples_MultipleNegativesRankingLoss)}") 
print(f"Number of OnlineContrastiveLoss        Trainig Examples: {len(train_samples_ConstrativeLoss)}\n")        
# Shuffle the training examples
np.random.shuffle(train_samples_MultipleNegativesRankingLoss)
np.random.shuffle(train_samples_ConstrativeLoss)

# Create data loader and loss for MultipleNegativesRankingLoss
# train_samples_MultipleNegativesRankingLoss  = SentencesDataset(train_samples_MultipleNegativesRankingLoss, model=model)
train_dataloader_MultipleNegativesRankingLoss = DataLoader(train_samples_MultipleNegativesRankingLoss, shuffle=True, batch_size=BATCH_SIZE)
train_loss_MultipleNegativesRankingLoss       = losses.MultipleNegativesRankingLoss(model)

# Create data loader and loss for OnlineContrastiveLoss, we use cosine distance (cosine_distance = 1-cosine_similarity)
train_dataloader_ConstrativeLoss = DataLoader(train_samples_ConstrativeLoss, shuffle=True, batch_size=BATCH_SIZE)
train_loss_ConstrativeLoss       = losses.OnlineContrastiveLoss(model=model)

## (anchor, positive, negative) negative optional in some, loss function associated with datasets formats.
# train_loss = losses.ContrastiveLoss(model=model)
# train_loss = losses.MultipleNegativesSymmetricRankingLoss(model=model)
# train_loss = losses.MultipleNegativesRankingLoss(model=model)
# train_loss = losses.TripletLoss(model=model)
print(f"Number of MultipleNegativesRankingLoss Trainig Epoch   : {len(train_dataloader_MultipleNegativesRankingLoss)}")
print(f"Number of OnlineContrastiveLoss        Trainig Epoch   : {len(train_dataloader_ConstrativeLoss)}")

  0%|          | 0/316 [00:00<?, ?it/s]

Number of MultipleNegativesRankingLoss Trainig Examples: 4546
Number of OnlineContrastiveLoss        Trainig Examples: 3112

Number of MultipleNegativesRankingLoss Trainig Epoch   : 285
Number of OnlineContrastiveLoss        Trainig Epoch   : 195


## Model Training

Trains the model over the specified number of epochs.

- https://github.com/UKPLab/sentence-transformers/tree/master/examples

In [30]:
# Configure the training
initial_learning_rate = 2e-5  # 2e-5
weight_decay          = 0.01  # 0.01
num_epochs   = 100
warmup_steps = np.ceil(len(train_dataloader_MultipleNegativesRankingLoss) * num_epochs * 0.1)  # 10% of train data for warm-up

# Early stopping parameters
early_stop = False
best_score = float('-inf')
patience = 2
counter  = 0

# Best model file path
best_model_path = './best_model'
best_model_state_dict = 'best_model.safetensors'

# Train the model
for epoch in tqdm(range(num_epochs), desc="Training"):  # Set the number of epochs
    # cleaning
    gc.collect()
    model.fit(
        train_objectives = [
            (train_dataloader_MultipleNegativesRankingLoss, train_loss_MultipleNegativesRankingLoss), 
            (train_dataloader_ConstrativeLoss, train_loss_ConstrativeLoss)
        ],
        epochs           = 1,
        optimizer_params = {'lr': initial_learning_rate},
        weight_decay     = weight_decay,
        warmup_steps     = warmup_steps,
        show_progress_bar= 0,                           # Display the progress bar
        # evaluator      = seq_evaluator,
        # output_path    = best_model_path,
        # save_best_model= True
    )
    # Evaluate on the validation set
    score = binary_acc_evaluator(model)
    print(
        f"Epoch {epoch + 1}: "
        f"binary_evaluator Score = {score}, "
        f"triplet_evaluator Score = {triplet_evaluator(model)}, "
        f"ir_evaluator Score = {ir_evaluator(model)}"
    )
    # Check for early stopping
    if score > best_score:
        best_score = score
        counter = 0
        # Save the trained model
        torch.save(model.state_dict(), best_model_state_dict)
        # Use Sentence Transformers' save method
        # model.save(best_model_path)
    else:
        counter += 1
        # Manually adjust the learning rate after each epoch
        initial_learning_rate *= 0.85
        weight_decay          *= 0.85
        if counter >= patience:
            print(f"Early stopping at epoch {epoch + 1 - patience}")
            early_stop = True
            break

# Restore the best weights if early stopping occurred
if early_stop:
    # Load the best model state dict into the model
    model.load_state_dict(torch.load(best_model_state_dict))
    print("Best weights restored.")

Training:   0%|          | 0/100 [00:00<?, ?it/s]

Epoch 1: binary_evaluator Score = 0.9447816265234429, triplet_evaluator Score = 0.99375, ir_evaluator Score = 0.8752380952380951

Epoch 2: binary_evaluator Score = 0.957717869564054, triplet_evaluator Score = 0.996875, ir_evaluator Score = 0.8839972527472529

Epoch 3: binary_evaluator Score = 0.9678803045385579, triplet_evaluator Score = 0.996875, ir_evaluator Score = 0.8886607142857142

Epoch 4: binary_evaluator Score = 0.9722261108198784, triplet_evaluator Score = 0.996875, ir_evaluator Score = 0.9022023809523809

Epoch 5: binary_evaluator Score = 0.9765535115112803, triplet_evaluator Score = 0.996875, ir_evaluator Score = 0.8967708333333333

Epoch 6: binary_evaluator Score = 0.9792795897575098, triplet_evaluator Score = 0.996875, ir_evaluator Score = 0.8827083333333334

Epoch 7: binary_evaluator Score = 0.9802939077629669, triplet_evaluator Score = 0.996875, ir_evaluator Score = 0.8827083333333334

Epoch 8: binary_evaluator Score = 0.9834489278804479, triplet_evaluator Score = 0.996

In [31]:
# Load the saved model
model = SentenceTransformer('celik-muhammed/all-mpnet-base-v2-finetuned-dtc-zoomcamp')
model.max_seq_length = 512
model = model.to('cuda')
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': True, 'pooling_mode_mean_sqrt_len_tokens': True})
  (2): Dense({'in_features': 3072, 'out_features': 768, 'bias': True, 'activation_function': 'torch.nn.modules.activation.Tanh'})
  (3): Normalize()
)

In [32]:
binary_acc_evaluator(model)

0.9843386953324966

## Prediction on Training Data

Generates predictions on the training data and appends them to the dataframe.

In [33]:
try:
    del train_questions_df['predictions']
except:
    pass

train_questions_df_predict = pd.DataFrame()

for question_id, question, course, year, candidate_answers, answer_id in tqdm(train_questions_df.values):
    answers_list = candidate_answers.split(',')
    
    candidate_answers_list = []
    for ans_id in answers_list:
        candidate_answers_list.append(train_answers_df[train_answers_df.answer_id == int(ans_id)]['answer'].values[0])

    # Adding a single new row
    new_row = {
        'question_id': question_id, 
        'question': question, 
        'candidate_answers_id': answers_list,
        'candidate_answers': candidate_answers_list,
    }

    train_questions_df_predict = train_questions_df_predict._append(new_row, ignore_index=True)
    
    
train_predictions = []
similarity_scores_df = []
for question, candidate_answers, candidate_answers_id in zip(
    tqdm(train_questions_df_predict['question']), 
    train_questions_df_predict['candidate_answers'],
    train_questions_df_predict['candidate_answers_id']
):
    q_embeddings = model.encode([question], show_progress_bar=0)
    a_embeddings = []    
    for answer in candidate_answers:
        a_embedding = np.mean(model.encode(process_sentences(answer), show_progress_bar=0), axis=0)
        a_embeddings.append(a_embedding)    
    similarity_scores = cosine_similarity(q_embeddings, a_embeddings)
    
    train_predictions.append(int(candidate_answers_id[np.argmax(similarity_scores)]))
    similarity_scores_df.append(similarity_scores)

train_questions_df['predictions'] = train_predictions

  0%|          | 0/396 [00:00<?, ?it/s]

  0%|          | 0/396 [00:00<?, ?it/s]

## Evaluation

Calculates and prints the accuracy of the model on the training data.

In [34]:
# Accuracy calculation
accuracy_full  = (train_questions_df['predictions'] == train_questions_df['answer_id']).mean()
print(f'Accuracy full : {accuracy_full:.4f}')

accuracy_val   = train_questions_df[train_questions_df['question_id'].isin(val_df['question_id']  )].apply(lambda row: row['predictions']==row['answer_id'], axis=1).mean()
accuracy_train = train_questions_df[train_questions_df['question_id'].isin(train_df['question_id'])].apply(lambda row: row['predictions']==row['answer_id'], axis=1).mean()
print(f'Accuracy val  : {accuracy_val:.4f}')
print(f'Accuracy train: {accuracy_train:.4f}')

Accuracy full : 0.9949
Accuracy val  : 0.9750
Accuracy train: 1.0000


## Prediction and Submission for Test Data

Generates predictions for the test data and creates a submission file for the Kaggle competition.

In [35]:
try:
    del test_questions_df['predicted_answer_id']
except:
    pass

test_questions_df = test_questions_df.drop_duplicates(subset='question_id')
test_questions_df.shape

test_questions_df_predict = pd.DataFrame()

for question_id, question, course, year, candidate_answers in tqdm(test_questions_df.values):
    answers_list = candidate_answers.split(',')
    candidate_answers_list = []

    for ans_id in answers_list:
        candidate_answers_list.append(test_answers_df[test_answers_df.answer_id == int(ans_id)]['answer'].values[0])

    # Adding a single new row
    new_row = {
        'question_id': question_id, 
        'question': question, 
        'candidate_answers_id': answers_list,
        'candidate_answers': candidate_answers_list,
    }

    test_questions_df_predict = test_questions_df_predict._append(new_row, ignore_index=True)
    
    
test_predictions = []
similarity_scores_df = []
for question, candidate_answers, candidate_answers_id in zip(
    tqdm(test_questions_df_predict['question']), 
    test_questions_df_predict['candidate_answers'],
    test_questions_df_predict['candidate_answers_id']
):
    q_embeddings = model.encode([question], show_progress_bar=0)
    a_embeddings = []    
    for answer in candidate_answers:
        a_embedding = np.mean(model.encode(process_sentences(answer), show_progress_bar=0), axis=0)
        a_embeddings.append(a_embedding)    
    similarity_scores = cosine_similarity(q_embeddings, a_embeddings)
    
    test_predictions.append(int(candidate_answers_id[np.argmax(similarity_scores)]))
    similarity_scores_df.append(similarity_scores)

test_questions_df['predicted_answer_id'] = test_predictions

  0%|          | 0/514 [00:00<?, ?it/s]

  0%|          | 0/514 [00:00<?, ?it/s]

In [36]:
test_questions_df['predicted_answer_id'].nunique()

491

In [37]:
test_questions_df[['question_id', 'predicted_answer_id']].to_csv('submission.csv', index=False)

In [38]:
submission_df = pd.read_csv("/kaggle/working/submission.csv")
submission_df.head()

Unnamed: 0,question_id,predicted_answer_id
0,707,767296
1,534450,573165
2,996163,571892
3,860215,988549
4,980124,384381


In [39]:
# Save Versions
# now = datetime.datetime.now().strftime("%d-%b-%Y %H-%M-%S")
# now = pd.Timestamp.now().strftime("%d-%b-%Y %H-%M-%S")
# np.save(now, np.array([now]))

from IPython.display import FileLink, FileLinks
local_file = FileLink(r'submission.csv', result_html_prefix="Click here to download: ")
local_file

## Save and Push Model Huggingface Hub

- https://huggingface.co/docs/transformers/v4.15.0/model_sharing#directly-push-your-model-to-the-hub

In [41]:
model.save(".", model_name=save_to_hub_model_id)

{'__version__': {'sentence_transformers': '2.2.2',
  'transformers': '4.36.2',
  'pytorch': '2.0.0'}}

In [1]:
# import huggingface_hub as hf
# from huggingface_hub import HfApi
# import json

# # Ensure you are logged in
# # hf.login()

# HUGGINGFACE_TOKEN = ''
# if HUGGINGFACE_TOKEN:
#     !huggingface-cli login --token $HUGGINGFACE_TOKEN --add-to-git-credential
    
#     # Replace 'username/model' with the actual username and model name on the Hugging Face Model Hub
#     save_to_hub_model_id = f"{hf.whoami()['name']}/{model_checkpoint.split('/')[-1]}-finetuned-dtc-zoomcamp"
    
#     # Save the model in local as safetensors
#     model.save(save_to_hub_model_id, model_name=save_to_hub_model_id)
#     # Save the model in PyTorch format using native PyTorch functions
#     torch.save(model.state_dict(), save_to_hub_model_id+'/pytorch_model.bin')
        
#     # update config file
#     model.save(save_to_hub_model_id, model_name=save_to_hub_model_id)

#     # Push the tokenizer
#     tokenizer.push_to_hub(save_to_hub_model_id)
#     # Push to Hub all model Safetensors, PyTorch format
#     api = HfApi()
#     api.upload_folder(
#         repo_id=save_to_hub_model_id,
#         folder_path=save_to_hub_model_id,
#         repo_type="model",
#     )

## Check Saved Model

In [44]:
from sentence_transformers import SentenceTransformer
sentences = ['This is an example sentence', 'Each sentence is converted']

model_checkpoint = 'celik-muhammed/all-mpnet-base-v2-finetuned-dtc-zoomcamp'
model      = SentenceTransformer(model_checkpoint)
embeddings = model.encode(sentences)
print(embeddings.shape)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

(2, 768)


## Export to TFLite

- https://www.tensorflow.org/lite/guide/ops_select
- https://huggingface.co/docs/transformers/main/en/tflite#export-to-tflite

In [45]:
from transformers import AutoConfig, AutoTokenizer, AutoModel, TFAutoModel
import tensorflow as tf; tf.keras.backend.clear_session()

## Load the model (architecture + weights), from safetensors
model_checkpoint = 'celik-muhammed/all-mpnet-base-v2-finetuned-dtc-zoomcamp'
model     = TFAutoModel.from_pretrained(model_checkpoint, from_pt=False)
## Set the entire model to be non-trainable
model.trainable = False
model.compile()
model.summary()

config.json:   0%|          | 0.00/613 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFMPNetModel.

All the weights of TFMPNetModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMPNetModel for predictions without further training.


Model: "tfmp_net_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 mpnet (TFMPNetMainLayer)    multiple                  109486464 
                                                                 
Total params: 109486464 (417.66 MB)
Trainable params: 0 (0.00 Byte)
Non-trainable params: 109486464 (417.66 MB)
_________________________________________________________________


In [46]:
import tensorflow as tf

converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]
converter.target_spec.supported_ops = [
  tf.lite.OpsSet.TFLITE_BUILTINS, # enable TensorFlow Lite ops
  tf.lite.OpsSet.SELECT_TF_OPS, # enable TensorFlow ops
]
tflite_model = converter.convert()
with open('mpnet-dtc-zoomcamp_tfmodel.tflite', 'wb') as f_out:
    f_out.write(tflite_model)

## Dummy Testing Model and Prepare for Containerization

In [48]:
!pip install -U https://github.com/alexeygrigorev/tflite-aws-lambda/raw/main/tflite/tflite_runtime-2.14.0-cp310-cp310-linux_x86_64.whl

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting tflite-runtime==2.14.0
  Downloading https://github.com/alexeygrigorev/tflite-aws-lambda/raw/main/tflite/tflite_runtime-2.14.0-cp310-cp310-linux_x86_64.whl (2.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m39.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Installing collected packages: tflite-runtime
Successfully installed tflite-runtime-2.14.0


In [49]:
# !pip install -U https://github.com/alexeygrigorev/tflite-aws-lambda/raw/main/tflite/tflite_runtime-2.14.0-cp310-cp310-linux_x86_64.whl
import requests
import numpy as np
import tflite_runtime.interpreter as tflite
import json  # Import the json module
# Disable TensorFlow warnings, before you import tensorflow
import os; os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
import gc; gc.collect()

# Replace this random input with your actual question and answers
question    = "That is a happy person"
answers     = ["That is a happy dog", "That is a very happy person", "Today is a sunny day"]
dummy_input = np.array([[1, 1, 1]], dtype=np.int32)

# Load the TFLite model in TFLite Interpreter and allocate tensors.
interpreter = tflite.Interpreter(model_path='mpnet-dtc-zoomcamp_tfmodel.tflite')
interpreter.allocate_tensors()

# Get input and output tensors.
# input_details    = interpreter.get_input_details()
# output_details   = interpreter.get_output_details()

# Print the signatures from the converted model
# {'serving_default': {'inputs': ['attention_mask', 'input_ids'], 'outputs': ['last_hidden_state', 'pooler_output']}}
interpreter.get_signature_list()

infer  = interpreter.get_signature_runner("serving_default")
output = infer(input_ids=dummy_input, attention_mask=dummy_input)['last_hidden_state'].squeeze()
output = np.mean(output, axis=0)

# Convert the result to a JSON-formatted string
result_json = json.dumps({"shape": output.shape}, indent=2)
print(result_json)

INFO: Created TensorFlow Lite delegate for select TF ops.
INFO: TfLiteFlexDelegate delegate: 4 nodes delegated out of 1350 nodes with 1 partitions.

INFO: Created TensorFlow Lite XNNPACK delegate for CPU.


{
  "shape": [
    768
  ]
}


## Testing function for Lambda

In [51]:
# !pip install -U https://github.com/alexeygrigorev/tflite-aws-lambda/raw/main/tflite/tflite_runtime-2.14.0-cp310-cp310-linux_x86_64.whl
import requests
import numpy as np
import tflite_runtime.interpreter as tflite
import json  # Import the json module
# Disable TensorFlow warnings, before you import tensorflow
import os; os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
import gc; gc.collect()
from transformers import AutoTokenizer
# Use a tokenizer appropriate for your model (replace "bert-base-uncased" with your model name)
tokenizer = AutoTokenizer.from_pretrained('celik-muhammed/all-mpnet-base-v2-finetuned-dtc-zoomcamp')

# Replace this random input with your actual question and answers
question = "That is a happy person"
answers = ["That is a happy dog", "That is a very happy person", "Today is a sunny day"]
# Tokenize the question and answers
tokenized_question = tokenizer(question, return_tensors="np", padding=True, truncation=True, max_length=128)

# Load the TFLite model in TFLite Interpreter and allocate tensors.
interpreter = tflite.Interpreter(model_path='mpnet-dtc-zoomcamp_tfmodel.tflite')
interpreter.allocate_tensors()

# Get input and output tensors.
# input_details    = interpreter.get_input_details()
# output_details   = interpreter.get_output_details()
# {'serving_default': {'inputs': ['attention_mask', 'input_ids'], 'outputs': ['last_hidden_state', 'pooler_output']}}
interpreter.get_signature_list()

infer  = interpreter.get_signature_runner("serving_default")
output = infer(
    input_ids      = np.array(tokenized_question['input_ids'], dtype=np.int32), 
    attention_mask = np.array(tokenized_question['attention_mask'], dtype=np.int32)
)['last_hidden_state'].squeeze()

# Get the output data
q_embedding = np.mean(output, axis=0)

# Your q_embedding contains the embeddings. Now, you can use these embeddings
# to calculate cosine similarity between the question and each answers.

# Replace this with your actual processing of each answer
a_embeddings = []
for i, answer in enumerate(answers):
    # Process each answer with your model and obtain the embedding
    tokenized_answers  = tokenizer(answer, return_tensors="np", padding=True, truncation=True, max_length=128)
    output = infer(
        input_ids      = np.array(tokenized_answers['input_ids'], dtype=np.int32),
        attention_mask = np.array(tokenized_answers['attention_mask'], dtype=np.int32)
    )['last_hidden_state'].squeeze()
    a_embedding = np.mean(output, axis=0)
    
    # Calculate cosine similarity between the question and the current answer
    similarity = np.dot(q_embedding, a_embedding) / (np.linalg.norm(q_embedding) * np.linalg.norm(a_embedding))
    a_embeddings.append({"answer": answer, "similarity": f"{similarity:.3f}"})

# Convert the result to a JSON-formatted string
result_json = json.dumps({"question": question, "results": a_embeddings}, indent=2)
print(result_json)

{
  "question": "That is a happy person",
  "results": [
    {
      "answer": "That is a happy dog",
      "similarity": "0.760"
    },
    {
      "answer": "That is a very happy person",
      "similarity": "0.971"
    },
    {
      "answer": "Today is a sunny day",
      "similarity": "0.355"
    }
  ]
}


## Convert .ipynb to .py

Bash code for converting ipynb to py

> jupyter nbconvert --to script model.ipynb

lambda function a function must be added as below to the lambda_function.py file:


```py
def lambda_handler(event, context):
    url = event['url']
    result = predict(url)
    return result
```

In [52]:
%%writefile lambda_function.py
#!/usr/bin/env python
# coding: utf-8

# !pip install -U https://github.com/alexeygrigorev/tflite-aws-lambda/raw/main/tflite/tflite_runtime-2.14.0-cp310-cp310-linux_x86_64.whl
import requests
import numpy as np
import tflite_runtime.interpreter as tflite
import json  # Import the json module
# Disable TensorFlow warnings, before you import tensorflow
import os; os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
import gc; gc.collect()
from transformers import AutoTokenizer
# Use a tokenizer appropriate for your model (replace "bert-base-uncased" with your model name)
tokenizer = AutoTokenizer.from_pretrained('celik-muhammed/all-mpnet-base-v2-finetuned-dtc-zoomcamp')

# Load the TFLite model in TFLite Interpreter and allocate tensors.
interpreter = tflite.Interpreter(model_path='mpnet-dtc-zoomcamp_tfmodel.tflite')
interpreter.allocate_tensors()
infer  = interpreter.get_signature_runner("serving_default")

def query(payload):
    # Replace this random input with your actual question and answers
    question = payload['question']
    answers  = payload['answers']
    
    # Tokenize the question and answers
    tokenized_question = tokenizer(question, return_tensors="np", padding=True, truncation=True, max_length=128)
    output = infer(
        input_ids      = np.array(tokenized_question['input_ids'], dtype=np.int32), 
        attention_mask = np.array(tokenized_question['attention_mask'], dtype=np.int32)
    )['last_hidden_state'].squeeze()
    # Get the output data
    q_embedding = np.mean(output, axis=0)

    # Replace this with your actual processing of each answer
    a_embeddings = []
    for i, answer in enumerate(answers):
        # Process each answer with your model and obtain the embedding
        tokenized_answers  = tokenizer(answer, return_tensors="np", padding=True, truncation=True, max_length=128)
        output = infer(
            input_ids      = np.array(tokenized_answers['input_ids'], dtype=np.int32),
            attention_mask = np.array(tokenized_answers['attention_mask'], dtype=np.int32)
        )['last_hidden_state'].squeeze()
        a_embedding = np.mean(output, axis=0)

        # Calculate cosine similarity between the question and the current answer
        similarity = np.dot(q_embedding, a_embedding) / (np.linalg.norm(q_embedding) * np.linalg.norm(a_embedding))
        a_embeddings.append({"answer": answer, "similarity": f"{similarity:.3f}"})

    # Convert the result to a JSON-formatted string
    result_json = json.dumps({"question": question, "results": a_embeddings}, indent=2)
    return result_json

def lambda_handler(event, context):
    payload = event['inputs']
    result = query(payload)
    return result

Writing lambda_function.py


## Testing function for Lambda

In [53]:
import lambda_function

# Create the event with the URL of the image
event = {
    "inputs": {
    "question": "That is a happy person",
    "answers" : [
        "That is a happy dog",
        "That is a very happy person",
        "Today is a sunny day"
        ]
    },
}

# Invoke the Lambda function locally
result = lambda_function.lambda_handler(event, None)
print(result)

{
  "question": "That is a happy person",
  "results": [
    {
      "answer": "That is a happy dog",
      "similarity": "0.760"
    },
    {
      "answer": "That is a very happy person",
      "similarity": "0.971"
    },
    {
      "answer": "Today is a sunny day",
      "similarity": "0.355"
    }
  ]
}


## Pipenv for dependency management:

In [56]:
%%bash
get_versions() {
  for lib in "$@"; do
    info=$(pip show $lib 2>/dev/null)
    if [ $? -eq 0 ]; then
      version=$(echo "$info" | awk -F ': ' '$1=="Version" {print $2}')
      echo "$lib==$version"
    fi
  done
}
# Example usage
get_versions  "transformers" "tflite_runtime" "numpy" > requirements.txt
cat requirements.txt

transformers==4.36.2
tflite_runtime==2.14.0
numpy==1.24.3


In [57]:
# Generate Pipfile file from requirements.txt
!pipenv --python 3.10
# Create a virtual environment and install packages
!pipenv install
# Update the Pipfile.lock
!pipenv lock
# Clean up the temporary virtual environment
!pipenv --rm
# Check
# !ls ~/.local/share/virtualenvs/
!cat Pipfile

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[1mCreating a virtualenv for this project...[0m
Pipfile: [33m[1m/kaggle/working/Pipfile[0m
[1mUsing[0m [33m[1m/opt/conda/bin/python3.1[0m [32m(3.10.12)[0m [1mto create virtualenv...[0m
[2K[32m⠦[0m Creating virtual environment.....[36mcreated virtual environment CPython3.10.12.final.0-64 in 1620ms
  creator CPython3Posix(dest=/root/.local/share/virtualenvs/working-MLRu3Pvq, clear=False, no_vcs_ignore=False, global=False)
  seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/root/.local/share/virtualenv)
    added seed packages: pip==23.3.1, setuptools==69.0.2, wheel==0.42.0
  activators BashActivator,CShellActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator
[0m
✔ Successfully created virtual environment!
[2K[32m⠧[0m Creating virtual environment...
[1A[2K[32mVirtualenv location: /root/.local/share/virtualenvs/working-MLRu3Pvq[0m
[1mrequirements.txt[0m found in [1;33m/kaggle/[0m[1;3

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[1mPipfile.lock not found, creating[0m[1;33m...[0m
Locking[0m [33m[packages][0m dependencies...[0m
[?25lBuilding requirements[33m...[0m
[2KResolving dependencies[33m...[0m
[2K✔ Success! Locking...
[2K[32m⠹[0m Locking...
[1A[2KLocking[0m [33m[dev-packages][0m dependencies...[0m
[1mUpdated Pipfile.lock (02cd654209c8f6373a9812d72a6e911b97765ff86013ea787cebbef2b80b7454)![0m
[1mInstalling dependencies from Pipfile.lock [0m[1m([0m[1m0b7454[0m[1m)[0m[1;33m...[0m
To activate this project's virtualenv, run [33mpipenv shell[0m.
Alternatively, run a command inside the virtualenv with [33mpipenv run[0m.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Locking[0m [33m[packages][0m dependencies...[0m
[?25lBuilding requirements[33m...[0m
[2KResolving dependencies[33m...[0m
[2K✔ Success! Locking...
[2K[32m⠧[0m Locking...
[1A[2KLocking[0m [33m[dev-packages][0m dependencies...[0m
[1mUpdated Pipfile.lock (02cd654209c8f6373a9812d72a6e911b97765ff86013ea787cebbef2b80b7454)![0m


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[1mRemoving virtualenv[0m [1m([0m[32m/root/.local/share/virtualenvs/[0m[32mworking-MLRu3Pvq[0m[1;32m)[0m[32m...[0m
[2K[32m⠹[0m Running.....
[1A[2K

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[[source]]
url = "https://pypi.org/simple"
verify_ssl = true
name = "pypi"

[packages]
transformers = "==4.36.2"
tflite-runtime = "==2.14.0"
numpy = "==1.24.3"

[dev-packages]

[requires]
python_version = "3.10"
python_full_version = "3.10.12"


## Preparing Docker Image

Build docker image using the recommended public image for Lambda once Dockerfile has been created below:

- https://repost.aws/knowledge-center/lambda-container-images

> docker build -t dtc-zoomcamp-q-a-challenge .

To test first run image that was built:

> docker run --gpus all -it --rm -p 9000:8080 dtc-zoomcamp-q-a-challenge:latest

In [58]:
%%writefile Dockerfile
# public image for Lambda
FROM public.ecr.aws/lambda/python:3.10

# Copy the Pipfile and Pipfile.lock into the container
# COPY ["requirements.txt", "./"]
# RUN pip install -r requirements.txt

# recompiled with the lambda image
RUN pip install --upgrade pip
RUN pip install -U https://github.com/alexeygrigorev/tflite-aws-lambda/raw/main/tflite/tflite_runtime-2.14.0-cp310-cp310-linux_x86_64.whl
RUN pip install -U transformers
RUN pip install -U numpy

# Copy function code and model into the container
COPY ["lambda_function.py", "mpnet-dtc-zoomcamp_tfmodel.tflite", "./"]

# Set the CMD to your handler 
CMD [ "lambda_function.lambda_handler" ]

Writing Dockerfile


In [59]:
# run to local
# docker build -t dtc-zoomcamp-q-a-challenge .
# docker run -it --rm -p 9000:8080 dtc-zoomcamp-q-a-challenge:latest

## test.py created per AWS documentation for testing

Run the file:

> python client_to_docker_test.py

This is the output I recieved which clearly shows that the image was predicted as a "dent" which is correct:

```html
{
  "question": "That is a happy person",
  "results": [
    {
      "answer": "That is a happy dog",
      "similarity": "0.760"
    },
    {
      "answer": "That is a very happy person",
      "similarity": "0.971"
    },
    {
      "answer": "Today is a sunny day",
      "similarity": "0.355"
    }
  ]
}
```

In [60]:
%%writefile client_to_docker_test.py
import requests

# curl -XPOST "http://localhost:9000/2015-03-31/functions/function/invocations" -d '{}'
url = 'http://localhost:9000/2015-03-31/functions/function/invocations'

# Create the event with the URL of the image
event = {
    "inputs": {
    "question": "That is a happy person",
    "answers" : [
        "That is a happy dog",
        "That is a very happy person",
        "Today is a sunny day"
        ]
    },
}

# Send POST request using requests module
response = requests.post(url, json=event)
# Print the response
print(response.text)

Writing client_to_docker_test.py


In [42]:
# to local, sometimes work or not work 
!python client_to_docker_test.py

{"errorMessage": "Select TensorFlow op(s), included in the given model, is(are) not supported by this interpreter. Make sure you apply/link the Flex delegate before inference. For the Android, it can be resolved by adding \"org.tensorflow:tensorflow-lite-select-tf-ops\" dependency. See instructions: https://www.tensorflow.org/lite/guide/ops_selectNode number 241 (FlexRealDiv) failed to prepare.", "errorType": "RuntimeError", "requestId": "215971ae-cde1-485b-b15c-39ba19645a33", "stackTrace": ["  File \"/var/task/lambda_function.py\", line 56, in lambda_handler\n    result = query(payload)\n", "  File \"/var/task/lambda_function.py\", line 28, in query\n    output = infer(\n", "  File \"/var/lang/lib/python3.10/site-packages/tflite_runtime/interpreter.py\", line 249, in __call__\n    self._interpreter_wrapper.Invoke(self._subgraph_index)\n"]}


## Docker Hub

In [62]:
# Tag the Existing Image, username/car-insurance-model:new-tag
# docker tag dtc-zoomcamp-q-a-challenge:latest developerhost/dtc-zoomcamp-q-a-challenge:latest

# Push the newly tagged image to Docker Hub:
# docker push developerhost/dtc-zoomcamp-q-a-challenge:latest

# you can pull the image:
# docker pull developerhost/dtc-zoomcamp-q-a-challenge:latest

## End of the Project