In [22]:
!pip install -q -U accelerate bitsandbytes langchain langchain-community sentence-transformers ragatouille faiss-gpu rank_bm25
# ! pip install -q -U beautifulsoup4 # Install beautifulsoup4 if you are running the notebook not in Kaggle
!pip install -q -U keras-nlp
!pip install -q -U keras>3

In [23]:
import os
import keras
import keras_nlp
import pandas as pd

from bs4 import BeautifulSoup
from typing import Optional, List, Tuple
from IPython.display import display, Markdown

from transformers import AutoTokenizer
from ragatouille import RAGPretrainedModel
from langchain.docstore.document import Document
from langchain.prompts.prompt import PromptTemplate
from langchain_core.runnables import ConfigurableField
from langchain_community.vectorstores import FAISS, Chroma
from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import DataFrameLoader
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores.utils import DistanceStrategy

os.environ["KERAS_BACKEND"] = "torch"  # Or "torch" or "tensorflow".
#os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"]="1.00" # Avoid memory fragmentation on JAX backend.

In [24]:


data = pd.read_csv('/kaggle/input/kaggle-solutions-methods/kaggle_winning_solutions_methods.csv')
data.head()



Unnamed: 0,link,place,competition_name,prize,team,kind,metric,year,nm,writeup,num_tokens,methods,cleaned_methods
0,https://www.kaggle.com/c/asl-signs/discussion/...,2,Google - Isolated Sign Language Recognition,"$100,000",1165,Research,PostProcessorKernelDesc,2023,406306,<h2>TLDR</h2>\n<p>We used an approach similar ...,2914,"['EfficientNet-B0', 'Data Augmentation', 'Norm...",Replace augmentation
1,https://www.kaggle.com/c/asl-signs/discussion/...,2,Google - Isolated Sign Language Recognition,"$100,000",1165,Research,PostProcessorKernelDesc,2023,406306,<h2>TLDR</h2>\n<p>We used an approach similar ...,2914,"['EfficientNet-B0', 'Data Augmentation', 'Norm...",Finger tree rotate
2,https://www.kaggle.com/c/asl-signs/discussion/...,2,Google - Isolated Sign Language Recognition,"$100,000",1165,Research,PostProcessorKernelDesc,2023,406306,<h2>TLDR</h2>\n<p>We used an approach similar ...,2914,"['EfficientNet-B0', 'Data Augmentation', 'Norm...",Data Augmentation
3,https://www.kaggle.com/c/asl-signs/discussion/...,2,Google - Isolated Sign Language Recognition,"$100,000",1165,Research,PostProcessorKernelDesc,2023,406306,<h2>TLDR</h2>\n<p>We used an approach similar ...,2914,"['EfficientNet-B0', 'Data Augmentation', 'Norm...",Onecycle scheduler
4,https://www.kaggle.com/c/asl-signs/discussion/...,2,Google - Isolated Sign Language Recognition,"$100,000",1165,Research,PostProcessorKernelDesc,2023,406306,<h2>TLDR</h2>\n<p>We used an approach similar ...,2914,"['EfficientNet-B0', 'Data Augmentation', 'Norm...",Flip pose


In [25]:


data['writeup'][42]



'<p>Here is a quick overview of the 5th-place solution.</p>\n<ol>\n<li><p><strong>we applied various augmentations like flip, concatenation, etc</strong><br>\n1.1. By applying different augmentations, we can increase the cv by ~ 0.02 (0.76 -&gt; 0.78)</p></li>\n<li><p><strong>the model is only a transformer model based on the public kernels</strong><br>\n2.1. By increasing the number of parameters, the performance of a single model can be increased to around 0.8 (0.78-&gt;0.8) in public LB.<br>\n2.1.1. 3 layers of transformer with the embedding size 480.</p></li>\n<li><p><strong>Preprocessing by mean and std of single sign sequence</strong><br>\n3.1. the preprocessing does affect the final performance. <br>\n3.1.1. we tried different ways of calculating the mean and std and found out that using the mean and std of the single sign sequence results in better cv.</p></li>\n<li><p><strong>Feature engineering like distances between points</strong><br>\n4.1. we selected and used around 106 p

In [26]:


%%time

def clean_html(html_content):
    """Function to clean up HTML tags in each writeup"""
    soup = BeautifulSoup(html_content, 'html.parser')
    # Use '\n' as a separator to preserve the structure of the various parts
    text = soup.get_text(separator='\n', strip=True)
    return text

data['writeup'] = data['writeup'].apply(clean_html) # This might take a while



CPU times: user 30 s, sys: 109 ms, total: 30.1 s
Wall time: 30.1 s


In [27]:
print(data['writeup'][42])

Here is a quick overview of the 5th-place solution.
we applied various augmentations like flip, concatenation, etc
1.1. By applying different augmentations, we can increase the cv by ~ 0.02 (0.76 -> 0.78)
the model is only a transformer model based on the public kernels
2.1. By increasing the number of parameters, the performance of a single model can be increased to around 0.8 (0.78->0.8) in public LB.
2.1.1. 3 layers of transformer with the embedding size 480.
Preprocessing by mean and std of single sign sequence
3.1. the preprocessing does affect the final performance.
3.1.1. we tried different ways of calculating the mean and std and found out that using the mean and std of the single sign sequence results in better cv.
Feature engineering like distances between points
4.1. we selected and used around 106 points (as the public notebook by Heck).
4.2. distances withinpoints of hands/nose/eyes/… are calculated.
some methods to prevent overfitting like awp, random mask of frames, ema,

In [28]:


data['LLM_context'] = (
    "Competition Name: " + data['competition_name'] +
    ",\nPlace: " + data['place'].astype(str) +
    ",\nMethods Used: " + data['methods'] +
    ",\nSolution: " + data['writeup']
)

print(data['LLM_context'][42])



Competition Name: Google - Isolated Sign Language Recognition,
Place: 5,
Methods Used: ['Augmentation', 'Transformer model', 'Preprocessing', 'Feature engineering', 'Overfitting prevention'],
Solution: Here is a quick overview of the 5th-place solution.
we applied various augmentations like flip, concatenation, etc
1.1. By applying different augmentations, we can increase the cv by ~ 0.02 (0.76 -> 0.78)
the model is only a transformer model based on the public kernels
2.1. By increasing the number of parameters, the performance of a single model can be increased to around 0.8 (0.78->0.8) in public LB.
2.1.1. 3 layers of transformer with the embedding size 480.
Preprocessing by mean and std of single sign sequence
3.1. the preprocessing does affect the final performance.
3.1.1. we tried different ways of calculating the mean and std and found out that using the mean and std of the single sign sequence results in better cv.
Feature engineering like distances between points
4.1. we select

In [29]:
data = data.drop("writeup", axis=1) # We remove 'writeup' column as it is already in LLM_context

In [30]:
loader = DataFrameLoader(data, page_content_column="LLM_context")
docs = loader.load()
docs_subset = docs[:1500] # Part of the data is used to reduce execution time.

In [31]:
print("-----------PAGE CONTENT-----------")
print(docs_subset[42].page_content)
print("\n\n-----------METADATA-----------\n")
print(docs_subset[42].metadata)

-----------PAGE CONTENT-----------
Competition Name: Google - Isolated Sign Language Recognition,
Place: 5,
Methods Used: ['Augmentation', 'Transformer model', 'Preprocessing', 'Feature engineering', 'Overfitting prevention'],
Solution: Here is a quick overview of the 5th-place solution.
we applied various augmentations like flip, concatenation, etc
1.1. By applying different augmentations, we can increase the cv by ~ 0.02 (0.76 -> 0.78)
the model is only a transformer model based on the public kernels
2.1. By increasing the number of parameters, the performance of a single model can be increased to around 0.8 (0.78->0.8) in public LB.
2.1.1. 3 layers of transformer with the embedding size 480.
Preprocessing by mean and std of single sign sequence
3.1. the preprocessing does affect the final performance.
3.1.1. we tried different ways of calculating the mean and std and found out that using the mean and std of the single sign sequence results in better cv.
Feature engineering like dist

In [32]:
EMBEDDING_MODEL_NAME = "BAAI/bge-base-en-v1.5"
CHUNK_SIZE = 512 # We choose a chunk size adapted to our model

In [33]:
%%time

def split_documents(
    chunk_size: int,
    knowledge_base: List[Document],
    tokenizer_name: Optional[str] = EMBEDDING_MODEL_NAME,
) -> List[Document]:
    """
    Split documents into chunks of maximum size `chunk_size` tokens and return a list of documents.
    """
    
    text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
        AutoTokenizer.from_pretrained(tokenizer_name),
        chunk_size=chunk_size,
        chunk_overlap=int(chunk_size / 10),
        add_start_index=True,
        strip_whitespace=True,
    )

    docs_processed = []
    for doc in knowledge_base:
        docs_processed += text_splitter.split_documents([doc])

    # Remove duplicates
    unique_texts = {}
    docs_processed_unique = []
    for doc in docs_processed:
        if doc.page_content not in unique_texts:
            unique_texts[doc.page_content] = True
            docs_processed_unique.append(doc)

    return docs_processed_unique

chunked_docs = split_documents(
    CHUNK_SIZE,  
    docs_subset,
    tokenizer_name=EMBEDDING_MODEL_NAME,
)

Token indices sequence length is longer than the specified maximum sequence length for this model (956 > 512). Running this sequence through the model will result in indexing errors


CPU times: user 31.7 s, sys: 12.8 ms, total: 31.7 s
Wall time: 31.8 s


In [34]:
%%time

embedding_model = HuggingFaceEmbeddings(
    model_name=EMBEDDING_MODEL_NAME,
    multi_process=False,
    model_kwargs={"device": "cuda"},
    encode_kwargs={"normalize_embeddings": True},  # set True for cosine similarity
)


CPU times: user 250 ms, sys: 98.2 ms, total: 348 ms
Wall time: 826 ms


In [35]:
num_docs = 5 # Default number of documents to retrieve

bm25_retriever = BM25Retriever.from_documents(
    chunked_docs
    ).configurable_fields(
    k=ConfigurableField(
        id="search_kwargs_bm25",
        name="k",
        description="The search kwargs to use",
    )
)

faiss_vectorstore = FAISS.from_documents(
    chunked_docs, embedding_model, distance_strategy=DistanceStrategy.COSINE
)

faiss_retriever = faiss_vectorstore.as_retriever(
    search_kwargs={"k": num_docs}
    ).configurable_fields(
    search_kwargs=ConfigurableField(
        id="search_kwargs_faiss",
        name="Search Kwargs",
        description="The search kwargs to use",
    )
)

# initialize the ensemble retriever
vector_database = EnsembleRetriever(
    retrievers=[bm25_retriever, faiss_retriever], weights=[0.5, 0.5] # You can adjust the weight of each retriever in the EnsembleRetriever
)

In [36]:
print(data.iloc[42, :])

link                https://www.kaggle.com/c/asl-signs/discussion/...
place                                                               5
competition_name          Google - Isolated Sign Language Recognition
prize                                                        $100,000
team                                                            1,165
kind                                                         Research
metric                                        PostProcessorKernelDesc
year                                                             2023
nm                                                             406491
num_tokens                                                        473
methods             ['Augmentation', 'Transformer model', 'Preproc...
cleaned_methods                                       Post-processing
LLM_context         Competition Name: Google - Isolated Sign Langu...
Name: 42, dtype: object


In [37]:
user_query = """
I want to understand the 5th-place solution in the 'Google - Isolated Sign Language Recognition' competition. 
What overfitting prevention techniques were used, and how did they ensure model robustness?
"""
config = {"configurable": {"search_kwargs_faiss": {"k": 5}, "search_kwargs_bm25": 5}}
retrieved_docs = vector_database.invoke(user_query, config=config)
print("----------------------Top document content----------------------")
print(retrieved_docs[0].page_content)
print("----------------------Top document metadata----------------------")
print(retrieved_docs[0].metadata)

----------------------Top document content----------------------
Competition Name: Google - Isolated Sign Language Recognition,
Place: 5,
Methods Used: ['Augmentation', 'Transformer model', 'Preprocessing', 'Feature engineering', 'Overfitting prevention'],
Solution: Here is a quick overview of the 5th-place solution.
we applied various augmentations like flip, concatenation, etc
1.1. By applying different augmentations, we can increase the cv by ~ 0.02 (0.76 -> 0.78)
the model is only a transformer model based on the public kernels
2.1. By increasing the number of parameters, the performance of a single model can be increased to around 0.8 (0.78->0.8) in public LB.
2.1.1. 3 layers of transformer with the embedding size 480.
Preprocessing by mean and std of single sign sequence
3.1. the preprocessing does affect the final performance.
3.1.1. we tried different ways of calculating the mean and std and found out that using the mean and std of the single sign sequence results in better cv.

In [38]:
reranker = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

  self.scaler = torch.cuda.amp.GradScaler()


In [39]:


page_contents = [doc.page_content for doc in retrieved_docs]  # keep only the text
relevant_docs = reranker.rerank(user_query, page_contents, k=5)
relevant_docs = [doc["content"] for doc in relevant_docs]



  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 1/1 [00:00<00:00,  6.89it/s]


In [40]:


print(relevant_docs[0])



Competition Name: Google - Isolated Sign Language Recognition,
Place: 5,
Methods Used: ['Augmentation', 'Transformer model', 'Preprocessing', 'Feature engineering', 'Overfitting prevention'],
Solution: Here is a quick overview of the 5th-place solution.
we applied various augmentations like flip, concatenation, etc
1.1. By applying different augmentations, we can increase the cv by ~ 0.02 (0.76 -> 0.78)
the model is only a transformer model based on the public kernels
2.1. By increasing the number of parameters, the performance of a single model can be increased to around 0.8 (0.78->0.8) in public LB.
2.1.1. 3 layers of transformer with the embedding size 480.
Preprocessing by mean and std of single sign sequence
3.1. the preprocessing does affect the final performance.
3.1.1. we tried different ways of calculating the mean and std and found out that using the mean and std of the single sign sequence results in better cv.
Feature engineering like distances between points
4.1. we select

In [41]:
gemma_lm = keras_nlp.models.GemmaCausalLM.from_preset("gemma_instruct_2b_en")

In [42]:


%%time
display(Markdown(gemma_lm.generate("Hi, what can you tell me about Kaggle competitions?", max_length=256)))



Hi, what can you tell me about Kaggle competitions?

**What are Kaggle competitions?**

Kaggle competitions are a platform where data scientists and machine learning engineers can participate in a wide range of data science and machine learning challenges. These competitions offer a unique opportunity to learn from experts, solve real-world problems, and potentially win prizes.

**Key features of Kaggle competitions:**

* **Real-world datasets:** Competitions typically use real-world datasets that are relevant to various industries and domains.
* **Multiple data modalities:** Competitions allow participants to submit solutions for various data modalities, including images, text, and time series.
* **Various challenge levels:** Competitions offer different challenge levels to cater to different skill sets and experience levels.
* **Community engagement:** Kaggle provides a vibrant community where participants can interact, share knowledge, and collaborate on solutions.
* **Prizes and recognition:** Winners of Kaggle competitions receive significant prizes and recognition, including cash, prizes, and public acclaim.

**Benefits of participating in Kaggle competitions:**

* **Learn from industry experts:** Solve real-world problems and gain insights from data science and machine learning experts.
* **Boost your resume:** Winning a Kaggle competition can significantly enhance your

CPU times: user 23.9 s, sys: 605 ms, total: 24.5 s
Wall time: 24.6 s


In [43]:
prompt_template = """
Based on your extensive knowledge and the following detailed context, 
please provide a comprehensive answer to explain concepts from Kaggle competition solution write-ups:

CONTEXT:
{context}

QUESTION:
{question}

ANSWER:
"""

RAG_PROMPT_TEMPLATE = PromptTemplate(
    input_variables=["context", "question"],
    template=prompt_template,
)


In [44]:
def answer_with_rag(
    question: str,
    llm,
    knowledge_index: FAISS,
    reranker: Optional[RAGPretrainedModel] = None,
    num_retrieved_docs: int = 10,
    num_docs_final: int = 5,
) -> Tuple[str, List[Document]]:
    # Gather documents with retriever
    print("=> Retrieving documents...")
    config = {"configurable": {"search_kwargs_faiss": {"k": num_retrieved_docs}, "search_kwargs_bm25": num_retrieved_docs}}
    relevant_docs = knowledge_index.invoke(question, config=config)
    relevant_docs = [doc.page_content for doc in relevant_docs]  # keep only the text
    
    # Optionally rerank results
    if reranker:
        print("=> Reranking documents...")
        relevant_docs = reranker.rerank(question, relevant_docs, k=num_docs_final)
        relevant_docs = [doc["content"] for doc in relevant_docs]
        
    relevant_docs = relevant_docs[:num_docs_final] # Keeping only num_docs_final documents

    # Build the final prompt
    context = relevant_docs[0] # We select only the top relevant document
    
    final_prompt = RAG_PROMPT_TEMPLATE.format(
        context = context,  
        question=question
    )

    # Redact an answer
    print("=> Generating answer...")
    answer = llm.generate(final_prompt, max_length=1024)

    return answer, relevant_docs

In [45]:
%%time
question = """I want to understand the 5th-place solution in the 'Google - Isolated Sign Language Recognition' competition. 
What overfitting prevention techniques were used, and how did they ensure model robustness?
"""
answer, relevant_docs = answer_with_rag(question, gemma_lm, vector_database, reranker)

  return torch.cuda.amp.autocast() if self.activated else NullContextManager()


=> Retrieving documents...
=> Reranking documents...


100%|██████████| 1/1 [00:00<00:00,  4.29it/s]


=> Generating answer...
CPU times: user 27.5 s, sys: 1.22 s, total: 28.7 s
Wall time: 28.7 s


In [46]:
def get_gemma_answer(generated_answer: str) -> str:
    """Function to get Gemma answer"""
    split = generated_answer.split("ANSWER:")
    return split[1] if len(split) > 1 else "No answer has been generatedCliquez pour utiliser cette solution"

display(Markdown("### Gemma Answer"))
display(Markdown(get_gemma_answer(answer)))
display(Markdown("### Source docs"))
for i, doc in enumerate(relevant_docs):
    display(Markdown(f"**Document {i}------------------------------------------------------------**"))
    display(Markdown(doc))

### Gemma Answer


**Overfitting prevention techniques used in the 5th-place solution:**

* **Random masking of frames:** This technique randomly selects a subset of frames from the training data and trains the model on this subset. This helps to prevent the model from overfitting to the specific training data and improves itsgeneralizability.
* **Early stopping:** This technique stops training the model when it reaches a certain number of epochs or when the validation loss starts to increase. This helps to prevent the model from overfitting to the training data and improves itsgeneralizability.
* **Data augmentation:** This technique is used to increase the size of the training dataset and to introduce diversity into the training data. This helps to prevent the model from overfitting to the training data and improves itsgeneralizability.
* **Mean and standard deviation of the single sign sequence:** This technique is used to pre-process the training data and to improve the performance of the model.

**How these techniques ensured model robustness:**

* **Random masking of frames:** This technique helped to prevent the model from overfitting to the specific training data by exposing it to a wide range of images.
* **Early stopping:** This technique helped to prevent the model from overfitting to the training data by stopping training when it reached a certain number of epochs.
* **Data augmentation:** This technique helped to increase the size of the training dataset and to introduce diversity into the training data. This helped to prevent the model from overfitting to the training data and improved itsgeneralizability.
* **Mean and standard deviation of the single sign sequence:** This technique helped to improve the performance of the model by reducing overfitting and by introducing diversity into the training data.

### Source docs

**Document 0------------------------------------------------------------**

Competition Name: Google - Isolated Sign Language Recognition,
Place: 5,
Methods Used: ['Augmentation', 'Transformer model', 'Preprocessing', 'Feature engineering', 'Overfitting prevention'],
Solution: Here is a quick overview of the 5th-place solution.
we applied various augmentations like flip, concatenation, etc
1.1. By applying different augmentations, we can increase the cv by ~ 0.02 (0.76 -> 0.78)
the model is only a transformer model based on the public kernels
2.1. By increasing the number of parameters, the performance of a single model can be increased to around 0.8 (0.78->0.8) in public LB.
2.1.1. 3 layers of transformer with the embedding size 480.
Preprocessing by mean and std of single sign sequence
3.1. the preprocessing does affect the final performance.
3.1.1. we tried different ways of calculating the mean and std and found out that using the mean and std of the single sign sequence results in better cv.
Feature engineering like distances between points
4.1. we selected and used around 106 points (as the public notebook by Heck).
4.2. distances withinpoints of hands/nose/eyes/… are calculated.
some methods to prevent overfitting like awp, random mask of frames, ema, etc …
many thanks to my teammates
@qiaoshiji
@zengzhaoyang
The source code for training models can be found here :
https://github.com/zhouyuanzhe/kaggleasl5thplacesolution

**Document 1------------------------------------------------------------**

Competition Name: Google - Isolated Sign Language Recognition,
Place: 8,
Methods Used: ['Transformer models', 'FFN encoder', 'Cosine schedule', 'Dropout', 'Label smoothing', 'Sequence cutout augmentation', 'Mirror left augmentation', 'Random rotate augmentation', 'Linear interpolation', 'Min-max normalization', 'Mean/std normalization', 'Time shift delta features', 'Angle features', 'Point to point distances', 'Tflite conversion', 'Speed up with model.half().float()', 'Normalizing points across the whole sequence', 'Mixup (tried but did not work)', 'CNNs with mixup (tried but did not work)'],
Solution: Here is a quick overview of the 8th place solution.
3 transformers models, 2 layers each (384 hidden, 512 hidden ffn), with an ffn encoder (512->384), trained from scratch. LR 8e-4 with cosine schedule trained for ~300 epochs, dropout 0.1, batch size 1024, label smoothing 0.1. Using hands, lips and pose (above waist only). On one transformer all pose and a subset of lips were used for diversity.
Augmentations
most important was sequence cutout. On each sample, and each body part (left hand, right hand, lips, pose) with a 0.4 proba convert to nan 5 random slices of 0.15 x SequenceLength. It was hard to overfit with this in.
mirror left
random rotate.
Preprocessing
Linear interpolation of longer sequences to max length of 96.
Normalise each body part, using min max - I found this better than mean/std. In one model I used mean/std for diversity.
Create time shift delta features on a subset of points, using time shifts of
[1, 2, 3, 4, 6, 8, 12, 16]

**Document 2------------------------------------------------------------**

Competition Name: Google - Isolated Sign Language Recognition,
Place: 6,
Methods Used: ['MLP', 'Encoder', 'Transformer', 'Convolutional Neural Network (CNN)', 'Data Augmentation', 'Cross Entropy Loss', 'Weight Decay', 'Mean Teacher', 'Knowledge Distillation', 'Ensemble Learning', 'Stratified K-fold', 'Baseline Model', 'Deberta', 'Max Pooling', 'Normalization', 'Interpolation', 'Manifold Mixup', 'Face CutMix', 'Outlier Sample Mining (OUSM)', 'Model Soup', 'Data Relabeling', 'Data Truncation', 'Mish Activation Function'],
Solution: Thanks to both, the organizers of this competition who offered a fun yet challenging problem as well as all of the other competitors - well done to everyone who worked hard for small incremental increases.
Although I am the one posting the topic, this is the result of a great team effort, so big shoutout to
@christofhenkel
.
Brief Summary
Our solution is a 2 model ensemble of a MLP-encoder-frame-transformer model. We pushed our transformer models close to the limit and implemented a lot of tricks to climb up to 6th place.
I have 1403 hours of experiment monitoring time in April (that’s 48h per day :)).
Update :
Code is available here :
https://github.com/TheoViel/kaggle_islr
Detailed Summary
Preprocessing & Model
Preprocessing
Remove frames without fingers
Stride the sequence (use 1 every n frames) such that the sequence size is
<= max_len
. We used
max_len=25
and
80
in the final ensemble

**Document 3------------------------------------------------------------**

Competition Name: Google - Isolated Sign Language Recognition,
Place: 26,
Methods Used: ['Mixup', 'Mirroring', 'LLaMa-inspired architecture', 'RMSNorm normalization', 'Lion optimizer', 'Cosine decay learning rate', 'Batch size 128', 'Dropout 0.1', 'Exponential moving average of weights'],
Solution: Github with all the code used
Summary
The most important part of the solution is the data utilization. Major improvements were from keypoints choice and mixup. External data does not help because it is from a very different distribution. Given data amount does not benefit larger models so ensembles of small models is the way to utilize given constraints to the fullest.
Most augmentations are not helpful, because they prevent model from learning the true data distribution. So only used mirroring and mixup (0.5).
Inputs to the model
All models are trained to support sequences of up to 512 frames.
Preprocessing
Only 2d coordinates are used as 3rd dimension leads to unstable training.
To normalize inputs all keypoints are shifted so that head is located at the origin.
Scaling did not provide any benefit so not used.
All nans are replaced with 0 after normalization.
Chosen keypoints
All (21) hand keypoints
26 face keypoints
17 pose keypoints
Architecture
LLaMa-inspired architecture. Most notable improvement comes from much better normalization RMSNorm.
For all models head dimensions are set to 64
Single model (Private/Public LB: 0.8543689/0.7702471)
6 heads 5 layers 9.2M parameters
Ensemble of 3 models (Private/Public LB: 0.8584568/0.7725324)
2 heads 6 layers 1.7M parameters per model
Larger models could be fit into file size limit, but it would time out during submission.
Augmentations

**Document 4------------------------------------------------------------**

Competition Name: Google - Isolated Sign Language Recognition,
Place: 11,
Methods Used: ['Ensemble', 'Strong augmentation', 'Manual model conversion from pytorch to tensorflow', 'CLIP transformer architecture', 'Decrease parameter size', 'Motion features', 'Longer epoch'],
Solution: Thank you to the organizer and Kaggle for hosting this interesting challenge.
Especially I enjoyed this strict inference time restriction. It keeps model size reasonable and requires us for some practical technique.
TL;DR
Ensemble 5 transformer models
Strong augmentation
Manual model conversion from pytroch to tensorflow
Code is available here ->
https://github.com/bamps53/kaggle-asl-11th-place-solution
Overview
I started from
@hengck23
‘s
great discussion
and
notebook
. Thanks for sharing a lot of useful tricks as always!
The changes I made are following;
Change model architecture to CLIP transformer in HuggingFace
Decrease parameter size to maximize latency within the range of same accuracy
Some strong augmentations
Horizontal flip(p=0.5)
Random 3d rotation(p=1, -45~45)
Random scale(p=1, 0.5~1.5)
Random shift(p=1, 0.7~1.3)
Random mask frames(p=1, mask_ratio=0.5)
Random resize (p=1, 0.5~1.5)
Add motion features
current - prev
next - current
Velocity
Longer epoch, 250 for 5 fold and 300 for all data
For the details, please refer to the code.(planning to upload)
Model conversion