Setup and important aspects:
- Gemma 2B and 7B 
- HuggingFace
- LangChain

In [1]:

from transformers import pipeline, set_seed
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from accelerate.utils import release_memory
import torch 
# from kaggle_secrets import UserSecretClient 
# from huggingface_hub import login 
import huggingface_hub
from datasets import Dataset
from peft import LoraConfig, PeftModel
import pandas as pd 
import langchain 
from langchain.text_splitter import CharacterTextSplitter, HTMLHeaderTextSplitter
from langchain.docstore.document import Document 
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
import evaluate 
import transformers
from langchain.llms.base import LLM
from typing import Any 
import warnings 
import gc 
import random 
import numpy as np 

warnings.filterwarnings("ignore")
from dotenv import load_dotenv
import os
# Load the .env file
load_dotenv()
# Get the secret key
huggingface_key = os.getenv("HUGGINGFACE_SECRET_KEY")


# Set seed
set_seed(42)
torch.manual_seed(42)
np.random.seed(42)
random.seed(42)


# read writeups dataset 
writeups = pd.read_csv("kaggle-winning-solutions-methods/kaggle_winning_solutions_methods.csv")
writeups = writeups.drop_duplicates(subset = ['link', 'writeup']).reset_index(drop = True)

huggingface_hub.login(token=huggingface_key)

model_huggingface_path = "./google_model/gemma_2b_it"
pipe = transformers.pipeline('text-generation', 
                             model=model_huggingface_path,
                              model_kwargs={"torch_dtype": torch.float16},
                            device='cuda',
                            max_new_tokens=512
)



  from .autonotebook import tqdm as notebook_tqdm
2024-04-12 21:14:34.084201: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.


Gemma's activation function should be approximate GeLU and not exact GeLU.
Changing the activation function to `gelu_pytorch_tanh`.if you want to use the legacy `gelu`, edit the `model.config` to set `hidden_activation=gelu`   instead of `hidden_act`. See https://github.com/huggingface/transformers/pull/29402 for more details.


Token is valid (permission: read).
Your token has been saved to /home/xuananh/.cache/huggingface/token
Login successful


Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00,  1.75it/s]


Pipeline provide an efficient and user-friendly way to leverage models for inference. They consist of:
- A tokenizer, which, if not explicitly specified, is automatically imported from the model configurations  on HuggingFace.
- The model itself
- Parameters for controlling and fine-tuning the output

Considering the code above ,several crucial parameters have been configured: 

- `max_new_tokens` - controls the **maximum number of newly generated tokens**. If not specified, the default value may not be sufficient to generate enough text (therefore summaries).

In [2]:
# Import the first writeup from the dataset and inspect the first 1000 chars
writeup_small = writeups.iloc[0, 9] # dòng số 0 và ô thứ  9 
print('Number of characters:', len(writeup_small))
writeup_small[:1000]

Number of characters: 9864


'<h2>TLDR</h2>\n<p>We used an approach similar to audio spectrogram classification using the EfficientNet-B0 model, with numerous augmentations and transformer models such as BERT and DeBERTa as helper models. The final solution consists of one EfficientNet-B0 with an input size of 160x80, trained on a single fold from 8 randomly split folds, as well as DeBERTa and BERT trained on the full dataset. A single fold model using EfficientNet has a CV score of 0.898 and a leaderboard score of ~0.8.</p>\n<p>We used only competition data.</p>\n<h2>1. Data Preprocessing</h2>\n<h3>1.1 CNN Preprocessing</h3>\n<ul>\n<li>We extracted 18 lip points, 20 pose points (including arms, shoulders, eyebrows, and nose), and all hand points, resulting in a total of 80 points.</li>\n<li>During training, we applied various augmentations.</li>\n<li>We implemented standard normalization.</li>\n<li>Instead of dropping NaN values, we filled them with zeros after normalization.</li>\n<li>We interpolated the time ax

In [3]:
writeups.head(3)

Unnamed: 0,link,place,competition_name,prize,team,kind,metric,year,nm,writeup,num_tokens,methods,cleaned_methods
0,https://www.kaggle.com/c/asl-signs/discussion/...,2,Google - Isolated Sign Language Recognition,"$100,000",1165,Research,PostProcessorKernelDesc,2023,406306,<h2>TLDR</h2>\n<p>We used an approach similar ...,2914,"['EfficientNet-B0', 'Data Augmentation', 'Norm...",Replace augmentation
1,https://www.kaggle.com/c/asl-signs/discussion/...,3,Google - Isolated Sign Language Recognition,"$100,000",1165,Research,PostProcessorKernelDesc,2023,406568,<p>We used an <strong>ensemble of six conv1d m...,1744,"['Conv1D', 'Transformers', 'Data preprocessing...",Conv1D
2,https://www.kaggle.com/c/asl-signs/discussion/...,4,Google - Isolated Sign Language Recognition,"$100,000",1165,Research,PostProcessorKernelDesc,2023,406673,<p>I would like to thank the organizers and al...,2189,"['XY coordinates', 'Normalization', 'Flip', '1...",Max pooling


In [4]:
messages = [
    {
        "role": "user",
        "content": "Summarize the following text in a technical way. Focus on facts, numbers and strategies used. Divide the summary in chapters, be impersonal and use bullet points:\n\n{}".format(writeup_small)
    }
]

prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(
    prompt,
    do_sample=True,
    temperature=0.1,
    top_k=20,
    top_p=0.3,
    add_special_tokens=True
)
print(outputs[0]["generated_text"][len(prompt):])

Sure, here's a summary of the text in a technical way:

**1. Data Preprocessing**

* Extract 80 points from the image, including lip and pose points.
* Apply various augmentations and normalizations.
* Fill NaN values with zeros and use nearest interpolation for the time axis.

**2. Augmentation**

* Use common and CNN specific augmentations.
* Implement a mixup augmentation that only works with CNNs.

**3. Training**

* Train EfficientNet-B0 and BERT on a single fold with 0.1 warm-up.
* Train a transformer model with a ranger optimizer and 4-layer transformer.
* Tune hyperparameters with Optuna.

**4. Submissions**

* Aggregate models in a tf.Module.
* Calculate ensemble weights for fold 0 and apply to the full dataset.

**5. PS. Need BETTER TFlite DepthwiseConv2D**

* Explore different ways to implement depthwise convolution in tflite.
* Experiment with different FLOP configurations.

**6. Conclusion**

* EfficientNet-B0 achieved a leaderboard score of 0.8.
* Transformers improved th

# Chat template

check a test message

In [5]:
test_messages = [
    {"role": "user",
     "content": "This is a test"},
    {"role": "assistant",
     "content": "Good for you!"},
    {"role": "user",
     "content": "Ah ah"},
]

test_prompt = pipe.tokenizer.apply_chat_template(test_messages, tokenize=False, add_generation_prompt=True)
print(test_prompt)

<start_of_turn>user
This is a test<end_of_turn>
<start_of_turn>model
Good for you!<end_of_turn>
<start_of_turn>user
Ah ah<end_of_turn>
<start_of_turn>model



In [2]:
messages_eli5 = [
    {"role": "user",
     "content": "Summarize the following text while avoiding difficult jargon using bullet points chapters. Explain it like I am a 5 years old:\n\n{}".format(writeup_small)},
]

prompt_eli5 = pipe.tokenizer.apply_chat_template(messages_eli5, tokenize=False, add_generation_prompt=True)
outputs_eli5 = pipe(
    prompt_eli5,
    add_special_tokens=True,
    do_sample=True,
    temperature=0.1,
    top_k=20,
    top_p=0.3
)
print(outputs_eli5[0]["generated_text"][len(prompt_eli5):])

NameError: name 'writeup_small' is not defined

In [7]:
messages_few_shot = [
    {"role": "user",
     "content": "This film was great, rich of details and with great actors."},
    {"role": "assistant",
     "content": "SENTIMENT: Positive.\nSUBJECT: Film"},
    {"role": "user",
     "content": "This park is dirty."},
    {"role": "assistant",
     "content": "SENTIMENT: Negative.\nSUBJECT: Park"},
    {"role": "user",
     "content": "This notebook is fantastic. I'm learning a lot"},
]

prompt_few_shot = pipe.tokenizer.apply_chat_template(messages_few_shot, tokenize=False, add_generation_prompt=True)
outputs_few_shot = pipe(
    prompt_few_shot,
    add_special_tokens=True,
    do_sample=True,
    temperature=0.1,
    top_k=20,
    top_p=0.3
)
print(outputs_few_shot[0]["generated_text"][len(prompt_few_shot):])

SENTIMENT: Positive.
SUBJECT: Notebook<end_of_turn>


In [8]:
prompt_few_shot = pipe.tokenizer.apply_chat_template(messages_few_shot, 
                                                     tokenize=False, 
                                                     add_generation_prompt=False)

print(prompt_few_shot)


<start_of_turn>user
This film was great, rich of details and with great actors.<end_of_turn>
<start_of_turn>model
SENTIMENT: Positive.
SUBJECT: Film<end_of_turn>
<start_of_turn>user
This park is dirty.<end_of_turn>
<start_of_turn>model
SENTIMENT: Negative.
SUBJECT: Park<end_of_turn>
<start_of_turn>user
This notebook is fantastic. I'm learning a lot<end_of_turn>



# Pipeline parameters

Parameters on pipeline transformers:   
- `do sample`: If set to True, the model will do sampling to generate the next token. If set to False, the model will use greedy decoding.  this parameter enables decoding strategies to select the next token from the probability distribution over the entire vocabulary. Together with num_beams, we can control different strategies. I opted for True and num_beams=1 (default), which is the multinomial sampling. More of decoding strategies here and here.  
- `temperature`: this paramater control randomness. The lower the temperature, the more deterministic the results are in the sense that the highest probable token is picked. I opted for a very low value because we need to encourage more factual responses and not creative ones.  
- `top_p`: controls the sampling of tokens. higher values will allow more tokens to be sampled. , including less likely ones. We opted for a relatively low value to maintain coherence given the task at hand.  
- `top_k` : in simple terms, together with top_p, it controls the number of tokens to keep for prediction. Once again, a low value will favour less creative responses, which is exactly what we are looking for in this notebook.

## Key learnings
- Following a chat template is highly recommented in order to mimick model's training process (therefore, model's knowledge).
- Prompt engineering is mandatory: a poor prompt will lead to poor results. Techniques such as Few-Shot learning can be useful tools in our arsenale.
- Controlling the generation parameters is important and depends on the task, whether we seek creativity or factual responses.

# Langchain 

## Stuffing

<!-- import image from image path -->
![image](image/image.png)


## MapReduce

<!-- import image from image path -->
![image](image/mapreduce.png)

## Refine

<!-- import image from image path -->
![image](image/refine.png)

## Document splitting strategies
Tokens are pieces of words: when we write a prompt, the input is transformed into tokens. One token doens't mean one word, but we can generally approximate 1 token ~ 4 characters in English ~ 3/4 of a word. Simply put, 75 words ~ 100 tokens.

Depending on the model used, we can accomodate a certain amount of tokens shared between prompt and model's generation, thus forcing us to operate some splitting if the context is too large. Gemma has a maximum context length of 8192 tokens, which roughly translates to more than 6100 words.

Scenario C - Output is poor and writeups don't follow a clear structure: This is the most difficult, yet plausible, scenario, in which our model struggles with the winner's stream of consciousness and lack of a clear document structure. Moreover, if the writeup is lengthy, the situation could be particularly challenging. In such cases, a character splitting strategy could be ideal.

Let's see it in action, testing a formatted fake writeup from the [documentation](https://www.kaggle.com/solution-write-up-documentation) and a messy one.

In [9]:
example_clean_writeup = """
# Context section This # section only #contains 2 links # Data context link to the competition data page # Overview of the Approach this section should describe the models or algorithms used, describe the data preprocessing, feature engineering, and/or feature selection strategy, described the validation strategy.# Details of the submission this section should include what was special, creative, important, and/or impactful about the submission. And also, what was tried and didn’t work. # Sources this section should include links to helpful resources like research papers, past winning write-up solutions, forum posts, helpful notebooks, etc.""" 

example_messy_writeup = """
This section only contains 2 links, and here the link to the competition data page.
Partial section.
Another partial section.
Extensive model secondi which describes the models or algorithms used, describe the data preprocessing, feature engineering, and/or feature selection strategy, described the validation strategy.
Special section should include what was special, creative, important, and/or impactful about the submission. And also, what was tried and didn’t work. Last section should include links to helpful resources like research papers, past winning write-up solutions, forum posts, helpful notebooks, etc."""

## Langchain

### Split clean writeup

In [10]:
# Split the clean writeup based on sections
text_splitter = CharacterTextSplitter(separator='#', chunk_size=100, chunk_overlap=50)
texts_clean_writeup = text_splitter.split_text(example_clean_writeup)

# Print the first characters of each split
for i in texts_clean_writeup:
    print("----------------")
    print(i)

Created a chunk of size 209, which is longer than the specified 100
Created a chunk of size 175, which is longer than the specified 100


----------------
# Context section This # section only #contains 2 links
----------------
section only #contains 2 links # Data context link to the competition data page
----------------
Overview of the Approach this section should describe the models or algorithms used, describe the data preprocessing, feature engineering, and/or feature selection strategy, described the validation strategy.
----------------
Details of the submission this section should include what was special, creative, important, and/or impactful about the submission. And also, what was tried and didn’t work.
----------------
Sources this section should include links to helpful resources like research papers, past winning write-up solutions, forum posts, helpful notebooks, etc.


### Split  messy writeup


In [11]:
example_messy_writeup = """
This section only contains 2 links, and here the link to the competition data page.
Partial section.
Another partial section.
Extensive model secondi which describes the models or algorithms used, describe the data preprocessing, feature engineering, and/or feature selection strategy, described the validation strategy.
Special section should include what was special, creative, important, and/or impactful about the submission. And also, what was tried and didn’t work. Last section should include links to helpful resources like research papers, past winning write-up solutions, forum posts, helpful notebooks, etc."""
# Split the messy writeup based on newlines
text_splitter = CharacterTextSplitter(separator='\n', chunk_size=100, chunk_overlap=50)
texts_messy_writeup = text_splitter.split_text(example_messy_writeup)

for i in texts_messy_writeup:
    print("----------------")
    print(i)

Created a chunk of size 194, which is longer than the specified 100


----------------
This section only contains 2 links, and here the link to the competition data page.
Partial section.
----------------
Partial section.
Another partial section.
----------------
Extensive model secondi which describes the models or algorithms used, describe the data preprocessing, feature engineering, and/or feature selection strategy, described the validation strategy.
----------------
Special section should include what was special, creative, important, and/or impactful about the submission. And also, what was tried and didn’t work. Last section should include links to helpful resources like research papers, past winning write-up solutions, forum posts, helpful notebooks, etc.


### Split by HTML header

In [12]:
print(writeup_small)

<h2>TLDR</h2>
<p>We used an approach similar to audio spectrogram classification using the EfficientNet-B0 model, with numerous augmentations and transformer models such as BERT and DeBERTa as helper models. The final solution consists of one EfficientNet-B0 with an input size of 160x80, trained on a single fold from 8 randomly split folds, as well as DeBERTa and BERT trained on the full dataset. A single fold model using EfficientNet has a CV score of 0.898 and a leaderboard score of ~0.8.</p>
<p>We used only competition data.</p>
<h2>1. Data Preprocessing</h2>
<h3>1.1 CNN Preprocessing</h3>
<ul>
<li>We extracted 18 lip points, 20 pose points (including arms, shoulders, eyebrows, and nose), and all hand points, resulting in a total of 80 points.</li>
<li>During training, we applied various augmentations.</li>
<li>We implemented standard normalization.</li>
<li>Instead of dropping NaN values, we filled them with zeros after normalization.</li>
<li>We interpolated the time axis to a siz

The parameter `return_each_element` in this function will 

In [13]:
# Split on HTML headers
headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2")
]

# Split the real HTML writeup based on headers
text_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on, 
                                       return_each_element=False
                                       )

texts_html_writeup = text_splitter.split_text(writeup_small)

print('Length writup:', len(writeup_small))
print('Number of splits:', len(texts_html_writeup))
print('Element returned:', type(texts_html_writeup[0]))
print('Length of each split:', [len(i.page_content) for i in texts_html_writeup])

# Print the first characters for each split
print(); print([(i.page_content[:50], i.metadata) for i in texts_html_writeup])

Length writup: 9864
Number of splits: 6
Element returned: <class 'langchain_core.documents.base.Document'>
Length of each split: [511, 1567, 1040, 1203, 1031, 1260]

[('We used an approach similar to audio spectrogram c', {'Header 2': 'TLDR'}), ('We extracted 18 lip points, 20 pose points (includ', {'Header 2': '1. Data Preprocessing'}), ('These augmentations are used in both CNN training ', {'Header 2': '2. Augmentation'}), ('Train on one fold with a random split (8 folds in ', {'Header 2': '3. Training'}), ('We rewrote all our models in Keras and transferred', {'Header 2': '4. Submissions, Conversion and Ensemble'}), ('Depthwise convolution models performed very well f', {'Header 2': '5. PS. Need BETTER TFlite DepthwiseConv2D'})]


In [14]:
for i, text in enumerate(texts_html_writeup):
    # Join the metadata and the content together
    final_content = '\n'.join(text.metadata.values()) + '\n' + text.page_content
    # Replace the old content with the enriched one
    text.page_content = final_content
    
    # Print some examples
    if i < 2:
        print(final_content); print()

TLDR
We used an approach similar to audio spectrogram classification using the EfficientNet-B0 model, with numerous augmentations and transformer models such as BERT and DeBERTa as helper models. The final solution consists of one EfficientNet-B0 with an input size of 160x80, trained on a single fold from 8 randomly split folds, as well as DeBERTa and BERT trained on the full dataset. A single fold model using EfficientNet has a CV score of 0.898 and a leaderboard score of ~0.8.  
We used only competition data.

1. Data Preprocessing
We extracted 18 lip points, 20 pose points (including arms, shoulders, eyebrows, and nose), and all hand points, resulting in a total of 80 points. During training, we applied various augmentations. We implemented standard normalization. Instead of dropping NaN values, we filled them with zeros after normalization. We interpolated the time axis to a size of 160 using 'nearest' interpolation: yy = F.interpolate(yy[None, None, :], size=self.new_size, mode='n

We use CharacterTextSplitter again

In [15]:
text_splitter = langchain.text_splitter.CharacterTextSplitter(chunk_size=2000, chunk_overlap=100)

# Split
splits = text_splitter.split_documents(texts_html_writeup)
print('Number of final splits:', len(splits))
print('Length of each final split:', [len(i.page_content) for i in splits])

print(); print([(i.page_content[:50], i.metadata) for i in splits])

Number of final splits: 6
Length of each final split: [516, 1589, 1056, 1215, 1071, 1302]

[('TLDR\nWe used an approach similar to audio spectrog', {'Header 2': 'TLDR'}), ('1. Data Preprocessing\nWe extracted 18 lip points, ', {'Header 2': '1. Data Preprocessing'}), ('2. Augmentation\nThese augmentations are used in bo', {'Header 2': '2. Augmentation'}), ('3. Training\nTrain on one fold with a random split ', {'Header 2': '3. Training'}), ('4. Submissions, Conversion and Ensemble\nWe rewrote', {'Header 2': '4. Submissions, Conversion and Ensemble'}), ('5. PS. Need BETTER TFlite DepthwiseConv2D\nDepthwis', {'Header 2': '5. PS. Need BETTER TFlite DepthwiseConv2D'})]


Key learnings

- Stuffing, MapReduce and Refine are three different techniques that can be used to summarize documents.
- Given the task, document structure and model capabilities, we might need to split our document in chunks to fit the context in our prompt or to improve the summary.
- Kaggle writeups can all potentially fit in Gemma given its context length, but different strategies such as Sections splitting based on HTML formatting could potentially be tested.

In [16]:
with torch.no_grad():
    torch.cuda.empty_cache()
gc.collect()

class GemmaLLM(LLM):
    hf_pipe: Any = None
    pipe_kwargs: Any = None
        
    def __init__(self, hf_pipeline, pipe_kwargs):
        super(GemmaLLM, self).__init__()
        self.hf_pipe = hf_pipeline
        self.pipe_kwargs = pipe_kwargs

    @property
    def _llm_type(self):
        return "Gemma pipeline"

    def _call(self, prompt, **kwargs):
        """
        This is the part that gets invoked by LangChain. We make sure that we pass the parameters we
        previously discussed to the HF pipeline, returning only the output without the prompt.
        """
        outputs = self.hf_pipe(
            prompt,
            do_sample=self.pipe_kwargs['do_sample'],
            temperature=self.pipe_kwargs['temperature'],
            top_k=self.pipe_kwargs['top_k'],
            top_p=self.pipe_kwargs['top_p'],
            add_special_tokens=self.pipe_kwargs['add_special_tokens']
        )
        return outputs[0]["generated_text"][len(prompt):]  

    @property
    def _identifying_params(self):
        """Pipeline params"""
        return {"n": self.pipe_kwargs}

langchain_hf = GemmaLLM(hf_pipeline=pipe,
                        pipe_kwargs={
                            'do_sample':True,
                            'temperature':0.1,
                            'top_k':20,
                            'top_p':0.3,
                            'add_special_tokens':True
                })

In [17]:
prompt[:350]

'<start_of_turn>user\nSummarize the following text in a technical way. Focus on facts, numbers and strategies used. Divide the summary in chapters, be impersonal and use bullet points:\n\n<h2>TLDR</h2>\n<p>We used an approach similar to audio spectrogram classification using the EfficientNet-B0 model, with numerous augmentations and transformer models s'

In [18]:
out = langchain_hf.invoke(prompt)
print(out)

Sure, here's a summary of the text in a technical way:

**1. Data Preprocessing**

* Extract 80 points from the image, including lip and pose points.
* Apply various augmentations and normalizations.
* Fill NaN values with zeros and use nearest interpolation for the time axis.

**2. Augmentation**

* Use common and CNN specific augmentations.
* Implement a mixup augmentation that only works with CNNs.

**3. Training**

* Train EfficientNet-B0 and BERT on a single fold with 0.1 warm-up.
* Train a transformer model with a ranger optimizer and 4-layer transformer.
* Tune hyperparameters with Optuna.

**4. Submissions**

* Aggregate models in a tf.Module.
* Calculate ensemble weights for fold 0 and apply to the full dataset.

**5. PS. Need BETTER TFlite DepthwiseConv2D**

* Explore different ways to implement depthwise convolution in tflite.
* Experiment with different FLOP configurations.

**6. Conclusion**

* EfficientNet-B0 achieved a leaderboard score of 0.8.
* Transformers improved th

In [19]:
from typing import Any

from llama_index.core.callbacks import CallbackManager
from llama_index.core.llms import (
    CustomLLM,
    CompletionResponse,
    CompletionResponseGen,
    LLMMetadata,
)
from llama_index.core.llms.callbacks import llm_completion_callback

class Gemma(CustomLLM):
    num_output: int = 512
    model_name: str = "Gemma"
    model: Any = None

    def __init__(self, model, num_output):
        super(Gemma, self).__init__()
        self.model = model
        self.num_output = num_output

    @property
    def metadata(self) -> LLMMetadata:
        """Get LLM metadata."""
        return LLMMetadata(
            num_output=self.num_output,
            model_name=self.model_name,
        )

    @llm_completion_callback()
    def complete(self, prompt: str, **kwargs: Any) -> CompletionResponse:
        return CompletionResponse(text=self.model.generate(prompt, max_length=self.num_output))

    @llm_completion_callback()
    def stream_complete(self, prompt: str, **kwargs: Any) -> CompletionResponseGen:
        response = ""
        for token in self.model.generate(prompt, max_length=self.num_output):
            response += token
            yield CompletionResponse(text=response, delta=token)

In [20]:
# response = Gemma(gemma_lm, 512).complete(sample_query)
# print(response.text)

NameError: name 'gemma_lm' is not defined

# MapReduce strategy


In [None]:
from langchain.prompts import PromptTemplate

# Define prompt for summarization of each chunk
prompt_template = """<bos><start_of_turn>user
Summarize the following text in a technical way. Focus on facts, numbers and strategies used. Divide the summary in chapters, be impersonal and use bullet points:

{text}<end_of_turn>
<start_of_turn>model"""
prompt_init = PromptTemplate.from_template(prompt_template)

# Define prompt for final output, the summary of summaries
combine_template = """<bos><start_of_turn>user
You are given a text containing summaries of different part of a document.
Create one single summary combining all the information of the chapters. Divide the summary in chapters, be impersonal and use bullet points:

{text}<end_of_turn>
<start_of_turn>model"""
combine_prompt = PromptTemplate.from_template(combine_template)

# Create the chain of summarization, using map_reduce
chain = load_summarize_chain(langchain_hf, chain_type='map_reduce', map_prompt=prompt_init, combine_prompt=combine_prompt)

# Run the chain on the chunks
out_summary = chain.invoke(splits)
print(out_summary['output_text'].replace('\n\n','\n'))


 1: EfficientNet-B0
* The EfficientNet-B0 model is a deep neural network architecture that is designed to be efficient.
* The model consists of a hierarchy of depthwise convolutions, followed by a global average pooling layer.
* The model is trained using a single fold of 8 randomly split folds.
**Chapter 1: Data Preparation**
* The dataset consists of 10,000 images with 10 classes.
* The EfficientNet-B0 model is trained on a single fold with the following settings:
    * Input size: 160x80
    * Number of filters: 512
    * Number of layers: 19
    * Batch size: 32
    * Learning rate: 0.001
**Chapter 2: Model Training**
* The EfficientNet-B0 model is trained on the single fold with the following settings:
    * Input size: 160x80
    * Number of filters: 512
    * Number of layers: 19
    * Batch size: 32
    * Learning rate: 0.001
**Chapter 3: Evaluation**
* The model is evaluated on the single fold with the following metrics:
    * CV score: 0.898
    * Leaderboard score: ~0.8
**Ch

In [None]:

# Repeat the process above, with verbose True
chain = load_summarize_chain(langchain_hf, chain_type='map_reduce', verbose=True, map_prompt=prompt_init, combine_prompt=combine_prompt)

# Run the chain on the chunks
out_summary = chain.invoke(splits)



[1m> Entering new MapReduceDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m<bos><start_of_turn>user
Summarize the following text in a technical way. Focus on facts, numbers and strategies used. Divide the summary in chapters, be impersonal and use bullet points:

TLDR
We used an approach similar to audio spectrogram classification using the EfficientNet-B0 model, with numerous augmentations and transformer models such as BERT and DeBERTa as helper models. The final solution consists of one EfficientNet-B0 with an input size of 160x80, trained on a single fold from 8 randomly split folds, as well as DeBERTa and BERT trained on the full dataset. A single fold model using EfficientNet has a CV score of 0.898 and a leaderboard score of ~0.8.  
We used only competition data.<end_of_turn>
<start_of_turn>model[0m
Prompt after formatting:
[32;1m[1;3m<bos><start_of_turn>user
Summarize the following text in a technical way. Focus

## Refine method

In [None]:


# Define prompt for the first summarization
prompt_template = """<bos><start_of_turn>user
Summarize the following text in a technical way. Focus on facts, numbers and strategies used. Divide the summary in chapters, be impersonal and use bullet points:

{text}<end_of_turn>
<start_of_turn>model"""
prompt_init = PromptTemplate.from_template(prompt_template)

# Define prompt for the refine phase, enhancing the previous summary with the new information
refine_template = """<bos><start_of_turn>user
Your job is to produce a final document divided in chapters and bullet points.
You are given a text containing an existing summary to a certain point:

{existing_answer}

You can now refine it (if necessary) with more context below.

{text}

Given the new context, refine the original summary.<end_of_turn>
<start_of_turn>model"""
prompt_refine = PromptTemplate.from_template(refine_template)


chain = load_summarize_chain(langchain_hf, chain_type='refine',
                             return_intermediate_steps=True,
                             input_key='input_documents',
                             output_key='output_text',
                             question_prompt=prompt_init,
                             refine_prompt=prompt_refine,
                             verbose=True)

out_summary = chain.invoke(splits, return_only_outputs=True)
print(out_summary['output_text'])



[1m> Entering new RefineDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m<bos><start_of_turn>user
Summarize the following text in a technical way. Focus on facts, numbers and strategies used. Divide the summary in chapters, be impersonal and use bullet points:

TLDR
We used an approach similar to audio spectrogram classification using the EfficientNet-B0 model, with numerous augmentations and transformer models such as BERT and DeBERTa as helper models. The final solution consists of one EfficientNet-B0 with an input size of 160x80, trained on a single fold from 8 randomly split folds, as well as DeBERTa and BERT trained on the full dataset. A single fold model using EfficientNet has a CV score of 0.898 and a leaderboard score of ~0.8.  
We used only competition data.<end_of_turn>
<start_of_turn>model[0m

[1m> Finished chain.[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m<bos><start_

# Fine-tuning Gemma with LoRa¶

In [None]:
# Import of the validation set which contains fewer examples than training
validation = pd.read_csv('cnn_dailymail/validation.csv')[['article', 'highlights']]
validation.head()

Unnamed: 0,article,highlights
0,"Sally Forrest, an actress-dancer who graced th...","Sally Forrest, an actress-dancer who graced th..."
1,A middle-school teacher in China has inked hun...,Works include pictures of Presidential Palace ...
2,A man convicted of killing the father and sist...,"Iftekhar Murtaza, 29, was convicted a year ago..."
3,Avid rugby fan Prince Harry could barely watch...,Prince Harry in attendance for England's crunc...
4,A Triple M Radio producer has been inundated w...,Nick Slater's colleagues uploaded a picture to...


In [None]:
model = "google_model/gemma_2b_it"

lora_config = LoraConfig(
    r=6,
    target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
    task_type="CAUSAL_LM",
)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16
)

tokenizer = AutoTokenizer.from_pretrained(model)
tokenizer.padding_side = "right" # Fixing overflow issue ref: source code
model = AutoModelForCausalLM.from_pretrained(model, device_map="auto", quantization_config=bnb_config)

Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00,  1.11s/it]


In [None]:
train_data = Dataset.from_pandas(validation)

def formatting_prompts_func(example):
    output_texts = []
    for i in range(len(example['article'])):
        messages = [
            {"role": "user",
             "content": "Given the following article, write a short summary of the article in 2-3 sentences:\n\nArticle: {}".format(example['article'][i])},
            {"role": "assistant",
             "content": "{}".format(example['highlights'][i])}
        ]
        output_texts.append(tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False))
        
    return output_texts

# Print the first training example
print(formatting_prompts_func(train_data[:1])[0])

<start_of_turn>user
Given the following article, write a short summary of the article in 2-3 sentences:

Article: Sally Forrest, an actress-dancer who graced the silver screen throughout the '40s and '50s in MGM musicals and films such as the 1956 noir While the City Sleeps died on March 15 at her home in Beverly Hills, California. Forrest, whose birth name was Katherine Feeney, was 86 and had long battled cancer. Her publicist, Judith Goffin, announced the news Thursday. Scroll down for video . Actress: Sally Forrest was in the 1951 Ida Lupino-directed film 'Hard, Fast and Beautiful' (left) and the 1956 Fritz Lang movie 'While the City Sleeps' A San Diego native, Forrest became a protege of Hollywood trailblazer Ida Lupino, who cast her in starring roles in films including the critical and commercial success Not Wanted, Never Fear and Hard, Fast and Beautiful. Some of Forrest's other film credits included Bannerline, Son of Sinbad, and Excuse My Dust, according to her iMDB page. The p

In [None]:
from trl import SFTTrainer
trainer = SFTTrainer(
    model=model,
    train_dataset=train_data,
    max_seq_length=512,
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=25,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=2,
        report_to='none',
        output_dir='logs',
        optim="paged_adamw_8bit"
    ),
    peft_config=lora_config,
    formatting_func=formatting_prompts_func,
)
trainer.train()

Map: 100%|██████████| 13368/13368 [00:04<00:00, 3146.53 examples/s]


Step,Training Loss
2,3.443
4,3.5491
6,3.0839
8,3.0991
10,2.6273
12,2.6998
14,2.6311
16,2.5855
18,2.3657
20,2.5368


TrainOutput(global_step=25, training_loss=2.80006965637207, metrics={'train_runtime': 24.1194, 'train_samples_per_second': 4.146, 'train_steps_per_second': 1.037, 'total_flos': 599516784721920.0, 'train_loss': 2.80006965637207, 'epoch': 0.01})