In this notebook, you will learn optimization techniques on the subject of Santa-2024 competition.<br>
santa-2024のコンペを題材に、最適化手法を学んでいきます。<br>
1. [Issues in Competition (コンペにおける課題)](https://www.kaggle.com/code/utm529fg/eng-tutorial-1-issues-in-competitions)
2. Greedy (貪欲法) ← this notebook
3. [BS: Beam Search (ビームサーチ)](https://www.kaggle.com/code/utm529fg/eng-tutorial-3-beam-search)
4. [HC: Hill Climbing (山登り法)](https://www.kaggle.com/code/utm529fg/eng-tutorial-4-hill-climbing)
5. [SA: Simulated Annealing (焼きなまし法)](https://www.kaggle.com/code/utm529fg/eng-tutorial-5-simulated-annealing)
6. [GA: Genetic Algorithm (遺伝的アルゴリズム)](https://www.kaggle.com/code/utm529fg/eng-tutorial-6-genetic-algorithm)
7. [Consideration of improvement (改善検討)](https://www.kaggle.com/code/utm529fg/eng-tutorial-7-consideration-of-improvement)

# 2-1. What's Greedy Algrithm　貪欲法とは
In the previous tutorial, we explained that it is impossible to evaluate all combinatorial patterns in a realistic amount of time.<br>
This time, let's consider the idea of determining a sequence of 10 words, one word at a time, in order from the beginning to the end of the sequence to maximize the score. The algorithm is as follows.<br>
step1. Calculate the perplexity score with only one word.<br>
step2. Determine the word with the highest score as the first word.<br>
step3. Calculate the perplexity score with two words, with each of the remaining words as the second word.<br>
step4. The word with the best scoring word sequence is determined to be the second word.<br>
step5. Repeat steps 3-4, increasing the number of words by one.<br>

The algorithm is based on the idea of greedily determining the best word at the time, one word at a time, without thinking about the future.
Let's follow the flow of the algorithm while solving an actual problem.

前回のチュートリアルでは、現実的な時間で全ての組合せパターンを評価することは不可能という解説をしました。<br>
今回は、10個の単語の並びを頭から順にスコアが最大になるように1単語ずつ決めていくという考え方をしてみましょう。
以下のようなアルゴリズムです。<br>
step1. 1単語のみで、perplexityスコアを計算する。<br>
step2. 最もスコアの良い単語を1番目の単語に決定する。<br>
step3. 残りの単語をそれぞれ2番目の単語とした時の、2単語でのperplexityスコアを計算する。<br>
step4. 最もスコアの良い単語の並びを得られた単語を2番目の単語に決定する。<br>
step5. 3-4の手順を単語数の数だけ一つずつ増やしながら繰り返す。<br>

このアルゴリズムは、後先考えずに貪欲にその時最良の単語を1つずつ決めていくという考え方です。<br>
実際の問題を解きながらアルゴリズムの流れを追ってみましょう。

れを追ってみましょう。

In [1]:
import pandas as pd
sample_submission = pd.read_csv('/kaggle/input/santa-2024/sample_submission.csv')
words = sample_submission.loc[0,'text'].split()
words

['advent',
 'chimney',
 'elf',
 'family',
 'fireplace',
 'gingerbread',
 'mistletoe',
 'ornament',
 'reindeer',
 'scrooge']

In [2]:
"""Evaluation metric for Santa 2024."""

import gc
import os
from math import exp
from collections import Counter
from typing import List, Optional, Union

import numpy as np
import pandas as pd
import transformers
import torch

os.environ['OMP_NUM_THREADS'] = '1'
os.environ['TOKENIZERS_PARALLELISM'] = 'false'
PAD_TOKEN_LABEL_ID = torch.nn.CrossEntropyLoss().ignore_index
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


class ParticipantVisibleError(Exception):
    pass


def score(
    solution: pd.DataFrame,
    submission: pd.DataFrame,
    row_id_column_name: str,
    model_path: str = '/kaggle/input/gemma-2/transformers/gemma-2-9b/2',
    load_in_8bit: bool = False,
    clear_mem: bool = False,
) -> float:
    """
    Calculates the mean perplexity of submitted text permutations compared to an original text.

    Parameters
    ----------
    solution : DataFrame
        DataFrame containing the original text in a column named 'text'.
        Includes a row ID column specified by `row_id_column_name`.

    submission : DataFrame
        DataFrame containing the permuted text in a column named 'text'.
        Must have the same row IDs as the solution.
        Includes a row ID column specified by `row_id_column_name`.

    row_id_column_name : str
        Name of the column containing row IDs.
        Ensures aligned comparison between solution and submission.

    model_path : str, default='/kaggle/input/gemma-2/transformers/gemma-2-9b/2'
        Path to the serialized LLM.

    load_in_8bit : bool, default=False
        Use 8-bit quantization for the model. Requires CUDA.

    clear_mem : bool, default=False
        Clear GPU memory after scoring by clearing the CUDA cache.
        Useful for testing.

    Returns
    -------
    float
        The mean perplexity score. Lower is better.

    Raises
    ------
    ParticipantVisibleError
        If the submission format is invalid or submitted strings are not valid permutations.

    Examples
    --------
    >>> import pandas as pd
    >>> model_path = "/kaggle/input/gemma-2/transformers/gemma-2-9b/2"
    >>> solution = pd.DataFrame({
    ...     'id': [0, 1],
    ...     'text': ["this is a normal english sentence", "the quick brown fox jumps over the lazy dog"]
    ... })
    >>> submission = pd.DataFrame({
    ...     'id': [0, 1],
    ...     'text': ["sentence english normal a is this", "lazy the over jumps fox brown quick the dog"]
    ... })
    >>> score(solution, submission, 'id', model_path=model_path, clear_mem=True) > 0
    True
    """
    # Check that each submitted string is a permutation of the solution string
    sol_counts = solution.loc[:, 'text'].str.split().apply(Counter)
    sub_counts = submission.loc[:, 'text'].str.split().apply(Counter)
    invalid_mask = sol_counts != sub_counts
    if invalid_mask.any():
        raise ParticipantVisibleError(
            'At least one submitted string is not a valid permutation of the solution string.'
        )

    # Calculate perplexity for the submitted strings
    sub_strings = [
        ' '.join(s.split()) for s in submission['text'].tolist()
    ]  # Split and rejoin to normalize whitespace
    scorer = PerplexityCalculator(
        model_path=model_path,
        load_in_8bit=load_in_8bit,
    )  # Initialize the perplexity calculator with a pre-trained model
    perplexities = scorer.get_perplexity(
        sub_strings
    )  # Calculate perplexity for each submitted string

    if clear_mem:
        # Just move on if it fails. Not essential if we have the score.
        try:
            scorer.clear_gpu_memory()
        except:
            print('GPU memory clearing failed.')

    return float(np.mean(perplexities))


class PerplexityCalculator:
    """
    Calculates perplexity of text using a pre-trained language model.

    Adapted from https://github.com/asahi417/lmppl/blob/main/lmppl/ppl_recurrent_lm.py

    Parameters
    ----------
    model_path : str
        Path to the pre-trained language model

    load_in_8bit : bool, default=False
        Use 8-bit quantization for the model. Requires CUDA.

    device_map : str, default="auto"
        Device mapping for the model.
    """

    def __init__(
        self,
        model_path: str,
        load_in_8bit: bool = False,
        device_map: str = 'auto',
    ):
        self.tokenizer = transformers.AutoTokenizer.from_pretrained(model_path)
        # Configure model loading based on quantization setting and device availability
        if load_in_8bit:
            if DEVICE.type != 'cuda':
                raise ValueError('8-bit quantization requires CUDA device')
            quantization_config = transformers.BitsAndBytesConfig(load_in_8bit=True)
            self.model = transformers.AutoModelForCausalLM.from_pretrained(
                model_path,
                quantization_config=quantization_config,
                device_map=device_map,
            )
        else:
            self.model = transformers.AutoModelForCausalLM.from_pretrained(
                model_path,
                torch_dtype=torch.float16 if DEVICE.type == 'cuda' else torch.float32,
                device_map=device_map,
            )

        self.loss_fct = torch.nn.CrossEntropyLoss(reduction='none')

        self.model.eval()

    def get_perplexity(
        self, input_texts: Union[str, List[str]], debug=False
    ) -> Union[float, List[float]]:
        """
        Calculates the perplexity of given texts.

        Parameters
        ----------
        input_texts : str or list of str
            A single string or a list of strings.

        batch_size : int, default=None
            Batch size for processing. Defaults to the number of input texts.

        debug : bool, default=False
            Print debugging information.

        Returns
        -------
        float or list of float
            A single perplexity value if input is a single string,
            or a list of perplexity values if input is a list of strings.

        Examples
        --------
        >>> import pandas as pd
        >>> model_path = "/kaggle/input/gemma-2/transformers/gemma-2-9b/2"
        >>> scorer = PerplexityCalculator(model_path=model_path)

        >>> submission = pd.DataFrame({
        ...     'id': [0, 1, 2],
        ...     'text': ["this is a normal english sentence", "thsi is a slihgtly misspelled zr4g sentense", "the quick brown fox jumps over the lazy dog"]
        ... })
        >>> perplexities = scorer.get_perplexity(submission["text"].tolist())
        >>> perplexities[0] < perplexities[1]
        True
        >>> perplexities[2] < perplexities[0]
        True

        >>> perplexities = scorer.get_perplexity(["this is a sentence", "another sentence"])
        >>> all(p > 0 for p in perplexities)
        True

        >>> scorer.clear_gpu_memory()
        """
        single_input = isinstance(input_texts, str)
        input_texts = [input_texts] if single_input else input_texts

        loss_list = []
        with torch.no_grad():
            # Process each sequence independently
            for text in input_texts:
                # Explicitly add sequence boundary tokens to the text
                text_with_special = f"{self.tokenizer.bos_token}{text}{self.tokenizer.eos_token}"

                # Tokenize
                model_inputs = self.tokenizer(
                    text_with_special,
                    return_tensors='pt',
                    add_special_tokens=False,
                )

                if 'token_type_ids' in model_inputs:
                    model_inputs.pop('token_type_ids')

                model_inputs = {k: v.to(DEVICE) for k, v in model_inputs.items()}

                # Get model output
                output = self.model(**model_inputs, use_cache=False)
                logits = output['logits']

                # Shift logits and labels for calculating loss
                shift_logits = logits[..., :-1, :].contiguous()  # Drop last prediction
                shift_labels = model_inputs['input_ids'][..., 1:].contiguous()  # Drop first input

                # Calculate token-wise loss
                loss = self.loss_fct(
                    shift_logits.view(-1, shift_logits.size(-1)),
                    shift_labels.view(-1)
                )

                # Calculate average loss
                sequence_loss = loss.sum() / len(loss)
                loss_list.append(sequence_loss.cpu().item())

                # Debug output
                if debug:
                    print(f"\nProcessing: '{text}'")
                    print(f"With special tokens: '{text_with_special}'")
                    print(f"Input tokens: {model_inputs['input_ids'][0].tolist()}")
                    print(f"Target tokens: {shift_labels[0].tolist()}")
                    print(f"Input decoded: {self.tokenizer.decode(model_inputs['input_ids'][0])}")
                    print(f"Target decoded: {self.tokenizer.decode(shift_labels[0])}")
                    print(f"Individual losses: {loss.tolist()}")
                    print(f"Average loss: {sequence_loss.item():.4f}")

        ppl = [exp(i) for i in loss_list]

        if debug:
            print("\nFinal perplexities:")
            for text, perp in zip(input_texts, ppl):
                print(f"Text: '{text}'")
                print(f"Perplexity: {perp:.2f}")

        return ppl[0] if single_input else ppl

    def clear_gpu_memory(self) -> None:
        """Clears GPU memory by deleting references and emptying caches."""
        if not torch.cuda.is_available():
            return

        # Delete model and tokenizer if they exist
        if hasattr(self, 'model'):
            del self.model
        if hasattr(self, 'tokenizer'):
            del self.tokenizer

        # Run garbage collection
        gc.collect()

        # Clear CUDA cache and reset memory stats
        with DEVICE:
            torch.cuda.empty_cache()
            torch.cuda.ipc_collect()
            torch.cuda.reset_peak_memory_stats()

In [3]:
scorer = PerplexityCalculator('/kaggle/input/gemma-2/transformers/gemma-2-9b/2')

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

step1. Calculate the perplexity score with only one word.<br>
1単語のみで、perplexityスコアを計算する。

In [4]:
candidates = words.copy()
scores = scorer.get_perplexity(candidates)
pd.DataFrame({'text_candidate':candidates, 'score':scores})

Unnamed: 0,text_candidate,score
0,advent,145673000.0
1,chimney,167921.2
2,elf,29595270.0
3,family,11957570.0
4,fireplace,279026.6
5,gingerbread,311276.7
6,mistletoe,333951.1
7,ornament,66274.63
8,reindeer,227734.9
9,scrooge,9039.643


step2. Determine the word with the highest score as the first word.<br>最もスコアの良い単語を1番目の単語に決定する。

In [5]:
# 最もスコアの良いテキストを選択
intermediate_solution = candidates[np.argmin(scores)]

# 決定した単語は、残りの単語リストから削除
used_word = words[np.argmin(scores)]
words.remove(used_word)

step3. Calculate the perplexity score with two words, with each of the remaining words as the second word.<br>残りの単語をそれぞれ2番目の単語とした時の、2単語でのperplexityスコアを計算する。

In [6]:
candidates = [intermediate_solution + ' ' + word for word in words]
scores = scorer.get_perplexity(candidates)
pd.DataFrame({'text_candidate':candidates, 'score':scores})

Unnamed: 0,text_candidate,score
0,scrooge advent,62259.256371
1,scrooge chimney,51614.736408
2,scrooge elf,28062.445433
3,scrooge family,7320.526276
4,scrooge fireplace,196319.752648
5,scrooge gingerbread,57580.399367
6,scrooge mistletoe,3918.395352
7,scrooge ornament,18840.234595
8,scrooge reindeer,50419.082327


step4. The word with the best scoring word sequence is determined to be the second word.<br>
最もスコアの良い単語の並びを得られた単語を2番目の単語に決定する。

In [7]:
words

['advent',
 'chimney',
 'elf',
 'family',
 'fireplace',
 'gingerbread',
 'mistletoe',
 'ornament',
 'reindeer']

In [8]:
# 最もスコアの良いテキストを選択
intermediate_solution = candidates[np.argmin(scores)]

# 決定した単語は、残りの単語リストから削除
used_word = words[np.argmin(scores)]
words.remove(used_word)

In [9]:
intermediate_solution

'scrooge mistletoe'

In [10]:
words

['advent',
 'chimney',
 'elf',
 'family',
 'fireplace',
 'gingerbread',
 'ornament',
 'reindeer']

step5. Repeat steps 3-4, increasing the number of words by one.<br>
3-4の手順を単語数の数だけ一つずつ増やしながら繰り返す。

In [11]:
while len(words)>0:
    candidates = [intermediate_solution + ' ' + word for word in words]
    scores = scorer.get_perplexity(candidates)
    display(pd.DataFrame({'text_candidate':candidates, 'score':scores}))

    # 最もスコアの良いテキストを選択
    intermediate_solution = candidates[np.argmin(scores)]

    # 決定した単語は、残りの単語リストから削除
    used_word = words[np.argmin(scores)]
    words.remove(used_word)

Unnamed: 0,text_candidate,score
0,scrooge mistletoe advent,5928.342844
1,scrooge mistletoe chimney,11516.804401
2,scrooge mistletoe elf,4510.045405
3,scrooge mistletoe family,4270.025167
4,scrooge mistletoe fireplace,13050.249091
5,scrooge mistletoe gingerbread,6876.99801
6,scrooge mistletoe ornament,2640.982405
7,scrooge mistletoe reindeer,4726.487003


Unnamed: 0,text_candidate,score
0,scrooge mistletoe ornament advent,4203.824562
1,scrooge mistletoe ornament chimney,9473.464891
2,scrooge mistletoe ornament elf,4876.521806
3,scrooge mistletoe ornament family,3949.127708
4,scrooge mistletoe ornament fireplace,6310.688108
5,scrooge mistletoe ornament gingerbread,5397.817585
6,scrooge mistletoe ornament reindeer,5482.820844


Unnamed: 0,text_candidate,score
0,scrooge mistletoe ornament family advent,3075.583751
1,scrooge mistletoe ornament family chimney,6511.010852
2,scrooge mistletoe ornament family elf,4106.443071
3,scrooge mistletoe ornament family fireplace,4800.918343
4,scrooge mistletoe ornament family gingerbread,4953.31585
5,scrooge mistletoe ornament family reindeer,4371.285895


Unnamed: 0,text_candidate,score
0,scrooge mistletoe ornament family advent chimney,4303.515389
1,scrooge mistletoe ornament family advent elf,3623.923291
2,scrooge mistletoe ornament family advent firep...,3004.33793
3,scrooge mistletoe ornament family advent ginge...,3768.283983
4,scrooge mistletoe ornament family advent reindeer,4405.570315


Unnamed: 0,text_candidate,score
0,scrooge mistletoe ornament family advent firep...,2330.658792
1,scrooge mistletoe ornament family advent firep...,2589.901181
2,scrooge mistletoe ornament family advent firep...,2946.228775
3,scrooge mistletoe ornament family advent firep...,2693.071115


Unnamed: 0,text_candidate,score
0,scrooge mistletoe ornament family advent firep...,1836.51494
1,scrooge mistletoe ornament family advent firep...,2480.97337
2,scrooge mistletoe ornament family advent firep...,1954.959977


Unnamed: 0,text_candidate,score
0,scrooge mistletoe ornament family advent firep...,1639.823265
1,scrooge mistletoe ornament family advent firep...,1441.497092


Unnamed: 0,text_candidate,score
0,scrooge mistletoe ornament family advent firep...,1327.96935


In [12]:
print('score', min(scores))
print('text', intermediate_solution)

score 1327.9693500653907
text scrooge mistletoe ornament family advent fireplace chimney elf reindeer gingerbread


Using the greedy method, the solution is obtained: the number of times to evaluate a text with id=0 is 55 times.<br>
Even for the problem with id=5 and 100 words, the number of times to evaluate the text is 5050, which is a realistic number of times to obtain.<br>
Now let's solve all the remaining problems in the same way.

貪欲法によって、解が求まりました。`id=0`のテキストを評価する回数は、$\sum_{i=1}^{10}i = 55$回です。<br>
`id=5` の単語数100の問題でも、評価回数は$\sum_{i=1}^{100}i = 5050$回と現実的な回数で求めることができます。<br>
それでは、残りの全ての問題も同じように解いてみましょう。

In [13]:
def greedy1(text):
    words = text.split()
    intermediate_solution = ''
    while len(words)>0:
        candidates = [(intermediate_solution + ' ' + word).lstrip() for word in words]
        #lstrip(): 先頭の空白文字を削除します。これは、複数の空白が連続して発生するのを防ぐためです。
        scores = scorer.get_perplexity(candidates)
        intermediate_solution = candidates[np.argmin(scores)]
        used_word = words[np.argmin(scores)]
        words.remove(used_word)

    return min(scores), intermediate_solution

In [14]:
from tqdm import tqdm
scores = [min(scores)]
soliutions = [intermediate_solution]
#desc="id" は、 tqdm の機能の一つで、プログレスバーの上部に表示される説明文を指定する引数
#desc: description（説明）の略
#"id": ここでは、プログレスバーが何に関するものかを示すラベルとして "id" が設定されています。
for i in tqdm(range(1, 6), desc='id'):
    text = sample_submission.loc[i,'text']
    score, solution = greedy1(text)
    scores.append(score)
    soliutions.append(solution)

sample_submission['greedy1_text'] = soliutions
sample_submission['greedy1_score'] = scores
sample_submission

id: 100%|██████████| 5/5 [14:24<00:00, 172.92s/it]


Unnamed: 0,id,text,greedy1_text,greedy1_score
0,0,advent chimney elf family fireplace gingerbrea...,scrooge mistletoe ornament family advent firep...,1327.96935
1,1,advent chimney elf family fireplace gingerbrea...,scrooge mistletoe ornament and reindeer family...,2017.017068
2,2,yuletide decorations gifts cheer holiday carol...,yuletide gifts grinch ornament nutcracker deco...,967.775366
3,3,yuletide decorations gifts cheer holiday carol...,yuletide gifts unwrap holiday cheer the nutcra...,762.588639
4,4,hohoho candle poinsettia snowglobe peppermint ...,eggnog fruitcake poinsettia snowglobe wreath c...,283.847607
5,5,advent chimney elf family fireplace gingerbrea...,eggnog yuletide poinsettia mistletoe fruitcake...,128.943587


In [15]:
print('Greedy 1 Average Perplexity:', np.mean(scores))

Greedy 1 Average Perplexity: 914.6902695695588


# 2-2. Improvement of Greedy Algrithm 貪欲法の改善
Now that we have implemented the basic greedy algorithm, let's consider a few more ideas to increase the score.<br>
After determining the first word, the second word was always placed to the right of the first word, but why not also evaluate the pattern of placing it to the left?<br>
The number of evaluations would be approximately doubled, from 100 for `id=0` to 10,000 for `id=5`.<br>

基本的な貪欲法の実装はできましたが、もう少しスコアを上げる考え方を検討してみましょう。<br>
1番目の単語を決めた後に、2番目の単語は必ず1番目の単語の右に配置しましたが、左に配置するパターンも評価してみてはどうでしょうか。<br>
評価回数は約2倍になり、`id=0`で100回、`id=5`で10,000回です。

In [16]:
def greedy2(text):
    words = text.split()
    intermediate_solution = ''
    while len(words)>0:
        candidates = [(intermediate_solution + ' ' + word).lstrip() for word in words]
        candidates += [(word + ' ' + intermediate_solution).rstrip() for word in words] # add from greedy1
        scores = scorer.get_perplexity(candidates)
        intermediate_solution = candidates[np.argmin(scores)]
        used_word = words[np.argmin(scores) % len(words)] # mod from greedy1
        words.remove(used_word)

    return min(scores), intermediate_solution

In [17]:
scores = []
soliutions = []
for i in tqdm(range(6), desc='id'):
    text = sample_submission.loc[i,'text']
    score, solution = greedy2(text)
    scores.append(score)
    soliutions.append(solution)

sample_submission['greedy2_text'] = soliutions
sample_submission['greedy2_score'] = scores
sample_submission.to_csv('greedy.csv', index=False)

id: 100%|██████████| 6/6 [29:02<00:00, 290.47s/it]


Let us now compare the score with the solution of the first greedy method.<br>
最初の貪欲法の解とスコアを比較してみましょう。

In [18]:
sample_submission[['id', 'greedy1_score', 'greedy2_score']]

Unnamed: 0,id,greedy1_score,greedy2_score
0,0,1327.96935,708.039583
1,1,2017.017068,710.810771
2,2,967.775366,750.765798
3,3,762.588639,884.618298
4,4,283.847607,309.319586
5,5,128.943587,135.131719


In [19]:
sample_submission[['greedy1_score', 'greedy2_score']].mean()

greedy1_score    914.690270
greedy2_score    583.114292
dtype: float64

For the first three questions, the improved greedy method scored better.<br>
Since the greedy algorithm determines solutions one by one without second thoughts, it is not always possible to find the overall optimal solution, but it is possible to find a reasonable solution in a realistic search.<br>
You may want to try a different algorithm, for example, placing words on the left side.<br>

前半の3問については、改善した貪欲法のほうが良いスコアになりました。<br>
貪欲法は、後先考えずに一つずつ解を決めていく方法なので、必ずしも全体最適な解を求めることはできませんが、現実的な探索でそれなりの解を求めることができます。<br>
例えば、左側に単語を配置していくなど、別の手法を試してみても良いかもしれませんね。

# 2-3. Submission 提出
Finally, let's submit the solutions.<br>
For each id, the text with the higher score is saved as the answer.<br>

最後に、求めた解を提出してみましょう。<br>
各idについて、よりスコアの高いテキストを解答として保存します。

In [20]:
sub = sample_submission.copy()
sub.loc[sub['greedy1_score']<sub['greedy2_score'], 'text'] = sub.loc[sub['greedy1_score']<sub['greedy2_score'], 'greedy1_text']
sub.loc[sub['greedy1_score']>sub['greedy2_score'], 'text'] = sub.loc[sub['greedy1_score']>sub['greedy2_score'], 'greedy2_text']
sub = sub[['id','text']]
sub.to_csv('submission.csv', index=False)
sub

Unnamed: 0,id,text
0,0,reindeer scrooge mistletoe elf gingerbread orn...
1,1,reindeer the scrooge mistletoe elf gingerbread...
2,2,ornament yuletide holiday decorations gifts nu...
3,3,yuletide gifts unwrap holiday cheer the nutcra...
4,4,eggnog fruitcake poinsettia snowglobe wreath c...
5,5,eggnog yuletide poinsettia mistletoe fruitcake...


sub.loc[sub['greedy1_score']<sub['greedy2_score'], 'text'] = sub.loc[sub['greedy1_score']<sub['greedy2_score'], 'greedy1_text']<br>

greedy1_score が greedy2_score よりも低い行（つまり、greedy1 のスコアがより良い）をフィルタリングします。<br>
そのフィルタリングされた行の text 列に、greedy1_text 列の値を代入します。<br>
sub.loc[sub['greedy1_score']>sub['greedy2_score'], 'text'] = sub.loc[sub['greedy1_score']>sub['greedy2_score'], 'greedy2_text']<br>

greedy1_score が greedy2_score よりも高い行（つまり、greedy2 のスコアがより良い）をフィルタリングします。<br>
そのフィルタリングされた行の text 列に、greedy2_text 列の値を代入します。<br>
sub = sub[['id','text']]<br>

最終的な出力として必要な列 id と text を選択し、新しい sub データフレームを作成します。

In the next tutorial, I will introduce the beam search algorithm, which further improves the score based on the greedy algorithm.<br>
次回は、貪欲法をベースにさらにスコアを改善するビームサーチ法を紹介します。