# Gemma and QWEN7B

This notebook uses two Models, Qwen7B and Gemma, to classify the posts.

It will then ensemble the two predictions. I don't know how to best ensemble the two models yet, for now we'll average.


Credits: Kudos to the authors of these notebooks. Thanks for sharing:
- https://www.kaggle.com/code/yangyefd/batch-gemma3-sample-rules-classification
- https://www.kaggle.com/code/aerdem4/jigsaw-acrc-qwen7b-finetune-logits-processor-zoo

## How to install dependencies

Interactive notebooks: Click on Add-Ons->Install Dependencies, then Run.

Submission: Dependencies will be installed automatically

## Prepare Qwen7b and Gemma

In [52]:
import vllm, torch
from logits_processor_zoo.vllm import MultipleChoiceLogitsProcessor


In [53]:
def predict_with_model (df, model_path):
    model = vllm.LLM(
        model_path,
        tensor_parallel_size=torch.cuda.device_count(),
        gpu_memory_utilization=0.95,
        trust_remote_code=True,
        dtype="half",
        enforce_eager=True,
        max_model_len=4096,
        disable_log_stats=True,
        enable_prefix_caching=True
    )
    tokenizer = model.get_tokenizer()


    SYS_PROMPT = """
    You are given a comment on reddit. Your task is to classify if it violates the given rule. Only respond Yes/No.
    """
    
    prompts = []
    for i, row in df.iterrows():
        text = f"""
    r/{row.subreddit}
    Rule: {row.rule}
    
    1) {row.positive_example_1}
    Violation: Yes
    
    2) {row.negative_example_1}
    Violation: No
    
    3) {row.negative_example_2}
    Violation: No
    
    4) {row.positive_example_2}
    Violation: Yes
    
    5) {row.body}
    """
        print(text)
        
        messages = [
            {"role": "system", "content": SYS_PROMPT},
            {"role": "user", "content": text}
        ]
    
        prompt = tokenizer.apply_chat_template(
            messages,
            add_generation_prompt=True,
            tokenize=False,
        ) + "Violation: "
        prompts.append(prompt)
    
    mclp = MultipleChoiceLogitsProcessor(tokenizer, 
                                         choices=["Yes", "No"])
    
    outputs = model.generate(
        prompts,
        vllm.SamplingParams(
            seed=0,
            skip_special_tokens=True,
            max_tokens=1,
            logits_processors=[mclp],
            logprobs=len(mclp.choices)
    
        ),
        use_tqdm=True
    )
    
    
    logprobs = []
    for lps in [output.outputs[0].logprobs[0].values() for output in outputs]:
        logprobs.append({lp.decoded_token: lp.logprob for lp in list(lps)})
    return logprobs

## Prepare Training data and Examples 

In [54]:
import os, sys
import pandas as pd

data_path = "/kaggle/input/jigsaw-agile-community-rules/test.csv" \
                if os.getenv('KAGGLE_IS_COMPETITION_RERUN') \
                else "/kaggle/input/jigsaw-agile-community-rules/train.csv"

df = pd.read_csv(data_path)
df.head(10)

Unnamed: 0,row_id,body,rule,subreddit,positive_example_1,positive_example_2,negative_example_1,negative_example_2,rule_violation
0,0,Banks don't want you to know this! Click here ...,"No Advertising: Spam, referral links, unsolici...",Futurology,If you could tell your younger self something ...,hunt for lady for jack off in neighbourhood ht...,Watch Golden Globe Awards 2017 Live Online in ...,"DOUBLE CEE x BANDS EPPS - ""BIRDS""\n\nDOWNLOAD/...",0
1,1,SD Stream [ ENG Link 1] (http://www.sportsstre...,"No Advertising: Spam, referral links, unsolici...",soccerstreams,[I wanna kiss you all over! Stunning!](http://...,LOLGA.COM is One of the First Professional Onl...,#Rapper \n🚨Straight Outta Cross Keys SC 🚨YouTu...,[15 Amazing Hidden Features Of Google Search Y...,0
2,2,Lol. Try appealing the ban and say you won't d...,No legal advice: Do not offer or request legal...,pcmasterrace,Don't break up with him or call the cops. If ...,It'll be dismissed: https://en.wikipedia.org/w...,Where is there a site that still works where y...,Because this statement of his is true. It isn'...,1
3,3,she will come your home open her legs with an...,"No Advertising: Spam, referral links, unsolici...",sex,Selling Tyrande codes for 3€ to paypal. PM. \n...,tight pussy watch for your cock get her at thi...,NSFW(obviously) http://spankbang.com/iy3u/vide...,Good News ::Download WhatsApp 2.16.230 APK for...,1
4,4,code free tyrande --->>> [Imgur](http://i.imgu...,"No Advertising: Spam, referral links, unsolici...",hearthstone,wow!! amazing reminds me of the old days.Well...,seek for lady for sex in around http://p77.pl/...,must be watch movie https://sites.google.com/s...,We're streaming Pokemon Veitnamese Crystal RIG...,1
5,5,I live in the US it's it possible to get in tr...,No legal advice: Do not offer or request legal...,soccerstreams,Public School? You are under no legal obligati...,"OP, you need to get the fuck away from your bo...",It looks like it could be a sterile cotton swa...,That is called battery. Two wrongs don't make...,0
6,6,young kitty watching for your buddy get her he...,"No Advertising: Spam, referral links, unsolici...",sex,Save on Medicine!! Save over $700 a month on ...,Make your life comfortable. Get up to 15% Disc...,They have nothing on the platypus though goo.g...,Try and see if someone at www.siddhantayoga.co...,0
7,7,[liệt dương](http://namkhoathientam.com/nguyen...,"No Advertising: Spam, referral links, unsolici...",gifs,EARN MONEY in online . Just Sign up and View f...,You can use www.easy-lol.com/probuilds/\n\nIt ...,HD | [English Stream](http://www.ufc187livestr...,* **SD - http://livestreamnba.ru/2016/12/19/pr...,0
8,8,"Maybe true, but that's very short-sighted. The...",No legal advice: Do not offer or request legal...,The_Donald,"OP, you need to get the fuck away from your bo...",Steal the dogs back and put a lean on all her ...,Is this 100% legal tho? Are their any copyrigh...,"If you masturbate before the age of 18, you're...",1
9,9,you can sue them for negligence and try and re...,No legal advice: Do not offer or request legal...,legaladvice,"IIRC the laws require photo id, and social sec...",Tell them you want to take possession of the a...,That is called battery. Two wrongs don't make...,"Heard you might have their address, it could b...",1


## Predict / Apply MultiChoice processor

In [None]:
qwen7b_model_path = "/kaggle/input/jigsaw-acrc-qwen7b-v01"
qwen7b_outputs =  predict_with_model (df, qwen7b_model_path)
qwen7b_outputs

INFO 07-25 10:52:12 [config.py:1604] Using max model len 4096
INFO 07-25 10:52:12 [llm_engine.py:228] Initializing a V0 LLM engine (v0.10.0) with config: model='/kaggle/input/jigsaw-acrc-qwen7b-v01', speculative_config=None, tokenizer='/kaggle/input/jigsaw-acrc-qwen7b-v01', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=None, served_model_name=/kaggle/

2025-07-25 10:52:17.396446: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1753440737.419158     399 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1753440737.425863     399 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


INFO 07-25 10:52:22 [__init__.py:235] Automatically detected platform cuda.
[1;36m(VllmWorkerProcess pid=399)[0;0m INFO 07-25 10:52:23 [multiproc_worker_utils.py:226] Worker ready; awaiting tasks
[1;36m(VllmWorkerProcess pid=399)[0;0m INFO 07-25 10:52:24 [cuda.py:346] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
[1;36m(VllmWorkerProcess pid=399)[0;0m INFO 07-25 10:52:24 [cuda.py:395] Using XFormers backend.


[W725 10:52:36.832267644 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W725 10:52:46.503733379 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W725 10:52:54.610826222 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W725 10:52:57.226569539 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W725 10:53:08.261066281 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3


In [None]:
gemma_path = "/kaggle/input/gemma-7b-gguf/gemma-2-9b-it.Q5_K_M.gguf"
gemma_outputs =  predict_with_model (df, gemma_path)

gemma_outputs

## Averaging predictions and creating a submission file


In [None]:
# Combine logprobs from Qwen and Gemma using logprob averaging
final_preds = []
for q_lp, g_lp in zip(qwen7b_logprobs, gemma_logprobs):
    avg_yes = (q_lp.get("Yes", -999) + g_lp.get("Yes", -999)) / 2
    avg_no = (q_lp.get("No", -999) + g_lp.get("No", -999)) / 2
    final_preds.append("Yes" if avg_yes > avg_no else "No")

# Format for Kaggle submission
submission_df = pd.DataFrame({
    "example_id": df["example_id"],
    "label": final_preds
})

submission_df.to_csv("submission.csv", index=False)
print("Saved submission.csv")


In [None]:
import matplotlib.pyplot as plt

# Extract individual logprobs for plotting
qwen_yes = [lp.get("Yes", -999) for lp in qwen7b_logprobs]
qwen_no  = [lp.get("No", -999) for lp in qwen7b_logprobs]

gemma_yes = [lp.get("Yes", -999) for lp in gemma_logprobs]
gemma_no  = [lp.get("No", -999) for lp in gemma_logprobs]

avg_yes = [(qy + gy) / 2 for qy, gy in zip(qwen_yes, gemma_yes)]
avg_no  = [(qn + gn) / 2 for qn, gn in zip(qwen_no, gemma_no)]

# Plot histograms
plt.figure(figsize=(12, 6))
plt.hist(qwen_yes, bins=30, alpha=0.5, label='Qwen7B - Yes', density=True)
plt.hist(qwen_no, bins=30, alpha=0.5, label='Qwen7B - No', density=True)
plt.hist(gemma_yes, bins=30, alpha=0.5, label='Gemma - Yes', density=True)
plt.hist(gemma_no, bins=30, alpha=0.5, label='Gemma - No', density=True)
plt.hist(avg_yes, bins=30, alpha=0.6, label='Avg - Yes', histtype='step', linewidth=2)
plt.hist(avg_no, bins=30, alpha=0.6, label='Avg - No', histtype='step', linewidth=2)

plt.title("Histogram of Logprobs for 'Yes' and 'No' Predictions")
plt.xlabel("Log Probability")
plt.ylabel("Density")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
