# Mitigating Polarisation in Online Discussions Through Adaptive Moderation Techniques


## Model selection

We choose quantized versions of the LLaMa-13b-chat variant. Previous experiments which used the LLaMa-13b base model yielded unsatisfactory results. The models follow the GGUF format which is used by the `llama.cpp` project, on which the high-level Python library is based on.

The quantization method was selected to be highly accurate while keeping inference relatively fast. We don't care about model size since the model is lazily loaded from the file cache due to Linux file-cached memory files (see comments below). *If you intend to run this notebook on Windows or MacOS make sure the RAM can hold the whole model at once*.

Model selection and download was performed using the [following HuggingFace repository](https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF).

We use the `llama-ccp-python` library to run the model locally (not to be confused with the `pyllama-cpp` library).

In [None]:
%load_ext autoreload
%autoreload 2

import llama_cpp
import tasks.models
from tasks.actors import Actor, LlmActor
import tasks.conversation


MAX_TOKENS = 512
# see this for a discussion on ctx width for llama models
# https://github.com/ggerganov/llama.cpp/issues/194
CTX_WIDTH_TOKENS = 512 
MODEL_PATH = "/home/dimits/bin/llm_models/llama-2-13b.Q5_K_S.gguf"
RANDOM_SEED = 42
INFERENCE_THREADS = 4


llm = llama_cpp.Llama(
      model_path=MODEL_PATH,
      seed=RANDOM_SEED,
      n_ctx=CTX_WIDTH_TOKENS,
      n_threads=INFERENCE_THREADS,
      # will vary from machine to machine
      n_gpu_layers=11,
      # if ran on Linux, model size does not matter since the model uses mmap for lazy loading
      # https://github.com/ggerganov/llama.cpp/discussions/638
      # still have to pay some performance costs of course
      use_mmap=True,
      # using llama-2 leads to well-known model collapse
      # https://www.reddit.com/r/LocalLLaMA/comments/17th1sk/cant_make_the_llamacpppython_respond_normally/
      chat_format="alpaca", 
      mlock=True, # keep memcached model files in RAM if possible
      verbose=False,
)

When using `create_completion()` instead of `create_chat_completion()`, the model refuses to answer at all when the prompt becomes larger than a few sentences. (https://github.com/run-llama/llama_index/issues/8973).

The model is also extremely sensitive to the prompt template, frequently producing no output (https://huggingface.co/TheBloke/Nous-Capybara-34B-GGUF/discussions/4)

In [None]:
output_completion = llm("Q: You are an assistant who specializes in computer science. Describe what Linux is A: ",
              max_tokens=32, 
              stop=["Q:", "\n"], 
              echo=True)
print(output_completion)

In [None]:
output_chat = llm.create_chat_completion(
      messages = [
        {
            "role": "system", 
            "content": "You are an assistant who specializes in computer science."
        },
        {
            "role": "user",
            "content": "Describe what Linux is."
        }
      ],
      max_tokens=MAX_TOKENS,
      # prevent model from making up its own prompts
      # may need tuning
      stop=["###", "\n"]
)
print(output_chat)

In [None]:
output_chat["choices"][0]["message"] # type: ignore

## Setting up a conversation

We create our own playground, in which models pretending to be users take turns participating in the discussion. In part based on [Bootstrapping LLM-based Task-Oriented Dialogue Agents via Self-Talk](https://arxiv.org/abs/2401.05033), with the difference being that instead of a client and an agent, we have three clients interacting with each other and with no specific goal in mind.

Our playground consists of three parts: *Models*, *Actors* and the *Conversation*.
* **Models** are wrappers around actual LLMs in order to freely tweak LLM behavior without compromising the rest of our API
* **Actors** are objects that define a prompt template and apply it to Models.
    * Actors could also be *human*, *IR-based models* or just *sophisticated random samplers* as seen in [DeliData: A dataset for deliberation in multi-party problem solving](https://arxiv.org/abs/2108.05271)
* The conversation is handled by the **ConversationManager** which gives each Actor a turn to speak and records the history of the dialogue. It's also responsible for determining which parts of the coversation history are fed as context to each Actor.


In [None]:
model = tasks.models.LlamaModel(llm, max_out_tokens=MAX_TOKENS, seed=RANDOM_SEED)

In [None]:
userA_name = "Steve2001"
userB_name = "GeorgeBush78"
userA: Actor = LlmActor(model=model,
                              name=userA_name,
                              role="chat user",
                              attributes=["level-headed", "rational", "open-minded"],
                              context=f"Argue with {userB_name}. Claim that social media are harmful for mental health",
                              instructions=f"")

userB: Actor = LlmActor(model=model,
                              name=userB_name,
                              role="chat user",
                              attributes=["uncompomising", "irrational", "angry"],
                              context=f"Argue with {userA_name} that social media are not harmful for mental health.",
                              instructions=f"Disagree with {userA_name}.")

moderator: Actor = LlmActor(model=model,
                            name="moderator01",
                            role="chat moderator",
                            attributes=["just", "cautious", "strict"],
                            context=f"Moderate a conversation about the impact of social media to mental health.",
                            instructions="Intervene if one user dominates or veers off-topic. Respond only if necessary.")

## The experiment

In [None]:
chat = tasks.conversation.ConversationManager([userB, userA], history_len=2, conv_len=3)
logs = chat.begin_conversation(verbose=True)