# LongLLaMA Code: Focused Transformer Training for Context Scaling
**LongLLaMA-Code-Instruct-7B is a LLM that builds upon the foundation of CodeLLaMA using FoT context extension and instruction tuning**.

It is created by first taking the [CodeLLaMA-7B](https://huggingface.co/codellama/CodeLlama-7b-hf), continuing pre-training with [Focused Transformer (FoT)](https://arxiv.org/abs/2307.03170) method, and then instruction tuning the model.

This notebook is a demo of [LongLLaMA-Code-Instruct-7B](TODO).  
To create the model we have used the following datasets:
* [MathInstruct](https://huggingface.co/datasets/TIGER-Lab/MathInstruct)
* [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca)
* [ShareGPT-Processed](https://huggingface.co/datasets/zetavg/ShareGPT-Processed)

For more, see the [FoT paper](https://arxiv.org/abs/2307.03170) and the [GitHub repository](https://github.com/CStanKonrad/long_llama).  
Note that as we have started from the Meta's [CodeLLaMA-7B](https://huggingface.co/codellama/CodeLlama-7b-hf) model and used GPT outputs to finetune it, the model is aimed for research purposes only.


**We provide the basic code for quantization that should suffice to run most of the demo parts on the free Colab GPU.**  
**However, the quantization code is not optimized and may result in reduced performance.**  
**Generation with free colab GPU may be slow, but processing the long input should take 1-2 min**  
**Running this model without quantization may require an A100 40GB GPU.**  

# Initial steps

In [None]:
!pip install --upgrade pip
!pip install transformers==4.33.2 sentencepiece accelerate

Collecting pip
  Downloading pip-24.0-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.1.2
    Uninstalling pip-23.1.2:
      Successfully uninstalled pip-23.1.2
Successfully installed pip-24.0
Collecting transformers==4.33.2
  Downloading transformers-4.33.2-py3-none-any.whl.metadata (119 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.9/119.9 kB[0m [31m791.5 kB/s[0m eta [36m0:00:00[0m
Collecting accelerate
  Downloading accelerate-0.28.0-py3-none-any.whl.metadata (18 kB)
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers==4.33.2)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Downloading nvidia_cuda_nvr

In [None]:
import gc
import numpy as np
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer, PreTrainedModel, PreTrainedTokenizer
from typing import List, Optional

  torch.utils._pytree._register_pytree_node(
  torch.utils._pytree._register_pytree_node(


In [None]:
MODEL_PATH = "syzymon/long_llama_code_7b_instruct"
TOKENIZER_PATH = MODEL_PATH
# to reduce GPU memory usage we will use reduced precision
TORCH_DTYPE = torch.bfloat16

if torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

print(device)

cuda


In [None]:
# To fit most of the demo parts on a single Google Colab GPU we
# provide a basic unoptimized quantization code
# change to False to disable the quantization
QUANTIZED = True

In [None]:
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_PATH)

# unoptimized quantization code for running with free Colab GPU
def load_and_qunatize_model(num_bit: int, model_path):
    print(f"!!!!!WARNING!!!!! The mode will be quantized to {num_bit} bits!\n"
          "This may affect the model performance!")

    !pip3 install huggingface_hub
    !pip3 install bitsandbytes
    !git clone https://github.com/CStanKonrad/long_llama.git
    !cp -r long_llama/src long_llama_code/
    from long_llama_code.modeling_longllama import LongLlamaForCausalLM
    from long_llama_code.configuration_longllama import LongLlamaConfig
    from transformers import AutoConfig
    from accelerate.utils import BnbQuantizationConfig
    from accelerate.utils import load_and_quantize_model
    from accelerate import init_empty_weights
    from huggingface_hub import snapshot_download, hf_hub_download


    cfg = LongLlamaConfig.from_pretrained(model_path)
    cfg.mem_attention_grouping = (1, 1024)
    with init_empty_weights():
        empty_model = LongLlamaForCausalLM(cfg)

    gc.collect()
    if num_bit == 8:
        weights_loc = hf_hub_download(repo_id=MODEL_PATH, filename="quantized/pytorch_model_8bit.bin")
        bnb_quantization_config = BnbQuantizationConfig(load_in_8bit=True, llm_int8_threshold = 6)
    elif num_bit == 4:
        # May give out of RAM on Colab
        weights_loc = snapshot_download(MODEL_PATH) #MODEL_PATH
        bnb_quantization_config = BnbQuantizationConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)
    else:
        raise ValueError(f"{num_bit} quantization not supported.")

    gc.collect()
    model = load_and_quantize_model(empty_model, weights_location=weights_loc, bnb_quantization_config=bnb_quantization_config, device_map="auto")
    model.eval()
    return model

if not QUANTIZED:
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_PATH,
        torch_dtype=TORCH_DTYPE,
        device_map=device,
        trust_remote_code=True,
        # mem_attention_grouping is used
        # to trade speed for memory usage
        # for details, see the section Additional configuration
        # in the Github repository
        mem_attention_grouping=(1, 1024),
    )
    model.eval()
else:
    model = load_and_qunatize_model(8, MODEL_PATH)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


This may affect the model performance!
[0mfatal: destination path 'long_llama' already exists and is not an empty directory.


config.json:   0%|          | 0.00/1.10k [00:00<?, ?B/s]

pytorch_model_8bit.bin:   0%|          | 0.00/7.01G [00:00<?, ?B/s]

# The demo

## Question answering on long documents
Here we show the ability of the model to answer questions about long documents.   
As it is a 7B parameter model it should be better than the [3B parameter one](https://colab.research.google.com/github/CStanKonrad/long_llama/blob/main/long_llama_instruct_colab.ipynb).


### Downloading and memory loading
Code for downloading files and loading them into model memory.  

In [None]:
import urllib.request
import tempfile
import shutil
import os


def get_file(url: str, main_file: Optional[str] = None):
    with tempfile.TemporaryDirectory() as tmp_dir:
        file_path = os.path.join(tmp_dir, "_paper.tmp")
        if main_file is not None:
            # we are dealing with an archive
            file_path += ".tar"

        urllib.request.urlretrieve(url, file_path)

        if main_file is not None:
            # we are dealing with an archive
            shutil.unpack_archive(file_path, tmp_dir)
            main_file_path = os.path.join(tmp_dir, main_file)
        else:
            main_file_path = file_path

        with open(main_file_path, "r") as f:
            data = f.read()

    return data


def get_paper(url: str, main_file: Optional[str] = None):
    return get_file(url, main_file)


def get_files(url_list: List[str]):
    data = []
    for url in url_list:
        data.append(get_file(url))

    data = "\n".join(data)
    return data


@torch.no_grad()
def load_to_memory(model: PreTrainedModel, tokenizer: PreTrainedTokenizer, text: str):
    tokenized_data = tokenizer(text, return_tensors="pt")
    input_ids = tokenized_data.input_ids
    input_ids = input_ids.to(model.device)
    torch.manual_seed(0)
    output = model(input_ids=input_ids)
    memory = output.past_key_values
    return memory


@torch.no_grad()
def generate_with_memory(model: PreTrainedModel, tokenizer: PreTrainedTokenizer, memory, prompt: str, temperature=0.2):
    tokenized_data = tokenizer(prompt, return_tensors="pt")
    input_ids = tokenized_data.input_ids
    input_ids = input_ids.to(model.device)

    streamer = TextStreamer(tokenizer, skip_prompt=False)

    new_memory = memory

    stop = False
    while not stop:
        output = model(input_ids, past_key_values=new_memory, last_context_length=3072)
        new_memory = output.past_key_values
        assert len(output.logits.shape) == 3
        assert output.logits.shape[0] == 1
        last_logit = output.logits[[0], [-1], :]
        dist = torch.distributions.Categorical(logits=last_logit / temperature)
        next_token = dist.sample()
        if next_token[0] == tokenizer.eos_token_id:
            streamer.put(next_token[None, :])
            streamer.end()
            stop = True
        else:
            input_ids = next_token[None, :]
            streamer.put(input_ids)


PROMPT_PREFIX = "You are an AI assistant. User will you give you a task. Your goal is to complete the task as faithfully as you can.\n\n"


def construct_question_prompt(question: str):
    prompt = (
        f"\nYou are an AI assistant. User will you give you a task. Your goal is to complete the task as faithfully as you can.\n\n"
        "Answer the question below using the information from the text above.\n"
        f"Question: {question}\nAnswer: "
    )
    return prompt


def ask_model(model: PreTrainedModel, tokenizer: PreTrainedTokenizer, prompt: str, memory, seed=0):
    tokenized_data = tokenizer(prompt, return_tensors="pt")
    input_ids = tokenized_data.input_ids
    input_ids = input_ids.to(model.device)

    torch.manual_seed(seed)
    generate_with_memory(model, tokenizer, memory, prompt)

In [None]:
try:
    del chatbot
except:
    pass
gc.collect()
torch.cuda.empty_cache()


### Questions about code
We download the instruction tuning files from the long_llama repository and ask the model questions about the implementation.
Each question is asked independently without updating the memory.

In [None]:
instruct_dp = get_files(
    [
        "https://raw.githubusercontent.com/CStanKonrad/long_llama/2c88620d0ec9c28e13b4c208be34ebac68b90e37/instruction_fine_tuning/arguments.py",
        "https://raw.githubusercontent.com/CStanKonrad/long_llama/2c88620d0ec9c28e13b4c208be34ebac68b90e37/instruction_fine_tuning/data_processing.py",
        "https://raw.githubusercontent.com/CStanKonrad/long_llama/2c88620d0ec9c28e13b4c208be34ebac68b90e37/instruction_fine_tuning/fine_tuning.py",
    ]
)
try:
    del fot_memory
except:
    pass
gc.collect()
torch.cuda.empty_cache()
fot_memory = load_to_memory(model, tokenizer, PROMPT_PREFIX + instruct_dp)

In [None]:
prompt = construct_question_prompt("What is the purpose of this code?")
ask_model(model, tokenizer, prompt, fot_memory)

The purpose of this code is to train a LongLlamaForCausalLM model using the MixedTuneDataset and save the trained model and its parameters to a specified output directory.</s>


In [None]:
prompt = construct_question_prompt("What is used for preparing the data? Name the most important functions and classes.")
ask_model(model, tokenizer, prompt, fot_memory)

The most important functions and classes used for preparing the data are:

- Data args: This dataclass contains the arguments related to the data processing, such as the data type, data filter, and data path.
- Tokenization args: This dataclass contains the arguments related to the tokenization process, such as the pre-prompt text, the prompt field, the post-prompt text, and the post-response text.
- Data processor: This class is used for preparing the data, tokenizing the text, and padding the data.
- Data collator: This class is used for collating the data for training and evaluation.
- MixedTuneDataset: This class is used for creating a dataset for training and evaluation.

These classes and functions are used for preparing the data for training and evaluation.</s>


In [None]:
prompt = construct_question_prompt("Can you say something more about `tokenize_text_no_special_tokens`?")
ask_model(model, tokenizer, prompt, fot_memory)

In [None]:
prompt = construct_question_prompt("Can you say something more about `MixedTuneDataset`?")
ask_model(model, tokenizer, prompt, fot_memory)

In [None]:
prompt = construct_question_prompt("What are the main model configuration options? Enumerate them, use `` for their names.")
ask_model(model, tokenizer, prompt, fot_memory)

In [None]:
prompt = construct_question_prompt("What are the options to configure the data? Enumerate them, use `` for their names.")
ask_model(model, tokenizer, prompt, fot_memory)

### Questions about FoT paper
We download the FoT paper and ask basic questions about it's content.  

In [None]:
try:
    del fot_memory
except:
    pass
gc.collect()
torch.cuda.empty_cache()
fot_paper = get_paper(url="https://raw.githubusercontent.com/CStanKonrad/long_llama/main/assets/fot_paper.tar", main_file="fot_paper.tex")
if QUANTIZED:
    fot_paper = fot_paper[:50000]
fot_memory = load_to_memory(model, tokenizer, PROMPT_PREFIX + fot_paper)

In [None]:
prompt = construct_question_prompt("What is the paper above about?")
ask_model(model, tokenizer, prompt, fot_memory)

The paper above is about the Focused Transformer (\method{}) technique, which is a method for improving the context length in transformer models by endowing attention layers with access to an external memory. The paper discusses the design choices and ablations of the \method{} technique, as well as its effectiveness in handling distractions and extrapolating to longer contexts.</s>


In [None]:
prompt = construct_question_prompt("What method is introduced in the paper?")
ask_model(model, tokenizer, prompt, fot_memory)

In [None]:
prompt = construct_question_prompt("How is the 3B model called by the authors?")
ask_model(model, tokenizer, prompt, fot_memory)

The 3B model is called \largeModels{} in the text.</s>


In [None]:
prompt = construct_question_prompt("Name all six authors of the presented paper.")
ask_model(model, tokenizer, prompt, fot_memory)

In [None]:
prompt = construct_question_prompt("What is the distraction issue?")
ask_model(model, tokenizer, prompt, fot_memory)

In [None]:
prompt = construct_question_prompt('What are the three main contributions of the paper?')
ask_model(model, tokenizer, prompt, fot_memory)

The three main contributions of the paper are:
1. Identifying the \problem{} in long-context language models and proposing the \method{} technique to address it.
2. Demonstrating the effectiveness of \method{} in improving the context length of existing models and fine-tuning pre-trained models.
3. Providing ablation studies and additional analysis to support the effectiveness of \method{} in various scenarios.</s>


## Working with code

The base [CodeLLaMA-7B](https://huggingface.co/codellama/CodeLlama-7b-hf) model was trained on a large amount of code data.    
During the FoT tuning Python constituted a significant portion of the mixture.  
For instruction tuning, we have utilized the [MathInstruct](https://huggingface.co/datasets/TIGER-Lab/MathInstruct) dataset that contains both Chain of Thought and Python of Thought examples.  
Because of all of the above, the model can be used for manipulation of the coded data that includes but is not limited to
* refactoring
* rewriting to other languages
* explaining  

However, as this model has only 7B parameters it can still make simple mistakes.  
Below we show some examples of how the model can be used.

### Code for the chat interface
Here, we provide the code for communicating with the model in an iterative way.

In [None]:
class ChatOutputBuffer:
    """
    For providing online output that
    is truncated after generating specified (stop_text)
    sequence of characters
    """

    def __init__(self, stop_text: List[str], tokenizer: PreTrainedModel):
        self.tokenizer = tokenizer
        self.streamer = TextStreamer(tokenizer, skip_prompt=False)
        self.max_stop_seq = 0
        self.stop_seq = []
        for st in stop_text:
            self.stop_seq.append(st)
            self.max_stop_seq = max(self.max_stop_seq, len(st))

        self.output_buffer = np.empty((0,), dtype=np.int64)

    def reset_output_buffer(self):
        self.output_buffer = np.empty((0,), dtype=np.int64)

    def advance_output(self):
        beg = 0
        end = len(self.output_buffer) - self.max_stop_seq

        if end > beg:
            output = self.output_buffer[beg:end]
            self.streamer.put(output)
            self.output_buffer = self.output_buffer[end:]

    def flush_buffer(self):
        if len(self.output_buffer) > 0:
            self.streamer.put(self.output_buffer)
            self.output_buffer = self.output_buffer[len(self.output_buffer) :]
        self.streamer.end()

    def generation_too_long(self, text: str) -> int:
        end_requests = 0
        for st in self.stop_seq:
            if text.endswith(st):
                end_requests += 1
        return end_requests

    def update_buffer(self, next_tok: int) -> bool:
        assert isinstance(next_tok, int)

        array_next_tok = np.array([next_tok], dtype=np.int64)
        self.output_buffer = np.concatenate([self.output_buffer, array_next_tok], axis=0)

        suffix = self.output_buffer[-self.max_stop_seq :]
        decoded = self.tokenizer.decode(suffix)
        end_requests = self.generation_too_long(decoded)
        if end_requests > 0:
            decoded = self.tokenizer.decode(suffix[1:])
            while self.generation_too_long(decoded) == end_requests:
                suffix = suffix[1:]
                decoded = self.tokenizer.decode(suffix[1:])

            left_intact = len(self.output_buffer) - len(suffix)

            self.output_buffer = self.output_buffer[:left_intact]
            self.flush_buffer()
            return True

        self.advance_output()
        return False


class SimpleChatBot:
    def __init__(self, model: PreTrainedModel, tokenizer: PreTrainedTokenizer):
        self.model = model
        self.tokenizer = tokenizer
        self.prompt = "A chat between a user (denoted as USER:) and an artificial intelligence assistant (denoted as ASSISTANT:). The assistant gives helpful, detailed, and polite answers to the user's questions.\n\n"
        self.tokenized_prompt = self.tokenizer.encode(self.prompt, return_tensors="pt", add_special_tokens=False)
        self.tokenized_prompt = torch.concatenate(
            [torch.tensor([self.tokenizer.bos_token_id], dtype=torch.long).reshape(1, 1), self.tokenized_prompt],
            dim=-1,
        )
        self.model_name = "\nASSISTANT: "
        self.tokenized_model_name = self.tokenizer.encode(
            self.model_name, return_tensors="pt", add_special_tokens=False
        )
        self.user_name = "\nUSER: "
        self.tokenized_user_name = self.tokenizer.encode(self.user_name, return_tensors="pt", add_special_tokens=False)
        self.past_key_values = None

        self.t = 0.2
        self.output_buffer = ChatOutputBuffer(
            [self.model_name.strip(), self.user_name.strip(), self.tokenizer.eos_token], self.tokenizer
        )

    @torch.no_grad()
    def ask(self, text: str):
        input_ids = self.tokenizer.encode(text, return_tensors="pt", add_special_tokens=False)

        input_ids = torch.concatenate([self.tokenized_user_name, input_ids, self.tokenized_model_name], dim=-1)

        if self.past_key_values is None:
            input_ids = torch.concatenate([self.tokenized_prompt, input_ids], dim=-1)

        self.output_buffer.reset_output_buffer()
        output_text = self.model_name
        output_ids = self.tokenizer.encode(
            output_text, return_tensors="pt", add_special_tokens=self.past_key_values is None
        )
        self.output_buffer.streamer.put(output_ids)

        is_writing = True

        step_id = 0

        while is_writing:
            input_ids = input_ids.to(model.device)
            output = self.model(input_ids=input_ids, past_key_values=self.past_key_values)

            logits = output.logits
            assert len(logits.shape) == 3
            assert logits.shape[0] == 1
            last_logit = logits[[0], [-1], :]

            if step_id <= 2:
                last_logit[..., tokenizer.eos_token_id] = -1e4

            dist = torch.distributions.Categorical(logits=last_logit / self.t)
            next_token = dist.sample()
            # Note that parts of cut out text may remain in model memory
            # this is implemented in this way for performance reasons
            past_key_values = output.past_key_values
            assert len(next_token.shape) == 1
            should_stop = self.output_buffer.update_buffer(next_token[0].cpu().item())
            if should_stop:
                is_writing = False
            else:
                input_ids = next_token[None, :]
                self.past_key_values = past_key_values

            step_id += 1

### Ask to improve/refactor/rewritte
You can ask the model to improve the code.  
Note that this is only a 7B parameter model, so the results may contain errors.

In [None]:
try:
    del fot_memory
except:
    pass
gc.collect()
torch.cuda.empty_cache()

chatbot = SimpleChatBot(model=model, tokenizer=tokenizer)
chatbot.ask('''
#include <bits/stdc++.h>

const int MAX_INPUT_LENGTH = 2*1000000;
int _preff[MAX_INPUT_LENGTH + 7];

void _some_func(char *_text, int _dlugosc)
{
    _preff[0] = 0;
    _preff[1] = 0;
    int p = 0;
    for (int i = 2; i <= _dlugosc; ++i)
    {
        while (p != 0 && _text[p + 1] != _text[i])
            p = _preff[p];

        if (_text[p + 1] == _text[i])
            ++p;

        _preff[i] = p;
    }
}

Can you change the variable and function names in the code above so that they are more descriptive?''')


<s> 
ASSISTANT:  Sure, here's the code with more descriptive variable and function names:

```c
#include <bits/stdc++.h>

const int MAX_INPUT_LENGTH = 2*1000000;
int prefix_array[MAX_INPUT_LENGTH + 7];

void compute_prefix_array(char *text, int length)
{
   prefix_array[0] = 0;
   prefix_array[1] = 0;
   int p = 0;
   for (int i = 2; i <= length; ++i)
   {
       while (p != 0 && text[p + 1] != text[i])
           p = prefix_array[p];
       
       if (text[p + 1] == text[i])
           ++p;
       
       prefix_array[i] = p;
   }
}
```
In this code, the `compute_prefix_array` function takes a `text` pointer and a `length` integer as input, and it populates the `prefix_array` array with the prefix array values for the given text. The `prefix_array` array is used to store the prefix values for each character in the text. The `p` variable is used to keep track of the current position in the text, and it is updated based on the prefix values in the `prefix_array` array.


In [None]:
chatbot.ask("Great! Can you rewrite it in Python? Change global arrays to local lists.")

When the code is simple the model can even improve it.

In [None]:
chatbot = SimpleChatBot(model=model, tokenizer=tokenizer)
chatbot.ask('''
def is_prime(x):
    if x <= 1:
        return False
    for i in range(2, x):
        if x % i == 0:
            return False
    return True

I have written the code above but it is pretty slow.
Can you make it faster? Say O(sqrt(x)).''')

<s> 
ASSISTANT:  Sure, here's an implementation of the is_prime function that runs in O(sqrt(x)) time:

```python
def is_prime(x):
   if x <= 1:
       return False
   for i in range(2, int(x**0.5) + 1):
       if x % i == 0:
           return False
   return True
```
This implementation uses the fact that if a number x is prime, then any number less than or equal to x that is divisible by x is also a factor of x. Therefore, we can check for divisibility by numbers up to the square root of x without having to check all numbers up to x.

This implementation has a time complexity of O(sqrt(x)) because the loop iterates up to `int(x**0.5) + 1`, which is O(sqrt(x)).


In [None]:
chatbot.ask("Great! Can you rewrite this faster implementation it in C++?")

In [None]:
chatbot.ask("Thanks! Can you add more comments?")

## Chat
We have also used [ShareGPT-Processed](https://huggingface.co/datasets/zetavg/ShareGPT-Processed) dataset to enhance the model conversation abilities. The chat prompt was inspired by [LongChat](https://github.com/DachengLi1/LongChat).

Feel free to try the chat yourself:

In [None]:
chatbot = SimpleChatBot(model=model, tokenizer=tokenizer)
while True:
    user_text = input("USER: ")
    chatbot.ask(user_text)

<s> 
ASSISTANT:  Hello! How can I help you today?

ASSISTANT:  I'm a computer program designed to assist and communicate with you in a friendly and helpful way. I'm here to answer your questions and provide information to the best of my abilities. Is there something specific you would like to know?

ASSISTANT:  I'm a computer program designed to assist and communicate with you in a friendly and helpful way. I'm here to answer your questions and provide information to the best of my abilities. Is there something specific you would like to know?

ASSISTANT:  I'm a computer program designed to assist and communicate with you in a friendly and helpful way. I'm here to answer your questions and provide information to the best of my abilities. Is there something specific you would like to know

ASSISTANT:  Conversational models are a type of artificial intelligence that are designed to mimic human conversation. There are several state-of-the-art techniques used to create conversational model