T-Ragx
=====================

#### Enhancing Translation with RAG-Powered Large Language Models

<br><br>

<p align="center">
  <picture>
    <img alt="T-Ragx Featured Image" src="https://raw.githubusercontent.com/rayliuca/T-Ragx/main/assets/featured_repo.png" height="300" style="max-width: 100%;">
  </picture>
  <br/>
  <br/>
</p>


### Imports

#### Install Packages

In [None]:
# ! wget https://raw.githubusercontent.com/rayliuca/T-Ragx/main/requirements.txt

# # the CMAKE_ARGS will enable CUDA support for llama-cpp, see the llama-cpp-python GitHub for more details
# ! CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install -r requirements.txt -qqqq
# ! pip install t-ragx

##### Suppress llama_cpp Logs

In [1]:
import ctypes

from llama_cpp import llama_log_set

def null_callback(level, message, user_data):
    pass

log_callback = ctypes.CFUNCTYPE(None, ctypes.c_int, ctypes.c_char_p, ctypes.c_void_p)(null_callback)
llama_log_set(log_callback, ctypes.c_void_p())

#### Import

In [2]:
import t_ragx
import glob
import logging

logging.getLogger("httpx").setLevel(logging.WARNING)
logging.getLogger("elasticsearch").setLevel(logging.WARNING)
logging.getLogger("llama-cpp-python").setLevel(logging.WARNING)

  from .autonotebook import tqdm as notebook_tqdm
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
0it [00:00, ?it/s]
  from pkg_resources import DistributionNotFound, get_distribution


### Setup the input processor

Input processor will handle the memory/ glossary retrieval for us

In [3]:
input_processor = t_ragx.processors.ElasticInputProcessor()

input_processor.load_general_glossary()
input_processor.load_general_translation(elastic_index="general_translation_memory", elasticsearch_host=["https://t-ragx-fossil.rayliu.ca", "https://t-ragx-fossil2.rayliu.ca"])

HEAD https://t-ragx-fossil2.rayliu.ca:443/general_translation_memory [status:200 duration:2.168s]


#### Let's try it out!

In [4]:
example = "ティラノサウルス（学名：genus Tyrannosaurus）は、約7,000万 - 約6,600万年前（中生代白亜紀末期マーストリヒチアン）の北アメリカ大陸（画像資料[注 1]）に生息していた肉食恐竜"

In [5]:
glossary_results = input_processor.batch_search_glossary([example], max_k=5, source_lang='ja', target_lang='en')
glossary_results

[{'ティラノサウルス': ['9951 Tyrannosaurus', 'Tyrannosaurus'],
  '北アメリカ': ['North America'],
  '白亜紀': ['Cretaceous'],
  '中生代': ['Mesozoic'],
  'ストリ': ['Stryj', 'Sutri']}]

In [6]:
memory_results = input_processor.search_memory([example], top_k=3, source_lang='ja', target_lang='en')
memory_results

[[{'score': 48.205418,
   'distance': 79,
   'ja': '巨大なティラノサウルスは8000万~6600万年前の後期白亜紀に繁栄しました。',
   'en': 'Gigantic tyrannosaurs thrived in the Late Cretaceous from 80-66 million years ago.',
   'normed_distance': 0.7745098039215687},
  {'score': 54.356842,
   'distance': 85,
   'ja': 'この恐竜は、白亜紀のカンパニアン期(約7100万~7500万年前)に現在のモンゴルにあたる地域で生息していた。',
   'en': 'Halszkaraptor escuilliei lived during the Campanian stage of the Cretaceous (about 71-75 million years ago) in what is now Mongolia. ',
   'normed_distance': 0.8333333333333334},
  {'score': 49.845387,
   'distance': 90,
   'ja': 'それは、白亜紀の終わりの6,600万年前に起こった。',
   'en': 'That happened 66m years ago, at the end of the Cretaceous period.',
   'normed_distance': 0.8823529411764706}]]

### Setup the models

We have a few options here. T-Ragx supports HuggingFace Transformers, llama-cpp-python, or OpenAI/ Ollama based API as backends


In [7]:
# mistral_model = t_ragx.models.OllamaModel(host='localhost', port='11434', endpoint='/api/generate', model="t_ragx_beagle",
#                  protocol="http")

# mistral_model = t_ragx.models.OpenAIModel(host='localhost', port=11434, endpoint='/v1', model="t_ragx_mistral",
#                  protocol="http")

mistral_model = t_ragx.models.LlamaCppPythonModel(
    repo_id="rayliuca/TRagx-GGUF-Mistral-7B-Instruct-v0.2",
    filename="*Q4_K_M*",
    chat_format="mistral-instruct",
    model_config={'n_ctx':2048},
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | 
Model metadata: {'tokenizer.chat_template': "{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 

#### Putting all the pieces together to create the true T-Ragx

In [8]:
t_ragx_translator = t_ragx.TRagx([mistral_model], input_processor=input_processor)

#### And Translate!

In [9]:
t_ragx_translator.batch_translate([example], source_lang_code='ja', target_lang_code='en', memory_search_args={'top_k':3}, generation_args=[{'max_tokens':2048}])

  0%|          | 0/1 [00:00<?, ?it/s]

['Tyrannosaurus (genus name) was a carnivorous theropod dinosaur that lived approximately 70-66 million years ago (Late Cretaceous, late Maastrichtian) on the North American continent (image data [note 1]).']

### Document Level Translation

Lets try to translate a sample text from WMT23

In [10]:
sample_text_list = [
    '徹底した感染対策／クラツー公式 - 助成金でお得旅',
    '助成金ツアー・県民割旅行ならクラブツーリズム！',
    '各都道府県毎の助成を活用したツアーをご紹介。',
    '密集・密閉・密接を避けた新しい旅の形をご提案。',
    '助成金ツアーで旅に出よう／クラブツーリズム。',
    'CTPは全国旅行支援と併用可・到着地: 沖縄, 関西, 北海道, 東北, 九州。'
]

#### Get preceeding text using the helper utility

In [11]:
pre_text_list = t_ragx.utils.helper.get_preceding_text(sample_text_list)
pre_text_list

[[],
 ['徹底した感染対策／クラツー公式 - 助成金でお得旅'],
 ['徹底した感染対策／クラツー公式 - 助成金でお得旅', '助成金ツアー・県民割旅行ならクラブツーリズム！'],
 ['徹底した感染対策／クラツー公式 - 助成金でお得旅',
  '助成金ツアー・県民割旅行ならクラブツーリズム！',
  '各都道府県毎の助成を活用したツアーをご紹介。'],
 ['助成金ツアー・県民割旅行ならクラブツーリズム！',
  '各都道府県毎の助成を活用したツアーをご紹介。',
  '密集・密閉・密接を避けた新しい旅の形をご提案。'],
 ['各都道府県毎の助成を活用したツアーをご紹介。',
  '密集・密閉・密接を避けた新しい旅の形をご提案。',
  '助成金ツアーで旅に出よう／クラブツーリズム。']]

#### Translate!

In [12]:
t_ragx_translator.batch_translate(sample_text_list, pre_text_list=pre_text_list, source_lang_code='ja', target_lang_code='en', memory_search_args={'top_k':3})

  0%|          | 0/6 [00:00<?, ?it/s]

['Thorough infection prevention measures / Katsuura Official - Affordable travel with subsidies',
 "If it's a subsidized tour or a hometown trip, it's club tourism!",
 'We introduce tours that make use of subsidies from each prefecture.',
 'We propose new forms of travel that avoid crowded, enclosed and close-contact situations.',
 'Go on a tour with subsidized travel ② Club Tourism.',
 'CTP can be used for national travel support and can be used at the following destinations: Okinawa, Kansai, Hokkaido, Tohoku, and Kyushu.']

#### Notice that the translation is non-deterministic, so there would be some variations every time

In [13]:
t_ragx_translator.batch_translate(sample_text_list, pre_text_list=pre_text_list, source_lang_code='ja', target_lang_code='en', memory_search_args={'top_k':3})

  0%|          | 0/6 [00:00<?, ?it/s]

['Thorough infection prevention measures / Katsuura Official - Affordable travel with subsidies',
 "If it's a subsidized tour or a citizen's tour, it's club tourism!",
 'Here is a tour that utilizes the grants available in each prefecture.',
 'We propose a new form of travel to avoid crowded and closed spaces and close contact.',
 'Go on a tour with subsidies! Club Tourism.',
 'CTP can be used for nationwide travel support and can be used for the following destinations: Okinawa, Kansai, Hokkaido, Tohoku, and Kyushu.']

### Expected Translations (TRagx-Mistral-7B-Instruct-v0.2 without quantization)
- Thorough infection countermeasures/Katsuura official - Affordable travel with subsidies
- If you're looking for a subsidized tour or a discounted trip for residents, then Club Tourism!
- Here are some tours that make use of the subsidies offered by each prefecture.
- We propose a new form of travel that avoids crowded, closed, and close contact.
- Go on a tour with subsidies/Club Tourism.
- CTP can be used for nationwide travel support and can be used in conjunction with other travel support programs. Arrival destinations: Okinawa, Kansai, Hokkaido, Tohoku, Kyushu.


### Reference Translation
- Thorough Infection Controls / Club Tourism Official - Subsidized Budget Travel
- Subsidized Tours - Club Tourism for Domestic Tourism Campaign!
- We provide tours that take advantage of subsidies in each prefecture.
- We suggest a new type of travel that avoids crowded places, closed spaces, and close contact.
- Take a Trip on a Subsidized Tour / Club Tourism.
- The Club Tourism Pass can be used with the national travel subsidy program - Destinations: Okinawa, Kansai, Hokkaido, Tohoku, Kyushu.
