This project aims to analyze the Vietnamese language to develop a faster typing method by implementing word prediction based on partial input. For instance, inputting only x0ch2
should yield xin chào
as the predicted output.
Completeness: v7
is basically better VNI, everything VNI can do, v7
also can do. So you can input any possible Vietnamese words with v7
.
Use the below script to try v7
method!
- The Vietnamese language consists of many diacritics, making typing in Vietnamese time-consuming due to the need for these diacritical marks.
v7
aims to simplify Vietnamese typing by using only the initial consonant and tone to predict the intended words. For example, instead of typingtưởng tượng
astuong73 tuong75
(VNI
) ortuongwr tuongwj
(Telex
), you can typet3t5
withv7
!- Naturally, this reduction in key usage leads to some information loss. For instance, the input
t3t5
could also correspond totiểu tiện
, as3
represents the hook tonehỏi
and5
represents the underdot tonenặng
. - This project analyzes and addresses these problems to ultimately introduce
v7
, enhancing the Vietnamese typing experience.
v7
inherits both from former VNI and Telex.
-
Special consonants:
g
for bothg
andgh
.ng
for bothng
andngh
.z
forgi
. (z6
→giúp
,giết
,giáp
, ...)dd
forđ
. (dd4
→đã
,đãi
,đỗ
, ...) (Telex style
)
-
Tones (
VNI style
):0
for no tones:tuân
,câm
,tân
...1
for normal acute:cấm
,tiếng
,tấn
,thính
... (compare with6
to see the differences)2
for grave:tuần
,cầm
,tần
...3
for hook:tẩn
,cẩm
,hỉ
...4
for tilde:mãi
,rã
,phũ
...5
for normal underdot:nhậm
,phụng
,độn
,mạnh
... (compare with7
to see the differences)6
forentering/checked
acute:cấp
,tiếc
,tất
,thích
... (everything with acute and ends withp
,t
,c
,ch
must be tone6
)7
forentering/checked
underdot:nhập
,phục
,đột
,mạch
... (everything with underdot and ends withp
,t
,c
,ch
must be tone7
)
-
Special vowels:
- Lots of
ă
,â
,ê
,ô
,ơ
,ư
when typing Vietnamese? Not a problem anymore because just typinga
,e
,o
,u
andv7
will predict the most suitable ones for you! This feature also helps reducing number of keys you have to type!
- Lots of
This 8-tone system follows the Vietnamese Eight-Tone Analysis.
Note: If you aren't familiar with 8-tone system, you can still config to use traditional VNI 6-tone. But using 8-tone system is highly recommended for much much better AI result!
v7
predicts the words/phrases users want to type by checking and ranking possible words/phrases. It operates in two modes:
In this mode, v7
searches for matching phrases in the dictionary and ranks them based on trained usage frequency.
- Limitations:
- Can only detect phrases present in the dictionary (although users can add more phrases to the dictionary).
- No understanding of the context.
- Effective for predicting single words or one phrase in the dictionary at a time.
This mode utilize v7gpt
: a GPT-like model with a custom tokenizer only for v7
, trained on a Vietnamese corpus, based on Andrej Karpathy's nanoGPT.
- Advantages:
- Works in any circumstances.
- Understands the context in which the user is writing to predict the most suitable next word.
- Can effectively predict entire sentences at a time.
Future plans include combining both modes to create the most robust Vietnamese input method.
This project uses Python 3.12.
To run the app in Dictionary Mode, follow these steps:
- Install the required packages:
pip install -r requirements.txt
- Start the application:
python main.py --lang en --ai false # VNI 6-tone is not yet supported for Dictionary mode
To run the app in AI Mode, follow these steps:
- Install the required packages for AI Mode (Torch is required):
pip install -r requirements_ai.txt
- Download the pretrained model checkpoint:
gdown 1dDP0jIJ79syE6vt6QnVl05_4fYpuwrqd -O checkpoints/v7gpt.pth
- Start the application:
python main.py --lang en --ai true --vni_tones false # use [--vni_tones true] if you want VNI 6-tone