# Preprocessing in NLP

M·ª•c ti√™u c·ªßa to√†n b·ªô qu√° tr√¨nh n√†y l√† gi√∫p m√¥ h√¨nh h·ªçc m√°y (Machine Learning) ho·∫∑c m√¥ h√¨nh h·ªçc s√¢u (Deep Learning) nh·∫≠n ƒë∆∞·ª£c ƒë·∫ßu v√†o ‚Äús·∫°ch‚Äù v√† ‚Äúchu·∫©n ho√°‚Äù, t·ª´ ƒë√≥ ƒë·∫°t hi·ªáu qu·∫£ cao h∆°n trong ph√¢n t√≠ch v√† d·ª± ƒëo√°n.

**NLP Python packages:**

|NLP Library|	Description|
|---|---|
|NLTK	|This is one of the most usable and mother of all NLP libraries.|
|spaCy	|This is a completely optimized and highly accurate library widely used in deep learning|
|Stanford CoreNLP| Python	For client-server-based architecture, this is a good library in NLTK. This is written in JAVA, but it provides modularity to use it in Python.|
|TextBlob	|This is an NLP library which works in Pyhton2 and python3. This is used for processing textual data and provide mainly all type of operation in the form of API.|
|Gensim	|Genism is a robust open source NLP library support in Python. This library is highly efficient and scalable.|
|Pattern	|It is a light-weighted NLP module. This is generally used in Web-mining, crawling or such type of spidering task|
|Polyglot	|For massive multilingual applications, Polyglot is best suitable NLP library. Feature extraction in the way on Identity and Entity.|
|PyNLPl	|PyNLPI also was known as ‚ÄòPineapple‚Äô and supports Python. It provides a parser for many data formats like FoLiA/Giza/Moses/ARPA/Timbl/CQL.|
|Vocabulary	|This library is best to get Semantic type information from the given text.|
|pyvi	| Python Vietnamese Core NLP Toolkit |
|underthesea| Underthesea - Vietnamese NLP Toolkit |


**The process of NLP Preprocessing is as follows:**
1. General cleaning
    - Case normalization
    - Normalize grammatical structure
    - Regular expression handling
2. Removing noise from the dataset
    - Removing special characters/patterns
    - Removing punctuations
    - Removing stop words
    - Remove unnecessary components: table, image, etc.
3. Normalizing text to right-format for the ML Algorithm
    - Tagging: Part-of-speech tagging, named-entity recognition
    - Stemming / Lemmatization
4. Tokenization
5. Text Mining

In [5]:
# import libraries
import pandas as pd
import numpy as np
import string
import re
import nltk

In [6]:
# link data: https://github.com/WhySchools/VFND-vietnamese-fake-news-datasets/blob/master/CSV/vn_news_223_tdlfr.csv

raw = pd.read_csv(
    r"contents\theory\aiml_algorithms\dl_nlp\data\vn_news_223_tdlfr.csv"
)
print(raw.shape)
raw.head(3)

(223, 3)


Unnamed: 0,text,domain,label
0,Th·ªß t∆∞·ªõng Abe c√∫i ƒë·∫ßu xin l·ªói v√¨ h√†nh ƒë·ªông phi...,binhluan.biz,1
1,Th·ªß t∆∞·ªõng Nh·∫≠t c√∫i ƒë·∫ßu xin l·ªói v√¨ tinh th·∫ßn ph...,www.ipick.vn,1
2,Cho√°ng! C∆° tr∆∞·ªüng ƒëeo khƒÉn qu√†ng qu·∫©y banh n√≥c...,tintucqpvn.net,1


## General Cleaning

1. **Case Normalization**: Th∆∞·ªùng chuy·ªÉn t·∫•t c·∫£ v·ªÅ ch·ªØ th∆∞·ªùng (lowercase) ƒë·ªÉ gi·∫£m ƒë·ªô ph·ª©c t·∫°p khi so s√°nh t·ª´. However, do remember that **lowercasing can change the meaning of some text** e.g "US" vs "us".

2. **S·ª≠a l·ªói ch√≠nh t·∫£ (n·∫øu c·∫ßn)**: Trong m·ªôt s·ªë b√†i to√°n ph√¢n t√≠ch ng√¥n ng·ªØ, vi·ªác ch√≠nh t·∫£ ch√≠nh x√°c c√≥ √Ω nghƒ©a quan tr·ªçng.

3. **X·ª≠ l√Ω c√°c t·ª´ vi·∫øt t·∫Øt, t·ª´ l√≥ng**: V√≠ d·ª•: ‚Äúko‚Äù -> ‚Äúkh√¥ng‚Äù, ‚Äúk‚Äù -> ‚Äúkh√¥ng‚Äù (trong ti·∫øng Vi·ªát), ho·∫∑c ‚Äúu‚Äù -> ‚Äúyou‚Äù (ti·∫øng Anh). --> Vi·ªác nh·∫•t qu√°n ho√° c√°c bi·∫øn th·ªÉ t·ª´ v·ª±ng gi√∫p m√¥ h√¨nh hi·ªÉu r√µ h∆°n.

## Removing Noise

Text data often contains noise such as punctuation, special characters, and irrelevant symbols. Preprocessing helps remove these elements, making the text cleaner and easier to analyze.

1. **Lo·∫°i b·ªè k√Ω t·ª± ho·∫∑c bi·ªÉu t∆∞·ª£ng kh√¥ng mong mu·ªën**: k√Ω t·ª± ƒë·∫∑c bi·ªát, emoji, ƒë∆∞·ªùng d·∫´n (URL), email, k√Ω t·ª± HTML, th·∫ª HTML, v.v. --> gi·∫£m b·ªõt nh·ªØng th√†nh ph·∫ßn kh√¥ng c√≥ gi√° tr·ªã ng·ªØ nghƒ©a ho·∫∑c g√¢y nhi·ªÖu.

2. **Lo·∫°i b·ªè kho·∫£ng tr·∫Øng, xu·ªëng d√≤ng th·ª´a, or punctuations**: like `. , ! $( ) * % @` gi√∫p d·ªØ li·ªáu g·ªçn g√†ng, nh·∫•t qu√°n.

3. **Lo·∫°i b·ªè ho·∫∑c thay th·∫ø token v√¥ nghƒ©a (Stopwords, t·ª´ v√¥ nghƒ©a trong ng·ªØ c·∫£nh)**: Stopwords (nh∆∞ "v√†", "ho·∫∑c", "c·ªßa" trong ti·∫øng Vi·ªát; "the", "is", "at" trong ti·∫øng Anh, v.v.) th∆∞·ªùng √≠t mang th√¥ng tin ng·ªØ nghƒ©a v√† c√≥ th·ªÉ g√¢y nhi·ªÖu cho m√¥ h√¨nh. T√πy b√†i to√°n m√† quy·∫øt ƒë·ªãnh gi·ªØ hay b·ªè, v√¨ ƒë√¥i khi stopwords c≈©ng quan tr·ªçng trong m·ªôt s·ªë ng·ªØ c·∫£nh.

4. **Lo·∫°i b·ªè nh·ªØng ph·∫ßn t·ª≠ kh√¥ng li√™n quan**: V√≠ d·ª•: trong c√°c ƒëo·∫°n vƒÉn b·∫£n c√≥ ch√®n c√°c code snippet, b·∫£ng bi·ªÉu, metadata‚Ä¶ kh√¥ng c·∫ßn thi·∫øt cho ph√¢n t√≠ch.


## Normalizing

Different forms of words (e.g., ‚Äúrun,‚Äù ‚Äúrunning,‚Äù ‚Äúran‚Äù) can convey the same meaning but appear in different forms. Preprocessing techniques like stemming and lemmatization help standardize these variations.

1. **Tagging**
    - **Part-of-speech (POS)**: 
    - **Named-entity recognition (NER)**:

2. **Stemming / Lemmatization (Gi·∫£m bi·∫øn th·ªÉ t·ª´ v·ª±ng - useful for English)**:
    - **Stemming**: c·∫Øt b·ªè ph·∫ßn ‚Äúƒëu√¥i‚Äù c·ªßa t·ª´ ƒë·ªÉ ƒë∆∞a v·ªÅ ‚Äúg·ªëc‚Äù (c√≥ th·ªÉ kh√¥ng ph·∫£i l√† t·ª´ ƒë√∫ng trong t·ª´ ƒëi·ªÉn).
        ```text
        connecting  -->  connect
        connected  -->  connect
        connectivity  -->  connect
        connect  -->  connect
        connects  -->  connect
        ```
    - **Lemmatization**: ƒë∆∞a t·ª´ v·ªÅ d·∫°ng ‚Äúg·ªëc t·ª´ ƒëi·ªÉn‚Äù (ch√≠nh t·∫Øc) d·ª±a v√†o t·ª´ lo·∫°i, ng·ªØ c·∫£nh. **Lemmatization** v·ªÅ c∆° b·∫£n l√† gi·ªëng v·ªõi **stemming** khi n√≥ lo·∫°i b·ªè ph·∫ßn ƒëu√¥i c·ªßa t·ª´ ƒë·ªÉ thu ƒë∆∞·ª£c g·ªëc t·ª´, tuy nhi√™n c√°c g·ªëc t·ª´ ·ªü ƒë√¢y ƒë·ªÅu th·ª±c s·ª± t·ªën t·∫°i ch·ª© kh√¥ng nh∆∞ **stemming** (nh∆∞ v√≠ d·ª• tr√™n th√¨ t·ª´ `moved` sau khi lemmatize s·∫Ω thu ƒë∆∞·ª£c `move`). Trong th∆∞ vi·ªán NLTK s·∫Ω s·ª≠ d·ª•ng t·ª´ ƒëi·ªÉn **Wordnet** ƒë·ªÉ map theo c√°c quy t·∫Øc (theo t√≠nh ch·∫•t c·ªßa t·ª´, t·ª´ l√† danh t·ª´, ƒë·ªông t·ª´, tr·∫°ng t·ª´ hay t√≠nh t·ª´). S·ª≠ d·ª•ng part-of-speech tagging (nltk.pos_tag) ƒë·ªÉ thu ƒë∆∞·ª£c c√°c t√≠nh ch·∫•t c·ªßa t·ª´.
    
    -> Hai k·ªπ thu·∫≠t n√†y gi√∫p gi·∫£m s·ª± tr√πng l·∫∑p khi c√πng m·ªôt t·ª´ xu·∫•t hi·ªán ·ªü c√°c d·∫°ng bi·∫øn th·ªÉ kh√°c nhau.

3. **X·ª≠ l√Ω nh√£n (n·∫øu l√† b√†i to√°n gi√°m s√°t)**:
    - Ki·ªÉm tra v√† chu·∫©n ho√° d·ªØ li·ªáu nh√£n (label). V√≠ d·ª•: chuy·ªÉn t·ª´ ‚Äúpositive‚Äù / ‚Äúnegative‚Äù / ‚Äúneutral‚Äù sang 0 / 1 / 2 ho·∫∑c t∆∞∆°ng t·ª±.

In [33]:
from pyvi import ViTokenizer, ViPosTagger
import string


# remove punctuation
def remove_punctuation(text):
    return text.translate(str.maketrans("", "", string.punctuation))


# stop words: https://github.com/stopwords/vietnamese-stopwords/blob/master/vietnamese-stopwords-dash.txt
stop_words = (
    pd.read_csv(
        r"contents\theory\aiml_algorithms\dl_nlp\data\vietnamese-stopwords-dash.txt",
        header=None,
    )
    .iloc[:, 0]
    .tolist()
)


def process_text(text):
    # replace parttern " ko " by " kh√¥ng "
    processed_text = re.sub(r"\bko\b", "kh√¥ng", text)

    tokens = ViTokenizer.tokenize(text)
    pos_tags = ViPosTagger.postagging(tokens)
    processed_text = []
    for token, pos in zip(pos_tags[0], pos_tags[1]):
        # if token not in stop_words:
        if pos.startswith("Np"):
            processed_text.append(token.title())
        else:
            processed_text.append(token.lower())

    return " ".join(processed_text)


# Apply the function to the 'text' column
raw["processed_text"] = raw["text"].apply(process_text)
raw[["text", "processed_text"]]

Unnamed: 0,text,processed_text
0,Th·ªß t∆∞·ªõng Abe c√∫i ƒë·∫ßu xin l·ªói v√¨ h√†nh ƒë·ªông phi...,th·ªß_t∆∞·ªõng Abe c√∫i ƒë·∫ßu xin_l·ªói v√¨ h√†nh_ƒë·ªông phi...
1,Th·ªß t∆∞·ªõng Nh·∫≠t c√∫i ƒë·∫ßu xin l·ªói v√¨ tinh th·∫ßn ph...,th·ªß_t∆∞·ªõng Nh·∫≠t c√∫i ƒë·∫ßu xin_l·ªói v√¨ tinh_th·∫ßn ph...
2,Cho√°ng! C∆° tr∆∞·ªüng ƒëeo khƒÉn qu√†ng qu·∫©y banh n√≥c...,cho√°ng ! c∆°_tr∆∞·ªüng ƒëeo khƒÉn_qu√†ng qu·∫©y banh n√≥...
3,Ch∆∞a bao gi·ªù nh·∫°c Kpop l·∫°i d·ªÖ h√°t ƒë·∫øn th·∫ø!!!\n...,ch∆∞a bao_gi·ªù nh·∫°c Kpop l·∫°i d·ªÖ h√°t ƒë·∫øn th·∫ø ! ! ...
4,"ƒê·∫°i h·ªçc Hutech s·∫Ω √°p d·ª•ng c·∫£i c√°ch ""Ti·∫øq Vi·ªát""...","ƒë·∫°i_h·ªçc Hutech s·∫Ω √°p_d·ª•ng c·∫£i_c√°ch "" Ti·∫øq Vi·ªát..."
...,...,...
218,‚ÄúSi√™u m√°y bay‚Äù A350 s·∫Ω ch·ªü CƒêV Vi·ªát Nam ƒëi Mal...,‚Äú si√™u m√°y_bay ‚Äù A350 s·∫Ω ch·ªü cƒëv Vi·ªát_Nam ƒëi M...
219,Th∆∞·ªüng 20.000 USD cho ƒë·ªôi tuy·ªÉn c·ªù vua Vi·ªát Na...,th∆∞·ªüng 20.000 usd cho ƒë·ªôi_tuy·ªÉn c·ªù_vua Vi·ªát_Na...
220,Tr∆∞·ªùng S∆°n gi√†nh HCV t·∫°i gi·∫£i c·ªù vua ƒë·ªìng ƒë·ªôi ...,Tr∆∞·ªùng_S∆°n gi√†nh hcv t·∫°i gi·∫£i c·ªù_vua ƒë·ªìng_ƒë·ªôi ...
221,Chuy·ªán v·ªÅ ch√†ng sinh vi√™n Lu·∫≠t - Ki·ªán t∆∞·ªõng L√™...,chuy·ªán v·ªÅ ch√†ng sinh_vi√™n Lu·∫≠t - ki·ªán_t∆∞·ªõng L√™...


## Tokenizing

![](https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/images/08-tokenization-vs-embedding.png)

**Tokenization**: Tokenization (quy tr√¨nh t√°ch t·ª´, chia nh·ªè vƒÉn b·∫£n) l√† b∆∞·ªõc ti·ªÅn x·ª≠ l√Ω (preprocessing) v√¥ c√πng quan tr·ªçng trong x·ª≠ l√Ω vƒÉn b·∫£n. M·ª•c ti√™u c·ªßa tokenization l√† chuy·ªÉn vƒÉn b·∫£n g·ªëc (chu·ªói k√Ω t·ª±) th√†nh danh s√°ch c√°c token (nh·ªØng ƒë∆°n v·ªã c√≥ √Ω nghƒ©a).

- T√°ch c√¢u th√†nh c√°c ƒë∆°n v·ªã t·ª´ ho·∫∑c subword.
- Trong ti·∫øng Anh th∆∞·ªùng d·ªÖ d√†ng h∆°n (t√°ch theo d·∫•u c√°ch v√† k√Ω t·ª± ƒë·∫∑c bi·ªát), c√≤n ti·∫øng Vi·ªát c·∫ßn s·ª≠ d·ª•ng m√¥ h√¨nh ho·∫∑c th∆∞ vi·ªán t√°ch t·ª´ chuy√™n d·ª•ng (nh∆∞ VnCoreNLP, PyVi, v.v.).

√Ånh x·∫° character/word/subword sang gi√° tr·ªã s·ªë numberical value. C√≥ 3 level c·ªßa tokenization
- _Word-level tokenization_: M·ªói t·ª´ s·∫Ω ƒë·∫°i di·ªán b·ªüi 1 numerical value. T√°ch theo kho·∫£ng tr·∫Øng ho·∫∑c d·ª±a tr√™n th∆∞ vi·ªán tokenizer (ti·∫øng Vi·ªát: VnCoreNLP, PyVi, RDRSegmenter, ‚Ä¶). V√≠ d·ª•: "I love yout" ---> [0,1,2]
- _Character-level tokenization_: M·ªói character (ch·ªØ c√°i, d·∫•u c√¢u) s·∫Ω ƒë·∫°i di·ªán cho 1 token. H·ªØu √≠ch trong m·ªôt s·ªë b√†i to√°n (ƒë·∫∑c bi·ªát v·ªõi c√°c m√¥ h√¨nh ng√¥n ng·ªØ c√≥ √Ω ƒë·ªãnh x·ª≠ l√Ω ƒë√°nh v·∫ßn, ho·∫∑c khi d·ªØ li·ªáu c√≥ nhi·ªÅu t·ª´ m·ªõi).
- _Sub-word tokenization_: break t·ª´ng t·ª´ th√†nh c√°c ph·∫ßn v√† tokenization n√≥, khi ƒë√≥ m·ªói word c√≥ th·ªÉ th√†nh nhi·ªÅu tokens. K·∫øt h·ª£p ∆∞u ƒëi·ªÉm gi·ªØa word-level v√† character-level, ƒë∆∞·ª£c s·ª≠ d·ª•ng trong BERT, GPT, RoBERTa, PhoBERT, v.v.

> Tu·ª≥ thu·ªôc v√†o problem m√† n√™n ch·ªçn level tokenization cho ph√π h·ª£p, ho·∫∑c c√≥ th·ªÉ th·ª≠ c√°c level v√† ki·ªÉm tra performance, ho·∫∑c c√≥ th·ªÉ s·ª≠ d·ª•ng `tf.keras.layers.concatenate` ƒë·ªÉ combine/stacking ch√∫ng l·∫°i v·ªõi nhau.
---
**T·∫°i sao tokenization quan tr·ªçng?**
- C√°c m√¥ h√¨nh NLP c·ªï ƒëi·ªÉn (Bag-of-Words, TF-IDF, v.v.) hay hi·ªán ƒë·∫°i (Deep Learning) ƒë·ªÅu l√†m vi·ªác tr√™n c√°c ƒë∆°n v·ªã r·ªùi r·∫°c (token).
- Tokenization quy·∫øt ƒë·ªãnh c√°ch m√¥ h√¨nh nh·∫≠n th·ª©c vƒÉn b·∫£n: sai s√≥t ho·∫∑c thi·∫øu h·ª£p l√Ω trong tokenization ·∫£nh h∆∞·ªüng ƒë√°ng k·ªÉ ƒë·∫øn ch·∫•t l∆∞·ª£ng m√¥ h√¨nh.
- V·ªõi c√°c m√¥ h√¨nh ng√¥n ng·ªØ hi·ªán ƒë·∫°i (**Transformer** nh∆∞ `BERT`, `GPT`, `RoBERTa`, v.v.), v·∫´n c·∫ßn **tokenization**, th∆∞·ªùng l√† subword tokenization (v√≠ d·ª• `BPE`, `SentencePiece`). L√Ω do: m√¥ h√¨nh c·∫ßn chia vƒÉn b·∫£n th√†nh c√°c ‚Äúm√£‚Äù (code) ƒë∆∞·ª£c h·ªçc s·∫µn trong t·ª´ v·ª±ng (vocabulary) ƒë·ªÉ √°nh x·∫° m·ªói token sang vector **embedding** ph√π h·ª£p.

---
**B·∫£n ch·∫•t h√†nh ƒë·ªông c·ªßa Tokenization v√† Embedding:**

- **Tokenization**: Chuy·ªÉn ƒë·ªïi t·ª´ vƒÉn b·∫£n th√†nh danh s√°ch c√°c token (d·∫°ng ch·ªØ) (word, subword, character). Sau ƒë√≥ s·∫Ω chuy·ªÉn sang d·∫°ng s·ªë (index) th√¥ng qua t·ª´ ƒëi·ªÉn (vocabulary).
- **Embedding**: V·ªõi m·ªói 1 token (d·∫°ng s·ªë - index) th√¨ token chuy·ªÉn ƒë·ªïi t·ª´ d·∫°ng s·ªë (index) sang **vector** (nhi·ªÅu chi·ªÅu) s·ªë th·ª±c (embedding). M·ªói token s·∫Ω ƒë∆∞·ª£c bi·ªÉu di·ªÖn b·ªüi m·ªôt vector s·ªë th·ª±c c√≥ s·ªë chi·ªÅu l√† `d` x√°c ƒë·ªãnh tr∆∞·ªõc (v√≠ d·ª•: 100, 200, 300 chi·ªÅu). Th·ªÉ hi·ªán s·ª± t∆∞∆°ng quan gi·ªØa c√°c token trong kh√¥ng gian vector.
- Khi m√¥ h√¨nh nh·∫≠n input (list c√°c token ID), n√≥ s·∫Ω tra c·ª©u (lookup) t·ª´ng token ID trong h√†ng t∆∞∆°ng ·ª©ng c·ªßa ma tr·∫≠n embedding `ùëä` ƒë·ªÉ l·∫•y ƒë∆∞·ª£c vector (d chi·ªÅu) t∆∞∆°ng ·ª©ng v·ªõi token ƒë√≥.
- K·∫øt qu·∫£: Thay v√¨ list s·ªë ID, m√¥ h√¨nh c√≥ m·ªôt chu·ªói vector (m·ªôt cho m·ªói token), ph·∫£n √°nh th√¥ng tin ng·ªØ nghƒ©a v√† ng·ªØ c·∫£nh (v·ªõi c√°c m√¥ h√¨nh hi·ªán ƒë·∫°i) c·ªßa nh·ªØng token ƒë√≥. Sau ƒë√≥, m√¥ h√¨nh s·∫Ω s·ª≠ d·ª•ng vector n√†y ƒë·ªÉ th·ª±c hi·ªán c√°c ph√©p to√°n (t√≠ch v√¥ h∆∞·ªõng, pooling, attention, v.v.) ƒë·ªÉ h·ªçc c·∫•u tr√∫c ng·ªØ nghƒ©a c·ªßa vƒÉn b·∫£n.

---

**Best Practice cho Tokenization**

***1. Ch·ªçn ph∆∞∆°ng ph√°p tokenization ph√π h·ª£p:***

- V·ªõi m√¥ h√¨nh Transformer hi·ªán ƒë·∫°i, th∆∞·ªùng d√πng subword tokenization (BPE, SentencePiece) v√¨ kh·∫£ nƒÉng x·ª≠ l√Ω t·ªët t·ª´ m·ªõi, t·ª´ sai ch√≠nh t·∫£, t·ª´ hi·∫øm, v.v.
- N·∫øu l√†m truy·ªÅn th·ªëng (Bag-of-Words, TF-IDF) v·ªõi ti·∫øng Vi·ªát, h√£y s·ª≠ d·ª•ng th∆∞ vi·ªán t√°ch t·ª´ chuy√™n d·ª•ng (VD: VnCoreNLP).

***2. Gi·ªØ nguy√™n (ho·∫∑c x·ª≠ l√Ω ph√π h·ª£p) d·∫•u c√¢u, bi·ªÉu t∆∞·ª£ng c·∫£m x√∫c (emoji) n·∫øu ch√∫ng mang √Ω nghƒ©a trong b√†i to√°n.***

- Nhi·ªÅu b√†i to√°n ph√¢n t√≠ch c·∫£m x√∫c ·ªü MXH c·∫ßn emoji ƒë·ªÉ hi·ªÉu s·∫Øc th√°i.

***3. Ki·ªÉm tra ch·∫•t l∆∞·ª£ng tokenization:***

- ƒê·∫∑c bi·ªát v·ªõi ti·∫øng Vi·ªát, tokenization ch∆∞a chu·∫©n c√≥ th·ªÉ g√¢y ‚Äúv·ª° nghƒ©a‚Äù.
- Th·ª≠ soi m·ªôt s·ªë vƒÉn b·∫£n sau khi tokenization ƒë·ªÉ ch·∫Øc ch·∫Øn ph√π h·ª£p, tr√°nh t√°ch sai t·ª´ gh√©p (VD: ‚Äúƒëi·ªán tho·∫°i‚Äù, ‚Äúc·∫ßm tay‚Äù th√†nh ‚Äúƒëi·ªán‚Äù, ‚Äútho·∫°i‚Äù, ‚Äúc·∫ßm‚Äù, ‚Äútay‚Äù).

***4. X·ª≠ l√Ω t·ª´ ƒë·∫∑c bi·ªát, hashtag, mention:***

- Trong b√†i to√°n MXH, token h√≥a hashtag (#myhashtag), mention (@username), link URL, v.v., t√πy xem b·∫°n c√≥ mu·ªën gi·ªØ hay lo·∫°i b·ªè.

***5. Chu·∫©n ho√° (normalization):***

- Th√¥ng th∆∞·ªùng, chuy·ªÉn vƒÉn b·∫£n v·ªÅ d·∫°ng ch·ªØ th∆∞·ªùng (lowercasing).
- V·ªõi ti·∫øng Vi·ªát, c·∫ßn xem c√≥ bi·∫øn ƒë·ªïi d·∫•u kh√¥ng, ho·∫∑c chu·∫©n ho√° k√Ω t·ª± unicode t·ªï h·ª£p.

**Note**: M√¥ h√¨nh BERT/PhoBERT g·ªëc c√≥ th·ªÉ kh√¥ng lowercasing ƒë·ªÉ gi·ªØ nguy√™n case. N√™n ki·ªÉm tra m√¥ h√¨nh pre-trained y√™u c·∫ßu g√¨.

---

**C√≥ c·∫ßn tuning l·∫°i tokenization?**

V·ªõi c√°c Large Language Model (LLM) hi·ªán ƒë·∫°i nh∆∞ GPT, BERT-based, T5, RoBERTa, ‚Ä¶ th∆∞·ªùng d√πng m·ªôt vocabulary v√† c∆° ch·∫ø tokenization ƒë√£ hu·∫•n luy·ªán s·∫µn. Khi b·∫°n t·∫£i m√¥ h√¨nh n√†y v·ªÅ, ƒëi k√®m lu√¥n c√≥ tokenizer (v√† embedding matrix t∆∞∆°ng ·ª©ng) ƒë√£ ƒë·ªìng b·ªô v·ªõi m√¥ h√¨nh.

> Th·ª±c t·∫ø, ƒëa s·ªë ng∆∞·ªùi d√πng kh√¥ng ph·∫£i t·ª± retrain hay tune l·∫°i tokenizer.
> - V√¨ m√¥ h√¨nh ƒë√£ ƒë∆∞·ª£c pre-trained tr√™n m·ªôt kh·ªëi l∆∞·ª£ng d·ªØ li·ªáu kh·ªïng l·ªì, tokenizer ban ƒë·∫ßu (subword / BPE / SentencePiece) ƒë√£ t∆∞∆°ng ƒë·ªëi t·ªëi ∆∞u.
> - N·∫øu t·ª± √Ω thay ƒë·ªïi tokenizer (th√™m token, b·ªõt token, thay ƒë·ªïi c√°ch t√°ch subword), b·∫°n s·∫Ω ph·∫£i hu·∫•n luy·ªán l·∫°i (ho·∫∑c ƒëi·ªÅu ch·ªânh l·ªõn) embedding, l√†m m·∫•t t√≠nh t∆∞∆°ng th√≠ch v·ªõi c√°c tr·ªçng s·ªë ƒë√£ ƒë∆∞·ª£c pre-trained.


***Ngo·∫°i l·ªá: Trong m·ªôt s·ªë tr∆∞·ªùng h·ª£p ƒë·∫∑c bi·ªát, b·∫°n c√≥ th·ªÉ re-train / fine-tune tokenizer.***

> V√≠ d·ª•: khi b·∫°n c√≥ m·ªôt mi·ªÅn d·ªØ li·ªáu r·∫•t ƒë·∫∑c th√π (nh∆∞ y h·ªçc, h√≥a h·ªçc, t√†i ch√≠nh) v·ªõi nhi·ªÅu thu·∫≠t ng·ªØ, t·ª´ vi·∫øt t·∫Øt, k√Ω hi·ªáu‚Ä¶ ho√†n to√†n kh√¥ng (ho·∫∑c r·∫•t √≠t) xu·∫•t hi·ªán trong b·ªô pre-training.
> - Khi ƒë√≥, tokenizer c≈© c√≥ th·ªÉ t·∫°o ra nhi·ªÅu token ‚Äú[UNK]‚Äù (token l·∫°) ho·∫∑c token r·∫•t d√†i do kh√¥ng t√¨m th·∫•y subword ph√π h·ª£p.
> - B·∫°n mu·ªën n√¢ng cao kh·∫£ nƒÉng bi·ªÉu di·ªÖn, c√≥ th·ªÉ hu·∫•n luy·ªán l·∫°i tokenizer tr√™n data chuy√™n ng√†nh, r·ªìi fine-tune (ho·∫∑c hu·∫•n luy·ªán l·∫°i t·ª´ ƒë·∫ßu) embedding. Tuy nhi√™n, vi·ªác n√†y ƒë√≤i h·ªèi r·∫•t nhi·ªÅu t√†i nguy√™n v√† kinh nghi·ªám, v√¨ b·∫°n c·∫ßn ƒë·∫£m b·∫£o m√¥ h√¨nh (v√† embedding matrix) t∆∞∆°ng th√≠ch v·ªõi tokenizer m·ªõi.


V·ªõi h·∫ßu h·∫øt d·ª± √°n NLP ph·ªï bi·∫øn, tuning l·∫°i tokenizer l√† kh√¥ng c·∫ßn thi·∫øt (th·∫≠m ch√≠ b·∫•t l·ª£i n·∫øu b·∫°n kh√¥ng c√≥ ƒë·ªß d·ªØ li·ªáu ƒë·ªÉ hu·∫•n luy·ªán l·∫°i ph·∫ßn embedding). Th√¥ng th∆∞·ªùng, b·∫°n:
- D√πng tokenizer g·ªëc k√®m m√¥ h√¨nh.
- Fine-tune m√¥ h√¨nh tr√™n d·ªØ li·ªáu b·∫°n quan t√¢m. M·ªçi th·ª© t·ª´ tokenization ƒë·∫øn embedding ƒë∆∞·ª£c k·∫ø th·ª´a s·∫µn.

### by split()

### by RegEx

### by NLTK

### by spaCy

### by keras
S·ª≠ d·ª•ng [`tf.keras.layers.TextVectorization`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization) v·ªõi m·ªôt s·ªë params nh∆∞ sau:
- `max_tokens` - S·ªë l∆∞·ª£ng word t·ªëi ƒëa trong vocabulary (e.g. 20000 or the number of unique words in your text), bao g·ªìm 1 slot cho OOV (out of vocabulary) tokens.
- `standardize` - Ph∆∞∆°ng th·ª©c ƒë·ªÉ standardizing text. Default is "lower_and_strip_punctuation" nghƒ©a l√† lowers text and removes all punctuation marks.
- `split` - How to split text, default is "whitespace" which splits on spaces.
- `ngrams` - How many words to contain per token split (create groups of n-words?), for example, ngrams=2 splits tokens into continuous sequences of 2.
- `output_mode` - How to output tokens:
    - "int" (integer mapping): map theo index c·ªßa t·ª´ trong vocab
    - "multi_hot" : mapping theo ki·ªÉu one-hot n·∫øu t·ª´ ƒë√≥ c√≥ xu·∫•t hi·ªán trong text
    - "count": mapping theo s·ªë l·∫ßn t·ª´ ƒë√≥ xu·∫•t hi·ªán trong text
    - ["tf-idf"](https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting):  **Term Frequency - Inverse Document Frequency** tr√≠ch xu·∫•t v√† ƒë√°nh tr·ªçng s·ªë cho t·ª´/ng·ªØ (term) trong c√°c t√†i li·ªáu vƒÉn b·∫£n.
        - `TF` L√† t·∫ßn su·∫•t xu·∫•t hi·ªán c·ªßa m·ªôt t·ª´ (term) trong t√†i li·ªáu. cho bi·∫øt m·ªôt t·ª´ c√≥ ‚Äún·ªïi b·∫≠t‚Äù trong t√†i li·ªáu c·ª• th·ªÉ hay kh√¥ng.
        - `IDF`  L√† ngh·ªãch ƒë·∫£o t·∫ßn su·∫•t xu·∫•t hi·ªán c·ªßa m·ªôt t·ª´ trong t·∫≠p c√°c t√†i li·ªáu. cho bi·∫øt t·ª´ ƒë√≥ c√≥ ph·ªï bi·∫øn/hay hi·∫øm trong to√†n b·ªô kho t√†i li·ªáu.
        - **TF cao**: T·ª´ th∆∞·ªùng xu·∫•t hi·ªán nhi·ªÅu trong t√†i li·ªáu => c√≥ vai tr√≤ quan tr·ªçng trong t√†i li·ªáu ƒë√≥.
        - **IDF cao**: T·ª´ hi·∫øm (√≠t xu·∫•t hi·ªán trong to√†n b·ªô t·∫≠p t√†i li·ªáu) => ƒë·ªô ‚Äúph√¢n bi·ªát‚Äù cao.
        - **TF-IDF cao**: T·ª´ r·∫•t quan tr·ªçng (ph√¢n bi·ªát) cho t√†i li·ªáu ƒë√≥ (xu·∫•t hi·ªán nhi·ªÅu trong t√†i li·ªáu v√† hi·∫øm g·∫∑p trong t√†i li·ªáu kh√°c).
- `output_sequence_length` - S·ª≠ d·ª•ng trong `output_mode=int`, quy ƒë·ªãnh ƒë·ªô d√†i c·ªßa m·ªói sequence output g·ªìm bao nhi√™u tokens, n·∫øu sequence c√≥ ƒë·ªô d√†i h∆°n `output_sequence_length` th√¨ s·∫Ω ƒë∆∞·ª£c truncated, n·∫øu √≠t h∆°n th√¨ ƒë∆∞·ª£c padded ƒë·ªÉ ƒë·∫£m b·∫£o ƒë·ªô d√†i ch√≠nh x√°c c·ªßa output tensor l√† `shape = (batch_size, output_sequence_length)`
- `pad_to_max_tokens` - Defaults to False, if True, the output feature axis s·∫Ω ƒë∆∞·ª£c padded/m·ªü r·ªông t·ªõi ƒë·ªô d√†i b·∫±ng max_tokens ngay c·∫£ khi s·ªë l∆∞·ª£ng unique tokens in the vocabulary nh·ªè h∆°n max_tokens. Ch·ªâ c√≥ t√°c d·ª•ng trong `output_mode` l√† `multi_hot`, `count`, `tf_id`


### by Gensim

### by transformer pretrain

In [None]:
import torch
from transformers import AutoTokenizer, AutoModel

model_name = "bert-base-uncased"

# T·∫£i tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

## Text Mining

## Feature Extraction

Preprocessing can involve extracting features from text, such as word frequencies, n-grams, or word embeddings, which are essential for building machine learning models.

## Dimensionality Reduction

Text data often has a high dimensionality due to the presence of a large vocabulary. Preprocessing techniques like term frequency-inverse document frequency (TF-IDF) or dimensionality reduction methods can help.