# Homework Lab 2: Text Preprocessing with Vietnamese
**Overview:** In this exercise, we will build a text preprocessing program for Vietnamese.

Import the necessary libraries. Note that we are using the underthesea library for Vietnamese tokenization. To install it, follow the instructions below. ([link](https://github.com/undertheseanlp/underthesea))

In [4]:
import os,glob
import codecs
import sys
import re
from underthesea import word_tokenize

## Question 1: Create a Corpus and Survey the Data

The data in this section is partially extracted from the [VNTC](https://github.com/duyvuleo/VNTC) dataset. VNTC is a Vietnamese news dataset covering various topics. In this section, we will only process the science topic from VNTC. We will create a corpus from both the train and test directories. Complete the following program:

- Write `sentences_list` to a file named `dataset_name.txt`, with each element as a document on a separate line.
- Check how many documents are in the corpus.


In [5]:
dataset_name = "VNTC_khoahoc"
path = ['./VNTC_khoahoc/Train_Full/', './VNTC_khoahoc/Test_Full/']

if os.listdir(path[0]) == os.listdir(path[1]):
    folder_list = [os.listdir(path[0]), os.listdir(path[1])]
    print("train labels = test labels")
else:
    print("train labels differ from test labels")

doc_num = 0
sentences_list = []
meta_data_list = []
for i in range(2):
    for folder_name in folder_list[i]:
        folder_path = path[i] + folder_name
        if folder_name[0] != ".":
            for file_name in glob.glob(os.path.join(folder_path, '*.txt')):
                # Read the file content into f
                f = codecs.open(file_name, 'br')
                # Convert the data to UTF-16 format for Vietnamese text
                file_content = (f.read().decode("utf-16")).replace("\r\n", " ")
                sentences_list.append(file_content.strip())
                f.close
                # Count the number of documents
                doc_num += 1

#### YOUR CODE HERE ####
with open(f"{dataset_name}.txt", "w", encoding="utf-8") as out_f:
    for sentence in sentences_list:
        out_f.write(sentence + "\n")

print(f"Number of documents in corpus: {doc_num}")#### END YOUR CODE #####

train labels = test labels
Number of documents in corpus: 84132


## Question 2: Write Preprocessing Functions







### Question 2.1: Write a Function to Clean Text
Hint:
- The text should only retain the following characters: aAàÀảẢãÃáÁạẠăĂằẰẳẲẵẴắẮặẶâÂầẦẩẨẫẪấẤậẬbBcCdDđĐeEèÈẻẺẽẼéÉẹẸêÊềỀểỂễỄếẾệỆfFgGhHiIìÌỉỈĩĨíÍịỊjJkKlLmMnNoOòÒỏỎõÕóÓọỌôÔồỒổỔỗỖốỐộỘơƠờỜởỞỡỠớỚợỢpPqQrRsStTuUùÙủỦũŨúÚụỤưƯừỪửỬữỮứỨựỰvVwWxXyYỳỲỷỶỹỸýÝỵỴzZ0-9(),!?\'\
- Then trim the whitespace in the input text.

In [15]:
def clean_str(string):
    #### YOUR CODE HERE ####
    pattern = r"[^aAàÀảẢãÃáÁạẠăĂằẰẳẲẵẴắẮặẶâÂầẦẩẨẫẪấẤậẬ" \
              r"bBcCdDđĐeEèÈẻẺẽẼéÉẹẸêÊềỀểỂễỄếẾệỆ" \
              r"fFgGhHiIìÌỉỈĩĨíÍịỊjJkKlLmMnNoOòÒỏỎõÕóÓọỌ" \
              r"ôÔồỒổỔỗỖốỐộỘơƠờỜởỞỡỠớỚợỢ" \
              r"pPqQrRsStTuUùÙủỦũŨúÚụỤưƯừỪửỬữỮứỨựỰ" \
              r"vVwWxXyYỳỲỷỶỹỸýÝỵỴzZ0-9(),!?\'\\s]"
    
    # Thay các ký tự không hợp lệ thành rỗng
    cleaned = re.sub(pattern, "", string)
    
    # Chuẩn hoá khoảng trắng (xóa khoảng trắng thừa)
    cleaned = re.sub(r"\s+", " ", cleaned).strip()
    
    return cleaned
    #### END YOUR CODE #####

### Question 2.2: Write a Function to Convert Text to Lowercase

In [7]:
# make all text lowercase
def text_lowercase(string):
    #### YOUR CODE HERE ####
    return string.lower()
    #### END YOUR CODE #####

### Question 2.3: Tokenize Words
Hint: Use the `word_tokenize()` function imported above with two parameters: `strings` and `format="text"`.


In [12]:
def tokenize(strings):
    #### YOUR CODE HERE ####
    tokenized_text = word_tokenize(strings, format="text")
    # Chia thành list token
    return tokenized_text.split()
    #### END YOUR CODE #####

### Question 2.4: Remove Stop Words
To remove stop words, we use a list of Vietnamese stop words stored in the file `./vietnamese-stopwords.txt`. Complete the following program:
- Check each word in the text (`strings`). If a word is not in the stop words list, add it to `doc_words`.


In [13]:
def remove_stopwords(strings):
    #### YOUR CODE HERE ####
    with open("./vietnamese-stopwords.txt", encoding="utf-8") as f:
        stop_words = set([line.strip() for line in f if line.strip() != ""])

    doc_words = []
    for w in strings:
        if w.lower() not in stop_words:
            doc_words.append(w)

    return doc_words
    #### END YOUR CODE #####

## Question 2.5: Build a Preprocessing Function
Hint: Call the functions `clean_str`, `text_lowercase`, `tokenize`, and `remove_stopwords` in order, then return the result from the function.


In [10]:
def text_preprocessing(strings):
    #### YOUR CODE HERE ####
    cleaned = clean_str(strings)
    lowered = text_lowercase(cleaned)
    tokens = tokenize(lowered)
    filtered = remove_stopwords(tokens)
    
    return filtered
    #### END YOUR CODE #####


## Question 3: Perform Preprocessing
Now, we will read the corpus from the file created in Question 1. After that, we will call the preprocessing function for each document in the corpus.

Hint: Call the `text_preprocessing()` function with `doc_content` as the input parameter and save the result in the variable `temp1`.


In [16]:
#### YOUR CODE HERE ####
clean_docs = []
with open(f"{dataset_name}.txt", "r", encoding="utf-8") as f:
    for line in f:
        doc_content = line.strip()
        if doc_content:  # bỏ dòng trống
            temp1 = text_preprocessing(doc_content)
            # nếu preprocessing trả về list tokens thì nối lại thành string
            if isinstance(temp1, list):
                clean_docs.append(" ".join(temp1))
            else:
                clean_docs.append(temp1)
#### END YOUR CODE #####

print("\nlength of clean_docs = ", len(clean_docs))
print('clean_docs[0]:\n' + clean_docs[0])



length of clean_docs =  84132
clean_docs[0]:
thànhlậpdựánpolicyphòngchốnghivaidsởvn ( nlđ ) quỹhỗtrợkhẩncấpvềaidscủahoakỳvừathànhlậpdựánpolicytạivnvớicamkếthỗtrợchínhphủvànhândânvnđốiphóhivaidsdựáncónhiệmvụchínhlàcảithiệncôngtácphòngchốnghivaidsthôngquacáclĩnhvựcxâydựngchínhsách , ràsoátcácvănbảnphápluật , xâydựngchiếnlượcquảngbá , xâydựngchươngtrìnhđàotạovềphòngchốnghivaids , lênkếhoạchbốtrínguồnlực , huấnluyệnvànghiêncứuvềphươngtiệntruyềnthôngđạichúng , tổchứccáchoạtđộngnhằmgiảmkỳthịvàphânbiệtđốixửđốivớingườicóhivaidstheottxvn , dựánpolicyđặcbiệtquantâmđếncôngtáctruyềnthôngphòngchốnghivaids , coiđâylàmộtbiệnpháptíchcựcvàhữuhiệutrongviệcphòngchốngcóhiệuquảhivaidsthờigiantới , dựánpolicysẽtiếptụctổchứccáchoạtđộngnhằmnângcaonhậnthứcchonhữngngườicótráchnhiệmvớicôngtácchỉđạophòngchốnghivaids


## Question 4: Save Preprocessed Data
Hint: Save the preprocessed data to a file named `dataset_name + '.clean.txt'`, where each document is written on a separate line.


In [17]:
#### YOUR CODE HERE ####
with open(f"{dataset_name}.clean.txt", "w", encoding="utf-8") as f:
    for doc in clean_docs:
        f.write(doc + "\n")
#### END YOUR CODE #####