# Homework Lab 2: Text Preprocessing with Vietnamese
**Overview:** In this exercise, we will build a text preprocessing program for Vietnamese.

Import the necessary libraries. Note that we are using the underthesea library for Vietnamese tokenization. To install it, follow the instructions below. ([link](https://github.com/undertheseanlp/underthesea))

In [41]:
%pip install requests tqdm pandas zipfile36 underthesea patool

In [42]:
import os,glob
import codecs
import sys
import re
from underthesea import word_tokenize

## Question 1: Create a Corpus and Survey the Data

The data in this section is partially extracted from the [VNTC](https://github.com/duyvuleo/VNTC) dataset. VNTC is a Vietnamese news dataset covering various topics. In this section, we will only process the science topic from VNTC. We will create a corpus from both the train and test directories. Complete the following program:

- Write `sentences_list` to a file named `dataset_name.txt`, with each element as a document on a separate line.
- Check how many documents are in the corpus.


In [43]:
import requests
import zipfile
import patoolib

URL = "https://github.com/duyvuleo/VNTC/archive/refs/heads/master.zip"
ZIP_PATH = "VNTC_master.zip"
EXTRACT_DIR = "VNTC_raw"
FINAL_DIR = "VNTC_khoahoc"

print("Downloading dataset...")
response = requests.get(URL)
with open(ZIP_PATH, "wb") as f:
    f.write(response.content)

print("Extracting files...")
with zipfile.ZipFile(ZIP_PATH, "r") as zip_ref:
    zip_ref.extractall(EXTRACT_DIR)

TRAIN_PATH = os.path.join(EXTRACT_DIR, "VNTC-master", "Data", "10Topics", "Ver1.1", "Train_Full.rar")
TEST_PATH = os.path.join(EXTRACT_DIR, "VNTC-master", "Data", "10Topics", "Ver1.1", "Test_Full.rar")

patoolib.extract_archive(TRAIN_PATH, outdir=FINAL_DIR)
patoolib.extract_archive(TEST_PATH, outdir=FINAL_DIR)


Downloading dataset...
Extracting files...


INFO patool: Extracting VNTC_raw/VNTC-master/Data/10Topics/Ver1.1/Train_Full.rar ...
INFO patool: running /usr/bin/unrar x -kb -or -- /content/VNTC_raw/VNTC-master/Data/10Topics/Ver1.1/Train_Full.rar
INFO patool: ... VNTC_raw/VNTC-master/Data/10Topics/Ver1.1/Train_Full.rar extracted to `VNTC_khoahoc'.
INFO patool: Extracting VNTC_raw/VNTC-master/Data/10Topics/Ver1.1/Test_Full.rar ...
INFO patool: running /usr/bin/unrar x -kb -or -- /content/VNTC_raw/VNTC-master/Data/10Topics/Ver1.1/Test_Full.rar
INFO patool: ... VNTC_raw/VNTC-master/Data/10Topics/Ver1.1/Test_Full.rar extracted to `VNTC_khoahoc'.


'VNTC_khoahoc'

In [44]:
dataset_name = "VNTC_khoahoc"
path = ['./VNTC_khoahoc/Train_Full/', './VNTC_khoahoc/Test_Full/']

if os.listdir(path[0]) == os.listdir(path[1]):
    folder_list = [os.listdir(path[0]), os.listdir(path[1])]
    print("train labels = test labels")
else:
    print("train labels differ from test labels")

doc_num = 0
sentences_list = []
meta_data_list = []
for i in range(2):
    for folder_name in folder_list[i]:
        folder_path = path[i] + folder_name
        #Just Khoa Hoc for testing
        if folder_name == 'Khoa hoc':
            for file_name in glob.glob(os.path.join(folder_path, '*.txt')):
                # Read the file content into f
                f = codecs.open(file_name, 'br')
                # Convert the data to UTF-16 format for Vietnamese text
                file_content = (f.read().decode("utf-16")).replace("\r\n", " ")
                sentences_list.append(file_content.strip())
                f.close
                # Count the number of documents
                doc_num += 1

#### YOUR CODE HERE ####
with open(dataset_name + '.txt', 'w', encoding='utf-8') as f:
    for sentence in sentences_list:
        f.write(sentence + '\n')

print("Number of documents = ", doc_num)
#### END YOUR CODE #####

train labels = test labels
Number of documents =  15664


## Question 2: Write Preprocessing Functions







### Question 2.1: Write a Function to Clean Text
Hint:
- The text should only retain the following characters: aAàÀảẢãÃáÁạẠăĂằẰẳẲẵẴắẮặẶâÂầẦẩẨẫẪấẤậẬbBcCdDđĐeEèÈẻẺẽẼéÉẹẸêÊềỀểỂễỄếẾệỆfFgGhHiIìÌỉỈĩĨíÍịỊjJkKlLmMnNoOòÒỏỎõÕóÓọỌôÔồỒổỔỗỖốỐộỘơƠờỜởỞỡỠớỚợỢpPqQrRsStTuUùÙủỦũŨúÚụỤưƯừỪửỬữỮứỨựỰvVwWxXyYỳỲỷỶỹỸýÝỵỴzZ0-9(),!?\'\
- Then trim the whitespace in the input text.

In [45]:
allowed_chars = r"a-zA-Z0-9àÀảẢãÃáÁạẠăĂằẰẳẲẵẴắẮặẶâÂầẦẩẨẫẪấẤậẬđĐeEèÈẻẺẽẼéÉẹẸêÊềỀểỂễỄếẾệỆiIìÌỉỈĩĨíÍịỊoOòÒỏỎõÕóÓọỌôÔồỒổỔỗỖốỐộỘơƠờỜởỞỡỠớỚợỢuUùÙủỦũŨúÚụỤưƯừỪửỬữỮứỨựỰyYỳỲỷỶỹỸýÝỵỴ(),!?'\" "
clean_pattern = re.compile(f"[^{allowed_chars}]")
def clean_str(string):
    #### YOUR CODE HERE ####
    if not string:
        return ""
    cleaned_string = clean_pattern.sub('', string)

    return cleaned_string.strip()
    #### END YOUR CODE #####

### Question 2.2: Write a Function to Convert Text to Lowercase

In [46]:
# make all text lowercase
def text_lowercase(string):
    #### YOUR CODE HERE ####
    if string is not None:
      string = string.lower()
    return string
    #### END YOUR CODE #####

### Question 2.3: Tokenize Words
Hint: Use the `word_tokenize()` function imported above with two parameters: `strings` and `format="text"`.


In [47]:
def tokenize(strings):
    #### YOUR CODE HERE ####
    return word_tokenize(strings, format="text")
    #### END YOUR CODE #####

### Question 2.4: Remove Stop Words
To remove stop words, we use a list of Vietnamese stop words stored in the file `./vietnamese-stopwords.txt`. Complete the following program:
- Check each word in the text (`strings`). If a word is not in the stop words list, add it to `doc_words`.


In [48]:
with open('./vietnamese-stopwords.txt', 'r', encoding='utf-8') as f:
    STOP_WORDS = set([line.strip() for line in f])
def remove_stopwords(strings):
    #### YOUR CODE HERE ####
    doc_words = [word for word in strings.split() if word not in STOP_WORDS]
    return ' '.join(doc_words)
    #### END YOUR CODE #####

## Question 2.5: Build a Preprocessing Function
Hint: Call the functions `clean_str`, `text_lowercase`, `tokenize`, and `remove_stopwords` in order, then return the result from the function.


In [49]:
def text_preprocessing(strings):
    #### YOUR CODE HERE ####
    strings = clean_str(strings)
    strings = text_lowercase(strings)
    strings = tokenize(strings)
    strings = remove_stopwords(strings)
    return strings
    #### END YOUR CODE #####


## Question 3: Perform Preprocessing
Now, we will read the corpus from the file created in Question 1. After that, we will call the preprocessing function for each document in the corpus.

Hint: Call the `text_preprocessing()` function with `doc_content` as the input parameter and save the result in the variable `temp1`.


In [None]:
#### YOUR CODE HERE ####
clean_docs = []
with open(dataset_name + '.txt', 'r', encoding='utf-8') as f:
  for doc_content in f:
    temp1 = text_preprocessing(doc_content)
    clean_docs.append(temp1)


#### END YOUR CODE #####
print("\nlength of clean_docs = ", len(clean_docs))
print('clean_docs[0]:\n' + clean_docs[0])

## Question 4: Save Preprocessed Data
Hint: Save the preprocessed data to a file named `dataset_name + '.clean.txt'`, where each document is written on a separate line.


In [None]:
#### YOUR CODE HERE ####
with open(dataset_name + '.clean.txt', 'w', encoding='utf-8') as f:
  for doc in clean_docs:
    f.write(doc + '\n')
#### YOUR CODE HERE ####