# Text data cleaning by means of sentence segmentation/tokenization

### This program is intended to clean text data by removing unncessary sentences initiated by "baca juga" and "advertisement" strings. It uses spaCy NLP library for sentence segmentation.

### 1. Reading dataset

#### We will be reading a dataset/corpus containing news text data.

In [149]:
import pandas as pd
import os

# reding the news dataset
news_dataset = pd.read_csv(f"{os.getcwd()}/news_dataset.csv")
news_dataset_clean = news_dataset.dropna(subset=['content'])
news_dataset_clean['content'].drop_duplicates().reset_index(drop=True)
news_dataset_clean = news_dataset_clean.reset_index(drop=True)

print(news_dataset_clean["content"])

0      ALASKA — It's been snowing a lot lately in nor...
1      ESB Science Blast. 2.508 suka. An incredible S...
2      Suara.com - BRIZZI merupakan salah satu kartu ...
3      - Cash Tree Siapa sih yang tidak kenal dengan ...
4      KBEonline.id – Solo Leveling Season 2 episode ...
                             ...                        
897    HAIJAKARTA.ID – Nomor HP kamu bisa menerima sa...
898    JABAR EKSPRES – Hadir kembali link penghasil s...
899    Jakarta, benang.id – GoPay, unit bisnis Financ...
900    SINGAPORE: The Competition and Consumer Commis...
901    Once you sign up, $111 is added to your bonus ...
Name: content, Length: 902, dtype: object


#### The dataset contains 902 rows of news texts.

In [150]:
import warnings

warnings.simplefilter("ignore", UserWarning)

# regex with the unecessary keywords
regex_unecessary_words = r"(?i)\b(baca juga|advertisement|iklan)\b"

# add new column matches where True indicates instances of news text with unecessary word
news_dataset_clean['matches'] = news_dataset_clean['content'].str.contains(regex_unecessary_words)

# count the no. of True observation of column 'matches' and len of news_dataset_clean
nrow_matches = news_dataset_clean['matches'].sum()
nrow_news_dataset = len(news_dataset_clean)

print(f"{nrow_matches} instances of news text with unecessary words \n{nrow_matches/nrow_news_dataset:.2f} percent of instances with unecessary words")

310 instances of news text with unecessary words 
0.34 percent of instances with unecessary words


#### From 902 instances, there are 310 instances (34%) of news text containing unecessary words. Below are the randomly picked news text containing aforementioned words:

In [165]:
import random
import re

# the keywords inside the regex for loop
key_unecessary_words = ["baca juga", "advertisement", "iklan"]

# looping each keyword in key_unecessary_words
for keyword in key_unecessary_words:
    matches = news_dataset_clean[news_dataset_clean['content'].str.contains(r'(?i)\b' + re.escape(keyword) + r'\b', na=False)]
    matches = matches.reset_index(drop=True) # reset the indexing
    print(f"--- Unecessary words: {keyword} ({len(matches)} observations) ---")

    if len(matches) > 0:
        for num in range(min(3, len(matches))):
            random_news_instance = random.randint(0, len(matches) - 1)
            print(f"Random news {num}: {matches['content'].iloc[random_news_instance]}\n")
    else:
        print(f"No matches for {keyword} found in the dataset.\n")

--- Unecessary words: baca juga (291 observations) ---
Random news 0: Daftar Aplikasi Pinjaman Rp500 Ribu Langsung Cair, Aman dan Cepat Bisa untuk Tambahan Dana Mudik Lebaran Reporter: Yuni Khaerunisa | Editor: Syamsul Falaq | Rabu 19-03-2025,17:44 WIB Daftar aplikasi pinjaman Rp500 ribu yang bisa langsung cair, aman dan cepat--iStockphoto SPinjam atau ShopeePinjam ini adalah produk dari Shopee yang disediakan oleh PT Lentera Dana Nusantara sebagai operator platform layanan ini. Perusahaan ini sudah terdaftar dan diawasi langsung oleh OJK, sehingga untuk keamanannya pasti terjamin. Bunga yang ditawarkan mulai dari 1,95 persen per bulan. Sementara itu, limit nominalnya Rp500.000 dengan pilihan tenor 2, 3, 6, 12 bulan. Prosesnya juga cepat dengan pencairan dana dalam waktu 5 menit saja. Kamu bisa membuka aplikasi Shopee dan masukkan jumlah pinjaman, minimal Rp500.000 untuk memenuhi kebutuhan. BACA JUGA : Langsung Cair dengan KTP Hingga 80 Juta, Inilah Pinjaman Online Resmi OJK yang Mudah

#### Using regular expression, it is detected that there are 310 instances of news texts containing "baca juga", "advertisement", or "iklan" strings. We can also see the random news text containing the mentioned strings through regular expression (regex). However it is important to take these into consideration:
1. The regex retrieves solely news with abovementioned strings; it does not take the sentence where the strings are bound into consideration (e.g., the "baca juga" strings can take place in non-initial positions; the strings could be part of phrases of meaningful sentences, instead of just a imperative phrase redirecting readers to other news articles)
2. The strings "iklan" and "advertisement" seem to only be noises when the words are all capitalized

#### To account for this problem, we will:
1. Modify the regex to only filter out the strings "iklan" and "advertisement" when they are all capitalized
2. Use spaCy to segment the news into separate sentences, then modify the regex expression to filter out "baca juga" in sentence initial positions

### 2. Updating regular expressions

In [169]:
# new regex expression, where "iklan" and "advertisement" are only retrieved when in uppercase
new_regex = r"(?i)\b(baca juga)\b|\b(ADVERTISEMENT|IKLAN)\b"

new_dataset = news_dataset_clean

# add new column matches where True indicates instances of news text with unecessary word
new_dataset['matches'] = new_dataset['content'].str.contains(new_regex)

# count the no. of True observation of column 'matches' and len of news_dataset_clean
new_nrow_matches = new_dataset['matches'].sum()
new_nrow_news_dataset = len(new_dataset)

print(f"{new_nrow_matches} instances of news text with unecessary words \n{new_nrow_matches/new_nrow_news_dataset:.2f} percent of instances with unecessary words")

310 instances of news text with unecessary words 
0.34 percent of instances with unecessary words


In [163]:
# the keywords inside the regex for loop
new_unecessary_words = ["baca juga",  "ADVERTISEMENT", "IKLAN"]

# looping each keyword in key_unecessary_words
for keyword in new_unecessary_words:
    if keyword == "baca juga":
        pattern = r"(?i)\b" + re.escape(keyword) + r"\b"  # case-insensitive for "baca juga"
    else:
        pattern = r"\b" + re.escape(keyword) + r"\b"  # exact match for "ADVERTISEMENT" and "IKLAN"
    
    new_matches = new_dataset[new_dataset['content'].str.contains(pattern, na=False)] # subset dataframe containing keyword
    new_matches = new_matches.reset_index(drop=True) # reset the indexing
    print(f"--- Unecessary words: {keyword} ({len(new_matches)} observations) ---")
    
    if len(new_matches) > 0:
        for num in range(min(3, len(new_matches))):
            random_news_instance = random.randint(0, len(new_matches) - 1)
            print(f"Random news {num}: {new_matches['content'].iloc[random_news_instance]}\n")
    else:
        print(f"No matches for {keyword} found in the dataset.\n")

--- Unecessary words: baca juga (291 observations) ---
Random news 0: tirto.id - Pembayaran digital terbaru bernama QRIS TAP telah diresmikan oleh Bank Indonesia pada Jumat (14/3/2025) di Jakarta Pusat. Tercatat ada sejumlah 15 penyedia jasa pembayaran (PJP) berupa bank dan e-wallet yang dapat menerapkan fitur tersebut saat ini. QRIS TAP adalah pengembangan dari metode QRIS biasa, tetapi telah berinovasi menggunakan teknologi berbasis Near Field Communication (NFC). Untuk menerapkan metode pembayaran tersebut, pengguna harus memiliki ponsel berbasis NFC. Teknologi tersebut memungkinkan pengguna untuk tidak melakukan pemindaian kode QR (QRIS CPM) lagi. Cukup menempelkan atau mendekatkan ponsel ke NFC Reader, lalu sistem yang tersedia akan membaca data dari e-wallet atau e-banking milik pengguna. Proses pembacaan data dari ponsel NFC oleh NFC Reader diklaim hanya memakan waktu 0,3 detik saja. Ini jauh lebih cepat dibanding QRIS dengan basis teknologi chip yang butuh 4-5 detik untuk memin

#### The modification of the regex has decreased the number of observation of text with "iklan" and "advertisement" strings. It also showed that some the string "advertisement" are sentence-bound. 

#### To conclude, from these strings: "baca juga", "advertisement", and "iklan", two are likely to be sentence-bound: "baca juga" and "advertisement". Therefore, we will use spaCy library to work with these strings.

#### As of the other strings: the non-sentence-bound "iklan" and "advertisement" in uppercase, we will use regex to clean those strings from news texts.

### 3. Sentence segmentation/tokenization with spaCy

In [69]:
import spacy

nlp = spacy.load("xx_ent_wiki_sm")
sentencizer = nlp.add_pipe('sentencizer')
sentencizer.from_disk(f"{os.getcwd()}/sentencizer.json") # json file with list of punctuation mark

text = "The U.S. Drug Enforcement Administration (DEA) says hello. And have a nice day. Halo! Berikut merupakan teks berbahasa Indonesia. Presiden A.S. baru saja memberikan pidato di depan Gedung Putih."

text_unpunctuated = "She loves to read books on weekends she enjoys visiting the library and discovering new authors he prefers to read mystery novels and often finishes one in a day their favorite place is the bookstore which is near the park they meet there every Saturday morning to share their new finds they sometimes discuss their favorite characters and plot twists with excitement"

# Process the text using spaCy
doc = nlp(text)

# Print the segmented sentences
for sent in doc.sents:
    print(sent.text)

# Process the text using spaCy
doc = nlp(text_unpunctuated)

# Print the segmented sentences
for sent in doc.sents:
    print(sent.text)

The U.S. Drug Enforcement Administration (DEA) says hello.
And have a nice day.
Halo!
Berikut merupakan teks berbahasa Indonesia.
Presiden A.S. baru saja memberikan pidato di depan Gedung Putih.
She loves to read books on weekends she enjoys visiting the library and discovering new authors he prefers to read mystery novels and often finishes one in a day their favorite place is the bookstore which is near the park they meet there every Saturday morning to share their new finds they sometimes discuss their favorite characters and plot twists with excitement


#### The spaCy setencizer pipe works by segmenting sentences delimited by punctuation marks in a smart way. Unlike regex which is more exact in nature, spaCy setencizer takes into consideration abbreviations, such as "A.S." and "U.S." It can handle both Indonesian and English

#### Caveat: The spaCy setencizer pipe does not work well with unpunctuated text. We could create a function to make rough estimates of sentences in unpunctuated text, however it does not retrieve acccurate results.

### 4. Sentence segmentation with NLTK

In [61]:
import nltk

text = "The U.S. Drug Enforcement Administration (DEA) says hello. And have a nice day. Halo! Berikut merupakan teks berbahasa Indonesia. Presiden A.S. baru saja memberikan pidato di depan Gedung Putih."

text_unpunctuated = "She loves to read books on weekends she enjoys visiting the library and discovering new authors he prefers to read mystery novels and often finishes one in a day their favorite place is the bookstore which is near the park they meet there every Saturday morning to share their new finds they sometimes discuss their favorite characters and plot twists with excitement"

tokenized_text = nltk.tokenize.sent_tokenize(text)

for token in tokenized_text:
    print(token)

tokenized_text = nltk.tokenize.sent_tokenize(text_unpunctuated)

for token in tokenized_text:
    print(token)

The U.S. Drug Enforcement Administration (DEA) says hello.
And have a nice day.
Halo!
Berikut merupakan teks berbahasa Indonesia.
Presiden A.S. baru saja memberikan pidato di depan Gedung Putih.
She loves to read books on weekends she enjoys visiting the library and discovering new authors he prefers to read mystery novels and often finishes one in a day their favorite place is the bookstore which is near the park they meet there every Saturday morning to share their new finds they sometimes discuss their favorite characters and plot twists with excitement


#### Just like spaCy sentencizer pipe, the NLTK tokenize method also works by segmenting sentences delimited by punctuation marks in a smart way. It can also handle both Indonesian and English

#### Caveat: Just like the spaCy setencizer pipe, it does not work well with unpunctuated text.

#### The challenge of sentence segmentation from both models come from the fact that they cannot detect sentences in unpunctuated text. Meanwhile, most of "baca juga" string sentences are not delimited by any punctuation marks. 

### 5. Heuristic approach to text data cleaning

#### From previous observation, we know that the strings "baca juga" and "advertisement" are sentence bound. Since spaCy and NLTK segment sentence based on a hybrid nature and it cannot segment sentences in an unpunctuated text, we need to move to a heuristic approach. We need to understand the patterns of the sentences in which the strings "baca juga" and "advertisement" are bound to. For this, we will be analyzing the patterns of sentences bound to aforementioned strings. 

### 5.1. Analyzing sentence pattern bound with "baca juga" string