## Text data cleaning by means of sentence segmentation/tokenization

### This program is intended to clean text data by removing unncessary sentences initiated by "baca juga" and "advertisement" strings. It uses spaCy NLP library for sentence segmentation.

### 1. Reading dataset

#### We will be reading a dataset/corpus containing news text data.

In [52]:
import pandas as pd
import os

news_dataset = pd.read_csv(f"{os.getcwd()}/news_dataset.csv")
print(news_dataset["content"])

0                                                     NaN
1       ALASKA — It's been snowing a lot lately in nor...
2                                                     NaN
3       ESB Science Blast. 2.508 suka. An incredible S...
4                                                     NaN
                              ...                        
1043    HAIJAKARTA.ID – Nomor HP kamu bisa menerima sa...
1044    JABAR EKSPRES – Hadir kembali link penghasil s...
1045    Jakarta, benang.id – GoPay, unit bisnis Financ...
1046    SINGAPORE: The Competition and Consumer Commis...
1047    Once you sign up, $111 is added to your bonus ...
Name: content, Length: 1048, dtype: object


#### The dataset contains 1047 rows of news texts.

In [53]:
import warnings

warnings.simplefilter("ignore", UserWarning)

regex_unecessary_words = r"(?i)\b(baca juga|advertisement|iklan)\b"

news_dataset['matches'] = news_dataset['content'].str.contains(regex_unecessary_words)

nrow_matches = news_dataset['matches'].sum()
nrow_news_dataset = len(news_dataset)

print(f"{nrow_matches} instances of news text \n{nrow_matches/nrow_news_dataset:.2f} percent of instances with unecessary words")

310 instances of news text 
0.30 percent of instances with unecessary words


#### From 1047 instances, there are 310 instances (30%) of news text containing unecessary words. Below are the randomly picked news text containing aforementioned words:

In [54]:
import random
import re

matches = news_dataset[news_dataset['content'].str.contains(regex_unecessary_words, na=False)]

key_unecessary_words = ["baca juga", "advertisement", "iklan"]

for keyword in key_unecessary_words:
    print(f"unecessary words: {keyword}")
    matches_keyword = matches[matches['content'].str.contains(r'(?i)\b' + re.escape(keyword) + r'\b', na=False)]
    matches_reset = matches_keyword.reset_index(drop=True)
    for num in range(3):
        if len(matches_keyword) > 0:
            random_news_instance = random.randint(0, len(matches_keyword) - 1)
            print(f"Random news {num}: {matches_keyword['content'].iloc[random_news_instance]}\n")
        else:
            print(f"No matches for {keyword} found in the dataset.\n")

unecessary words: baca juga
Random news 0: Jakarta – CGS International Sekuritas Indonesia melihat pergerakan Indeks Harga Saham Gabungan (IHSG) secara teknikal hari ini (19/3) berpotensi untuk melanjutkan pelemahannya. “IHSG diprediksi akan melanjutkan pelemahannya dengan kisaran support 5.795-6.010 dan resistance 6.435-6.650,” tulis manajemen CGS dalam risetnya di Jakarta, 19 Maret 2025. Di mana, pada perdagangan kemarin (18/3) IHSG kembali ditutup bertahan pada zona merah ke posisi 6.223,38 dari dibuka pada level 6.458,66 atau melemah 3,84 persen. Baca juga: Pasar Saham RI Gelap: Longsor Besar dan Dibekukan Sementara Manajemen CGS menjelaskan bahwa, melemahnya indeks di bursa Wall Street dan berlanjutnya aksi jual investor asing dalam jumlah yang besar diprediksi akan menjadi katalis negatif untuk IHSG. Pelemahan indeks bursa Wall Street tercermin dari indeks Dow Jones yang turun 0,62 persen dan indeks S&P 500 mengalami pelemahan sebanyak 1,02 persen. Secara rinci, investor asing me

#### Using regular expression, it is detected that there are 310 instances of news texts containing "baca juga", "advertisement", and "iklan" strings. We can also see the random news text containing the mentioned strings through regular expression (regex). However it is important to take these into consideration:
1. The regex retrieves solely news with abovementioned strings; it does not take the sentence where the strings are bound into consideration (e.g., the "baca juga" strings can take place in non-initial positions; the strings could be part of phrases of meaningful sentences, instead of just a imperative phrase redirecting readers to other news articles)
2. The strings "iklan" and "advertisement" seem to only be noises when the words are all capitalized

#### To account for this problem, we will:
1. Modify the regex to only filter out the strings "iklan" and "advertisement" when they are all capitalized
2. Use spaCy to segment the news into separate sentences, then modify the regex expression to filter out "baca juga" in sentence initial positions