## Text data cleaning by means of sentence segmentation/tokenization

### This program is intended to clean text data by removing unncessary sentences initiated by "baca juga" and "advertisement" strings. It uses spaCy NLP library for sentence segmentation.

### 1. Reading dataset

#### We will be reading a dataset/corpus containing news text data.

In [42]:
import pandas as pd
import os

news_dataset = pd.read_csv(f"{os.getcwd()}/news_dataset.csv")
print(news_dataset["content"])

0                                                     NaN
1       ALASKA — It's been snowing a lot lately in nor...
2                                                     NaN
3       ESB Science Blast. 2.508 suka. An incredible S...
4                                                     NaN
                              ...                        
1043    HAIJAKARTA.ID – Nomor HP kamu bisa menerima sa...
1044    JABAR EKSPRES – Hadir kembali link penghasil s...
1045    Jakarta, benang.id – GoPay, unit bisnis Financ...
1046    SINGAPORE: The Competition and Consumer Commis...
1047    Once you sign up, $111 is added to your bonus ...
Name: content, Length: 1048, dtype: object


#### The dataset contains 1047 rows of news texts.

In [43]:
import warnings

warnings.simplefilter("ignore", UserWarning)

unecessary_words = r"(?i)\b(baca juga|advertisement|iklan)\b"

news_dataset['matches'] = news_dataset['content'].str.contains(unecessary_words)

nrow_matches = news_dataset['matches'].sum()
nrow_news_dataset = len(news_dataset)

print(f"{nrow_matches} instances of news text \n{nrow_matches/nrow_news_dataset:.2f} percent of instances with unecessary words")

310 instances of news text 
0.30 percent of instances with unecessary words


#### From 1047 instances, there are 310 instances (30%) of news text containing unecessary words. Below are the randomly picked news text containing aforementioned words:

In [44]:
import random
import re

matches = news_dataset[news_dataset['content'].str.contains(unecessary_words, na=False)]
matches_reset = matches.reset_index()

key_unecessary_words = ["baca juga", "advertisement", "iklan"]

for keyword in key_unecessary_words:
    matches['content'].str.contains(r'(?i)\b' + re.escape(keyword) + r'\b', na=False)
    matches_reset = matches.reset_index()
    print(f"unecessary words: {keyword}")
    for num in range(3):
        random_news_instance = random.randint(0, len(matches_reset) - 1)
        print(f"random news {num}: {matches_reset['content'][random_news_instance]}\n")



unecessary words: baca juga
random news 0: Info Penting! Ada Cs DANA Palsu dari WhatsApp Yang Berujung Penipuan, Perhatikan Ciri-ciri ini Agar Terhindar Reporter: Denis Ahmad | Editor: Denis Ahmad | Rabu 19-03-2025,06:58 WIB Hati-hati modus penipuan mengaku sebagai Cs DANA dari WhatsApp -radarindramayu.id-Foto via dana.id - Instagram RADARINDRAMAYU.ID - Menginformasikan dari laman Instagram resmi @ dana .id pada 16 Maret 2025, saat ini sedang marak modus penipuan yang mengatasnamakan Aplikasi dana . Sebuah aplikasi dompet digital , yang populer di gunakan masyarakat Indonesia untuk membantu transaksi /pembayaran jadi lebih mudah. Namun, dari kemudahan ini banyak sekali para penipu yang memanfaatkan segala cara untuk menghasut para pengguna DANA. Berbagai hal telah dilakukan oleh para penipu, mulai dari Link DANA yang bermodus mendapatkan saldo gratis . BACA JUGA: Simulasi Cicilan Pinjaman Non KUR Bank Mandiri 2025: Ajukan Sekarang dan Dapatkan Dana Hingga Rp1,5 Miliar! Hingga yang terb

#### Using regular expression, it is detected that there are 310 instances of news texts containing "baca juga", "advertisement", and "iklan" strings. We can also see the random news text containing the mentioned strings. However, this regular expression retrieves solely the abovementioned strings, it does not take sentence as parameters and the strings can be at the beginning, middle, or end of a sentence.

#### To account for this problem, we will be using spaCy to segment the strings into sentences, and we will modify the regex expression to account only strings in the beginning of sentences. 