# Text data cleaning by means of sentence segmentation/tokenization

### This program is intended to clean text data by removing unncessary sentences initiated by "baca juga" and "advertisement" strings. It uses spaCy NLP library for sentence segmentation.

### 1. Reading dataset

#### We will be reading a dataset/corpus containing news text data.

In [60]:
import pandas as pd
import os

# reding the news dataset
news_dataset = pd.read_csv(f"{os.getcwd()}/news_dataset.csv")
news_dataset_clean = news_dataset.dropna(subset=['content'])
news_dataset_clean = news_dataset_clean.reset_index(drop=True)

print(news_dataset_clean["content"])

0      ALASKA — It's been snowing a lot lately in nor...
1      ESB Science Blast. 2.508 suka. An incredible S...
2      Suara.com - BRIZZI merupakan salah satu kartu ...
3      - Cash Tree Siapa sih yang tidak kenal dengan ...
4      KBEonline.id – Solo Leveling Season 2 episode ...
                             ...                        
897    HAIJAKARTA.ID – Nomor HP kamu bisa menerima sa...
898    JABAR EKSPRES – Hadir kembali link penghasil s...
899    Jakarta, benang.id – GoPay, unit bisnis Financ...
900    SINGAPORE: The Competition and Consumer Commis...
901    Once you sign up, $111 is added to your bonus ...
Name: content, Length: 902, dtype: object


#### The dataset contains 901 rows of news texts.

In [72]:
import warnings

warnings.simplefilter("ignore", UserWarning)

# regex with the unecessary keywords
regex_unecessary_words = r"(?i)\b(baca juga|advertisement|iklan)\b"

# add new column matches where True indicates instances of news text with unecessary word
news_dataset_clean['matches'] = news_dataset_clean['content'].str.contains(regex_unecessary_words)

# count the no. of True observation of column 'matches' and len of news_dataset_clean
nrow_matches = news_dataset_clean['matches'].sum()
nrow_news_dataset = len(news_dataset_clean)

print(f"{nrow_matches} instances of news text with unecessary words \n{nrow_matches/nrow_news_dataset:.2f} percent of instances with unecessary words")

0      False
1      False
2      False
3       True
4       True
       ...  
897    False
898     True
899    False
900    False
901    False
Name: matches, Length: 902, dtype: bool
310 instances of news text with unecessary words 
0.34 percent of instances with unecessary words


#### From 1047 instances, there are 310 instances (34%) of news text containing unecessary words. Below are the randomly picked news text containing aforementioned words:

In [74]:
import random
import re

# subset dataframe where column 'content' contains regex_unecessary_words
matches = news_dataset_clean[news_dataset_clean['content'].str.contains(regex_unecessary_words, na=False)]

# the keywords inside the regex for loop
key_unecessary_words = ["baca juga", "advertisement", "iklan"]

# looping each keyword in key_unecessary_words
for keyword in key_unecessary_words:
    print(f"--- Unecessary words: {keyword} ---")
    matches_keyword = matches[matches['content'].str.contains(r'(?i)\b' + re.escape(keyword) + r'\b', na=False)] # subset dataframe containing keyword
    matches_reset = matches_keyword.reset_index(drop=True) # reset the indexing
    for num in range(3): # take three random news text
        if len(matches_keyword) > 0:
            random_news_instance = random.randint(0, len(matches_keyword) - 1)
            print(f"Random news {num}: {matches_keyword['content'].iloc[random_news_instance]}\n")
        else:
            print(f"No matches for {keyword} found in the dataset.\n")

--- Unecessary words: baca juga ---
Random news 0: Tap Layar Dapat Saldo DANA Gratis Lewat Game Fruit Match, Seru dan Bikin Untung! Reporter: Neng Erlin | Editor: Ria Sofyan | Jumat 21-03-2025,10:59 WIB Ilustrasi. Tap layar dapat saldo dana gratis lewat game fruit match, seru dan bikin untung!--(Sumber : Doc/BETV) BETVNEWS - Saat ini kamu bisa mendapatkan saldo DANA gratis dari Fruit Match apk yang akan memberimu uang sebagai reward saat memainkan game tersebut. Oleh sebab itu, hal ini bisa menjadi salah satu cara menyenangkan untuk mendapatkan uang yang bisa langsung masuk ke akun DANA milikmu. BACA JUGA: Baru Instal Langsung Dapet Rp10 Ribu, Coba Color Lab Sekarang Saldo DANA Gratis Auto Cair! BACA JUGA: Reses DPRD Seluma Habiskan Rp750 Juta, Tokoh Masyarakat: Ajang Tampung Aspirasi, Nihil Realisasi Aplikasi ini merupakan aplikasi penghasil uang berbasis permainan puzzle yang mengharuskan pemain mencocokkan buah-buahan untuk mendapatkan poin. Poin yang dikumpulkan inilah yang nantiny

#### Using regular expression, it is detected that there are 310 instances of news texts containing "baca juga", "advertisement", and "iklan" strings. We can also see the random news text containing the mentioned strings through regular expression (regex). However it is important to take these into consideration:
1. The regex retrieves solely news with abovementioned strings; it does not take the sentence where the strings are bound into consideration (e.g., the "baca juga" strings can take place in non-initial positions; the strings could be part of phrases of meaningful sentences, instead of just a imperative phrase redirecting readers to other news articles)
2. The strings "iklan" and "advertisement" seem to only be noises when the words are all capitalized

#### To account for this problem, we will:
1. Modify the regex to only filter out the strings "iklan" and "advertisement" when they are all capitalized
2. Use spaCy to segment the news into separate sentences, then modify the regex expression to filter out "baca juga" in sentence initial positions

In [None]:
new_regex = 