## Text data cleaning by means of sentence segmentation/tokenization

### This program is intended to clean text data by removing unncessary sentences initiated by "baca juga" and "advertisement" strings. It uses spaCy NLP library for sentence segmentation.

### 1. Reading dataset

#### We will be reading a dataset/corpus containing news text data.

In [4]:
import pandas as pd
import os

news_dataset = pd.read_csv(f"{os.getcwd()}/news_dataset.csv")
print(news_dataset["content"])

0                                                     NaN
1       ALASKA — It's been snowing a lot lately in nor...
2                                                     NaN
3       ESB Science Blast. 2.508 suka. An incredible S...
4                                                     NaN
                              ...                        
1043    HAIJAKARTA.ID – Nomor HP kamu bisa menerima sa...
1044    JABAR EKSPRES – Hadir kembali link penghasil s...
1045    Jakarta, benang.id – GoPay, unit bisnis Financ...
1046    SINGAPORE: The Competition and Consumer Commis...
1047    Once you sign up, $111 is added to your bonus ...
Name: content, Length: 1048, dtype: object


#### The dataset contains 1047 rows of news texts.

In [5]:
import warnings

warnings.simplefilter("ignore", UserWarning)

unecessary_words = r"(?i)\b(baca juga|advertisement|iklan)\b"

news_dataset['matches'] = news_dataset['content'].str.contains(unecessary_words)

nrow_matches = news_dataset['matches'].sum()
nrow_news_dataset = len(news_dataset)

print(f"{nrow_matches} instances of news text \n{nrow_matches/nrow_news_dataset:.2f} percent of instances with unecessary words")

310 instances of news text 
0.30 percent of instances with unecessary words


#### From 1047 instances, there are 310 instances (30%) of news text containing unecessary words. Below are the randomly picked news text containing aforementioned words:

In [6]:
import random

matches = news_dataset[news_dataset['content'].str.contains(unecessary_words, na=False)]
matches_reset = matches.reset_index()

for num in range(3):
    random_news_instance = random.randint(0, len(matches_reset) - 1)
    print(f"news {num}: {matches_reset['content'][random_news_instance]}")


news 0: Cuan dari HP! Game Penghasil Uang 2025 Ini Bisa Hasilkan Saldo DANA Gratis Rp500 Ribu Tanpa Syarat Ribet! Reporter: Asep Kusuma | Editor: Asep Kusuma | Jumat 21-03-2025,03:41 WIB Cuan dari HP! Game Penghasil Uang 2025 Ini Bisa Hasilkan Saldo DANA Gratis Rp500 Ribu Tanpa Syarat Ribet!-canva- RADARINDRAMAYU.ID - Mau dapat saldo DANA gratis Rp500.000 dengan cara mudah? Di tahun 2025, semakin banyak aplikasi penghasil uang yang bisa dimanfaatkan untuk menambah isi dompet elektronik kamu. Salah satu yang menarik perhatian adalah game penghasil saldo DANA yang terbukti membayar. Cukup dengan menyelesaikan misi tertentu, kamu bisa mendapatkan saldo yang bisa langsung ditarik ke e-wallet favoritmu seperti DANA, GoPay, dan OVO. BACA JUGA: Kiper Dewa United Sebut Timnas Indonesia Bisa Jadi Tambahan yang Hebat untuk Piala Dunia 2026 Banyak pengguna yang sudah berhasil menarik saldo dari aplikasi &#34;Low Go&#34; ini. Beberapa di antaranya bahkan berhasil mendapatkan hingga ratusan ribu ru

#### Using regular expression, it is detected that there are 310 instances of news texts containing "baca juga", "advertisement", and "iklan" strings. We can also see the random news text containing the mentioned strings. However, this regular expression retrieves solely the abovementioned strings, it does not take sentence as parameters and the strings can be at the beginning, middle, or end of a sentence.

#### To account for this problem, we will be using spaCy to segment the strings into sentences, and we will modify the regex expression to account only strings in the beginning of sentences. 