## Text data cleaning by means of sentence segmentation/tokenization

### This program is intended to clean text data by removing unncessary sentences initiated by "baca juga" and "advertisement" strings. It uses spaCy NLP library for sentence segmentation.

### 1. Reading dataset

#### We will be reading a dataset/corpus containing news text data.

In [13]:
import pandas as pd
import os

news_dataset = pd.read_csv(f"{os.getcwd()}/news_dataset.csv")
print(news_dataset["content"])

0                                                     NaN
1       ALASKA — It's been snowing a lot lately in nor...
2                                                     NaN
3       ESB Science Blast. 2.508 suka. An incredible S...
4                                                     NaN
                              ...                        
1043    HAIJAKARTA.ID – Nomor HP kamu bisa menerima sa...
1044    JABAR EKSPRES – Hadir kembali link penghasil s...
1045    Jakarta, benang.id – GoPay, unit bisnis Financ...
1046    SINGAPORE: The Competition and Consumer Commis...
1047    Once you sign up, $111 is added to your bonus ...
Name: content, Length: 1048, dtype: object


#### The dataset contains 1047 rows of news texts.

In [14]:
import warnings

warnings.simplefilter("ignore", UserWarning)

unecessary_words = r"(?i)\b(baca juga|advertisement|iklan)\b"

news_dataset['matches'] = news_dataset['content'].str.contains(unecessary_words)

nrow_matches = news_dataset['matches'].sum()
nrow_news_dataset = len(news_dataset)

print(f"{nrow_matches} instances of news text \n{nrow_matches/nrow_news_dataset:.2f} percent of instances with unecessary words")

310 instances of news text 
0.30 percent of instances with unecessary words


#### From 1047 instances, there are 310 instances (30%) of news text containing unecessary words. Below are the randomly picked news text containing aforementioned words:

In [15]:
import random

matches = news_dataset[news_dataset['content'].str.contains(unecessary_words, na=False)]
matches_reset = matches.reset_index()

for num in range(3):
    random_news_instance = random.randint(0, len(matches_reset) - 1)
    print(f"news {num}: {matches_reset['content'][random_news_instance]}\n")


news 0: Rekomendasi Pinjaman Tercepat Dan Termudah Limit 200 Ribu, Bisa Untuk THR Cepat Cair Hitungan Jam Reporter: Alfin Ananda | Editor: Syamsul Falaq | Kamis 20-03-2025,11:31 WIB Rekomendasi Pinjaman Tercepat Dan Termudah Limit 200 Ribu, Bisa Untuk THR Cepat Cair Hitungan Jam-- YOGYAKARTA, diswayjogja.id- Ketika kamu memerlukan dana tambahan sebesar 600 ribu rupiah dalam waktu yang singkat, pinjaman online menjadi solusi yang sangat praktis. Salah satu solusi untuk memenuhi kebutuhan tersebut adalah dengan meminjam uang terlebih dahulu, zaman sekarang hal tersebut sudah bisa kamu lakukan secara online melalui aplikasi pinjaman. Saat ini banyak orang mencari solusi pinjaman online legal yang menawarkan pinjaman dengan nominal mulai dari 200 ribu rupiah yang langsung cair ke rekening. Pinjaman online kini memang menjadi salah satu solusi yang kerap digunakan sebagian orang ketika sedang membutuhkan uang dalam keadaan darurat dan mendesak. BACA JUGA : Kebutuhan Lebaran Mencapai 80 Juta

#### Using regular expression, it is detected that there are 310 instances of news texts containing "baca juga", "advertisement", and "iklan" strings. We can also see the random news text containing the mentioned strings. However, this regular expression retrieves solely the abovementioned strings, it does not take sentence as parameters and the strings can be at the beginning, middle, or end of a sentence.

#### To account for this problem, we will be using spaCy to segment the strings into sentences, and we will modify the regex expression to account only strings in the beginning of sentences. 