## Text data cleaning by means of sentence segmentation/tokenization

### This program is intended to clean text data by removing unncessary sentences initiated by "baca juga" and "advertisement" strings. It uses spaCy NLP library for sentence segmentation.

### 1. Reading dataset

#### We will be reading a dataset/corpus containing news text data.

In [15]:
import pandas as pd
import os

news_dataset = pd.read_csv(f"{os.getcwd()}/news_dataset.csv")
print(news_dataset["content"])

0                                                     NaN
1       ALASKA — It's been snowing a lot lately in nor...
2                                                     NaN
3       ESB Science Blast. 2.508 suka. An incredible S...
4                                                     NaN
                              ...                        
1043    HAIJAKARTA.ID – Nomor HP kamu bisa menerima sa...
1044    JABAR EKSPRES – Hadir kembali link penghasil s...
1045    Jakarta, benang.id – GoPay, unit bisnis Financ...
1046    SINGAPORE: The Competition and Consumer Commis...
1047    Once you sign up, $111 is added to your bonus ...
Name: content, Length: 1048, dtype: object


#### The dataset contains 1047 rows of news texts.

In [20]:
import warnings

warnings.simplefilter("ignore", UserWarning)

unecessary_words = r"(?i)\b(baca juga|advertisement|iklan)\b"

news_dataset['matches'] = news_dataset['content'].str.contains(unecessary_words)
print(f"{news_dataset['matches'].sum()} instances of news text")

310 instances of news text


In [25]:
matches = news_dataset[news_dataset['content'].str.contains(unecessary_words,  na=False)]
matches_reset = matches.reset_index()

print(matches_reset['content'][0])
print(matches_reset['content'][4])
print(matches_reset['content'][20])

- Cash Tree Siapa sih yang tidak kenal dengan Cash Tree, yang merupakan salah satu aplikasi penghasil uang paling populer ini sayang untuk kamu lewatkan. Melalui aplikasi ini, kamu bisa langsung menggunakannya dan langsung cair ke dompet elektronik yang kamu miliki seperti e-wallet DANA, cukup mudah kok caranya. Kamu hanya perlu menyelesaikan berbagai misi dan menonton video saja untuk proses pengumpulan koin, nantinya jika sudah mencapai batas penukaran bisa ditukarkan menjadi saldo DANA. BACA JUGA:Aplikasi Penghasil Saldo DANA, OVO dan Shopee 2025 yang Terbukti Membayar Hingga Rp500.000! BACA JUGA:Baru Rilis! Aplikasi Penghasil Saldo DANA, OVO dan Shopee 2025 Terbukti Membayar Tanpa Modal!
Jakarta (ANTARA) - Pelatih Rans Simba Bogor Anthony Garbelotto menyebut bahwa permainan solid anak asuhnya, telah membawa tim untuk meraih sembilan kemenangan beruntun, serta menundukkan empat tim kuat di Indonesian Basketball League (IBL) 2025. &#34;Contohnya saat kami menang melawan juara IBL 202

#### Using regular expression, it is detected that there are 310 instances of news texts containing "baca juga", "advertisement", and "iklan" strings. However, this regular expression retrieves solely the abovementioned strings, it does not take sentence as parameters and the strings can be at the beginning, middle, or end of a sentence.

#### To account for this problem, we will be using spaCy to segment the strings into sentences, and we will modify the regex expression to account only strings in the beginning of sentences. 