# 03 — Text Preprocessing & Linguistic Normalization

## Objectives
- Membersihkan dan menormalisasi teks ulasan pengguna
- Menangani karakteristik bahasa Indonesia (slang, imbuhan)
- Menyediakan teks siap untuk:
  - Sentiment Analysis
  - Emotion Detection
  - Topic Modeling
  - Embedding-based Clustering

## Input
- data/raw/raw_e_wallet_reviews.csv

## Output
- data/processed/clean_e_wallet_reviews.csv

###### GOOGLE DRIVE MOUNT

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


###### IMPORT LIBRARIES

In [2]:
import pandas as pd
import numpy as np
import re
import string

from tqdm import tqdm

# NLP libraries
import nltk
from nltk.corpus import stopwords

# Indonesian NLP
!pip install Sastrawi
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory

Collecting Sastrawi
  Downloading Sastrawi-1.0.1-py2.py3-none-any.whl.metadata (909 bytes)
Downloading Sastrawi-1.0.1-py2.py3-none-any.whl (209 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m209.7/209.7 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: Sastrawi
Successfully installed Sastrawi-1.0.1


In [3]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

###### PATH CONFIGURATION & LOAD DATA

In [4]:
BASE_PATH = "/content/drive/MyDrive/ewallet_nlp_clustering_project"

RAW_DATA_PATH = f"{BASE_PATH}/data/raw"
PROCESSED_DATA_PATH = f"{BASE_PATH}/data/processed"

INPUT_FILE = f"{RAW_DATA_PATH}/raw_e_wallet_reviews.csv"
OUTPUT_FILE = f"{PROCESSED_DATA_PATH}/clean_e_wallet_reviews.csv"

df = pd.read_csv(INPUT_FILE)
df["review_date"] = pd.to_datetime(df["review_date"])

print(f"Loaded {len(df)} reviews")

Loaded 15000 reviews


## Preprocessing Design Decisions

Key principles:
- Preserve semantic meaning (avoid over-cleaning)
- Remove noise that degrades embedding quality
- Indonesian-specific normalization is mandatory
- Keep preprocessing deterministic & reproducible

Pipeline stages:
1. Case folding
2. URL, emoji, punctuation removal
3. Repeated character normalization
4. Slang normalization
5. Stopword removal
6. Stemming (Bahasa Indonesia)

###### BASIC TEXT CLEANING FUNCTIONS

In [5]:
def basic_cleaning(text):
    text = text.lower()
    text = re.sub(r"http\S+|www\S+", "", text)
    text = re.sub(r"@\w+|#\w+", "", text)
    text = re.sub(r"\d+", "", text)
    text = text.translate(str.maketrans("", "", string.punctuation))
    text = re.sub(r"\s+", " ", text).strip()
    return text

###### REPEATED CHARACTER NORMALIZATION

In [6]:
def normalize_repeated_chars(text):
    return re.sub(r'(.)\1{2,}', r'\1\1', text)

###### SLANG NORMALIZATION (INDONESIAN)

In [7]:
slang_dict = {
    "gk": "tidak",
    "ga": "tidak",
    "nggak": "tidak",
    "bgt": "banget",
    "yg": "yang",
    "tp": "tapi",
    "apk": "aplikasi",
    "lemot": "lambat",
    "error": "kesalahan"
}

def normalize_slang(text):
    tokens = text.split()
    normalized = [slang_dict.get(t, t) for t in tokens]
    return " ".join(normalized)

###### STOPWORD REMOVAL

In [8]:
stop_words = set(stopwords.words("indonesian"))

def remove_stopwords(text):
    tokens = text.split()
    filtered = [t for t in tokens if t not in stop_words]
    return " ".join(filtered)

###### STEMMING (SASTRAWI)

In [9]:
factory = StemmerFactory()
stemmer = factory.create_stemmer()

def stem_text(text):
    return stemmer.stem(text)

###### FULL PREPROCESSING PIPELINE

In [10]:
tqdm.pandas()

def full_preprocess(text):
    text = basic_cleaning(text)
    text = normalize_repeated_chars(text)
    text = normalize_slang(text)
    text = remove_stopwords(text)
    text = stem_text(text)
    return text

df["clean_text"] = df["review_text"].progress_apply(full_preprocess)

100%|██████████| 15000/15000 [23:24<00:00, 10.68it/s]


###### QUALITY CHECK (BEFORE vs AFTER)

In [11]:
df[["review_text", "clean_text"]].sample(5, random_state=42)

Unnamed: 0,review_text,clean_text
11499,best,best
6475,"e-wallet paling aneh, kok bisa saldo hilang ta...",ewallet aneh saldo hilang riwayat transaksi te...
13167,"sekarang drivernya pada gak bener, apalagi go ...",drivernya gak bener go food bayar sesuai tarif...
862,"keamanan sangat buruk,banyak transaksi di luar...",aman burukbanyak transaksi tahu kitadi gak rugi
5970,good,good


###### EMPTY & SHORT TEXT HANDLING

In [12]:
df["clean_length"] = df["clean_text"].apply(len)

df = df[df["clean_length"] >= 10]

print(f"Remaining reviews after cleaning: {len(df)}")

Remaining reviews after cleaning: 11141


###### FINAL DATASET STRUCTURE

In [13]:
final_columns = [
    "review_id",
    "app_name",
    "review_text",
    "clean_text",
    "rating",
    "review_date",
    "app_version",
    "thumbs_up",
    "reviewer_name"
]

df_final = df[final_columns]
df_final.head()

Unnamed: 0,review_id,app_name,review_text,clean_text,rating,review_date,app_version,thumbs_up,reviewer_name
0,2cc12584-7610-41f7-94b7-8d433e258aa3,DANA,woii dana Kenapaa sihh haruss banget gitu kalo...,woii dana kenapaa sihh haruss banget gitu kalo...,1,2026-01-21 17:54:20,2.109.2,0,Sy Pian
1,e99ad58e-3e2a-494e-8857-82840a07c717,DANA,aplikasi gk bisa dibuka keluar sendiri,aplikasi buka,1,2026-01-21 17:50:48,,0,Tommy Hardiansyah
3,e66455aa-0f34-4eee-8409-133a7c612e8b,DANA,tolong di perbaiki sistem aplikasi dana nya .....,tolong baik sistem aplikasi dana nya login pay...,5,2026-01-21 17:48:19,2.98.0,0,Ucokers
5,7994021f-4daa-47c4-bb58-64606f2bdd30,DANA,Apk yg bagus guna transaksi..,aplikasi bagus transaksi,4,2026-01-21 17:46:35,2.110.0,0,Beryl Hamizan
6,1453ebb7-3e0f-45f1-b4f7-264f08233e37,DANA,blm bisa buka dana cicik,blm buka dana cicik,5,2026-01-21 17:46:14,2.111.0,0,H iwan H iwan


###### SAVE CLEAN DATASET

In [14]:
df_final.to_csv(OUTPUT_FILE, index=False)

print(f"Clean dataset saved to:\n{OUTPUT_FILE}")
print(f"Final dataset size: {len(df_final)}")

Clean dataset saved to:
/content/drive/MyDrive/ewallet_nlp_clustering_project/data/processed/clean_e_wallet_reviews.csv
Final dataset size: 11141


## Preprocessing Summary

✔ Indonesian language normalization applied  
✔ Noise & spam minimized  
✔ Slang & repeated characters handled  
✔ Stopword removal & stemming completed  
✔ Text suitable for transformer-based embedding  

Output dataset is ready for:
- Sentiment analysis
- Emotion detection
- Topic modeling
- Unsupervised clustering