# Phase 2: Data Preprocessing & Cleaning

> **Author:** Azizah Adilah

> **Date:** January 2026

> **Goal:** Transform raw text into a clean format suitable for sentiment analysis.

### **Import Libraries**

#### **Step 1: Import Libraries and Load Raw Data**
In this initial step, I import the necessary Python libraries for data manipulation (`pandas`), text processing (`re`, `string`), and Indonesian language processing (`Sastrawi`). I also load the raw dataset collected from the web scraping phase.

In [12]:
import pandas as pd
import re
import string
# Use library Sastrawi for bahasa Indonesia
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory

# Load data 
df = pd.read_csv('../data/reviews_dana_raw.csv')

# Only use column 'content' and 'score'
df = df[['content', 'score']]
df.head()

Unnamed: 0,content,score
0,lumayan TPI kadang erorr sampah,5
1,sangat buruk pelayananya baru beberapa hari ap...,1
2,luar biasa berguna,5
3,bagus,5
4,sangat baik,5


### Data Observation & Slang Identification
Before proceeding with cleaning, I performed a manual inspection and word frequency analysis to identify common slangs and issues in DANA reviews:

* **Top Issues Identified:** Transaction errors (saldo hilang), DANA Cicil functionality, and login loops after updates.
* **Slang Discoveries:** Found various abbreviations like 'ccln' (cicilan), 'kbnykn' (kebanyakan), and 'ndak' (tidak).
* **Business Insight:** There is a high volume of discussion regarding 'DANA Cicil' and 'Paylater' features, often associated with negative sentiment when the features are unavailable.

### **Cleaning Function**

#### **Step 2: General Text Cleaning**
The goal of this step is to standardize the text. The `clean_text` function performs:
* **Lowercasing**: Converting all text to lowercase to ensure uniformity.
* **Removing Special Characters**: Eliminating numbers, punctuation, and symbols that do not contribute to sentiment analysis.
* **Whitespace Removal**: Cleaning up unnecessary spaces for better tokenization.

In [3]:
def clean_text(text):
    # Lowercasing
    text = str(text).lower()
    # Remove number and symbol
    text = re.sub(r'\d+', '', text)
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Remove extra whitespace
    text = text.strip()
    return text

df['content_clean'] = df['content'].apply(clean_text)
df.head()

Unnamed: 0,content,score,content_clean
0,lumayan TPI kadang erorr sampah,5,lumayan tpi kadang erorr sampah
1,sangat buruk pelayananya baru beberapa hari ap...,1,sangat buruk pelayananya baru beberapa hari ap...
2,luar biasa berguna,5,luar biasa berguna
3,bagus,5,bagus
4,sangat baik,5,sangat baik


### **Top Words Checking**

#### **Step 3: Top Words Frequency Check**
Before applying any heavy processing, I analyze the word frequency distribution of the initial cleaned text. This quantitative step helps identify the most dominant words, including potential slang, abbreviations, and common stopwords that need to be addressed in the next steps.

In [4]:
from collections import Counter

# Gabungkan semua ulasan jadi satu teks besar
all_text = ' '.join(df['content_clean'].astype(str))
words = all_text.split()

# Hitung 50 kata paling sering muncul
top_words = Counter(words).most_common(50)
print(top_words)

[('dana', 312), ('di', 224), ('bisa', 194), ('sangat', 189), ('saya', 180), ('bagus', 137), ('tidak', 120), ('aplikasi', 118), ('dan', 113), ('nya', 102), ('ini', 102), ('ada', 91), ('membantu', 84), ('mantap', 66), ('cicil', 65), ('udah', 61), ('buka', 58), ('gak', 57), ('sudah', 55), ('saldo', 54), ('kenapa', 52), ('transaksi', 51), ('ga', 49), ('ke', 48), ('mau', 47), ('lagi', 46), ('baik', 45), ('apk', 45), ('untuk', 45), ('ok', 43), ('masuk', 43), ('uang', 43), ('tapi', 42), ('yang', 42), ('ya', 39), ('gk', 38), ('mudah', 38), ('aja', 36), ('padahal', 36), ('sekali', 35), ('akun', 35), ('cepat', 33), ('malah', 33), ('tolong', 32), ('baru', 31), ('terus', 31), ('sekarang', 29), ('banget', 28), ('masih', 28), ('sering', 27)]


### **Manual Sampling** 

#### **Step 4: Manual Data Sampling**
While frequency checks are useful, they lack context. In this step, I perform a qualitative review by sampling random rows of the raw reviews. This helps me understand the "flavor" of user languageâ€”how they express frustration, praise, and technical issues in their own words.

In [5]:
# Seeing 10 random examples of original reviews
print("--- Manual Sampling (Raw Content) ---")
samples = df['content_clean'].sample(10).values
for i, sample in enumerate(samples):
    print(f"{i+1}. {sample}")

--- Manual Sampling (Raw Content) ---
1. alhamdulillah klw punya aplikasi dana smua mudah dari apa aja semua serba bisa  trimakasih dana
2. the beast
3. good job team thanks
4. bagus
5. cukup membantu
6. sangat memuaskan
7. terbaik
8. sangat baik
9. semoga cepet muncul pitur dana cicil nya
10. mantap


### **Checking the context of specific words**

#### **Step 5: Contextual Investigation for Specific Keywords**
Based on the top words and manual samples, I investigate specific "high-signal" keywords (e.g., 'cicil', 'saldo', 'login'). This investigation reveals deep business insights, such as recurring technical bugs or specific feature dissatisfaction, which are crucial for the final business analysis report.

In [6]:
keywords = ['cicil', 'saldo', 'login', 'kecewa', 'update']

for word in keywords:
    print(f"\n=== Context for: {word} ===")
    # Taking 5 examples of reviews that contain those words
    display_samples = df[df['content_clean'].str.contains(word, case=False, na=False)]['content_clean'].head(5).values
    for i, text in enumerate(display_samples):
        print(f"{i+1}. {text}")


=== Context for: cicil ===
1. tolol apklikasi udh d munculin ccln masa iya blm dpt dan cicilan bertahap mengecewakan off pakai dana skrng dah kbnykn potong bertahunx pakai mengecewakan cuihhhh ganti yg lain juga bnyk g rugi juga g pakai dana orang dah bnyk gopay shopee pay juga ada udh paylater lg
2. fitur dana cicil fungsi nya apa klo ga bisa di gunain walau limit ada ga bisa bayar apapun di hapus aj ga guna
3. tolong dana cicil
4. blum pernah nyicil tiba tiba banyak cicilan dana gak jelas sekarang banyak pungli haram
5. kenapa dana saya tida ada cicill

=== Context for: saldo ===
1. apk dana tidak bisa di buka padahal jaringan enak nanti ujungnya saldo hangus tolong di perbaikin
2. saya sangat kecewa kepada dana semua sudah di lakukan tapi saldo nya tidak ada  uang raibhilang di rekening dana lapor admin jawabannya ngambang sepertinya aplikasi ini tidak aman lagi  kecewa berat bukti transfer ada tapi saldo hilang begitu aja lapor cs menunggu berharihari padahal uang ny mau di gunaka

### **Slang Normalization (Mapping)**

#### **Step 6: Slang Normalization (Mapping)**
App reviews are often written in informal Indonesian ("Bahasa Gaul"). I apply a custom slang mapping dictionary to transform abbreviations like 'gx' to 'tidak' and 'ccln' to 'cicilan'. This normalization ensures that the sentiment analysis model can recognize the actual meaning behind the informal text.

In [7]:
slang_dict = {
    # From the findings obtained
    "ccln": "cicilan", "ccl": "cicil",
    "skrng": "sekarang", "skrg": "sekarang", "krg": "kurang",
    "kbnykn": "kebanyakan", "bnyk": "banyak",
    "klo": "kalau", "kalo": "kalau",
    "aj": "saja", "aja": "saja",
    "mw": "mau", "mo": "mau",
    "sdh": "sudah", "udh": "sudah", "dah": "sudah",
    "ndak": "tidak", "gk": "tidak", "g": "tidak", "ga": "tidak", "gak": "tidak",
    "apl": "aplikasi", "apk": "aplikasi", "apklikasi": "aplikasi",
    "notip": "notifikasi",
    "mending": "lebih baik",
    
    # Emotion/transaction words
    "eror": "error", "errror": "error",
    "ilang": "hilang", "raib": "hilang",
    "nyesel": "menyesal",
    "cuih": "kecewa", # Negative emotion interjection
}

### **Stopwords Removal (Including Custom Brand Terms)**

#### **Step 7: Stopwords Removal**
Stopwords are common words that appear frequently but don't carry sentiment (e.g., 'dan', 'di'). In this step, I also remove context-specific terms like 'dana' and 'saya' to eliminate "noise" from the data, allowing the more meaningful emotional keywords to stand out in the visualizations.

In [10]:
from Sastrawi.StopWordRemover.StopWordRemoverFactory import StopWordRemoverFactory

factory = StopWordRemoverFactory()
# Get the standard Indonesian stopword list
stopwords = factory.get_stop_words()

# Add specific brand/context words
custom_stopwords = stopwords + ['dana', 'saya', 'aplikasi', 'apk', 'nya', 'ya', 'ini', 'itu']

def remove_stopwords(text):
    return ' '.join([word for word in str(text).split() if word not in custom_stopwords])

# Run before stemming
df['content_clean'] = df['content_clean'].apply(remove_stopwords)

### **Indonesian Language Stemming (Sastrawi)**

#### **Step 8: Indonesian Language Stemming (Sastrawi)**
Finally, I use the Sastrawi library to perform stemming. This process reduces words to their root form (e.g., 'mengecewakan' becomes 'kecewa'). Stemming is vital in Indonesian NLP to consolidate various word forms into a single meaningful base, which improves the clustering of sentiments.

In [11]:
# Initialize Sastrawi Stemmer
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory

factory = StemmerFactory()
stemmer = factory.create_stemmer()

# Execute Stemming Process
# Note: This might take 2-3 minutes for 1000 rows as Sastrawi processes each word
print("Starting Stemming process... Please wait.")

# Applying stemming to the cleaned content
df['content_clean'] = df['content_clean'].apply(lambda x: stemmer.stem(str(x)))

# Export the Final Cleaned Data
# Saving the result to a new CSV file for Phase 3
df.to_csv('../data/reviews_dana_cleaned.csv', index=False)

print("Process Completed! Cleaned data is saved to '../data/reviews_dana_cleaned.csv'")

Starting Stemming process... Please wait.
Process Completed! Cleaned data is saved to '../data/reviews_dana_cleaned.csv'
