# NER Laporan Warga (Jalan Rusak, Lampu Padam, Sampah) — Draft Notebook Lengkap

**Fitur utama**
- Koleksi data via **X API (resmi)** *opsional* + opsi **CSV dummy** jika belum punya akses.
- **Pre‑processing**: pembersihan teks, de‑identifikasi handle/URL, normalisasi tanggal.
- **Weak Supervision (Baseline Rules)**: regex & gazetteer untuk pre‑label (ISSUE_TYPE, DATETIME, SEVERITY, LOCATION*).
- Alur **anotasi NER** dengan **Doccano**: ekspor *template* dan impor hasil anotasi (JSONL spans) → konversi ke **BIO**.
- **Fine‑tuning IndoBERT** (HuggingFace) untuk **token classification**.
- **Evaluasi** dengan **seqeval** (entity‑level F1 per label + macro).
- **Inferensi** → JSON terstruktur (issue_type, lokasi, datetime, severity, dll.).
- (Opsional) **Geocoding** menggunakan Nominatim (OpenStreetMap).
- (Opsional) **FastAPI** endpoint `/extract` untuk integrasi.

> **Etika/ToS:** Gunakan **X API resmi**. Hindari scraping HTML/cookies yang melanggar ToS.


In [None]:
# !pip install -U pip
# !pip install transformers datasets accelerate seqeval evaluate tqdm rich sentencepiece dateparser rapidfuzz unidecode pydantic==2.* python-dotenv requests ujson pandas scikit-learn matplotlib
# # Opsional (CRF)
# # !pip install sklearn-crfsuite

import os, json, random, re, time, math
from dataclasses import dataclass
from typing import List, Dict, Any, Tuple

import numpy as np
import pandas as pd

BASE_DIR = os.getcwd()
DATA_DIR = os.path.join(BASE_DIR, 'data')
RAW_DIR = os.path.join(DATA_DIR, 'raw')
PROCESSED_DIR = os.path.join(DATA_DIR, 'processed')


for d in [DATA_DIR, RAW_DIR, PROCESSED_DIR]:
    os.makedirs(d, exist_ok=True)


In [None]:
!dpkg --configure -a
!sudo apt-get update
!sudo apt-get install -y ca-certificates curl gnupg
!sudo mkdir -p /etc/apt/keyrings
!curl -fsSL https://deb.nodesource.com/gpgkey/nodesource-repo.gpg.key | sudo gpg --dearmor -o /etc/apt/keyrings/nodesource.gpg
!NODE_MAJOR=18 && echo "deb [signed-by=/etc/apt/keyrings/nodesource.gpg] https://deb.nodesource.com/node_$NODE_MAJOR.x nodistro main" | sudo tee /etc/apt/sources.list.d/nodesource.list
!sudo apt-get update
!sudo apt-get install nodejs -y

0% [Working]            Get:1 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
0% [Connecting to archive.ubuntu.com (91.189.91.83)] [1 InRelease 37.3 kB/129 k0% [Connecting to archive.ubuntu.com (91.189.91.83)] [Connected to cloud.r-proj                                                                               Get:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,628 B]
0% [Waiting for headers] [2 InRelease 0 B/3,628 B 0%] [Connected to r2u.stat.il0% [Waiting for headers] [Connected to r2u.stat.illinois.edu (192.17.190.167)]                                                                                Get:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
0% [Waiting for headers] [Connected to r2u.stat.illinois.edu (192.17.190.167)] 0% [Waiting for headers] [Connected to r2u.stat.illinois.edu (192.17.190.167)]                                                                      

In [None]:
auth_token = 'c84ba206331211b33f2226e8a11dd9202176ec22' #isi dengan auth token twitter cookies
filename = 'data_jalan.csv'
keyword = 'jalan rusak since:2023-09-01 until:2025-09-01 lang:id'
limit = 1000



!npx -y tweet-harvest@2.6.1 -o "{filename}" -s "{keyword}" --tab "LATEST" -l {limit} --token {auth_token}

[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K[1m[32mTweet Harvest [v2.6.1][39m[22m
[1m[32m[39m[22m
[34mResearch by [39m[1m[34mHelmi Satria[39m[22m[34m[39m
[34mUse it for Educational Purposes only![39m
[34m[39m
[33mThis script uses Chromium Browser to crawl data from Twitter with [1myour Twitter auth token[22m.[39m
[33mPlease enter your Twitter auth token when prompted.[39m
[33m[39m
[31m[1mNote:[22m[39m Keep your access token secret! Don't share it with anyone else.
[31m[1mNote:[22m[39m This script only runs on your local device.

[34m[39m
[34mOpening twitter search page...[39m
[34m[39m
[90m[39m
[90m-- Scrolling... (1)[39m[90m (2)[39m[33m[39m
[33mFilling in keywords: jalan rusak since:2023-09-01 until:2025-09-01 lang:id[39m
[33m[39m
[90m (3)[39m[34m[39m
[34m[39m
[34mYour tweets saved to: /content/tweets-data/data_jalan.csv[39m
[33mTotal tweets saved: 18[39m
[34m[39m
[34m[39m
[34mYour tweets saved to: /

In [None]:
auth_token = 'c84ba206331211b33f2226e8a11dd9202176ec22' #isi dengan auth token twitter cookies
filename = 'data_lampu.csv'
keyword = 'lampu jalan mati since:2023-09-01 until:2025-09-01 lang:id'
limit = 1000



!npx -y tweet-harvest@2.6.1 -o "{filename}" -s "{keyword}" --tab "LATEST" -l {limit} --token {auth_token}

[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0K⠹[1G[0K[1m[32mTweet Harvest [v2.6.1][39m[22m
[1m[32m[39m[22m
[34mResearch by [39m[1m[34mHelmi Satria[39m[22m[34m[39m
[34mUse it for Educational Purposes only![39m
[34m[39m
[33mThis script uses Chromium Browser to crawl data from Twitter with [1myour Twitter auth token[22m.[39m
[33mPlease enter your Twitter auth token when prompted.[39m
[33m[39m
[31m[1mNote:[22m[39m Keep your access token secret! Don't share it with anyone else.
[31m[1mNote:[22m[39m This script only runs on your local device.

[34m[39m
[34mOpening twitter search page...[39m
[34m[39m
[90m[39m
[90m-- Scrolling... (1)[39m[90m (2)[39m[33m[39m
[33mFilling in keywords: lampu jalan mati since:2023-09-01 until:2025-09-01 lang:id[39m
[33m[39m
[90m (3)[39m[34m[39m
[34m[39m
[34mYour tweets saved to: /content/tweets-data/data_lampu.csv[39m
[33mTotal tweets saved:

In [None]:
auth_token = 'c84ba206331211b33f2226e8a11dd9202176ec22' #isi dengan auth token twitter cookies
filename = 'data_sampah.csv'
keyword = 'sampah menumpuk since:2023-09-01 until:2025-09-01 lang:id'
limit = 500



!npx -y tweet-harvest@2.6.1 -o "{filename}" -s "{keyword}" --tab "LATEST" -l {limit} --token {auth_token}

[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K[1m[32mTweet Harvest [v2.6.1][39m[22m
[1m[32m[39m[22m
[34mResearch by [39m[1m[34mHelmi Satria[39m[22m[34m[39m
[34mUse it for Educational Purposes only![39m
[34m[39m
[33mThis script uses Chromium Browser to crawl data from Twitter with [1myour Twitter auth token[22m.[39m
[33mPlease enter your Twitter auth token when prompted.[39m
[33m[39m
[31m[1mNote:[22m[39m Keep your access token secret! Don't share it with anyone else.
[31m[1mNote:[22m[39m This script only runs on your local device.

[34m[39m
[34mOpening twitter search page...[39m
[34m[39m
[90m[39m
[90m-- Scrolling... (1)[39m[90m (2)[39m[90m (3)[39m[33m[39m
[33mFilling in keywords: sampah menumpuk since:2023-09-01 until:2025-09-01 lang:id[39m
[33m[39m
[90m (4)[39m[34m[39m
[34m[39m
[34mYour tweets saved to: /content/tweets-data/data_sampah.csv[39m
[33mTotal tweets saved: 20[39m
[90m[39m
[90m-- Scrolling... (1)[39m

In [None]:
auth_token = 'c84ba206331211b33f2226e8a11dd9202176ec22' #isi dengan auth token twitter cookies
filename = 'data_banjir.csv'
keyword = 'banjir since:2023-09-01 until:2025-09-01 lang:id'
limit = 500



!npx -y tweet-harvest@2.6.1 -o "{filename}" -s "{keyword}" --tab "LATEST" -l {limit} --token {auth_token}

[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0K⠹[1G[0K[1m[32mTweet Harvest [v2.6.1][39m[22m
[1m[32m[39m[22m
[34mResearch by [39m[1m[34mHelmi Satria[39m[22m[34m[39m
[34mUse it for Educational Purposes only![39m
[34m[39m
[33mThis script uses Chromium Browser to crawl data from Twitter with [1myour Twitter auth token[22m.[39m
[33mPlease enter your Twitter auth token when prompted.[39m
[33m[39m
[31m[1mNote:[22m[39m Keep your access token secret! Don't share it with anyone else.
[31m[1mNote:[22m[39m This script only runs on your local device.

[34m[39m
[34mOpening twitter search page...[39m
[34m[39m
[90m[39m
[90m-- Scrolling... (1)[39m[90m (2)[39m[33m[39m
[33mFilling in keywords: banjir since:2023-09-01 until:2025-09-01 lang:id[39m
[33m[39m
[90m (3)[39m[34m[39m
[34m[39m
[34mYour tweets saved to: /content/tweets-data/data_banjir.csv[39m
[33mTotal tweets saved: 17[39m


In [None]:
auth_token = 'c84ba206331211b33f2226e8a11dd9202176ec22' #isi dengan auth token twitter cookies
filename = 'data_pohon.csv'
keyword = 'pohon tumbang since:2023-09-01 until:2025-09-01 lang:id'
limit = 500



!npx -y tweet-harvest@2.6.1 -o "{filename}" -s "{keyword}" --tab "LATEST" -l {limit} --token {auth_token}

[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K[1m[32mTweet Harvest [v2.6.1][39m[22m
[1m[32m[39m[22m
[34mResearch by [39m[1m[34mHelmi Satria[39m[22m[34m[39m
[34mUse it for Educational Purposes only![39m
[34m[39m
[33mThis script uses Chromium Browser to crawl data from Twitter with [1myour Twitter auth token[22m.[39m
[33mPlease enter your Twitter auth token when prompted.[39m
[33m[39m
[31m[1mNote:[22m[39m Keep your access token secret! Don't share it with anyone else.
[31m[1mNote:[22m[39m This script only runs on your local device.

[34m[39m
[34mOpening twitter search page...[39m
[34m[39m
[90m[39m
[90m-- Scrolling... (1)[39m[90m (2)[39m[90m (3)[39m[33m[39m
[33mFilling in keywords: pohon tumbang since:2023-09-01 until:2025-09-01 lang:id[39m
[33m[39m
[90m (4)[39m[90m (5)[39m[90m (6)[39m[90m (7)[39m[34m[39m
[34m[39m
[34mYour tweets saved to: /co

In [None]:
# List of filenames to combine
filenames = [
    'data_jalan.csv',
    'data_lampu.csv',
    'data_sampah.csv',
    'data_banjir.csv',
    'data_pohon.csv'
]

# Directory where the files are saved
data_dir = '/content/tweets-data'

# List to hold individual DataFrames
dfs = []

# Read each CSV file and append its DataFrame to the list
for filename in filenames:
    filepath = os.path.join(data_dir, filename)
    if os.path.exists(filepath):
        df = pd.read_csv(filepath)
        dfs.append(df)
    else:
        print(f"File not found: {filepath}")

# Concatenate all DataFrames into a single DataFrame
combined_df = pd.concat(dfs, ignore_index=True)

# Save the combined DataFrame to a new CSV file
output_filename = 'combined_data.csv'
output_filepath = os.path.join(RAW_DIR, output_filename) # Use RAW_DIR from the first code cell
combined_df.to_csv(output_filepath, index=False)

print(f"Combined data saved to: {output_filepath}")

# Display the first few rows of the combined DataFrame
display(combined_df.head())

# Print the shape of the combined DataFrame
print(f"Shape of the combined DataFrame: {combined_df.shape}")

Combined data saved to: /content/data/raw/combined_data.csv


Unnamed: 0,conversation_id_str,created_at,favorite_count,full_text,id_str,image_url,in_reply_to_screen_name,lang,location,quote_count,reply_count,retweet_count,tweet_url,user_id_str,username
0,1962299990761734245,Sun Aug 31 23:42:30 +0000 2025,0,tapi sumpah guyss jgn rusak fasilitas umum kar...,1962299990761734245,,,in,,0,0,0,https://x.com/undefined/status/196229999076173...,1693985563450433536,
1,1962298947650920833,Sun Aug 31 23:38:21 +0000 2025,1,Beneran gw pas jalan kerja di gojek nangis sed...,1962298947650920833,,,in,,0,0,0,https://x.com/undefined/status/196229894765092...,1574439522053804032,
2,1962296611000000927,Sun Aug 31 23:29:04 +0000 2025,16,Mobil Dinas Kesulitan di Jalan Rusak Warga Pil...,1962296611000000927,,,in,,1,1,2,https://x.com/undefined/status/196229661100000...,41730943,
3,1962285953726509097,Sun Aug 31 22:46:43 +0000 2025,1,helehhhh prett bgt orang jepara aslinya gamau ...,1962285953726509097,,,in,,0,0,0,https://x.com/undefined/status/196228595372650...,1121031670289690625,
4,1962179468489113897,Sun Aug 31 22:33:25 +0000 2025,0,@yorselves Apakah adil rakyat yg mati gegara l...,1962282606768341410,,yorselves,in,,0,0,0,https://x.com/undefined/status/196228260676834...,1393643270,


Shape of the combined DataFrame: (3272, 15)


In [None]:
!pip install sastrawi
!pip install wordcloud
!pip install nltk
!pip install emoji
!pip install nlp-id
!pip install tqdm
!pip install transformers
!pip install torch

Collecting sastrawi
  Downloading Sastrawi-1.0.1-py2.py3-none-any.whl.metadata (909 bytes)
Downloading Sastrawi-1.0.1-py2.py3-none-any.whl (209 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/209.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m209.7/209.7 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sastrawi
Successfully installed sastrawi-1.0.1
Collecting emoji
  Downloading emoji-2.15.0-py3-none-any.whl.metadata (5.7 kB)
Downloading emoji-2.15.0-py3-none-any.whl (608 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m608.4/608.4 kB[0m [31m16.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji
Successfully installed emoji-2.15.0
Collecting nlp-id
  Downloading nlp_id-0.1.20.0-py3-none-any.whl.metadata (7.6 kB)
Collecting huggingface-hub==0.34.4 (from nlp-id)
  Downloading huggingface_hub-0.34.4-py3-none-any.whl.metadata (14 kB)
C

In [None]:
# Import library yang diperlukan
import pandas as pd
import numpy as np
import re
import torch
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
import nltk
import emoji
from nltk.tokenize import word_tokenize
from tqdm import tqdm

# Library untuk preprocessing bahasa Indonesia
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from Sastrawi.StopWordRemover.StopWordRemoverFactory import StopWordRemoverFactory

from nlp_id.tokenizer import Tokenizer
from nlp_id.stopword import StopWord

# Download NLTK data
nltk.download('punkt_tab')

print("Semua library berhasil diimport!")

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Semua library berhasil diimport!


In [None]:
# Load dataset
df = pd.read_csv('/content/data/raw/combined_data.csv')

print("Dataset berhasil diupload!")
print(f"Shape dataset: {df.shape}")
print(f"Kolom yang tersedia: {list(df.columns)}")
display(df.head(10))

Dataset berhasil diupload!
Shape dataset: (3272, 15)
Kolom yang tersedia: ['conversation_id_str', 'created_at', 'favorite_count', 'full_text', 'id_str', 'image_url', 'in_reply_to_screen_name', 'lang', 'location', 'quote_count', 'reply_count', 'retweet_count', 'tweet_url', 'user_id_str', 'username']


Unnamed: 0,conversation_id_str,created_at,favorite_count,full_text,id_str,image_url,in_reply_to_screen_name,lang,location,quote_count,reply_count,retweet_count,tweet_url,user_id_str,username
0,1962299990761734245,Sun Aug 31 23:42:30 +0000 2025,0,tapi sumpah guyss jgn rusak fasilitas umum kar...,1962299990761734245,,,in,,0,0,0,https://x.com/undefined/status/196229999076173...,1693985563450433536,
1,1962298947650920833,Sun Aug 31 23:38:21 +0000 2025,1,Beneran gw pas jalan kerja di gojek nangis sed...,1962298947650920833,,,in,,0,0,0,https://x.com/undefined/status/196229894765092...,1574439522053804032,
2,1962296611000000927,Sun Aug 31 23:29:04 +0000 2025,16,Mobil Dinas Kesulitan di Jalan Rusak Warga Pil...,1962296611000000927,,,in,,1,1,2,https://x.com/undefined/status/196229661100000...,41730943,
3,1962285953726509097,Sun Aug 31 22:46:43 +0000 2025,1,helehhhh prett bgt orang jepara aslinya gamau ...,1962285953726509097,,,in,,0,0,0,https://x.com/undefined/status/196228595372650...,1121031670289690625,
4,1962179468489113897,Sun Aug 31 22:33:25 +0000 2025,0,@yorselves Apakah adil rakyat yg mati gegara l...,1962282606768341410,,yorselves,in,,0,0,0,https://x.com/undefined/status/196228260676834...,1393643270,
5,1961745502762078665,Sun Aug 31 18:34:54 +0000 2025,0,@alisyarief Jangan rusak fasilitas tapi kejar ...,1962222581517537415,,alisyarief,in,,0,0,0,https://x.com/undefined/status/196222258151753...,1855268187392323584,
6,1961626746744443281,Sun Aug 31 16:50:43 +0000 2025,0,@SwaggyMoYT @Elektroo12_ ANJINK GOBLOK. Dari d...,1962196362034049306,,busetwaduhkacau,in,,0,1,0,https://x.com/undefined/status/196219636203404...,1509411039187718147,
7,1962186910639817124,Sun Aug 31 16:13:09 +0000 2025,1,Stay safe ges yg mau ikut turun besok. Sekolah...,1962186910639817124,,,in,,0,1,0,https://x.com/undefined/status/196218691063981...,1127213416995741699,
8,1962157841952969209,Sun Aug 31 16:02:38 +0000 2025,0,@kanggalongratis Ga diliburin kah rijjak? Seko...,1962184263245451361,,kanggalongratis,in,,0,1,0,https://x.com/undefined/status/196218426324545...,1733521150263877632,
9,1962177559460012253,Sun Aug 31 15:37:34 +0000 2025,0,@budiwsso Berdasarkan info terkini dari berita...,1962177954617979057,,budiwsso,in,,0,0,0,https://x.com/undefined/status/196217795461797...,1720665183188922368,


In [None]:
# Create a new DataFrame with selected columns
cleaned_df = combined_df[['created_at', 'full_text']].copy()

# Rename columns for clarity (optional, but good practice)
cleaned_df = cleaned_df.rename(columns={
    'full_text': 'text'
})

# Define the output path for the new CSV file
output_cleaned_filepath = os.path.join(PROCESSED_DIR, 'cleaned_combined_data.csv') # Use PROCESSED_DIR

# Save the new DataFrame to a CSV file
cleaned_df.to_csv(output_cleaned_filepath, index=False)

print(f"Cleaned data saved to: {output_cleaned_filepath}")

# Display the first few rows of the new DataFrame
display(cleaned_df.head())

# Print the shape of the new DataFrame
print(f"Shape of the cleaned DataFrame: {cleaned_df.shape}")

Cleaned data saved to: /content/data/processed/cleaned_combined_data.csv


Unnamed: 0,created_at,text
0,Sun Aug 31 23:42:30 +0000 2025,tapi sumpah guyss jgn rusak fasilitas umum kar...
1,Sun Aug 31 23:38:21 +0000 2025,Beneran gw pas jalan kerja di gojek nangis sed...
2,Sun Aug 31 23:29:04 +0000 2025,Mobil Dinas Kesulitan di Jalan Rusak Warga Pil...
3,Sun Aug 31 22:46:43 +0000 2025,helehhhh prett bgt orang jepara aslinya gamau ...
4,Sun Aug 31 22:33:25 +0000 2025,@yorselves Apakah adil rakyat yg mati gegara l...


Shape of the cleaned DataFrame: (3272, 2)


In [None]:
import os, re, random
import numpy as np
import pandas as pd
import emoji
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from nlp_id.tokenizer import Tokenizer
from nlp_id.stopword import StopWord



# === Tools NLP Indonesia ===
tokenizer_nlp = Tokenizer()
stopwords = set(StopWord().get_stopword())
stemmer = StemmerFactory().create_stemmer()

# === Slang Dictionary ===
def load_slang_dict(file_path="/content/slangword.txt"):
    slang_dict = {}
    if os.path.exists(file_path):
        with open(file_path, "r", encoding="utf-8") as f:
            for line in f:
                line = line.strip()
                if not line or ':' not in line:
                    continue
                slang, meaning = line.split(':', 1)
                slang_dict[slang.strip().lower()] = meaning.strip().lower()
    return slang_dict

slang_dict = load_slang_dict()

def mapping_slang(tokens, slang_dict):
    return [slang_dict.get(w.lower(), w) for w in tokens]

# === Regex Utility ===
URL_RE = re.compile(r'https?://\S+')
HANDLE_RE = re.compile(r'@\w+')
WHITESPACE_RE = re.compile(r'\s+')

# === Clean + Preprocess ===
def clean_text(text: str) -> str:
    """
    Gabungan fungsi clean_text lama + preprocess_text
    """
    # Handling missing values
    if not isinstance(text, str):
        return ""

    # --- Step 1: Basic cleaning (dari kode lama)
    text = URL_RE.sub(" <URL> ", text)
    text = HANDLE_RE.sub(" <USER> ", text)
    text = WHITESPACE_RE.sub(" ", text).strip()

    # --- Step 2: Case folding
    text = text.lower()

    # --- Step 3: Remove URLs, mentions, emoji, hashtags
    text = re.sub(r"http\S+", "", text)
    text = re.sub(r"@\w+", "", text)
    text = emoji.replace_emoji(text, replace="")
    text = re.sub(r"#\w+", "", text)
    text = re.sub(r"&amp;", "dan", text)

    # --- Step 4: Remove special characters + angka, normalize repeated chars
    text = re.sub(r"[^a-z\s]", " ", text)
    text = re.sub(r"(.)\1{2,}", r"\1", text)

    # --- Step 5: Tokenization
    tokens = tokenizer_nlp.tokenize(text)

    # --- Step 6: Slang mapping
    tokens = mapping_slang(tokens, slang_dict)

    # --- Step 7: Stopword removal + filter kata pendek
    tokens = [w for w in tokens if w not in stopwords and len(w) > 2]

    # --- Step 8: Stemming (Sastrawi)
    tokens = [stemmer.stem(w) for w in tokens]

    return " ".join(tokens)

print("✅ Fungsi clean_text siap dengan slang mapping, stopword removal, dan stemming!")


✅ Fungsi clean_text siap dengan slang mapping, stopword removal, dan stemming!


In [None]:
# Terapkan Preprocessing pada Data
print("Sedang melakukan preprocessing teks...")

# Terapkan preprocessing pada kolom text
tqdm.pandas()
cleaned_df['cleaned_text'] = cleaned_df['text'].progress_apply(clean_text)

print("Preprocessing selesai!")
display(cleaned_df.head())

Sedang melakukan preprocessing teks...


100%|██████████| 3272/3272 [14:39<00:00,  3.72it/s]

Preprocessing selesai!





Unnamed: 0,created_at,text,cleaned_text
0,Sun Aug 31 23:42:30 +0000 2025,tapi sumpah guyss jgn rusak fasilitas umum kar...,sumpah guyss rusak fasilitas kitaa pakai rambu...
1,Sun Aug 31 23:38:21 +0000 2025,Beneran gw pas jalan kerja di gojek nangis sed...,beneran jalan kerja gojek nang sedih lihat hal...
2,Sun Aug 31 23:29:04 +0000 2025,Mobil Dinas Kesulitan di Jalan Rusak Warga Pil...,mobil dinas sulit jalan rusak warga pilih reka...
3,Sun Aug 31 22:46:43 +0000 2025,helehhhh prett bgt orang jepara aslinya gamau ...,heleh prett orang jepara asli gamau ribet pusi...
4,Sun Aug 31 22:33:25 +0000 2025,@yorselves Apakah adil rakyat yg mati gegara l...,user adil rakyat mati gegara lubang jalan kual...


In [None]:
# Define the output path for the preprocessed data
output_preprocessed_filepath = os.path.join(PROCESSED_DIR, 'preprocessed_data.csv')

# Save the cleaned DataFrame with the new 'cleaned_text' column to a CSV file
cleaned_df.to_csv(output_preprocessed_filepath, index=False)

print(f"Preprocessed data saved to: {output_preprocessed_filepath}")

Preprocessed data saved to: /content/data/processed/preprocessed_data.csv
