# **Vector Space Model**

VSM adalah cara untuk mengubah teks menjadi bentuk yang dapat dipahami oleh komputer. Bayangkan setiap dokumen (misalnya, artikel berita) diubah menjadi titik dalam ruang dimensi tinggi. Setiap dimensi mewakili kata atau token. Posisi titik ditentukan oleh seberapa sering kata-kata tersebut muncul dalam dokumen.

### Prepocessing

In [46]:
!pip install Sastrawi nltk



**Sastrawi**: Library ini khusus dirancang untuk memproses teks bahasa Indonesia. Salah satu fungsi utamanya adalah stemming, yaitu mengubah kata ke bentuk dasarnya (misalnya, "memakan" menjadi "makan"). Stemming ini penting dalam membangun Vector Space Model (VSM) karena membantu mengelompokkan kata-kata yang mirip.

**nltk**: Singkatan dari Natural Language Toolkit, library ini menyediakan berbagai tools untuk pemrosesan bahasa alami.

In [47]:
# import library yang dibutuhkan
import pandas as pd
import re
from tqdm import tqdm
import nltk
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from nltk.corpus import stopwords
import string

In [48]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [49]:
df = pd.read_csv("/content/drive/MyDrive/PPW/Hasil_crawling_cnbc.csv")

In [50]:
df.head()

Unnamed: 0,judul,tanggal,isi,url,kategori
0,5 Terminal Bus Populer di Jakarta yang Melayan...,19 December 2024 18:40,"Jakarta, CNBC Indonesia- Musim mudik liburan N...",https://www.cnbcindonesia.com/research/2024121...,Research
1,Waspada! ini Titik Rawan Macet di Tol Pulau Ja...,19 December 2024 17:35,"Jakarta, CNBC Indonesia- Arus mudik periode li...",https://www.cnbcindonesia.com/research/2024121...,Research
2,"Termasuk Alfamart, Ini Daftar 5 Raksasa Jaring...",19 December 2024 16:55,"Jakarta,CNBC Indonesia-Indonesia menjadi rumah...",https://www.cnbcindonesia.com/research/2024121...,Research
3,"Amukan Dolar AS Tak Terbendung, Rupiah Bernasi...",19 December 2024 14:30,"Jakarta, CNBC Indonesia -Nilai tukar mata uang...",https://www.cnbcindonesia.com/research/2024121...,Research
4,Ini 6 BPR Syariah Terbaik Penopang Usaha Kecil...,19 December 2024 14:10,"Jakarta, CNBC Indonesia -Bank Pembiayaan Rakya...",https://www.cnbcindonesia.com/research/2024121...,Research


In [51]:
# hanya memilih kolom judul, isi, dan kategori
df = df[['isi','kategori']]
df.head()

Unnamed: 0,isi,kategori
0,"Jakarta, CNBC Indonesia- Musim mudik liburan N...",Research
1,"Jakarta, CNBC Indonesia- Arus mudik periode li...",Research
2,"Jakarta,CNBC Indonesia-Indonesia menjadi rumah...",Research
3,"Jakarta, CNBC Indonesia -Nilai tukar mata uang...",Research
4,"Jakarta, CNBC Indonesia -Bank Pembiayaan Rakya...",Research


In [52]:
# gabungkan "Market" and "News" categories masing-masing diambil 50 berita
df_combined = pd.concat([
    df[df["kategori"] == "Market"].sample(n=50, random_state=42),
    df[df["kategori"] == "News"].sample(n=50, random_state=42)
])

# Now df_combined holds the combined data
df_combined

Unnamed: 0,isi,kategori
383,"Jakarta, CNBC Indonesia -Rupiah melemah terhad...",Market
353,"Jakarta, CNBC Indonesia- PT Bank Rakyat Indone...",Market
370,"Jakarta, CNBC Indonesia- Eksistensi produk ele...",Market
345,"Jakarta, CNBC Indonesia- Indeks Harga Saham Ga...",Market
344,"Jakarta, CNBC Indonesia -Pasar kripto mengalam...",Market
...,...,...
119,"Jakarta, CNBC Indonesia- Perdana Menteri (PM) ...",News
181,"Jakarta, CNBC Indonesia -Kementerian Kelautan ...",News
125,"Jakarta, CNBC Indonesia- Pemerintahan Presiden...",News
162,"Jakarta, CNBC Indonesia- Mantan Menteri Energi...",News


In [53]:
df_combined.info()

<class 'pandas.core.frame.DataFrame'>
Index: 100 entries, 383 to 113
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   isi       100 non-null    object
 1   kategori  100 non-null    object
dtypes: object(2)
memory usage: 2.3+ KB


In [54]:
data = df_combined.copy()

data = data.sample(frac=1, ignore_index=True)

Mengacak data sebelum melakukan pemodelan, seperti dalam kasus ini untuk Vector Space Model, sangat penting untuk:

Mencegah bias: Model machine learning bisa terpengaruh oleh urutan data. Mengacak data membantu memastikan model belajar dari pola data secara umum, bukan dari urutan spesifik data yang diberikan.
Meningkatkan generalisasi: Model yang dilatih dengan data acak cenderung lebih baik dalam memprediksi data baru yang belum pernah dilihat sebelumnya.
Jadi, intinya baris kode ini berfungsi untuk mempersiapkan data agar lebih optimal untuk proses pemodelan selanjutnya.

#### Proses Cleaning Text

**Filtering Stopwords**:
       - Menghapus stopwords (kata-kata umum yang tidak penting dalam analisis teks seperti "yang", "dan", "di") menggunakan daftar stopwords dari
         bahasa Indonesia yang disediakan oleh NLTK. Ini bertujuan agar hanya kata-kata penting yang dipertahankan dalam teks.

In [55]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [56]:
stopwords = stopwords.words('indonesian')
# save stopwords
with open('stopwords.txt', 'w') as f:
  for item in stopwords:
    f.write("%s\n" % item)

**fungsi cleansing** itu membersihkan teks mentah sebelum dianalisis.

Cara kerjanya:

Menghilangkan spasi berlebih, karakter asing, angka, dan tanda baca.
Mengubah semua huruf jadi kecil.
Menghapus baris baru, sehingga teks jadi satu baris saja.
Tujuannya:

Mempersiapkan teks agar lebih mudah diproses dan dianalisis oleh komputer, misalnya untuk membuat Vector Space Model.

In [57]:
def cleansing(text):
    # Menghapus tambahan spasi
    text = re.sub(r'[\s]+', ' ', text)
    # Menghapus karakter non-ASCII
    text = text.encode('ascii','ignore').decode('utf-8')
     # Menghapus karakter non-printable
    text = re.sub(r'[^\x00-\x7f]', r'', text)
     # Menghapus angka
    text = re.sub(r'\d+', '', text)
     # Mengubah teks menjadi huruf kecil
    text = text.lower()
     # Menghapus tanda hubung yang tidak berada di antara dua huruf
    text = re.sub(r'\b-\b', ' ', text)
     # Menghapus tanda baca
    text = re.sub(r'[^\w\s]+', ' ', text)
     #Menghapus baris baru
    text = text.replace('\n','')
    return text

**Remove Stopword**

In [58]:
def stopword(text):
    text = text.split() # Memisahkan teks menjadi kata-kata
    text = [word for word in text if word not in stopwords] # Menghapus stopwords dari teks
    text = ' '.join(text) # Menggabungkan kata-kata kembali menjadi teks
    return text

#### Stemming

In [59]:
def stemming_indo(text):
    factory = StemmerFactory()
    stemmer = factory.create_stemmer()
    text = stemmer.stem(text)
    return text

In [60]:
def preprocessing(text):
    clean_text = cleansing(text)
    stopword_text = stopword(clean_text)
    stemmed_text = stemming_indo(stopword_text)
    return stemmed_text

In [61]:
print(data["isi"][0])

Jakarta, CNBC Indonesia -Direktur Utama PT PLN (Persero), Darmawan Prasodjo membeberkan empat pelanggan listrik yang akan menerima paket stimulus ekonomi melalui diskon tarif listrik 50% pada Januari dan Februari 2025. Setidaknya ada 81,4 juta pelanggan Rumah Tangga (RT) dari empat yang akan mendapatkan diskon tersebut.
Diantaranya: golongan listrik 450 Volt Amphere (VA) sebanyak 24,6 juta pelanggan, kemudian 900 VA sebanyak 38 juta pelanggan, lalu 1.300 VA sebanyak 14,1 juta dan pelanggan 2.200 VA sekitar 4,6 juta.
"Ini menyasar 97% pelanggan, diskon 50% pada bulan Januari-Februari 2025. Ini berkah untuk daya beli masyarakat, kami siap menjalankan berkah ini tentunya untuk pelanggan pra bayar kami, misalnya beli Rp 100 ribu bisa jadi separuhnya," terang Darmawan Prasodjo, dalam Konfrensi Pers: Paket Stimulus Ekonomi Untuk Kesejahteraan Masyarakat, dikutip Kamis (19/12/2024).


Darmawan menegaskan, PLN mendukung penuh kebijakan tersebut dan memastikan mekanisme penyaluran diskon listri

In [62]:
print(preprocessing(data["isi"][0]))

jakarta cnbc indonesia direktur utama pt pln persero darmawan prasodjo beber langgan listrik terima paket stimulus ekonomi diskon tarif listrik januari februari juta langgan rumah tangga rt diskon golong listrik volt amphere va juta langgan va juta langgan va juta langgan va juta sasar langgan diskon januari februari berkah daya beli masyarakat jalan berkah langgan pra bayar beli rp ribu paruh terang darmawan prasodjo konfrensi pers paket stimulus ekonomi sejahtera masyarakat kutip kamis darmawan pln dukung penuh bijak mekanisme salur diskon listrik jalan sasar proses registrasi jabar diskon langgan pascabayar nikmat otomatis langgan bayar tagih listrik periode januari februari langgan prabayar diskon oleh langgan beli token listrik periode langgan pascabayar nominal tagih bulan otomatis rang bayar listrik langgan prabayar potong langsung dapat langgan beli token listrik mana pln mobile ritel ritel agen mana jabar darmawan menteri uang menkeu sri mulyani bijak apbn instrumen selenggara

In [63]:
data["cleaned"] = data["isi"].apply(preprocessing)

In [64]:
data.head()

Unnamed: 0,isi,kategori,cleaned
0,"Jakarta, CNBC Indonesia -Direktur Utama PT PLN...",News,jakarta cnbc indonesia direktur utama pt pln p...
1,"Jakarta, CNBC Indonesia -Kementerian Energi da...",News,jakarta cnbc indonesia menteri energi sumber d...
2,"Jakarta, CNBC Indonesia-Pengusaha yang tergabu...",News,jakarta cnbc indonesia usaha gabung asosiasi u...
3,"Jakarta, CNBC Indonesia- Tempat rekreasi Taman...",Market,jakarta cnbc indonesia rekreasi taman mini ind...
4,"Jakarta, CNBC Indonesia -Rupiah dalam melawan ...",Market,jakarta cnbc indonesia rupiah lawan dolar amer...


In [65]:
data["isi"][0]

'Jakarta, CNBC Indonesia -Direktur Utama PT PLN (Persero), Darmawan Prasodjo membeberkan empat pelanggan listrik yang akan menerima paket stimulus ekonomi melalui diskon tarif listrik 50% pada Januari dan Februari 2025. Setidaknya ada 81,4 juta pelanggan Rumah Tangga (RT) dari empat yang akan mendapatkan diskon tersebut.\nDiantaranya: golongan listrik 450 Volt Amphere (VA) sebanyak 24,6 juta pelanggan, kemudian 900 VA sebanyak 38 juta pelanggan, lalu 1.300 VA sebanyak 14,1 juta dan pelanggan 2.200 VA sekitar 4,6 juta.\n"Ini menyasar 97% pelanggan, diskon 50% pada bulan Januari-Februari 2025. Ini berkah untuk daya beli masyarakat, kami siap menjalankan berkah ini tentunya untuk pelanggan pra bayar kami, misalnya beli Rp 100 ribu bisa jadi separuhnya," terang Darmawan Prasodjo, dalam Konfrensi Pers: Paket Stimulus Ekonomi Untuk Kesejahteraan Masyarakat, dikutip Kamis (19/12/2024).\n\n\nDarmawan menegaskan, PLN mendukung penuh kebijakan tersebut dan memastikan mekanisme penyaluran diskon 

In [66]:
data["cleaned"][0]

'jakarta cnbc indonesia direktur utama pt pln persero darmawan prasodjo beber langgan listrik terima paket stimulus ekonomi diskon tarif listrik januari februari juta langgan rumah tangga rt diskon golong listrik volt amphere va juta langgan va juta langgan va juta langgan va juta sasar langgan diskon januari februari berkah daya beli masyarakat jalan berkah langgan pra bayar beli rp ribu paruh terang darmawan prasodjo konfrensi pers paket stimulus ekonomi sejahtera masyarakat kutip kamis darmawan pln dukung penuh bijak mekanisme salur diskon listrik jalan sasar proses registrasi jabar diskon langgan pascabayar nikmat otomatis langgan bayar tagih listrik periode januari februari langgan prabayar diskon oleh langgan beli token listrik periode langgan pascabayar nominal tagih bulan otomatis rang bayar listrik langgan prabayar potong langsung dapat langgan beli token listrik mana pln mobile ritel ritel agen mana jabar darmawan menteri uang menkeu sri mulyani bijak apbn instrumen selenggar

In [67]:
# Save the dataframe before reading it
data.to_csv("data_VSMbersih.csv", index=False)

# Now read the saved file
data = pd.read_csv("data_VSMbersih.csv")

### TF-IDF

###Split data

In [68]:
data.head()

Unnamed: 0,isi,kategori,cleaned
0,"Jakarta, CNBC Indonesia -Direktur Utama PT PLN...",News,jakarta cnbc indonesia direktur utama pt pln p...
1,"Jakarta, CNBC Indonesia -Kementerian Energi da...",News,jakarta cnbc indonesia menteri energi sumber d...
2,"Jakarta, CNBC Indonesia-Pengusaha yang tergabu...",News,jakarta cnbc indonesia usaha gabung asosiasi u...
3,"Jakarta, CNBC Indonesia- Tempat rekreasi Taman...",Market,jakarta cnbc indonesia rekreasi taman mini ind...
4,"Jakarta, CNBC Indonesia -Rupiah dalam melawan ...",Market,jakarta cnbc indonesia rupiah lawan dolar amer...


In [69]:
data = data.sample(frac = 1, ignore_index=True)

In [70]:
data_train = data[:80]
data_test = data[80:]
data_train.head(10)

Unnamed: 0,isi,kategori,cleaned
0,"Jakarta, CNBC Indonesia- Hapus tagih kredit us...",Market,jakarta cnbc indonesia hapus tagih kredit usah...
1,"Jakarta, CNBC Indonesia- Salim Grup terkenal d...",Market,jakarta cnbc indonesia salim grup kenal bisnis...
2,"Jakarta, CNBC Indonesia -Liburan periode peray...",Market,jakarta cnbc indonesia libur periode raya nata...
3,"Jakarta, CNBC Indonesia- Pemerintahan Presiden...",News,jakarta cnbc indonesia perintah presiden prabo...
4,"Jakarta, CNBC Indonesia- Pemerintah Jepang mas...",News,jakarta cnbc indonesia perintah jepang komitme...
5,"Jakarta, CNBC Indonesia- Selama tiga hari beru...",Market,jakarta cnbc indonesia untun indeks harga saha...
6,"Jakarta, CNBC Indonesia- Perdana Menteri (PM) ...",News,jakarta cnbc indonesia perdana menteri pm kana...
7,"Jakarta, CNBC Indonesia -Menteri Koordinator B...",News,jakarta cnbc indonesia menteri koordinator bid...
8,"Jakarta, CNBC Indonesia -Pemerintah bakal memb...",News,jakarta cnbc indonesia perintah insentif kenda...
9,"Jakarta, CNBC Indonesia- Perusahaan holding di...",Market,jakarta cnbc indonesia usaha holding bidang en...


In [71]:
def tfidf_vsm(data, kategori):
	tfidf = TfidfVectorizer()
	tfidf_matrix = tfidf.fit_transform(data)
	feature_names = tfidf.get_feature_names_out()

	df_tfidf = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)
	df_tfidf.insert(0, 'Kategori Berita', kategori.reset_index(drop=True))

	return tfidf, df_tfidf

tfidf_model, df_tfidf = tfidf_vsm(data_train['cleaned'], data_train['kategori'])

In [72]:
df_tfidf.head(10)

Unnamed: 0,Kategori Berita,abad,abbas,abdurrahman,abstain,ac,acara,acceptance,aceh,aces,...,yuan,yuannya,yulaswati,yuliono,yupiter,zenix,ziswaf,zona,zulhas,zulkifli
0,Market,0.0,0.0,0.067267,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Market,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Market,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,News,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,News,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Market,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.099036,0.0,0.0
6,News,0.0,0.0,0.0,0.024929,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,News,0.0,0.0,0.0,0.0,0.0,0.038468,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,News,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.048693,0.0,0.0,0.0,0.0
9,Market,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [73]:
df_tfidf.shape

(80, 3053)

In [74]:
def model_tf_idf(data, model, kategori):
	tfidf_matrix = model.transform(data)
	feature_names = model.get_feature_names_out()

	df_tfidf = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)
	df_tfidf.insert(0, 'Kategori Berita', kategori.reset_index(drop=True))

	return df_tfidf

df_tfidf_test = model_tf_idf(data_test['cleaned'], tfidf_model, data_test['kategori'])

In [75]:
tfidf_model

In [76]:
df_tfidf_test.head()

Unnamed: 0,Kategori Berita,abad,abbas,abdurrahman,abstain,ac,acara,acceptance,aceh,aces,...,yuan,yuannya,yulaswati,yuliono,yupiter,zenix,ziswaf,zona,zulhas,zulkifli
0,News,0.0,0.0,0.0,0.0,0.0,0.04534,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Market,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Market,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Market,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Market,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.052748,0.0,0.0


#### Save Dataset

In [77]:
df_tfidf.to_csv("data_train_vsm.csv", index=False)
df_tfidf_test.to_csv("data_test_vsm.csv", index=False)

In [78]:
import pickle

In [79]:
with open('tfidf_model.pkl', 'wb') as f:
    pickle.dump(tfidf_model, f)

In [80]:
X = df_tfidf.drop('Kategori Berita', axis=1)
y = df_tfidf['Kategori Berita']

In [81]:
X.shape

(80, 3052)

In [82]:
x_test = df_tfidf_test.drop('Kategori Berita', axis=1)
y_test = df_tfidf_test['Kategori Berita']

In [83]:
tfidf_model, df_test = tfidf_vsm(data_test['cleaned'], data_train['kategori'])

In [84]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X, y)

In [85]:
y_pred = knn.predict(x_test)

In [86]:
y_pred

array(['News', 'Market', 'Market', 'Market', 'Market', 'News', 'News',
       'News', 'Market', 'News', 'News', 'Market', 'News', 'Market',
       'Market', 'Market', 'Market', 'News', 'Market', 'Market'],
      dtype=object)

In [87]:
y_test


Unnamed: 0,Kategori Berita
0,News
1,Market
2,Market
3,Market
4,Market
5,News
6,News
7,News
8,Market
9,News


In [88]:
# prompt: gabungkan y_pred dengan y_test

import pandas as pd

# Assuming y_pred and y_test are already defined as in your original code
# Create a DataFrame from y_pred
y_pred_df = pd.DataFrame({'y_pred': y_pred})

# Concatenate y_pred with y_test
result_df = pd.concat([y_test.reset_index(drop=True), y_pred_df], axis=1)

# Print or use the combined DataFrame
result_df

Unnamed: 0,Kategori Berita,y_pred
0,News,News
1,Market,Market
2,Market,Market
3,Market,Market
4,Market,Market
5,News,News
6,News,News
7,News,News
8,Market,Market
9,News,News


In [89]:
# prompt: hitung berapa yang salah

import pandas as pd
# ... (rest of your code)

# Assuming y_pred and y_test are already defined
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

print(classification_report(y_test, y_pred))

print(confusion_matrix(y_test, y_pred))

Accuracy: 0.9
              precision    recall  f1-score   support

      Market       0.83      1.00      0.91        10
        News       1.00      0.80      0.89        10

    accuracy                           0.90        20
   macro avg       0.92      0.90      0.90        20
weighted avg       0.92      0.90      0.90        20

[[10  0]
 [ 2  8]]


In [90]:
# prompt: buatkan akurasinya

from sklearn.metrics import accuracy_score

# Assuming y_pred and y_test are already defined as in your original code
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.9


In [91]:
# prompt: simpan hasil dari model KNN

# Save the trained KNN model
import pickle
filename = 'knn_model.sav'
pickle.dump(knn, open(filename, 'wb'))

# Load the saved model (example)
# loaded_model = pickle.load(open(filename, 'rb'))
# y_pred_loaded = loaded_model.predict(x_test)


# Optionally, save the predictions and true labels together
result_df.to_csv('knn_predictions.csv', index=False)