# PROJECT AKHIR

## IMPLEMENTASI

[Link Streamlit](https://ppw-uas.streamlit.app/)

## Import Library

In [1]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.svm import SVC

import pandas as pd
import warnings
import joblib
import nltk
import re

nltk.download('stopwords')
nltk.download('punkt')
warnings.filterwarnings('ignore')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.




```
# Code
```

Impor pustaka kode di atas yang diperlukan untuk melakukan analisis teks dengan Naive Bayes, termasuk penggunaan LDA untuk reduksi dimensi dan ekstraksi fitur teks. Berikut penjelasan singkat tentang setiap modul yang diimpor:
- `from sklearn.naive_bayes import GaussianNB`: Ini mengimpor modul `GaussianNB` dari Scikit-Learn, yang merupakan implementasi Naive Bayes dengan distribusi Gaussian. Ini adalah salah satu varian algoritma Naive Bayes yang dapat digunakan dalam klasifikasi.

- `from sklearn.decomposition import LatentDirichletAllocation`: Ini mengimpor modul LatentDirichletAllocation dari pustaka Scikit-Learn. LDA adalah algoritma yang digunakan untuk reduksi dimensi dalam analisis teks, terutama untuk topik modeling.

- `from sklearn.model_selection import train_test_split`: Modul `train_test_split` dari Scikit-Learn digunakan untuk membagi dataset ke dalam subset pelatihan dan pengujian. Ini adalah langkah yang umum dalam pemodelan dan evaluasi.

- `from sklearn.metrics import accuracy_score, classification_report, confusion_matrix`: Ini mengimpor beberapa modul yang digunakan untuk mengevaluasi kinerja model klasifikasi. `accuracy_score` digunakan untuk mengukur akurasi model, `classification_report` memberikan ringkasan berbagai metrik evaluasi, dan `confusion_matrix` digunakan untuk menganalisis hasil klasifikasi.

- `from sklearn.feature_extraction.text import CountVectorizer`: Modul `CountVectorizer` digunakan untuk mengubah teks menjadi vektor Term Frequensi (TF), yang dapat digunakan sebagai fitur dalam pemodelan klasifikasi.

- `import joblib`: Ini mengimpor modul `joblib`, yang dapat digunakan untuk menyimpan dan memuat model yang telah dilatih.

- `import pandas as pd`: Modul `pandas` digunakan untuk manipulasi data, termasuk membaca dan menyimpan data dalam bentuk DataFrame.

- `import warnings`: Modul `warnings` digunakan untuk mengelola dan menghilangkan peringatan yang muncul selama proses eksekusi kode.

## Load Dataset

Dalam kode di bawah, program ini menggunakan pustaka `pandas` untuk membaca sebuah file CSV dari lokasi yang diberikan, dan kemudian menampilkan isi DataFrame tersebut. Ini berguna ketika Anda ingin memuat dan mengeksplorasi data dari file CSV yang sudah ada.

Berikut penjelasan singkatnya:

- `df = pd.read_csv('/content/drive/MyDrive/Kuliah /Semester 7/Pencarian Dan Penambangan Web/ppw/pages/Term Frequensi.csv')`: Ini adalah perintah yang menggunakan `pandas` untuk membaca data dari file CSV yang terletak pada jalur tertentu. Anda memberikan jalur lengkap ke file CSV yang ingin dibaca, dan hasilnya dimuat ke dalam variabel `df`. Variabel `df` adalah objek DataFrame yang berisi data dari file CSV.

- `df`: Setelah Anda membaca data dari file CSV ke dalam DataFrame, perintah ini akan menampilkan seluruh isi DataFrame tersebut.

Jadi, dengan melakukan perintah tersebut, Anda telah memuat data dari file CSV ke dalam DataFrame dan menampilkan isi dari DataFrame tersebut di lingkungan kerja Anda, yang memungkinkan Anda untuk melihat, menganalisis, atau melakukan operasi lain pada data tersebut.

In [2]:
df = pd.read_csv('/content/drive/MyDrive/ppw/tugas/3kategori.csv')
df

Unnamed: 0,Judul,Isi,Label
0,Sultan Korban Kabel Menjuntai Akan Lanjutkan K...,"JAKARTA, KOMPAS.com - Sultan Rif'at Alfatih, k...",megapolitan
1,Aktivitas Firli Bahuri Usai Tak Aktif Jadi Ket...,"JAKARTA, KOMPAS.com - Ketua Komisi Pemberantas...",megapolitan
2,Firli Bahuri Tak Hadiri Peringatan Hari Antiko...,"JAKARTA, KOMPAS.com - Ketua Komisi Pemberantas...",megapolitan
3,Pemilik Bengkel di Bogor Diduga Dianiaya Tetan...,"BOGOR, KOMPAS.com - HSN (75), lansia pemilik b...",megapolitan
4,Bertambah 315 Kasus Covid-19 di Jakarta dalam ...,"JAKARTA, KOMPAS.com - Dinas Kesehatan (Dinkes)...",megapolitan
...,...,...,...
2645,Jawaban Anies soal Polusi Udara DKI Jakarta ya...,KOMPAS.com- Debat perdana calon presiden (capr...,tren
2646,"Tengah Mewabah, Ini 5 Bahan Alami untuk Membas...",KOMPAS.com - Wabah kutu busuk atau kutu kasur ...,tren
2647,"Driver Gojek Bisa Nilai Penumpang, Ini Risiko ...",KOMPAS.com - Sebuah unggahan yang menyatakan p...,tren
2648,"Meteran Listrik Diganti AMI, Apakah Biaya List...",KOMPAS.co - PT PLN (Persero) memastikan akan m...,tren


## Melihat NULL Data

Kode df.isnull().sum()biasanya digunakan dalam konteks bekerja dengan panda, pustaka manipulasi data yang populer dengan Python. Mari kita uraikan kodenya:

Jadi, df.isnull().sum()memberi Anda Seri di mana setiap entri adalah jumlah nilai nol di kolom DataFrame yang sesuai.

In [3]:
df.isnull().sum()

Judul    0
Isi      0
Label    0
dtype: int64

## Cleaning

In [5]:
def cleaning(text):
  text = re.sub(r'[^a-zA-Z\s]', '', text).strip()
  return text

df['Cleaning'] = df['Isi'].apply(cleaning)
df['Cleaning']

0       JAKARTA KOMPAScom  Sultan Rifat Alfatih korban...
1       JAKARTA KOMPAScom  Ketua Komisi Pemberantasan ...
2       JAKARTA KOMPAScom  Ketua Komisi Pemberantasan ...
3       BOGOR KOMPAScom  HSN  lansia pemilik bengkel d...
4       JAKARTA KOMPAScom  Dinas Kesehatan Dinkes DKI ...
                              ...                        
2645    KOMPAScom Debat perdana calon presiden capres ...
2646    KOMPAScom  Wabah kutu busuk atau kutu kasur me...
2647    KOMPAScom  Sebuah unggahan yang menyatakan pen...
2648    KOMPASco  PT PLN Persero memastikan akan mengg...
2649    KOMPAScom  Menyesap kopi di pagi hari sesaat s...
Name: Cleaning, Length: 2650, dtype: object

## Tokenizing

In [6]:
def tokenizer(text):
  text = text.lower()
  return word_tokenize(text)

df['Tokenizing'] = df['Cleaning'].apply(tokenizer)
df['Tokenizing']

0       [jakarta, kompascom, sultan, rifat, alfatih, k...
1       [jakarta, kompascom, ketua, komisi, pemberanta...
2       [jakarta, kompascom, ketua, komisi, pemberanta...
3       [bogor, kompascom, hsn, lansia, pemilik, bengk...
4       [jakarta, kompascom, dinas, kesehatan, dinkes,...
                              ...                        
2645    [kompascom, debat, perdana, calon, presiden, c...
2646    [kompascom, wabah, kutu, busuk, atau, kutu, ka...
2647    [kompascom, sebuah, unggahan, yang, menyatakan...
2648    [kompasco, pt, pln, persero, memastikan, akan,...
2649    [kompascom, menyesap, kopi, di, pagi, hari, se...
Name: Tokenizing, Length: 2650, dtype: object

## Stopword Removal

In [7]:
corpus = stopwords.words('indonesian')

def stopwordText(words):
 return [word for word in words if word not in corpus]

df['Stopword Removal'] = df['Tokenizing'].apply(stopwordText)

# Gabungkan kembali token menjadi kalimat utuh
df['Full Text'] = df['Stopword Removal'].apply(lambda x: ' '.join(x))
df['Full Text']

0       jakarta kompascom sultan rifat alfatih korban ...
1       jakarta kompascom ketua komisi pemberantasan k...
2       jakarta kompascom ketua komisi pemberantasan k...
3       bogor kompascom hsn lansia pemilik bengkel taj...
4       jakarta kompascom dinas kesehatan dinkes dki j...
                              ...                        
2645    kompascom debat perdana calon presiden capres ...
2646    kompascom wabah kutu busuk kutu kasur menyeran...
2647    kompascom unggahan penumpang ojek online ojol ...
2648    kompasco pt pln persero mengganti meteran list...
2649    kompascom menyesap kopi pagi bangun tidur sala...
Name: Full Text, Length: 2650, dtype: object

## TFIDF

In [9]:
def tfidf(dokumen, category):
  vectorizer = TfidfVectorizer()
  x = vectorizer.fit_transform(dokumen).toarray()
  terms = vectorizer.get_feature_names_out()

  final_tfidf = pd.DataFrame(x, columns=terms)
  final_tfidf.insert(0, 'Dokumen', dokumen)
  final_tfidf.insert(len(final_tfidf.columns),'Label', category)

  return (vectorizer, final_tfidf)

tfidf_vectorizer, final_tfidf = tfidf(df['Full Text'], df['Label'])
final_tfidf

Unnamed: 0,Dokumen,abdominal,abdominalrupanya,abdominalsebagai,abdulrahman,abs,abu,acara,acarajadi,adabaca,...,yuridisselain,yusuf,zambia,zen,zikrillah,zonasigapyear,zoni,zonihal,zurich,Label
0,jakarta kompascom sultan rifat alfatih korban ...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,megapolitan
1,jakarta kompascom ketua komisi pemberantasan k...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,megapolitan
2,jakarta kompascom ketua komisi pemberantasan k...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.079198,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,megapolitan
3,bogor kompascom hsn lansia pemilik bengkel taj...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.039295,0.0,0.0,megapolitan
4,jakarta kompascom dinas kesehatan dinkes dki j...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,megapolitan
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2645,kompascom debat perdana calon presiden capres ...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,tren
2646,kompascom wabah kutu busuk kutu kasur menyeran...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,tren
2647,kompascom unggahan penumpang ojek online ojol ...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,tren
2648,kompasco pt pln persero mengganti meteran list...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,tren


## Deklarasi X dan y dengan drop fitur dokumen dan label

Dalam proses ini melakukan pemisahan antara fitur (X) dan target (y) dalam DataFrame `df` untuk digunakan dalam proses pemodelan. Berikut penjelasan singkatnya:

.drop(['Dokumen', 'Category'], axis=1):final_tfidf.axis=1param

X:final_tfidfkecuali

JadiXakanfinal_tfidf, B

In [10]:
X = final_tfidf.drop(['Dokumen', 'Label'], axis=1)
X

Unnamed: 0,abdominal,abdominalrupanya,abdominalsebagai,abdulrahman,abs,abu,acara,acarajadi,adabaca,adakah,...,yunani,yuridisselain,yusuf,zambia,zen,zikrillah,zonasigapyear,zoni,zonihal,zurich
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.079198,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.039295,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2645,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0
2646,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0
2647,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0
2648,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0


In [11]:
y = df['Label']
y

0       megapolitan
1       megapolitan
2       megapolitan
3       megapolitan
4       megapolitan
           ...     
2645           tren
2646           tren
2647           tren
2648           tren
2649           tren
Name: Label, Length: 2650, dtype: object

## Splitting Data

- `X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)`: Ini adalah perintah yang membagi data menjadi empat subset, yaitu `X_train` (fitur pelatihan), `X_test` (fitur pengujian), `y_train` (target pelatihan), dan `y_test` (target pengujian).

    - `X` adalah DataFrame yang berisi fitur.
    - `y` adalah Series yang berisi target atau label.
    - `test_size=0.3` menentukan bahwa 30% dari data akan digunakan sebagai data pengujian, sementara 70% akan digunakan sebagai data pelatihan.
    - `random_state=42` digunakan untuk mengatur nilai seed agar pembagian data dapat direproduksi dengan hasil yang konsisten. Anda dapat menggantinya dengan nilai lain jika diperlukan.

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

## **Modeling With LDA**

In [13]:
def find_best_lda(Xtrain, Xtest, n_components, alpha, beta):
  looping = 1
  best = {'k' : 0, 'alpha' : 0, 'beta' : 0, 'accuracy' : 0, 'model': '', 'lda' : '', 'lda_Xtrain' : '', 'lda_Xtest' : ''}
  history = pd.DataFrame(columns=["Pengujian Ke", "K", "Alpha", "Beta", "Accuracy"])

  # Menambahkan tqdm pada loop terluar
  for k in n_components:
    for a in alpha:
      for b in beta:
        lda = LatentDirichletAllocation(n_components=k, doc_topic_prior=a, topic_word_prior=b)
        lda_Xtrain = lda.fit_transform(Xtrain)
        lda_Xtest = lda.transform(Xtest)

        # Membuat model Naive Bayes
        model = GaussianNB()

        # Melatih model pada data pelatihan
        model.fit(lda_Xtrain, y_train)

        # Melakukan prediksi pada data pengujian
        y_pred = model.predict(lda_Xtest)

        # Menghitung akurasi
        accuracy = accuracy_score(y_test, y_pred)
        print(f"Jumlah Topik: {k}, Alpha: {a}, Beta: {b}, Accuracy: {accuracy}")

        if accuracy > best['accuracy']:
          best['accuracy'] = accuracy
          best['k'] = k
          best['alpha'] = a
          best['beta'] = b
          best['model'] = model
          best['lda'] = lda
          best['lda_Xtrain'] = lda_Xtrain
          best['lda_Xtest'] = lda_Xtest

        history.loc[len(history)] = [f"Pengujian Ke- {looping}", k, a, b, accuracy]
        looping += 1

  return (best, history)

k = [3, 4, 5]
alpha = [0.3, 0.4]
beta = [0.1, 0.2]
best_param, history = find_best_lda(X_train, X_test, k, alpha, beta)

Jumlah Topik: 3, Alpha: 0.3, Beta: 0.1, Accuracy: 0.7886792452830189
Jumlah Topik: 3, Alpha: 0.3, Beta: 0.2, Accuracy: 0.6968553459119496
Jumlah Topik: 3, Alpha: 0.4, Beta: 0.1, Accuracy: 0.5974842767295597
Jumlah Topik: 3, Alpha: 0.4, Beta: 0.2, Accuracy: 0.650314465408805
Jumlah Topik: 4, Alpha: 0.3, Beta: 0.1, Accuracy: 0.5911949685534591
Jumlah Topik: 4, Alpha: 0.3, Beta: 0.2, Accuracy: 0.6792452830188679
Jumlah Topik: 4, Alpha: 0.4, Beta: 0.1, Accuracy: 0.5132075471698113
Jumlah Topik: 4, Alpha: 0.4, Beta: 0.2, Accuracy: 0.6691823899371069
Jumlah Topik: 5, Alpha: 0.3, Beta: 0.1, Accuracy: 0.6465408805031446
Jumlah Topik: 5, Alpha: 0.3, Beta: 0.2, Accuracy: 0.6389937106918239
Jumlah Topik: 5, Alpha: 0.4, Beta: 0.1, Accuracy: 0.6364779874213836
Jumlah Topik: 5, Alpha: 0.4, Beta: 0.2, Accuracy: 0.589937106918239


In [14]:
best_param

{'k': 3,
 'alpha': 0.3,
 'beta': 0.1,
 'accuracy': 0.7886792452830189,
 'model': GaussianNB(),
 'lda': LatentDirichletAllocation(doc_topic_prior=0.3, n_components=3,
                           topic_word_prior=0.1),
 'lda_Xtrain': array([[0.93979932, 0.03073772, 0.02946296],
        [0.9287801 , 0.03534205, 0.03587785],
        [0.03972918, 0.03921265, 0.92105817],
        ...,
        [0.92280624, 0.03862461, 0.03856915],
        [0.03386824, 0.93175616, 0.0343756 ],
        [0.94202508, 0.02887316, 0.02910177]]),
 'lda_Xtest': array([[0.03805526, 0.92401887, 0.03792587],
        [0.91439354, 0.04199023, 0.04361623],
        [0.04211858, 0.04293847, 0.91494295],
        ...,
        [0.93979932, 0.03073772, 0.02946296],
        [0.02713094, 0.02726526, 0.9456038 ],
        [0.04211858, 0.04293847, 0.91494295]])}

In [15]:
history

Unnamed: 0,Pengujian Ke,K,Alpha,Beta,Accuracy
0,Pengujian Ke- 1,3,0.3,0.1,0.788679
1,Pengujian Ke- 2,3,0.3,0.2,0.696855
2,Pengujian Ke- 3,3,0.4,0.1,0.597484
3,Pengujian Ke- 4,3,0.4,0.2,0.650314
4,Pengujian Ke- 5,4,0.3,0.1,0.591195
5,Pengujian Ke- 6,4,0.3,0.2,0.679245
6,Pengujian Ke- 7,4,0.4,0.1,0.513208
7,Pengujian Ke- 8,4,0.4,0.2,0.669182
8,Pengujian Ke- 9,5,0.3,0.1,0.646541
9,Pengujian Ke- 10,5,0.3,0.2,0.638994


In [16]:
history.to_csv("history.csv", index=False)

### Mendeklarasi K, Alpha, dan Beta

### LDA

Penerapkan LDA pada data pelatihan, yang menghasilkan matriks yang disebut `lda_x_train`. Matriks ini berisi representasi dokumen-dalam-topik untuk data pelatihan. Ini adalah cara kita menggambarkan hubungan antara dokumen dan topik dalam data pelatihan.

Kemudian, saya menggunakan model LDA yang sudah kita latih pada data pelatihan untuk mengubah data pengujian menjadi representasi dokumen-dalam-topik yang saya sebut sebagai `lda_x_test`. Ini membantu saya dalam memahami bagaimana dokumen-dokumen pengujian terkait dengan topik-topik yang telah saya identifikasi sebelumnya.

Dengan mengubah data teks ke dalam representasi topik-topik ini, saya memungkinkan penggunaan informasi ini dalam analisis lebih lanjut atau dalam pemodelan klasifikasi. Ini adalah salah satu cara di mana saya dapat mengolah dan memahami data teks dengan lebih baik."

In [17]:
lda = best_param['lda']
lda_x_train = best_param['lda_Xtrain']
lda_x_test = best_param['lda_Xtest']

### Tampilan Hasil Reduksi Dimensi

In [19]:
topik_columns = [f"Topik {i}" for i in range(1, best_param['k']+1)]
dokumen = final_tfidf['Dokumen']
output_proporsi_TD = pd.DataFrame(lda_x_train, columns=topik_columns)
output_proporsi_TD.insert(0,'Dokumen', dokumen)
output_proporsi_TD.insert(len(output_proporsi_TD.columns),'Label', final_tfidf['Label'])
output_proporsi_TD

Unnamed: 0,Dokumen,Topik 1,Topik 2,Topik 3,Label
0,jakarta kompascom sultan rifat alfatih korban ...,0.939799,0.030738,0.029463,megapolitan
1,jakarta kompascom ketua komisi pemberantasan k...,0.928780,0.035342,0.035878,megapolitan
2,jakarta kompascom ketua komisi pemberantasan k...,0.039729,0.039213,0.921058,megapolitan
3,bogor kompascom hsn lansia pemilik bengkel taj...,0.914394,0.041990,0.043616,megapolitan
4,jakarta kompascom dinas kesehatan dinkes dki j...,0.032659,0.933845,0.033495,megapolitan
...,...,...,...,...,...
1850,kompascom calon presiden capres nomor urut men...,0.026095,0.946559,0.027346,tren
1851,kompascom pria jepang berusia sengaja meledakk...,0.917000,0.040727,0.042273,tren
1852,kompascom kucing salah hewan diselimuti sejara...,0.922806,0.038625,0.038569,tren
1853,kompascom kementerian pendidikan kebudayaan ri...,0.033868,0.931756,0.034376,tren


### Save Data hasil reduksi dimensi

In [20]:
output_proporsi_TD.to_csv('reduksi dimensi.csv', index=False)

### Tampilan proporsi kata di tiap topik

In [21]:
# Output distribusi kata pada topik
distribusi_kata_topik = pd.DataFrame(lda.components_)
distribusi_kata_topik

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,3775,3776,3777,3778,3779,3780,3781,3782,3783,3784
0,6.934566,1.466913,1.466913,2.090392,1.484881,0.1,0.1,2.792707,4.221731,1.511237,...,1.497145,0.1,0.1,1.554227,0.1,0.1,2.433385,0.283921,0.1,4.462682
1,0.1,0.1,0.1,0.1,0.1,0.1,0.100001,0.10004,0.1,0.1,...,0.1,0.1,2.866828,0.1,0.1,1.743133,0.1,47.554841,2.119711,0.1
2,0.1,0.1,0.1,0.1,0.1,2.07767,2.628249,0.1,0.1,0.1,...,0.1,1.069578,0.1,0.1,0.635887,0.1,0.1,0.1,0.1,0.1


### Model Naive Bayes With LDA

In [22]:
# Membuat model Naive Bayes
model = best_param['model']

# Melakukan prediksi pada data pengujian
y_pred = model.predict(lda_x_test)

# Menghitung akurasi
accuracy = accuracy_score(y_test, y_pred)
print("Akurasi:", accuracy)

# Menampilkan laporan klasifikasi
print("Laporan Klasifikasi:")
print(classification_report(y_test, y_pred))

# Menampilkan matriks kebingungan
confusion = confusion_matrix(y_test, y_pred)
print("Confusion Matriks:")
print(confusion)

Akurasi: 0.7886792452830189
Laporan Klasifikasi:
              precision    recall  f1-score   support

 megapolitan       0.63      0.73      0.68       241
        tren       0.88      0.81      0.84       554

    accuracy                           0.79       795
   macro avg       0.75      0.77      0.76       795
weighted avg       0.80      0.79      0.79       795

Confusion Matriks:
[[177  64]
 [104 450]]


### Predict

In [30]:
data = ["Badan Pengawas Obat dan Makanan (BPOM) menemukan 143 obat tradisional dan suplemen kesehatan yang mengandung bahan kimia obat (BKO)."]
a = tfidf_vectorizer.transform(data).toarray()
b = lda.transform(a)
model.predict(b)

array(['tren'], dtype='<U11')

### Save Model

In [31]:
joblib.dump(lda, "lda.pkl")
joblib.dump(model, "naive bayes.pkl")

['naive bayes.pkl']

## Modeling Without LDA

### Training Model dengan Dataset Asli

In [32]:
# Membuat model Naive Bayes
nb = GaussianNB()

# Melatih model nbada data pelatihan
nb.fit(X_train, y_train)

# Melakukan prediksi pada data pengujian
y_pred_nb = nb.predict(X_test)

# Menghitung akurasi
accuracy = accuracy_score(y_test, y_pred_nb)
print("Akurasi:", accuracy)

# Menampilkan laporan klasifikasi
print("Laporan Klasifikasi:")
print(classification_report(y_test, y_pred_nb))

# Menampilkan matriks kebingungan
confusion = confusion_matrix(y_test, y_pred_nb)
print("Confusion Matriks:")
print(confusion)

Akurasi: 1.0
Laporan Klasifikasi:
              precision    recall  f1-score   support

 megapolitan       1.00      1.00      1.00       241
        tren       1.00      1.00      1.00       554

    accuracy                           1.00       795
   macro avg       1.00      1.00      1.00       795
weighted avg       1.00      1.00      1.00       795

Confusion Matriks:
[[241   0]
 [  0 554]]


### Predict

In [33]:
data = ["Badan Pengawas Obat dan Makanan (BPOM) menemukan 143 obat tradisional dan suplemen kesehatan yang mengandung bahan kimia obat (BKO)."]
tfidf_matrix = tfidf_vectorizer.transform(data).toarray()
nb.predict(tfidf_matrix)

array(['tren'], dtype='<U11')

### Save Model

In [34]:
joblib.dump(nb, "Naive Bayes (Asli).pkl")

['Naive Bayes (Asli).pkl']

## Save Vectorizer

In [35]:
joblib.dump(tfidf_vectorizer, 'vectorizer.pkl')

['vectorizer.pkl']