# **Proyek Akhir Pencarian Dan Penambangan WEB (Klasifikasi Berita dengan Reduksi Dimensi Latent Dirichlet Allocation (LDA) menggunakan Metode Naive Bayes)**

## Deployment

[Link Streamlit](https://ppw-uas.streamlit.app/)

## Import Library

In [None]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.svm import SVC

import pandas as pd
import warnings
import joblib
import nltk
import re

nltk.download('stopwords')
nltk.download('punkt')
warnings.filterwarnings('ignore')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Kode di atas adalah impor pustaka dan modul yang diperlukan untuk melakukan analisis teks dengan Naive Bayes, termasuk penggunaan Latent Dirichlet Allocation (LDA) untuk reduksi dimensi dan ekstraksi fitur teks. Berikut penjelasan singkat tentang setiap modul yang diimpor:

- `from sklearn.decomposition import LatentDirichletAllocation`: Ini mengimpor modul LatentDirichletAllocation dari pustaka Scikit-Learn. LDA adalah algoritma yang digunakan untuk reduksi dimensi dalam analisis teks, terutama untuk topik modeling.

- `from sklearn.model_selection import train_test_split`: Modul `train_test_split` dari Scikit-Learn digunakan untuk membagi dataset ke dalam subset pelatihan dan pengujian. Ini adalah langkah yang umum dalam pemodelan dan evaluasi.

- `from sklearn.naive_bayes import GaussianNB`: Ini mengimpor modul `GaussianNB` dari Scikit-Learn, yang merupakan implementasi Naive Bayes dengan distribusi Gaussian. Ini adalah salah satu varian algoritma Naive Bayes yang dapat digunakan dalam klasifikasi.

- `from sklearn.metrics import accuracy_score, classification_report, confusion_matrix`: Ini mengimpor beberapa modul yang digunakan untuk mengevaluasi kinerja model klasifikasi. `accuracy_score` digunakan untuk mengukur akurasi model, `classification_report` memberikan ringkasan berbagai metrik evaluasi, dan `confusion_matrix` digunakan untuk menganalisis hasil klasifikasi.

- `from sklearn.feature_extraction.text import CountVectorizer`: Modul `CountVectorizer` digunakan untuk mengubah teks menjadi vektor Term Frequensi (TF), yang dapat digunakan sebagai fitur dalam pemodelan klasifikasi.

- `import joblib`: Ini mengimpor modul `joblib`, yang dapat digunakan untuk menyimpan dan memuat model yang telah dilatih.

- `import pandas as pd`: Modul `pandas` digunakan untuk manipulasi data, termasuk membaca dan menyimpan data dalam bentuk DataFrame.

- `import warnings`: Modul `warnings` digunakan untuk mengelola dan menghilangkan peringatan yang muncul selama proses eksekusi kode.

## Load Dataset (transform to Term Frequency)

Dalam kode di bawah, program ini menggunakan pustaka `pandas` untuk membaca sebuah file CSV dari lokasi yang diberikan, dan kemudian menampilkan isi DataFrame tersebut. Ini berguna ketika Anda ingin memuat dan mengeksplorasi data dari file CSV yang sudah ada.

Berikut penjelasan singkatnya:

- `df = pd.read_csv('/content/drive/MyDrive/Kuliah /Semester 7/Pencarian Dan Penambangan Web/ppw/pages/Term Frequensi.csv')`: Ini adalah perintah yang menggunakan `pandas` untuk membaca data dari file CSV yang terletak pada jalur tertentu. Anda memberikan jalur lengkap ke file CSV yang ingin dibaca, dan hasilnya dimuat ke dalam variabel `df`. Variabel `df` adalah objek DataFrame yang berisi data dari file CSV.

- `df`: Setelah Anda membaca data dari file CSV ke dalam DataFrame, perintah ini akan menampilkan seluruh isi DataFrame tersebut.

Jadi, dengan melakukan perintah tersebut, Anda telah memuat data dari file CSV ke dalam DataFrame dan menampilkan isi dari DataFrame tersebut di lingkungan kerja Anda, yang memungkinkan Anda untuk melihat, menganalisis, atau melakukan operasi lain pada data tersebut.

In [None]:
df = pd.read_csv('/content/drive/MyDrive/Kuliah /Semester 7/Pencarian Dan Penambangan Web/ppw/Dataset/Data Berita CNBC.csv')
df

Unnamed: 0,Url,Judul,Tanggal,Author,Artikel,Category
0,https://www.cnbcindonesia.com/tech/20231106180...,Astronom Yakin Alien Sudah Tahu di Bumi Ada Ma...,06 November 2023 19:50,"Intan Rakhmayanti Dewi, CNBC Indonesia","Jakarta, CNBC Indonesia - Astronom yakin alien...",Tech
1,https://www.cnbcindonesia.com/tech/20231106175...,Scarlett Johansson Ngamuk Wajah dan Suaranya D...,06 November 2023 19:00,"Intan Rakhmayanti Dewi, CNBC Indonesia","Jakarta, CNBC Indonesia - Aktrisnominasi Oscar...",Tech
2,https://www.cnbcindonesia.com/tech/20231106164...,"Dulu Bikin Ghozali Kaya Raya, Startup Ini Peca...",06 November 2023 18:20,"Novina Putri Bestari, CNBC Indonesia","Jakarta, CNBC Indonesia - Masih ingat Ghozali ...",Tech
3,https://www.cnbcindonesia.com/tech/20231106162...,"Jeff Bezos Pindah ke Rumah Rp 2,3 T Demi Pacar...",06 November 2023 17:40,"Novina Putri Bestari, CNBC Indonesia","Jakarta, CNBC Indonesia - Jeff Bezos memutuska...",Tech
4,https://www.cnbcindonesia.com/tech/20231106172...,BRIBRAIN Sukses Raih Penghargaan Ini di IDC Aw...,06 November 2023 17:33,"Khoirul Anam, CNBC Indonesia","Jakarta, CNBC Indonesia- PT Bank Rakyat Indone...",Tech
...,...,...,...,...,...,...
1495,https://www.cnbcindonesia.com/entrepreneur/202...,"Bisnis Kopi Meluas, Kompetisi Barista dan Latt...",03 April 2023 12:41,"Zefanya Aprilia, CNBC Indonesia","Jakarta, CNBCIndonesia -Roadshow Bezzera Latte...",Entrepreneur
1496,https://www.cnbcindonesia.com/entrepreneur/202...,"Rahusna, Alumni ITB Penemu Mesin Pemilah Sampa...",03 April 2023 12:33,"Mentari Puspadini, CNBC Indonesia","Jakarta, CNBC Indonesia - Hingga saat ini, per...",Entrepreneur
1497,https://www.cnbcindonesia.com/entrepreneur/202...,Pameran Barang Mewah Terbesar akan Hadir di RI...,31 March 2023 12:45,"Zefanya Aprilia, CNBC Indonesia","Jakarta, CNBCIndonesia -Indonesia akan menyele...",Entrepreneur
1498,https://www.cnbcindonesia.com/entrepreneur/202...,SETC & INOTEK Kerja Sama Gelar Pelatihan UMKM ...,31 March 2023 09:30,"Teti Purwanti, CNBC Indonesia","Jakarta, CNBC Indonesia - Sampoerna Entreprene...",Entrepreneur


## Cek NULL Data

Perintah `df.isnull().sum()` digunakan untuk menghitung jumlah nilai-nilai yang hilang (missing values) dalam DataFrame `df`. Ini akan memberikan jumlah nilai-nilai yang hilang untuk setiap kolom dalam DataFrame. Hasilnya akan berupa deret data yang menunjukkan berapa banyak nilai yang hilang dalam masing-masing kolom.

In [None]:
df.isnull().sum()

Url         0
Judul       0
Tanggal     0
Author      0
Artikel     0
Category    0
dtype: int64

## Cleaning

In [None]:
def cleaning(text):
  text = re.sub(r'[^a-zA-Z\s]', '', text).strip()
  return text

df['Cleaning'] = df['Artikel'].apply(cleaning)
df['Cleaning']

0       Jakarta CNBC Indonesia  Astronom yakin alien s...
1       Jakarta CNBC Indonesia  Aktrisnominasi Oscar S...
2       Jakarta CNBC Indonesia  Masih ingat Ghozali ya...
3       Jakarta CNBC Indonesia  Jeff Bezos memutuskan ...
4       Jakarta CNBC Indonesia PT Bank Rakyat Indonesi...
                              ...                        
1495    Jakarta CNBCIndonesia Roadshow Bezzera Latte A...
1496    Jakarta CNBC Indonesia  Hingga saat ini permas...
1497    Jakarta CNBCIndonesia Indonesia akan menyeleng...
1498    Jakarta CNBC Indonesia  Sampoerna Entrepreneur...
1499    Jakarta CNBC Indonesia  Masalah postur tubuh d...
Name: Cleaning, Length: 1500, dtype: object

## Tokenizing

In [None]:
def tokenizer(text):
  text = text.lower()
  return word_tokenize(text)

df['Tokenizing'] = df['Cleaning'].apply(tokenizer)
df['Tokenizing']

0       [jakarta, cnbc, indonesia, astronom, yakin, al...
1       [jakarta, cnbc, indonesia, aktrisnominasi, osc...
2       [jakarta, cnbc, indonesia, masih, ingat, ghoza...
3       [jakarta, cnbc, indonesia, jeff, bezos, memutu...
4       [jakarta, cnbc, indonesia, pt, bank, rakyat, i...
                              ...                        
1495    [jakarta, cnbcindonesia, roadshow, bezzera, la...
1496    [jakarta, cnbc, indonesia, hingga, saat, ini, ...
1497    [jakarta, cnbcindonesia, indonesia, akan, meny...
1498    [jakarta, cnbc, indonesia, sampoerna, entrepre...
1499    [jakarta, cnbc, indonesia, masalah, postur, tu...
Name: Tokenizing, Length: 1500, dtype: object

## Stopword Removal

In [None]:
corpus = stopwords.words('indonesian')

def stopwordText(words):
 return [word for word in words if word not in corpus]

df['Stopword Removal'] = df['Tokenizing'].apply(stopwordText)

# Gabungkan kembali token menjadi kalimat utuh
df['Full Text'] = df['Stopword Removal'].apply(lambda x: ' '.join(x))
df['Full Text']

0       jakarta cnbc indonesia astronom alien bumi keh...
1       jakarta cnbc indonesia aktrisnominasi oscar sc...
2       jakarta cnbc indonesia ghozali mendadak kaya n...
3       jakarta cnbc indonesia jeff bezos memutuskan p...
4       jakarta cnbc indonesia pt bank rakyat indonesi...
                              ...                        
1495    jakarta cnbcindonesia roadshow bezzera latte a...
1496    jakarta cnbc indonesia permasalahan sampah pol...
1497    jakarta cnbcindonesia indonesia menyelenggarak...
1498    jakarta cnbc indonesia sampoerna entrepreneurs...
1499    jakarta cnbc indonesia postur tubuh tulang pin...
Name: Full Text, Length: 1500, dtype: object

## TFIDF

In [None]:
def tfidf(dokumen, category):
  vectorizer = TfidfVectorizer()
  x = vectorizer.fit_transform(dokumen).toarray()
  terms = vectorizer.get_feature_names_out()

  final_tfidf = pd.DataFrame(x, columns=terms)
  final_tfidf.insert(0, 'Dokumen', dokumen)
  final_tfidf.insert(len(final_tfidf.columns),'Category', category)

  return (vectorizer, final_tfidf)

tfidf_vectorizer, final_tfidf = tfidf(df['Full Text'], df['Category'])
final_tfidf

Unnamed: 0,Dokumen,aa,aad,aalco,aam,aamiin,ab,aba,abad,abadi,...,zulfi,zulfikar,zulhas,zulkifli,zumba,zuna,zunaldi,zusammenarbeit,zuzu,Category
0,jakarta cnbc indonesia astronom alien bumi keh...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Tech
1,jakarta cnbc indonesia aktrisnominasi oscar sc...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Tech
2,jakarta cnbc indonesia ghozali mendadak kaya n...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Tech
3,jakarta cnbc indonesia jeff bezos memutuskan p...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Tech
4,jakarta cnbc indonesia pt bank rakyat indonesi...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Tech
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1495,jakarta cnbcindonesia roadshow bezzera latte a...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Entrepreneur
1496,jakarta cnbc indonesia permasalahan sampah pol...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Entrepreneur
1497,jakarta cnbcindonesia indonesia menyelenggarak...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Entrepreneur
1498,jakarta cnbc indonesia sampoerna entrepreneurs...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Entrepreneur


## Deklarasi X dan y dengan drop fitur dokumen dan label

Dalam proses ini melakukan pemisahan antara fitur (X) dan target (y) dalam DataFrame `df` untuk digunakan dalam proses pemodelan. Berikut penjelasan singkatnya:

- `X = df.drop(['Dokumen', 'Label'], axis=1)`: Ini adalah perintah yang menghasilkan DataFrame `X` yang berisi fitur atau atribut dari data. Dalam hal ini, Anda sedang menghapus dua kolom, yaitu 'Dokumen' dan 'Label', dari DataFrame `df` menggunakan metode `drop` dengan parameter `axis=1`. Hasilnya adalah DataFrame `X` yang berisi semua kolom kecuali 'Dokumen' dan 'Label'. Fitur-fitur ini akan digunakan dalam pemodelan.

- `y = df['Label']`: Ini adalah perintah yang menghasilkan Series `y` yang berisi target atau label yang ingin diprediksi. Dalam hal ini, Anda hanya mengambil kolom 'Label' dari DataFrame `df` dan menyimpannya dalam Series `y`. Ini adalah variabel yang akan menjadi target dalam pemodelan klasifikasi.

In [None]:
X = final_tfidf.drop(['Dokumen', 'Category'], axis=1)
X

Unnamed: 0,aa,aad,aalco,aam,aamiin,ab,aba,abad,abadi,abai,...,zulfa,zulfi,zulfikar,zulhas,zulkifli,zumba,zuna,zunaldi,zusammenarbeit,zuzu
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1495,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1496,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1497,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1498,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
y = df['Category']
y

0               Tech
1               Tech
2               Tech
3               Tech
4               Tech
            ...     
1495    Entrepreneur
1496    Entrepreneur
1497    Entrepreneur
1498    Entrepreneur
1499    Entrepreneur
Name: Category, Length: 1500, dtype: object

## Splitting Data

Dalam proses ini menggunakan modul `train_test_split` dari Scikit-Learn untuk membagi dataset menjadi subset pelatihan (training) dan pengujian (testing). Berikut penjelasan singkat tentang apa yang terjadi:

- `X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)`: Ini adalah perintah yang membagi data menjadi empat subset, yaitu `X_train` (fitur pelatihan), `X_test` (fitur pengujian), `y_train` (target pelatihan), dan `y_test` (target pengujian).

    - `X` adalah DataFrame yang berisi fitur.
    - `y` adalah Series yang berisi target atau label.
    - `test_size=0.3` menentukan bahwa 30% dari data akan digunakan sebagai data pengujian, sementara 70% akan digunakan sebagai data pelatihan.
    - `random_state=42` digunakan untuk mengatur nilai seed agar pembagian data dapat direproduksi dengan hasil yang konsisten. Anda dapat menggantinya dengan nilai lain jika diperlukan.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

## **Modeling With LDA**

### Mencari Best Parameter untuk LDA

In [None]:
def find_best_lda(Xtrain, Xtest, n_components, alpha, beta):
  looping = 1
  best = {'k' : 0, 'alpha' : 0, 'beta' : 0, 'accuracy' : 0, 'model': '', 'lda' : '', 'lda_Xtrain' : '', 'lda_Xtest' : ''}
  history = pd.DataFrame(columns=["Pengujian Ke", "K", "Alpha", "Beta", "Accuracy"])

  # Menambahkan tqdm pada loop terluar
  for k in n_components:
    for a in alpha:
      for b in beta:
        lda = LatentDirichletAllocation(n_components=k, doc_topic_prior=a, topic_word_prior=b)
        lda_Xtrain = lda.fit_transform(Xtrain)
        lda_Xtest = lda.transform(Xtest)

        # Membuat model Naive Bayes
        model = GaussianNB()

        # Melatih model pada data pelatihan
        model.fit(lda_Xtrain, y_train)

        # Melakukan prediksi pada data pengujian
        y_pred = model.predict(lda_Xtest)

        # Menghitung akurasi
        accuracy = accuracy_score(y_test, y_pred)
        print(f"Jumlah Topik: {k}, Alpha: {a}, Beta: {b}, Accuracy: {accuracy}")

        if accuracy > best['accuracy']:
          best['accuracy'] = accuracy
          best['k'] = k
          best['alpha'] = a
          best['beta'] = b
          best['model'] = model
          best['lda'] = lda
          best['lda_Xtrain'] = lda_Xtrain
          best['lda_Xtest'] = lda_Xtest

        history.loc[len(history)] = [f"Pengujian Ke- {looping}", k, a, b, accuracy]
        looping += 1

  return (best, history)

k = [3, 4, 5]
alpha = [0.3, 0.4]
beta = [0.1, 0.2]
best_param, history = find_best_lda(X_train, X_test, k, alpha, beta)

Jumlah Topik: 3, Alpha: 0.3, Beta: 0.1, Accuracy: 0.45111111111111113
Jumlah Topik: 3, Alpha: 0.3, Beta: 0.2, Accuracy: 0.45555555555555555
Jumlah Topik: 3, Alpha: 0.4, Beta: 0.1, Accuracy: 0.4288888888888889
Jumlah Topik: 3, Alpha: 0.4, Beta: 0.2, Accuracy: 0.35555555555555557
Jumlah Topik: 4, Alpha: 0.3, Beta: 0.1, Accuracy: 0.37555555555555553
Jumlah Topik: 4, Alpha: 0.3, Beta: 0.2, Accuracy: 0.43333333333333335
Jumlah Topik: 4, Alpha: 0.4, Beta: 0.1, Accuracy: 0.45111111111111113
Jumlah Topik: 4, Alpha: 0.4, Beta: 0.2, Accuracy: 0.5555555555555556
Jumlah Topik: 5, Alpha: 0.3, Beta: 0.1, Accuracy: 0.39111111111111113
Jumlah Topik: 5, Alpha: 0.3, Beta: 0.2, Accuracy: 0.4311111111111111
Jumlah Topik: 5, Alpha: 0.4, Beta: 0.1, Accuracy: 0.5933333333333334
Jumlah Topik: 5, Alpha: 0.4, Beta: 0.2, Accuracy: 0.4822222222222222
Jumlah Topik: 10, Alpha: 0.3, Beta: 0.1, Accuracy: 0.3933333333333333
Jumlah Topik: 10, Alpha: 0.3, Beta: 0.2, Accuracy: 0.34444444444444444
Jumlah Topik: 10, Alpha:

In [None]:
best_param

{'k': 51,
 'alpha': 0.4,
 'beta': 0.1,
 'accuracy': 0.62,
 'model': GaussianNB(),
 'lda': LatentDirichletAllocation(doc_topic_prior=0.4, n_components=51,
                           topic_word_prior=0.1),
 'lda_Xtrain': array([[0.0148147 , 0.01481453, 0.01481452, ..., 0.01481469, 0.01481457,
         0.01481457],
        [0.01451178, 0.01451172, 0.01451171, ..., 0.01451177, 0.01723969,
         0.01451173],
        [0.01408354, 0.01408345, 0.01503033, ..., 0.01408353, 0.01408347,
         0.01408347],
        ...,
        [0.01266995, 0.01372004, 0.01266972, ..., 0.01266994, 0.0126698 ,
         0.0126698 ],
        [0.01319867, 0.01387401, 0.05909735, ..., 0.01319865, 0.0131983 ,
         0.0131983 ],
        [0.01282085, 0.0128203 , 0.01282025, ..., 0.01282082, 0.01282044,
         0.01282044]]),
 'lda_Xtest': array([[0.01470892, 0.01469677, 0.01469592, ..., 0.01473377, 0.01470013,
         0.01639436],
        [0.0137538 , 0.0137471 , 0.01671984, ..., 0.01375346, 0.01374884,
        

In [None]:
history

Unnamed: 0,Pengujian Ke,K,Alpha,Beta,Accuracy
0,Pengujian Ke- 1,3,0.3,0.1,0.451111
1,Pengujian Ke- 2,3,0.3,0.2,0.455556
2,Pengujian Ke- 3,3,0.4,0.1,0.428889
3,Pengujian Ke- 4,3,0.4,0.2,0.355556
4,Pengujian Ke- 5,4,0.3,0.1,0.375556
...,...,...,...,...,...
71,Pengujian Ke- 72,53,0.4,0.2,0.426667
72,Pengujian Ke- 73,54,0.3,0.1,0.533333
73,Pengujian Ke- 74,54,0.3,0.2,0.457778
74,Pengujian Ke- 75,54,0.4,0.1,0.480000


In [None]:
history.to_csv("history.csv", index=False)

### Deklarasi K, Alpha, dan Beta

### LDA

Dalam langkah berikutnya, saya menggunakan algoritma yang disebut Latent Dirichlet Allocation atau LDA untuk mengurangi dimensi data teks. LDA adalah algoritma yang membantu saya mengidentifikasi topik-topik utama yang muncul dalam dokumen-dokumen saya. Hasil dari ini adalah representasi dokumen-dalam-topik, yang artinya kita menggambarkan setiap dokumen sebagai kombinasi dari topik-topik yang ada.

Kami menerapkan LDA pada data pelatihan, yang menghasilkan matriks yang disebut `lda_x_train`. Matriks ini berisi representasi dokumen-dalam-topik untuk data pelatihan. Ini adalah cara kita menggambarkan hubungan antara dokumen dan topik dalam data pelatihan.

Kemudian, saya menggunakan model LDA yang sudah kita latih pada data pelatihan untuk mengubah data pengujian menjadi representasi dokumen-dalam-topik yang saya sebut sebagai `lda_x_test`. Ini membantu saya dalam memahami bagaimana dokumen-dokumen pengujian terkait dengan topik-topik yang telah saya identifikasi sebelumnya.

Dengan mengubah data teks ke dalam representasi topik-topik ini, saya memungkinkan penggunaan informasi ini dalam analisis lebih lanjut atau dalam pemodelan klasifikasi. Ini adalah salah satu cara di mana saya dapat mengolah dan memahami data teks dengan lebih baik."

In [None]:
lda = best_param['lda']
lda_x_train = best_param['lda_Xtrain']
lda_x_test = best_param['lda_Xtest']

### Tampilan Hasil Reduksi Dimensi

In [None]:
topik_columns = [f"Topik {i}" for i in range(1, best_param['k']+1)]
dokumen = final_tfidf['Dokumen']
output_proporsi_TD = pd.DataFrame(lda_x_train, columns=topik_columns)
output_proporsi_TD.insert(0,'Dokumen', dokumen)
output_proporsi_TD.insert(len(output_proporsi_TD.columns),'Category', final_tfidf['Category'])
output_proporsi_TD

Unnamed: 0,Dokumen,Topik 1,Topik 2,Topik 3,Topik 4,Topik 5,Topik 6,Topik 7,Topik 8,Topik 9,...,Topik 42,Topik 43,Topik 44,Topik 45,Topik 46,Topik 47,Topik 48,Topik 49,Topik 50,Category
0,jakarta cnbc indonesia astronom alien bumi keh...,0.015029,0.015029,0.015029,0.015029,0.015029,0.015029,0.015029,0.015029,0.015029,...,0.015029,0.157409,0.015029,0.015158,0.015029,0.015029,0.015029,0.015029,0.015029,Tech
1,jakarta cnbc indonesia aktrisnominasi oscar sc...,0.014724,0.014724,0.014724,0.014724,0.014724,0.014724,0.014724,0.014724,0.014724,...,0.021393,0.271855,0.014724,0.014723,0.014724,0.014724,0.014724,0.014724,0.014724,Tech
2,jakarta cnbc indonesia ghozali mendadak kaya n...,0.014285,0.014285,0.014285,0.014285,0.015272,0.014285,0.014285,0.014285,0.014285,...,0.014285,0.275922,0.014285,0.018278,0.014285,0.014285,0.014285,0.014285,0.014285,Tech
3,jakarta cnbc indonesia jeff bezos memutuskan p...,0.013477,0.013477,0.013477,0.013476,0.014530,0.013477,0.013477,0.013476,0.013477,...,0.013477,0.331347,0.013477,0.019481,0.013477,0.013477,0.013477,0.013477,0.013477,Tech
4,jakarta cnbc indonesia pt bank rakyat indonesi...,0.013464,0.013464,0.013464,0.013464,0.013464,0.013464,0.013464,0.013464,0.013464,...,0.013464,0.336778,0.013464,0.016933,0.013464,0.013464,0.013464,0.013464,0.013464,Tech
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1045,jakarta cnbc indonesia gerakan maju tani nusan...,0.014201,0.016163,0.014200,0.014978,0.014200,0.014200,0.015274,0.014197,0.015295,...,0.015981,0.208255,0.014201,0.024359,0.014200,0.014200,0.014201,0.014200,0.014200,Entrepreneur
1046,jakarta cnbc indonesia kebanyakan orang tujuan...,0.014546,0.014546,0.014546,0.014546,0.014546,0.014546,0.014546,0.014546,0.014546,...,0.014546,0.277413,0.014546,0.020979,0.014546,0.014546,0.014546,0.014546,0.014546,Entrepreneur
1047,jakarta cnbc indonesia kisah george soros mili...,0.012902,0.012901,0.012902,0.012900,0.012902,0.012902,0.012902,0.012899,0.014375,...,0.012901,0.336847,0.012902,0.042399,0.012902,0.012902,0.012902,0.012902,0.012902,Entrepreneur
1048,jakarta cnbc indonesia keindahan kota karawang...,0.013369,0.013368,0.013369,0.013368,0.013369,0.014791,0.013406,0.013368,0.013369,...,0.013368,0.328861,0.013369,0.019507,0.013369,0.013369,0.013369,0.013369,0.013369,Entrepreneur


### Save Data hasil reduksi dimensi

In [None]:
output_proporsi_TD.to_csv('reduksi dimensi.csv', index=False)

### Tampilan proporsi kata di tiap topik

In [None]:
# Output distribusi kata pada topik
distribusi_kata_topik = pd.DataFrame(lda.components_)
distribusi_kata_topik

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,29337,29338,29339,29340,29341,29342,29343,29344,29345,29346
0,0.100019,0.10001,0.100004,0.100004,0.100033,0.100012,0.1,0.100001,0.100002,0.100207,...,0.100023,0.100023,0.100069,0.10001,0.100004,0.100007,0.1,0.100023,0.1,0.100096
1,0.100037,0.10001,0.100003,0.100004,0.100034,0.100012,0.1,0.100002,0.100002,0.100204,...,0.100028,0.100028,0.100068,0.100009,0.100004,0.100007,0.1,0.100028,0.1,0.100118
2,0.100019,0.10001,0.100003,0.100004,0.100033,0.100012,0.1,0.100001,0.100002,0.100206,...,0.100022,0.100022,0.100068,0.100009,0.100004,0.100007,0.1,0.100022,0.1,0.100096
3,0.100019,0.10001,0.100003,0.100004,0.100032,0.100012,0.1,0.100001,0.100002,0.100202,...,0.100036,0.100036,0.100067,0.100017,0.100008,0.100007,0.1,0.100036,0.1,0.100094
4,0.100019,0.100012,0.100004,0.100004,0.100033,0.100012,0.1,0.100001,0.100002,0.100206,...,0.100023,0.100023,0.10008,0.100009,0.100004,0.100007,0.1,0.100023,0.1,0.100096
5,0.100019,0.10001,0.100004,0.100004,0.100033,0.100012,0.1,0.100001,0.100002,0.100207,...,0.100023,0.100023,0.100069,0.10001,0.100004,0.100007,0.1,0.100023,0.1,0.100096
6,0.100019,0.10001,0.100003,0.100004,0.100033,0.100013,0.1,0.100001,0.100002,0.100206,...,0.100023,0.100023,0.100084,0.100009,0.100004,0.100007,0.1,0.100023,0.1,0.100096
7,0.100024,0.10001,0.100003,0.100005,0.100032,0.100012,0.1,0.100001,0.100002,0.1002,...,0.100036,0.100036,0.100089,0.100011,0.100005,0.100007,0.1,0.100036,0.1,0.100093
8,0.100019,0.10001,0.100003,0.100004,0.100033,0.100012,0.1,0.100002,0.100002,0.100206,...,0.100023,0.100023,0.100068,0.100009,0.100004,0.100007,0.1,0.100023,0.1,0.100096
9,0.100019,0.10001,0.100004,0.100004,0.100033,0.100012,0.1,0.100001,0.100002,0.100207,...,0.100023,0.100023,0.100069,0.10001,0.100004,0.100007,0.1,0.100023,0.1,0.100096


### Model Naive Bayes With LDA

In [None]:
# Membuat model Naive Bayes
model = best_param['model']

# Melakukan prediksi pada data pengujian
y_pred = model.predict(lda_x_test)

# Menghitung akurasi
accuracy = accuracy_score(y_test, y_pred)
print("Akurasi:", accuracy)

# Menampilkan laporan klasifikasi
print("Laporan Klasifikasi:")
print(classification_report(y_test, y_pred))

# Menampilkan matriks kebingungan
confusion = confusion_matrix(y_test, y_pred)
print("Confusion Matriks:")
print(confusion)

Akurasi: 0.5377777777777778
Laporan Klasifikasi:
              precision    recall  f1-score   support

Entrepreneur       0.51      0.50      0.50       161
        News       0.46      0.60      0.52       141
        Tech       0.70      0.52      0.60       148

    accuracy                           0.54       450
   macro avg       0.56      0.54      0.54       450
weighted avg       0.56      0.54      0.54       450

Confusion Matriks:
[[80 64 17]
 [40 85 16]
 [36 35 77]]


### Predict

In [None]:
data = ["Penelitian ini menggabungkan konsep kecerdasan buatan dengan algoritma penjadwalan dalam upaya meningkatkan efisiensi produksi dalam lingkungan manufaktur. Kami memperkenalkan pendekatan yang memanfaatkan kecerdasan komputasional, yaitu algoritma optimasi berbasis swarm intelligence, seperti algoritma PSO (Particle Swarm Optimization) dan algoritma ACO (Ant Colony Optimization). Tujuan utama penelitian ini adalah untuk mengoptimalkan jadwal produksi dengan meminimalkan waktu produksi dan biaya, sambil mempertimbangkan berbagai kendala produksi seperti kapasitas mesin, waktu pemrosesan, dan persyaratan bahan baku. Melalui eksperimen dan simulasi, kami membandingkan hasil dari algoritma swarm intelligence dengan pendekatan konvensional. Hasilnya menunjukkan bahwa algoritma PSO dan ACO dapat menghasilkan jadwal produksi yang lebih efisien, dengan waktu produksi yang lebih pendek dan biaya yang lebih rendah. Selain itu, algoritma ini mampu beradaptasi dengan perubahan dalam lingkungan produksi dan menghasilkan jadwal yang optimal bahkan dalam situasi yang kompleks. Penelitian ini menunjukkan potensi besar dari penggunaan kecerdasan komputasional dalam meningkatkan efisiensi dan produktivitas dalam industri manufaktur. Hasil penelitian ini dapat digunakan sebagai dasar untuk mengembangkan sistem penjadwalan cerdas yang dapat diterapkan dalam berbagai industri."]
a = tfidf_vectorizer.transform(data).toarray()
b = lda.transform(a)
model.predict(b)

array(['Tech'], dtype=object)

### Save Model

In [None]:
joblib.dump(lda, "lda.pkl")
joblib.dump(model, "naive bayes.pkl")

['naive bayes.pkl']

## Modeling Without LDA

### Training Model dengan Dataset Asli

In [None]:
# Membuat model Naive Bayes
nb = GaussianNB()

# Melatih model nbada data pelatihan
nb.fit(X_train, y_train)

# Melakukan prediksi pada data pengujian
y_pred_nb = nb.predict(X_test)

# Menghitung akurasi
accuracy = accuracy_score(y_test, y_pred_nb)
print("Akurasi:", accuracy)

# Menampilkan laporan klasifikasi
print("Laporan Klasifikasi:")
print(classification_report(y_test, y_pred_nb))

# Menampilkan matriks kebingungan
confusion = confusion_matrix(y_test, y_pred_nb)
print("Confusion Matriks:")
print(confusion)

Akurasi: 0.8822222222222222
Laporan Klasifikasi:
              precision    recall  f1-score   support

Entrepreneur       0.84      0.89      0.86       161
        News       0.90      0.92      0.91       141
        Tech       0.92      0.84      0.88       148

    accuracy                           0.88       450
   macro avg       0.89      0.88      0.88       450
weighted avg       0.88      0.88      0.88       450

Confusion Matriks:
[[143   7  11]
 [ 11 130   0]
 [ 16   8 124]]


### Predict

In [None]:
data = ["Penelitian ini menggabungkan konsep kecerdasan buatan dengan algoritma penjadwalan dalam upaya meningkatkan efisiensi produksi dalam lingkungan manufaktur. Kami memperkenalkan pendekatan yang memanfaatkan kecerdasan komputasional, yaitu algoritma optimasi berbasis swarm intelligence, seperti algoritma PSO (Particle Swarm Optimization) dan algoritma ACO (Ant Colony Optimization). Tujuan utama penelitian ini adalah untuk mengoptimalkan jadwal produksi dengan meminimalkan waktu produksi dan biaya, sambil mempertimbangkan berbagai kendala produksi seperti kapasitas mesin, waktu pemrosesan, dan persyaratan bahan baku. Melalui eksperimen dan simulasi, kami membandingkan hasil dari algoritma swarm intelligence dengan pendekatan konvensional. Hasilnya menunjukkan bahwa algoritma PSO dan ACO dapat menghasilkan jadwal produksi yang lebih efisien, dengan waktu produksi yang lebih pendek dan biaya yang lebih rendah. Selain itu, algoritma ini mampu beradaptasi dengan perubahan dalam lingkungan produksi dan menghasilkan jadwal yang optimal bahkan dalam situasi yang kompleks. Penelitian ini menunjukkan potensi besar dari penggunaan kecerdasan komputasional dalam meningkatkan efisiensi dan produktivitas dalam industri manufaktur. Hasil penelitian ini dapat digunakan sebagai dasar untuk mengembangkan sistem penjadwalan cerdas yang dapat diterapkan dalam berbagai industri."]
tfidf_matrix = tfidf_vectorizer.transform(data).toarray()
nb.predict(tfidf_matrix)

array(['Tech'], dtype='<U12')

### Save Model

In [None]:
joblib.dump(nb, "Naive Bayes (Asli).pkl")

['Naive Bayes (Asli).pkl']

## Save Vectorizer

In [None]:
joblib.dump(tfidf_vectorizer, 'vectorizer.pkl')

['vectorizer.pkl']