<a href="https://colab.research.google.com/github/fdhliakbar/IR-Lab/blob/main/TF_IDF_Pertemuan4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Praktikum 4: Pembobotan TF-IDF

<img src="https://www.infront.com/wp-content/uploads/2021/01/tf%E2%80%93idf-1080.jpg" alt="TF IDF Image"/>

Dalam bidang Information Retrieval (IR), kita sering ingin mengetahui seberapa penting sebuah kata dalam sebuah dokumen dibandingkan dengan seluruh koleksi dokumen.
Untuk itu, digunakan metode pembobotan yang disebut TF-IDF (Term Frequency - Inverse Document Frequency).

TF-IDF membantu sistem untuk menentukan relevansi kata terhadap dokumen, dan digunakan secara luas dalam:
- Mesin pencari (search engine),
- Sistem rekomendasi,
- Text mining, dan
- Natural Language Processing (NLP).

## Term Frequency
Term Frequency (TF) menunjukkan seberapa sering sebuah kata (term) muncul dalam sebuah dokumen.

Rumus: `T F (t, d)`

Jumlah kemunculan term t dalam dokumen d
Jumlah total kata dalam dokumen d
TF(t,d)=
Jumlah total kata dalam dokumen d
Jumlah kemunculan term t dalam dokumen d

Semakin sering sebuah kata muncul dalam dokumen, semakin penting kata tersebut di dokumen itu. Namun, kata umum seperti “yang”, “dan”, “atau” akan memiliki TF tinggi di banyak dokumen, sehingga perlu faktor pembeda (IDF).

## Inverse Document Frequency
Inverse Document Frequency (IDF) mengukur seberapa unik atau jarang suatu kata muncul di seluruh kumpulan dokumen (corpus).

Rumus: `IDF(t) = log(N / df(t))`

Keterangan:

- N = jumlah total dokumen

- df(t) = jumlah dokumen yang mengandung kata 𝑡

Jika sebuah kata muncul di banyak dokumen, berarti kata tersebut kurang informatif.
Kata yang jarang muncul lebih unik dan memiliki nilai IDF tinggi.

## TF-IDF
TF-IDF adalah hasil perkalian antara TF dan IDF, yang memberi bobot pada kata berdasarkan:
- Seberapa sering kata itu muncul di dokumen (TF),
- Seberapa jarang kata itu muncul di keseluruhan dokumen (IDF).

Rumus = `TF-IDF(t,d)=TF(t,d) × IDF(t)`

Kata yang sering muncul di sebuah dokumen tetapi jarang muncul di dokumen lain akan memiliki nilai TF-IDF tinggi, menandakan kata tersebut penting.

<img src="https://miro.medium.com/v2/resize:fit:1358/0*W3Rzv6djRGrftW7r.PNG" width="108%" alt="Contoh TF IDF" />

## Kesimpulan

| Komponen   | Fungsi                             | Menunjukkan                 |
| ---------- | ---------------------------------- | --------------------------- |
| **TF**     | Frekuensi kata dalam dokumen       | Kepentingan lokal           |
| **IDF**    | Kelangkaan kata di seluruh dokumen | Kepentingan global          |
| **TF-IDF** | Kombinasi TF dan IDF               | Bobot total pentingnya kata |


## Lanjut Langkah Praktikum 🫡

## Inisialisasi dan Import Library

In [29]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [30]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.naive_bayes import MultinomialNB

In [31]:
!pip install Sastrawi



In [32]:
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from Sastrawi.StopWordRemover.StopWordRemoverFactory import StopWordRemoverFactory

## Baca Dataset

In [33]:
data =  pd.read_excel('dataKumparan1.xlsx')
data.head()

Unnamed: 0,Topic,Title,Content
0,Politik,"Pelanggaran Pemilu, Tiga Caleg di Sulteng Dipr...","Komisioner Bawaslu Sigi, Sulawesi Tengah, Agus..."
1,Politik,"Pemilu Susulan di Kota Jayapura, Suara Jokowi ...",Walaupun dua dari lima distrik melakukan pemil...
2,Politik,"Tsamara Amany Dipinang, Pengurus PSI Daerah Me...","Tsamara Amany, politisi Partai Solidaritas Ind..."
3,Politik,Ada 47 TPS di Sulawesi Utara Berpotensi Pemili...,Badan Pengawas Pemilu (Bawaslu) Provinsi Sulaw...
4,Politik,Ketua KPPS di Sleman Ditemukan Tewas Gantung D...,"Tugiman, Ketua Kelompok Penyelenggara Pemungut..."


### Ukuran Dataset

In [34]:
print('Ukuran dataset: ', data.shape)

Ukuran dataset:  (60, 3)


### Pembagian Data Training & Testing

In [35]:
x_train, x_test, y_train, y_test = train_test_split(data['Content'], data['Topic'], train_size=0.5, test_size=0.16)
train_data = pd.DataFrame({'Content': x_train, 'Topic': y_train})
test_data = pd.DataFrame({'Content': x_test, 'Topic': y_test})

### Data Training

In [36]:
df1 = pd.DataFrame(train_data)
print(df1)

                                              Content      Topic
27  Pesta demokrasi terbesar di Indonesia resmi di...  Teknologi
8   Pertamuan para kiai sepuh se-Jawa Timur yang d...    Politik
2   Tsamara Amany, politisi Partai Solidaritas Ind...    Politik
21  Facebook hari Rabu mengaku telah secara tidak ...  Teknologi
45  Jumat (19/4) kawasan Terminal Keberangkatan Do...     Travel
11  Proses rekapitulasi surat suara melalui Sistem...    Politik
14  KPU memberikan klarifikasi mengenai banyaknya ...    Politik
1   Walaupun dua dari lima distrik melakukan pemil...    Politik
32  Super hype! Penayangan perdana film 'Avengers:...  Teknologi
25  Apple ternyata tidak main-main untuk terjun ke...  Teknologi
47  Pemerintah Kota Banjarmasin baru saja menyeles...     Travel
37  Belanda merupakan negara yang secara geografis...  Teknologi
26  Sistem operasi iOS terbaru untuk perangkat App...  Teknologi
18  Isu kecurangan di Pemilu 2019 terus menyeruak....    Politik
19  Cawapres 02 Sandiaga 

### Data Testing

In [37]:
df2 = pd.DataFrame(test_data)
print(df2)

                                              Content      Topic
7   Badan Pengawas Pemilu (Bawaslu) Kota Banjarmas...    Politik
41  Berlibur menjadi kegiatan yang paling dinanti ...     Travel
39  Perusahaan e-commerce marketplace Tokopedia ke...  Teknologi
4   Tugiman, Ketua Kelompok Penyelenggara Pemungut...    Politik
6   Peningkatan perolehan suara untuk Partai Keadi...    Politik
36  Platform media sosial populer Instagram kembal...  Teknologi
13  Aplikasi dan situs ayojagatps.com ikut meramai...    Politik
29  Mesin pencari Google menampilkan sebuah doodle...  Teknologi
42  Dinas Kebudayaan dan Pariwisata (Disbudpar) Ko...     Travel
17  Capres nomor urut 02 Prabowo Subianto sempat m...    Politik


### Ukuran Data Training & Testing

In [38]:
print('Ukuran data train : ', train_data.shape)
print('Ukuran data test : ', test_data.shape)
n_train = train_data.shape[0]
n_test = test_data.shape[0]

Ukuran data train :  (30, 2)
Ukuran data test :  (10, 2)


### Mengecek Data dan Menggabungkan Data

In [42]:
sparse_data = pd.concat([train_data, test_data], ignore_index=True)
# sparse_data.head()

df3 = pd.DataFrame(sparse_data)
print(df3)

                                              Content      Topic
0   Pesta demokrasi terbesar di Indonesia resmi di...  Teknologi
1   Pertamuan para kiai sepuh se-Jawa Timur yang d...    Politik
2   Tsamara Amany, politisi Partai Solidaritas Ind...    Politik
3   Facebook hari Rabu mengaku telah secara tidak ...  Teknologi
4   Jumat (19/4) kawasan Terminal Keberangkatan Do...     Travel
5   Proses rekapitulasi surat suara melalui Sistem...    Politik
6   KPU memberikan klarifikasi mengenai banyaknya ...    Politik
7   Walaupun dua dari lima distrik melakukan pemil...    Politik
8   Super hype! Penayangan perdana film 'Avengers:...  Teknologi
9   Apple ternyata tidak main-main untuk terjun ke...  Teknologi
10  Pemerintah Kota Banjarmasin baru saja menyeles...     Travel
11  Belanda merupakan negara yang secara geografis...  Teknologi
12  Sistem operasi iOS terbaru untuk perangkat App...  Teknologi
13  Isu kecurangan di Pemilu 2019 terus menyeruak....    Politik
14  Cawapres 02 Sandiaga 

In [46]:
sparse_data.head()
print('Ukuran data sparse : ', sparse_data.shape)
n_document = sparse_data.shape[0]

Ukuran data sparse :  (40, 2)


## Dengan Stemming dan Stopword

### Preprocessing

### Stemming (Mengubah Kata ke Bentuk Dasar)

In [53]:
sparse_data['teks'] = sparse_data['Content']

stemmerFactory = StemmerFactory()
stemmer = StemmerFactory().create_stemmer()

for row in range(n_document):
  # Now you can access and modify the 'teks' column
  sparse_data.loc[row, 'teks'] = stemmer.stem(sparse_data.loc[row, 'teks'])


In [54]:
# Sparse_data.head()
df4 = pd.DataFrame(sparse_data)
print(df4)


                                              Content      Topic  \
0   Pesta demokrasi terbesar di Indonesia resmi di...  Teknologi   
1   Pertamuan para kiai sepuh se-Jawa Timur yang d...    Politik   
2   Tsamara Amany, politisi Partai Solidaritas Ind...    Politik   
3   Facebook hari Rabu mengaku telah secara tidak ...  Teknologi   
4   Jumat (19/4) kawasan Terminal Keberangkatan Do...     Travel   
5   Proses rekapitulasi surat suara melalui Sistem...    Politik   
6   KPU memberikan klarifikasi mengenai banyaknya ...    Politik   
7   Walaupun dua dari lima distrik melakukan pemil...    Politik   
8   Super hype! Penayangan perdana film 'Avengers:...  Teknologi   
9   Apple ternyata tidak main-main untuk terjun ke...  Teknologi   
10  Pemerintah Kota Banjarmasin baru saja menyeles...     Travel   
11  Belanda merupakan negara yang secara geografis...  Teknologi   
12  Sistem operasi iOS terbaru untuk perangkat App...  Teknologi   
13  Isu kecurangan di Pemilu 2019 terus menyerua

### Perhitungan Bobot TF-IDF

In [55]:
vectorize = CountVectorizer()
tf = vectorize.fit_transform(sparse_data['teks'])

print('Jumlah dokumen:', tf.shape[0])
print('Jumlah Term:', tf.shape[1])

Jumlah dokumen: 40
Jumlah Term: 2254


In [56]:
print('Daftar Term:')
vectorize.get_feature_names_out()

Daftar Term:


array(['00', '000', '01', ..., 'zainuddin', 'zamih', 'zat'], dtype=object)

In [63]:
print('Daftar Stopword:')
print(vectorize.get_stop_words())

Daftar Stopword:
None


In [58]:
print('Matriks TF:')
tf_matrix = pd.DataFrame(tf.toarray(), columns=vectorize.get_feature_names_out())
tf_matrix

Matriks TF:


Unnamed: 0,00,000,01,02,03,030,039,04,043,052,...,yakin,yakni,yang,yasar,yogyakarta,yusuf,zahid,zainuddin,zamih,zat
0,0,0,0,0,0,0,0,0,0,0,...,0,0,2,0,0,0,0,0,0,0
1,0,0,1,2,0,0,0,0,0,0,...,0,0,8,0,0,1,0,1,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,4,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,5,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,4,0,0,0,0,0,0,0
5,1,0,0,0,0,1,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,1,0,10,0,0,0,0,0,0,0
7,0,0,1,2,0,0,1,0,2,0,...,0,0,3,0,0,0,0,0,0,0
8,0,1,0,0,0,0,0,0,0,0,...,0,0,5,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,8,0,0,0,0,0,0,0


In [59]:
print('Matriks TF (Khusus data train):')
tf_train = tf_matrix[:n_train]
tf_train.shape

Matriks TF (Khusus data train):


(30, 2254)

### Menghitung Bobot TF-IDF

In [60]:
transformer = TfidfTransformer(use_idf=True)

# Penyesuian df agar query atau data test tidak dihitung pada perhitungan df
n = n_train
df = tf_train.astype(bool).sum(axis=0)
idf = np.log(n/df)
transformer.idf_ = idf

# Mengubah matriks TF menjadi matriks bobot TF-IDF.
weight = transformer.fit_transform(tf)
print('Jumlah Dokumen:', weight.shape[0])
print('Jumlah Term:', weight.shape[1])

Jumlah Dokumen: 40
Jumlah Term: 2254


### Hasil akhir pembobotan

In [61]:
weight_matrix = pd.DataFrame(weight.toarray(), columns=vectorize.get_feature_names_out())
weight_matrix

Unnamed: 0,00,000,01,02,03,030,039,04,043,052,...,yakin,yakni,yang,yasar,yogyakarta,yusuf,zahid,zainuddin,zamih,zat
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.050464,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.028924,0.054796,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.079195,0.0,0.0,0.035786,0.0,0.0398,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.053024,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.08687,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.055199,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.069,0.0,0.0,0.0,0.0,0.094945,0.0,0.0,0.0,0.0,...,0.0,0.0,0.023616,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.042682,0.0,0.1375,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.027109,0.051357,0.0,0.0,0.037301,0.0,0.074603,0.0,...,0.0,0.0,0.027834,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.027837,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.041832,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.070904,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [62]:
# Pembagian Matrix Bobot

weight_train = weight_matrix[:n_train]
weight_test = weight_matrix[n_train:]

## Next Menghitung Cosine Similarity dengan Data dari Praktikum 4

---

## POSTEST 4 - Modifikasi dengan Stopword

1. kerjakan postest seperti pada langkah praktikum
2. Buat **daftar stop word** dari data yang terdapat dalam file `Excel` yang telah disediakan.
3.  Analisis hasil perhitungan `TF-IDF` dari data yang sudah diolah, dan berikan penjelasan mengenai hasil analisis tersebut.

## Kode Program Postest

**Note**: Jika ada perlu ditanyakan terkait teknis praktikum, jangan ragu bertanya. Silahkan bertanya di group atau pc dengan asisten `Fadhli` & `Aufa`

## Selamat Mengerjakan

<div align="center">
<img src="https://i.pinimg.com/1200x/24/00/2b/24002be0f4ae505370e288ea883713f0.jpg" alt="Banner Haruhi suzumiya" />
</div>