<a href="https://colab.research.google.com/github/fdhliakbar/IR-Lab/blob/main/TF_IDF_Pertemuan4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Praktikum 4: Pembobotan TF-IDF

Dalam bidang Information Retrieval (IR), kita sering ingin mengetahui seberapa penting sebuah kata dalam sebuah dokumen dibandingkan dengan seluruh koleksi dokumen.
Untuk itu, digunakan metode pembobotan yang disebut TF-IDF (Term Frequency - Inverse Document Frequency).

TF-IDF membantu sistem untuk menentukan relevansi kata terhadap dokumen, dan digunakan secara luas dalam:
- Mesin pencari (search engine),
- Sistem rekomendasi,
- Text mining, dan
- Natural Language Processing (NLP).

## Term Frequency
Term Frequency (TF) menunjukkan seberapa sering sebuah kata (term) muncul dalam sebuah dokumen.

Rumus: `T F (t, d)`

Jumlah kemunculan term t dalam dokumen d
Jumlah total kata dalam dokumen d
TF(t,d)=
Jumlah total kata dalam dokumen d
Jumlah kemunculan term t dalam dokumen d

Semakin sering sebuah kata muncul dalam dokumen, semakin penting kata tersebut di dokumen itu. Namun, kata umum seperti “yang”, “dan”, “atau” akan memiliki TF tinggi di banyak dokumen, sehingga perlu faktor pembeda (IDF).

## Inverse Document Frequency
Inverse Document Frequency (IDF) mengukur seberapa unik atau jarang suatu kata muncul di seluruh kumpulan dokumen (corpus).

Rumus: `IDF(t) = log(N / df(t))`

Keterangan:

- N = jumlah total dokumen

- df(t) = jumlah dokumen yang mengandung kata 𝑡

Jika sebuah kata muncul di banyak dokumen, berarti kata tersebut kurang informatif.
Kata yang jarang muncul lebih unik dan memiliki nilai IDF tinggi.

## TF-IDF
TF-IDF adalah hasil perkalian antara TF dan IDF, yang memberi bobot pada kata berdasarkan:
- Seberapa sering kata itu muncul di dokumen (TF),
- Seberapa jarang kata itu muncul di keseluruhan dokumen (IDF).

Rumus = `TF-IDF(t,d)=TF(t,d) × IDF(t)`

Kata yang sering muncul di sebuah dokumen tetapi jarang muncul di dokumen lain akan memiliki nilai TF-IDF tinggi, menandakan kata tersebut penting.

<img src="https://tse4.mm.bing.net/th/id/OIP.JESrk-lt1T3U11qX0fzG1gHaDm?rs=1&pid=ImgDetMain&o=7&rm=3" width="108%" alt="Contoh TF IDF" />

## Kesimpulan

| Komponen   | Fungsi                             | Menunjukkan                 |
| ---------- | ---------------------------------- | --------------------------- |
| **TF**     | Frekuensi kata dalam dokumen       | Kepentingan lokal           |
| **IDF**    | Kelangkaan kata di seluruh dokumen | Kepentingan global          |
| **TF-IDF** | Kombinasi TF dan IDF               | Bobot total pentingnya kata |


## Lanjut Langkah Praktikum 🫡

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.naive_bayes import MultinomialNB

In [4]:
!pip install Sastrawi

Collecting Sastrawi
  Downloading Sastrawi-1.0.1-py2.py3-none-any.whl.metadata (909 bytes)
Downloading Sastrawi-1.0.1-py2.py3-none-any.whl (209 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/209.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m209.7/209.7 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: Sastrawi
Successfully installed Sastrawi-1.0.1


In [5]:
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from Sastrawi.StopWordRemover.StopWordRemoverFactory import StopWordRemoverFactory

## Baca Dataset

In [6]:
data =  pd.read_excel('dataKumparan1.xlsx')
data.head()

Unnamed: 0,Topic,Title,Content
0,Politik,"Pelanggaran Pemilu, Tiga Caleg di Sulteng Dipr...","Komisioner Bawaslu Sigi, Sulawesi Tengah, Agus..."
1,Politik,"Pemilu Susulan di Kota Jayapura, Suara Jokowi ...",Walaupun dua dari lima distrik melakukan pemil...
2,Politik,"Tsamara Amany Dipinang, Pengurus PSI Daerah Me...","Tsamara Amany, politisi Partai Solidaritas Ind..."
3,Politik,Ada 47 TPS di Sulawesi Utara Berpotensi Pemili...,Badan Pengawas Pemilu (Bawaslu) Provinsi Sulaw...
4,Politik,Ketua KPPS di Sleman Ditemukan Tewas Gantung D...,"Tugiman, Ketua Kelompok Penyelenggara Pemungut..."


### Ukuran Dataset

In [12]:
print('Ukuran dataset: ', data.shape)

Ukuran dataset:  (60, 3)


### Pembagian Data Training & Testing

In [13]:
x_train, x_test, y_train, y_test = train_test_split(data['Content'], data['Topic'], train_size=0.5, test_size=0.16)
train_data = pd.DataFrame({'Content': x_train, 'Topic': y_train})
test_data = pd.DataFrame({'Content': x_test, 'Topic': y_test})

### Data Training

In [14]:
df1 = pd.DataFrame(train_data)
print(df1)

                                              Content      Topic
1   Walaupun dua dari lima distrik melakukan pemil...    Politik
36  Platform media sosial populer Instagram kembal...  Teknologi
44  Hari Guru atau Minggu Apreasi Guru di Amerika ...     Travel
20  Facebook sedang mengembangkan asisten suara un...  Teknologi
49  Bila biasanya mesin fotokopi terdapat di kios-...     Travel
0   Komisioner Bawaslu Sigi, Sulawesi Tengah, Agus...    Politik
43  Anda suka dessert? Di berbagai media sosial, p...     Travel
4   Tugiman, Ketua Kelompok Penyelenggara Pemungut...    Politik
16  Pemungutan suara Pemilu 2019 telah usai. Tapi ...    Politik
59  Di balik kemegahan Pegunungan Tianzhu China, a...     Travel
30  Kemajuan teknologi memudahkan dan mengubah keb...  Teknologi
34  Persaingan aplikasi pesan instan kian sengit s...  Teknologi
31  Jelang bulan Ramadhan yang tinggal menghitung ...  Teknologi
11  Proses rekapitulasi surat suara melalui Sistem...    Politik
26  Sistem operasi iOS te

### Data Testing

In [15]:
df2 = pd.DataFrame(test_data)
print(df2)

                                              Content      Topic
54  Selain mempertimbangkan budaya, alam, dan kuli...     Travel
51  Tahun 2018 lalu, keluarga besar Kim Kardashian...     Travel
38  Kamu mungkin pernah merasa kesulitan untuk ber...  Teknologi
22  Amazon.com mengatakan akan menutup toko daring...  Teknologi
9   Seorang Ketua KPPS bernama Baharuddin Effendi ...    Politik
56  Sulawesi Utara (Sulut) menunjukkan komitmennya...     Travel
53  Untuk mendukung pengembangan pariwisata Sulawe...     Travel
40  Wahana hiburan Trampolin hadir pertama kali di...     Travel
21  Facebook hari Rabu mengaku telah secara tidak ...  Teknologi
5   Menteri Komunikasi dan Informatika (Menkominfo...    Politik


### Ukuran Data Training & Testing

In [16]:
print('Ukuran data train : ', train_data.shape)
print('Ukuran data test : ', test_data.shape)
n_train = train_data.shape[0]
n_test = test_data.shape[0]

Ukuran data train :  (30, 2)
Ukuran data test :  (10, 2)


### Mengecek Data

In [9]:
sparse_data = pd.concat([train_data, test_data], ignore_index=True)
sparse_data.head()
print('Ukuran data sparse : ', sparse_data.shape)
n_document = sparse_data.shape[0]

Ukuran data sparse :  (40, 2)


## Dengan Stemming dan Stopword

### Preprocessing

In [17]:
stemmerFactory = StemmerFactory()
stremmer  = stemmerFactory.create_stemmer()


In [18]:
# Stem Process
for row in range(n_document):
  sparse_data.loc[row, 'teks'] = stremmer.stem(sparse_data.loc[row, 'teks'])

In [19]:
# Sparse_data.head()
df4 = pd.DataFrame(sparse_data)
print(df4)


                                              Content      Topic  \
0   Hari Guru atau Minggu Apreasi Guru di Amerika ...     Travel   
1   Peningkatan perolehan suara untuk Partai Keadi...    Politik   
2   Tsamara Amany, politisi Partai Solidaritas Ind...    Politik   
3   Pertamuan para kiai sepuh se-Jawa Timur yang d...    Politik   
4   Komisioner Bawaslu Sigi, Sulawesi Tengah, Agus...    Politik   
5   Anda suka dessert? Di berbagai media sosial, p...     Travel   
6   Super hype! Penayangan perdana film 'Avengers:...  Teknologi   
7   Menteri Komunikasi dan Informatika (Menkominfo...    Politik   
8   Hari Bumi yang diperingati pada tanggal 22 Apr...     Travel   
9   Perkembangan teknologi otomotif tidak hanya te...  Teknologi   
10  Kabar bahagia datang bagi para penyelam di sel...     Travel   
11  Pariwisata Bintan terus menunjukan perkembanga...     Travel   
12  Perusahaan e-commerce marketplace Tokopedia ke...  Teknologi   
13  Facebook sedang mengembangkan asisten suara 

### Perhitungan Bobot

In [20]:
vectorize = CountVectorizer()
tf = vectorize.fit_transform(sparse_data['teks'])

print('Jumlah dokumen:', tf.shape[0])
print('Jumlah Term:', tf.shape[1])

Jumlah dokumen: 40
Jumlah Term: 2343


In [21]:
print('Daftar Term:')
vectorize.get_feature_names_out()

Daftar Term:


array(['00', '000', '01', ..., 'zamih', 'ziarah', 'zoetry'], dtype=object)

In [22]:
print('Daftar Stopword:')
vectorize.get_stop_words()

Daftar Stopword:


In [24]:
print('Matriks TF:')
tf_matrix = pd.DataFrame(tf.toarray(), columns=vectorize.get_feature_names_out())
tf_matrix

Matriks TF:


Unnamed: 0,00,000,01,02,03,04,052,08,09,10,...,yerusalem,yesus,yogyakarta,yunani,yusuf,zahid,zainuddin,zamih,ziarah,zoetry
0,0,0,0,0,0,0,0,0,0,2,...,0,0,0,0,0,0,0,0,0,1
1,0,0,0,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,1,2,0,0,0,0,0,0,...,0,0,0,0,1,0,1,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
6,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [25]:
print('Matriks TF (Khusus data train):')
tf_train = tf_matrix[:n_train]
tf_train.shape

Matriks TF (Khusus data train):


(30, 2343)

In [26]:
transformer = TfidfTransformer(use_idf=True)

# Penyesuian df agar query atau data test tidak dihitung pada perhitungan df
n = n_train
df = tf_train.astype(bool).sum(axis=0)
idf = np.log(n/df)
transformer.idf_ = idf

weight = transformer.fit_transform(tf)
print('Jumlah Dokumen:', weight.shape[0])
print('Jumlah Term:', weight.shape[1])

Jumlah Dokumen: 40
Jumlah Term: 2343


In [27]:
weight_matrix = pd.DataFrame(weight.toarray(), columns=vectorize.get_feature_names_out())
weight_matrix

Unnamed: 0,00,000,01,02,03,04,052,08,09,10,...,yerusalem,yesus,yogyakarta,yunani,yusuf,zahid,zainuddin,zamih,ziarah,zoetry
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.086113,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.065716
1,0.0,0.0,0.0,0.049835,0.064545,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.029553,0.059106,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.034416,0.0,0.038277,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.038709,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05908,0.0,0.0
6,0.0,0.026067,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.038412,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [28]:
# Pembagian Matrix Bobot

weight_train = weight_matrix[:n_train]
weight_test = weight_matrix[n_train:]

## Next Menghitung Cosine Similarity dengan Data dari Praktikum 4

## POSTEST 4

1. kerjakan postest seperti pada langkah praktikum
2. Buat daftar stop word dari data yang terdapat dalam file Excel yang telah disediakan.
3.  Analisis hasil perhitungan TF-IDF dari data yang sudah diolah, dan berikan penjelasan mengenai hasil analisis tersebut.


### Selamat Mengerjakan
<div align="center">
<img src="https://i.pinimg.com/1200x/24/00/2b/24002be0f4ae505370e288ea883713f0.jpg" alt="Banner Haruhi suzumiya" />
</div>