Tutorial pekan ke-3, pemodelan vektor kata dan dokumen dengan Tf-Idf

Sumber: https://stackabuse.com/python-for-nlp-creating-tf-idf-model-from-scratch/ 

Import library yang dibutuhkan

In [None]:
import nltk
import numpy as np
import random
import string
import re


Memproses teks masukan (terdiri atas beberapa kalimat), tokenisasi.

In [None]:
article_text = 'Saya sedang belajar Pemrosesan Bahasa Alami.\n'
article_text += 'Adik sedang belajar Bahasa Inggris.\n'
article_text += 'Ibu menonton video rekaman pelajaran Bahasa Arab.\n'
article_text += 'Jumlah angka penderita Covid-19 di Indonesia masih terus naik.\n'
article_text += 'Jumlah tenaga kesehatan yang terinfeksi Covid-19 masih bertambah.\n'
article_text += 'Masyarakat Indonesia harus disiplin supaya Covid-19 dapat terkendali.\n'


In [None]:
nltk.download('punkt')
corpus = nltk.sent_tokenize(article_text)

for i in range(len(corpus )):
    corpus [i] = corpus [i].lower()
    corpus [i] = re.sub(r'\W',' ',corpus [i]) # hapus punctuation / tanda baca
    corpus [i] = re.sub(r'\s+',' ',corpus [i]) # hapus spasi berlebih

wordfreq = {}
for sentence in corpus:
    tokens = nltk.word_tokenize(sentence)
    for token in tokens:
        if token not in wordfreq.keys():
            wordfreq[token] = 1
        else:
            wordfreq[token] += 1

import heapq
most_freq = heapq.nlargest(200, wordfreq, key=wordfreq.get) # setting wordfreq bisa diubah sesuai dengan karakteristik data yang digunakan. Di sini akan digunakan frekuensi maksimum 200.

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Menghitung nilai idf.

Perhatikan bahwa yang dianggap sebagai sebuah dokumen adalah **1 kalimat**.

In [None]:
word_idf_values = {}
for token in most_freq:
    doc_containing_word = 0
    for document in corpus:
        if token in nltk.word_tokenize(document):
            doc_containing_word += 1
    word_idf_values[token] = np.log10(len(corpus)/(doc_containing_word))

In [None]:
print(word_idf_values)

{'bahasa': 0.3010299956639812, 'covid': 0.3010299956639812, '19': 0.3010299956639812, 'sedang': 0.47712125471966244, 'belajar': 0.47712125471966244, 'jumlah': 0.47712125471966244, 'indonesia': 0.47712125471966244, 'masih': 0.47712125471966244, 'saya': 0.7781512503836436, 'pemrosesan': 0.7781512503836436, 'alami': 0.7781512503836436, 'adik': 0.7781512503836436, 'inggris': 0.7781512503836436, 'ibu': 0.7781512503836436, 'menonton': 0.7781512503836436, 'video': 0.7781512503836436, 'rekaman': 0.7781512503836436, 'pelajaran': 0.7781512503836436, 'arab': 0.7781512503836436, 'angka': 0.7781512503836436, 'penderita': 0.7781512503836436, 'di': 0.7781512503836436, 'terus': 0.7781512503836436, 'naik': 0.7781512503836436, 'tenaga': 0.7781512503836436, 'kesehatan': 0.7781512503836436, 'yang': 0.7781512503836436, 'terinfeksi': 0.7781512503836436, 'bertambah': 0.7781512503836436, 'masyarakat': 0.7781512503836436, 'harus': 0.7781512503836436, 'disiplin': 0.7781512503836436, 'supaya': 0.7781512503836436

Menghitung nilai tf. Nilai tf yang digunakan adalah yang dinormalisasi dengan panjang dokumen.

In [None]:
word_tf_values = {}
for token in most_freq:
    sent_tf_vector = []
    for document in corpus:
        doc_freq = 0
        for word in nltk.word_tokenize(document):
            if token == word:
                  doc_freq += 1
        word_tf = doc_freq # tanpa normalisasi
        #word_tf = doc_freq/len(nltk.word_tokenize(document)) # normalisasi dengan panjang dokumen
        sent_tf_vector.append(word_tf)
    word_tf_values[token] = sent_tf_vector

In [None]:
print(word_tf_values)

{'bahasa': [1, 1, 1, 0, 0, 0], 'covid': [0, 0, 0, 1, 1, 1], '19': [0, 0, 0, 1, 1, 1], 'sedang': [1, 1, 0, 0, 0, 0], 'belajar': [1, 1, 0, 0, 0, 0], 'jumlah': [0, 0, 0, 1, 1, 0], 'indonesia': [0, 0, 0, 1, 0, 1], 'masih': [0, 0, 0, 1, 1, 0], 'saya': [1, 0, 0, 0, 0, 0], 'pemrosesan': [1, 0, 0, 0, 0, 0], 'alami': [1, 0, 0, 0, 0, 0], 'adik': [0, 1, 0, 0, 0, 0], 'inggris': [0, 1, 0, 0, 0, 0], 'ibu': [0, 0, 1, 0, 0, 0], 'menonton': [0, 0, 1, 0, 0, 0], 'video': [0, 0, 1, 0, 0, 0], 'rekaman': [0, 0, 1, 0, 0, 0], 'pelajaran': [0, 0, 1, 0, 0, 0], 'arab': [0, 0, 1, 0, 0, 0], 'angka': [0, 0, 0, 1, 0, 0], 'penderita': [0, 0, 0, 1, 0, 0], 'di': [0, 0, 0, 1, 0, 0], 'terus': [0, 0, 0, 1, 0, 0], 'naik': [0, 0, 0, 1, 0, 0], 'tenaga': [0, 0, 0, 0, 1, 0], 'kesehatan': [0, 0, 0, 0, 1, 0], 'yang': [0, 0, 0, 0, 1, 0], 'terinfeksi': [0, 0, 0, 0, 1, 0], 'bertambah': [0, 0, 0, 0, 1, 0], 'masyarakat': [0, 0, 0, 0, 0, 1], 'harus': [0, 0, 0, 0, 0, 1], 'disiplin': [0, 0, 0, 0, 0, 1], 'supaya': [0, 0, 0, 0, 0, 1], 'da

Hitung nilai tf-idf

In [None]:
tfidf_values = []
for token in word_tf_values.keys():
    tfidf_sentences = [] # 1 dokumen adalah 1 kalimat
    for tf_sentence in word_tf_values[token]:
        tf_idf_score = tf_sentence * word_idf_values[token]
        tfidf_sentences.append(tf_idf_score)
    tfidf_values.append(tfidf_sentences)

In [None]:
print(tfidf_values)

[[0.3010299956639812, 0.3010299956639812, 0.3010299956639812, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.3010299956639812, 0.3010299956639812, 0.3010299956639812], [0.0, 0.0, 0.0, 0.3010299956639812, 0.3010299956639812, 0.3010299956639812], [0.47712125471966244, 0.47712125471966244, 0.0, 0.0, 0.0, 0.0], [0.47712125471966244, 0.47712125471966244, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.47712125471966244, 0.47712125471966244, 0.0], [0.0, 0.0, 0.0, 0.47712125471966244, 0.0, 0.47712125471966244], [0.0, 0.0, 0.0, 0.47712125471966244, 0.47712125471966244, 0.0], [0.7781512503836436, 0.0, 0.0, 0.0, 0.0, 0.0], [0.7781512503836436, 0.0, 0.0, 0.0, 0.0, 0.0], [0.7781512503836436, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.7781512503836436, 0.0, 0.0, 0.0, 0.0], [0.0, 0.7781512503836436, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.7781512503836436, 0.0, 0.0, 0.0], [0.0, 0.0, 0.7781512503836436, 0.0, 0.0, 0.0], [0.0, 0.0, 0.7781512503836436, 0.0, 0.0, 0.0], [0.0, 0.0, 0.7781512503836436, 0.0, 0.0, 0.0], [0.0, 0.0, 0.778151250383

Tampilkan sebagai array dengan kolom berupa kalimat. Coba perhatikan similarity antar kalimat!

In [None]:
tf_idf_model = np.asarray(tfidf_values)
tf_idf_model = np.transpose(tf_idf_model)
print(tf_idf_model)

[[0.30103    0.         0.         0.47712125 0.47712125 0.
  0.         0.         0.77815125 0.77815125 0.77815125 0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.        ]
 [0.30103    0.         0.         0.47712125 0.47712125 0.
  0.         0.         0.         0.         0.         0.77815125
  0.77815125 0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.        ]
 [0.30103    0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.77815125 0.77815125 0.77815125 0.77815125 0.77815125
  0.77815125 0.         0.         0.         0.         0.
  0.         0.         0.  

Coba menghitung cosine similarity antar kalimat

Cosine similarity antara kalimat 1 dan 2

In [None]:
cos_sim_01 = np.dot(tf_idf_model[0],tf_idf_model[1])/np.linalg.norm(tf_idf_model[0])*np.linalg.norm(tf_idf_model[1])
print(str(cos_sim_01))

0.470778183456045


Cosine similarity antara kalimat 1 dan 3

In [None]:
cos_sim_02 = np.dot(tf_idf_model[0],tf_idf_model[2])/np.linalg.norm(tf_idf_model[0])*np.linalg.norm(tf_idf_model[2])
print(str(cos_sim_02))

0.11376956826188767


Cosine similarity antara kalimat 1 dan 4

In [None]:
cos_sim_03 = np.dot(tf_idf_model[0],tf_idf_model[3])/np.linalg.norm(tf_idf_model[0])*np.linalg.norm(tf_idf_model[3])
print(str(cos_sim_03))

0.0


Cosine similarity antara kalimat 4 dan 5

In [None]:
cos_sim_34 = np.dot(tf_idf_model[3],tf_idf_model[4])/np.linalg.norm(tf_idf_model[3])*np.linalg.norm(tf_idf_model[4])
print(str(cos_sim_34))

0.6176305134056249


Latihan: bagaimana mengetahui cosine similarity antar kata?