# Bag of words model sebagai representasi text
Bag of words menyederhanakan representasi text sebagai sekumpulan kata serta mengabaikan grammar dan posisi tiap kata pada kalimat. Text akan dikonversi menjadi lowercase dan tanda baca akan diabaikan.

In [14]:
corpus = ['Linux has been around since the mid-1990s.',
          'Linux distributions include the Linux kernel.',
          'Linux is one of the most prominent open-source software.']
corpus

['Linux has been around since the mid-1990s.',
 'Linux distributions include the Linux kernel.',
 'Linux is one of the most prominent open-source software.']

Penerapan Bag of Words model dengan CountVectorizer
- Bag of Words model dapat diterapkan dengan memanfaatkan CountVectorizer.

In [15]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectorized_X = vectorizer.fit_transform(corpus).todense()
vectorized_X

matrix([[1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1],
        [0, 0, 0, 1, 0, 1, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
        [0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1]],
       dtype=int64)

In [16]:
vectorizer.get_feature_names_out()

array(['1990s', 'around', 'been', 'distributions', 'has', 'include', 'is',
       'kernel', 'linux', 'mid', 'most', 'of', 'one', 'open', 'prominent',
       'since', 'software', 'source', 'the'], dtype=object)

In [17]:
# Euclidean Distance untuk mengukur kedekatan/jarak antar dokumen (vector)
from sklearn.metrics.pairwise import euclidean_distances

for i in range (len(vectorized_X)):
    for j in range (i, len(vectorized_X)):
        if i == j:
            continue
        jarak  = euclidean_distances(vectorized_X[i],vectorized_X[j])
        print(f'Jarak dokumen {i+1} dan {j+1}: {jarak}')

Jarak dokumen 1 dan 2: [[3.16227766]]
Jarak dokumen 1 dan 3: [[3.74165739]]
Jarak dokumen 2 dan 3: [[3.46410162]]




Dari pengkukuran jarak dengan menggunakan euclidean_distance dapat diketahui bahwa jarak dari dokumen 1 dan 2 memiliki jarak yang paling kecil dibandingan kan dengan dokumen 2 dan 3 atau 1 dan 3.

# Stop Word Filtering pada text
Stop Word Filtering menyederhanakan repsesentasi text dengan mengabaikan beberapa kata seperti determiners(the,a,an) auxiliary verbs(do,be,will) dan preproposion(on,it,at)

In [18]:
corpus

['Linux has been around since the mid-1990s.',
 'Linux distributions include the Linux kernel.',
 'Linux is one of the most prominent open-source software.']

Penerapan Stop Word Filtering model dengan CountVectorizer
- Stop Word Filtering model dapat diterapkan dengan memanfaatkan CountVectorizer.

In [19]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(stop_words='english')
vectorized_X = vectorizer.fit_transform(corpus).todense()
vectorized_X

matrix([[1, 0, 0, 0, 1, 1, 0, 0, 0, 0],
        [0, 1, 1, 1, 2, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 1, 0, 1, 1, 1, 1]], dtype=int64)

In [22]:
vectorizer.get_feature_names_out()

array(['1990s', 'distributions', 'include', 'kernel', 'linux', 'mid',
       'open', 'prominent', 'software', 'source'], dtype=object)