# **Bag of Words & Stop Word Filtering**

# Bag of words

Menyederhanakan representasi text sebagai sekumpulan kata serta mengabaikan grammar dan posisi tiap kata pada kalimat, text dikonversi menjadi lowercase dan tanda baca diabaikan

**Dataset**

In [2]:
corpus = [
    'Linux has been around since the mid-1990s.',
    'Linux distributions include the Linux kernel.',
    'Linux is one of the most prominent open-source software'
]

corpus

['Linux has been around since the mid-1990s.',
 'Linux distributions include the Linux kernel.',
 'Linux is one of the most prominent open-source software']

**Bag of Words model dengan **CountVectorizer****

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorized_X = vectorizer.fit_transform(corpus).todense()
vectorized_X

matrix([[1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1],
        [0, 0, 0, 1, 0, 1, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
        [0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1]],
       dtype=int64)

In [4]:
vectorizer.get_feature_names()

['1990s',
 'around',
 'been',
 'distributions',
 'has',
 'include',
 'is',
 'kernel',
 'linux',
 'mid',
 'most',
 'of',
 'one',
 'open',
 'prominent',
 'since',
 'software',
 'source',
 'the']

**Euclidean Distance untuk mengukur jarak antar dokumen**

In [5]:
from sklearn.metrics.pairwise import euclidean_distances

for i in range(len(vectorized_X)):
    for j in range(i, len(vectorized_X)):
        if i == j:
            continue
        jarak = euclidean_distances(vectorized_X[i], vectorized_X[j])
        print(f'Jarak Dokumen {i+1} dan {j+1} : {jarak}')

Jarak Dokumen 1 dan 2 : [[3.16227766]]
Jarak Dokumen 1 dan 3 : [[3.74165739]]
Jarak Dokumen 2 dan 3 : [[3.46410162]]


# Stop Word Filtering pada Text

menyederhanakan representasi text dengan mengabaikan beberapa kata seeperti determiners, auxiliary verbs, dan prepositions

**Dataset**

In [6]:
corpus

['Linux has been around since the mid-1990s.',
 'Linux distributions include the Linux kernel.',
 'Linux is one of the most prominent open-source software']

Stop Word filtering dengan CountVectorizer

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(stop_words='english')
vectorized_X = vectorizer.fit_transform(corpus).todense()
vectorized_X

matrix([[1, 0, 0, 0, 1, 1, 0, 0, 0, 0],
        [0, 1, 1, 1, 2, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 1, 0, 1, 1, 1, 1]], dtype=int64)

In [8]:
vectorizer.get_feature_names()

['1990s',
 'distributions',
 'include',
 'kernel',
 'linux',
 'mid',
 'open',
 'prominent',
 'software',
 'source']