# 10 Mengenal Text Processing: <br/>Bag of Words & Stop Word Filtering

## Bag of Words model sebagai representasi text

Bag of Words menyederhanakan representasi text sebagai sekumpulan kata serta mengabaikan grammar dan posisi tiap kata pada kalimat. Text akan dikonversi menjadi lowercase dan tanda baca akan diabaikan.

Referensi: [https://en.wikipedia.org/wiki/Bag-of-words_model](https://en.wikipedia.org/wiki/Bag-of-words_model)

### Dataset

In [1]:
corpus = [
    'Linux has been around since the mid-1990s and has since reached a user-base that spans the globe.',
    'Distributions include the Linux kernel and supporting system software and libraries.',
    'Linux is one of the most prominent examples of free and open-source software collaboration.'
]

### Bag of Words model dengan `CountVectorizer`

Bag of Words model dapat diterapkan dengan memanfatkan `CountVectorizer`.

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorized_X = vectorizer.fit_transform(corpus).todense()
vectorized_X

matrix([[1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,
         0, 1, 2, 0, 0, 1, 0, 0, 1, 2, 1],
        [0, 2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0,
         0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0],
        [0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 2, 1, 1,
         1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0]])

In [3]:
vectorizer.vocabulary_

{'linux': 15,
 'has': 10,
 'been': 4,
 'around': 2,
 'since': 23,
 'the': 30,
 'mid': 16,
 '1990s': 0,
 'and': 1,
 'reached': 22,
 'user': 31,
 'base': 3,
 'that': 29,
 'spans': 26,
 'globe': 9,
 'distributions': 6,
 'include': 11,
 'kernel': 13,
 'supporting': 27,
 'system': 28,
 'software': 24,
 'libraries': 14,
 'is': 12,
 'one': 19,
 'of': 18,
 'most': 17,
 'prominent': 21,
 'examples': 7,
 'free': 8,
 'open': 20,
 'source': 25,
 'collaboration': 5}

In [4]:
dict(sorted(vectorizer.vocabulary_.items(), key=lambda item: item[0]))

{'1990s': 0,
 'and': 1,
 'around': 2,
 'base': 3,
 'been': 4,
 'collaboration': 5,
 'distributions': 6,
 'examples': 7,
 'free': 8,
 'globe': 9,
 'has': 10,
 'include': 11,
 'is': 12,
 'kernel': 13,
 'libraries': 14,
 'linux': 15,
 'mid': 16,
 'most': 17,
 'of': 18,
 'one': 19,
 'open': 20,
 'prominent': 21,
 'reached': 22,
 'since': 23,
 'software': 24,
 'source': 25,
 'spans': 26,
 'supporting': 27,
 'system': 28,
 'that': 29,
 'the': 30,
 'user': 31}

### Euclidean Distance untuk mengukur kedekatan/jarak antar dokumen (vector)


In [5]:
from sklearn.metrics.pairwise import euclidean_distances

for i in range(len(vectorized_X)):
    for j in range(i, len(vectorized_X)):
        if i == j:
            continue
        jarak = euclidean_distances(vectorized_X[i], vectorized_X[j])
        print(f'Jarak dokumen {i+1} dan {j+1}: {jarak}')

Jarak dokumen 1 dan 2: [[5.19615242]]
Jarak dokumen 1 dan 3: [[5.74456265]]
Jarak dokumen 2 dan 3: [[4.47213595]]


## Stop Word Filtering pada text

Stop Word Filtering menyederhanakan representasi text dengan mengabaikan beberapa kata seperti determiners (the, a, an),  auxiliary verbs (do, be, will), dan prepositions (on, in, at).

Referensi: [https://en.wikipedia.org/wiki/Stop_word](https://en.wikipedia.org/wiki/Stop_word)

### Dataset

In [6]:
corpus

['Linux has been around since the mid-1990s and has since reached a user-base that spans the globe.',
 'Distributions include the Linux kernel and supporting system software and libraries.',
 'Linux is one of the most prominent examples of free and open-source software collaboration.']

### Stop Word Filtering dengan `CountVectorizer`

Stop Word Filtering juga dapat diterapkan dengan memanfatkan `CountVectorizer`.

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(stop_words='english')
vectorized_X = vectorizer.fit_transform(corpus).todense()
vectorized_X

matrix([[1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1],
        [0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0],
        [0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0]])

In [8]:
vectorizer.vocabulary_

{'linux': 10,
 'mid': 11,
 '1990s': 0,
 'reached': 14,
 'user': 19,
 'base': 1,
 'spans': 17,
 'globe': 6,
 'distributions': 3,
 'include': 7,
 'kernel': 8,
 'supporting': 18,
 'software': 15,
 'libraries': 9,
 'prominent': 13,
 'examples': 4,
 'free': 5,
 'open': 12,
 'source': 16,
 'collaboration': 2}