# Tutorial 10
# Text Processing: Bag of Words & Stop Word Filtering

### 1. Bag of Words
Bag of Words **menyederhanakan representasi teks sebagai sekumpulan kata serta mengabaikan grammar dan posisi tiap kata pada kalimat**. Text akan dikonversi menjadi lowercase dan tanda baca akan diabaikan. 

#### - Sample Dataset

In [10]:
corpus = [
    'Linux has been around since the mid-1190s.', 
    'Linux distributions include the Linux Kernel.',
    'Linux is one of the most prominent open-source software.'
    ]

corpus

['Linux has been around since the mid-1190s.',
 'Linux distributions include the Linux Kernel.',
 'Linux is one of the most prominent open-source software.']

#### - Bag of Words dengan CountVectorizer
Bag of Words model **dapat diterapkan dengan memanfaatkan CountVectorizer**. 

In [11]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorized_x = vectorizer.fit_transform(corpus).todense()
vectorizer_x

matrix([[1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1],
        [0, 0, 0, 1, 0, 1, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
        [0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1]],
       dtype=int64)

In [12]:
vectorizer.get_feature_names()

['1190s',
 'around',
 'been',
 'distributions',
 'has',
 'include',
 'is',
 'kernel',
 'linux',
 'mid',
 'most',
 'of',
 'one',
 'open',
 'prominent',
 'since',
 'software',
 'source',
 'the']

#### - Euclidean Distance untuk Mengukur Jarak antar Dokumen (Vector)

In [14]:
from sklearn.metrics.pairwise import euclidean_distances

for i in range(len(vectorized_x)): 
    for j in range (i, len(vectorized_x)): 
        if i == j:
            continue
        jarak = euclidean_distances(vectorized_x[i], vectorized_x[j])
        print(f'Jarak dokumen {i+1} dan {j+1}: {jarak}')

Jarak dokumen 1 dan 2: [[3.16227766]]
Jarak dokumen 1 dan 3: [[3.74165739]]
Jarak dokumen 2 dan 3: [[3.46410162]]


### 2. Stop Word Filtering pada Text 
Stop Word Filtering **menyederhanakan representasi text dengan mengabaikan beberapa kata seperti determiners (the, a, an), auxiliary verbs (do, be, will) dan prepositions (on, in, at)**.

#### - Sample Dataset

In [15]:
corpus = [
    'Linux has been around since the mid-1190s.', 
    'Linux distributions include the Linux Kernel.',
    'Linux is one of the most prominent open-source software.'
    ]

corpus

['Linux has been around since the mid-1190s.',
 'Linux distributions include the Linux Kernel.',
 'Linux is one of the most prominent open-source software.']

#### - Stop Word Filtering dengan CountVectorizer 
Stop Word Filtering juga dapat **diterapkan dengan memanfaatkan CountVectorizer**.

In [17]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(stop_words='english')
vectorized_x = vectorizer.fit_transform(corpus).todense()
vectorized_x

matrix([[1, 0, 0, 0, 1, 1, 0, 0, 0, 0],
        [0, 1, 1, 1, 2, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 1, 0, 1, 1, 1, 1]], dtype=int64)

In [18]:
vectorizer.get_feature_names()

['1190s',
 'distributions',
 'include',
 'kernel',
 'linux',
 'mid',
 'open',
 'prominent',
 'software',
 'source']