# Ekstraksi Fitur pada Teks

### Model Bag of Words

In [1]:
import pandas as pd

In [2]:
# contoh teks untuk training model
corpus = [
    'Kami sedang belajar data science',
    'Kami mempelajari Machine Learning untuk data teks',
    'Data Science adalah Ilmu Data',
    'Kami sangat antusias belajar Data Science',
    'Banyak data data bertebaran di Internet'
]

Dari [dokumentasi scikit-learn](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.

Kita akan menggunakan [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) untuk mengubah "teks menjadi matriks":

In [3]:
# inisiasi model bag of words
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

In [4]:
# pelajari vocab pada corpus
vect.fit(corpus)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [5]:
# lihat vocab
vect.get_feature_names()

['adalah',
 'antusias',
 'banyak',
 'belajar',
 'bertebaran',
 'data',
 'di',
 'ilmu',
 'internet',
 'kami',
 'learning',
 'machine',
 'mempelajari',
 'sangat',
 'science',
 'sedang',
 'teks',
 'untuk']

In [7]:
# transformasikan list corpus menjadi matriks fitur
X_vect = vect.transform(corpus)
X_vect

<5x18 sparse matrix of type '<class 'numpy.int64'>'
	with 27 stored elements in Compressed Sparse Row format>

In [None]:
'''
Matriks Dense
1 0 0
0 5 0
0 0 1

Sparse matrix
1 (1,1)
5 (2,2)
1 (3,3)
3x3
'''

In [8]:
# ubah sparse matriks menjadi dense matriks
X_vect.toarray()

array([[0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1],
       [1, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0],
       [0, 0, 1, 0, 1, 2, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

In [9]:
# lihat arti dari fitur menggunakan pandas dataframe
pd.DataFrame(X_vect.toarray(),
             columns=vect.get_feature_names())

Unnamed: 0,adalah,antusias,banyak,belajar,bertebaran,data,di,ilmu,internet,kami,learning,machine,mempelajari,sangat,science,sedang,teks,untuk
0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,1,1,0,0
1,0,0,0,0,0,1,0,0,0,1,1,1,1,0,0,0,1,1
2,1,0,0,0,0,2,0,1,0,0,0,0,0,0,1,0,0,0
3,0,1,0,1,0,1,0,0,0,1,0,0,0,1,1,0,0,0
4,0,0,1,0,1,2,1,0,1,0,0,0,0,0,0,0,0,0


Dari [dokumentasi scikit-learn](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> In this scheme, features and samples are defined as follows:

> - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.
> - The vector of all the token frequencies for a given document is considered a multivariate **sample**.

> A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus.

> We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

In [10]:
# cek tipe corpus
type(corpus)

list

In [11]:
# lihat korpus
print(corpus)

['Kami sedang belajar data science', 'Kami mempelajari Machine Learning untuk data teks', 'Data Science adalah Ilmu Data', 'Kami sangat antusias belajar Data Science', 'Banyak data data bertebaran di Internet']


Dari [dokumentasi scikit-learn](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have **many feature values that are zeros** (typically more than 99% of them).

> For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.

> In order to be able to **store such a matrix in memory** but also to **speed up operations**, implementations will typically use a **sparse representation** such as the implementations available in the `scipy.sparse` package.

In [12]:
# Contoh tes model
simple_test = ["saya adalah data scientist"]

Untuk **membuat prediksi**, observasi harus memiliki **fitur yang sama seperti data training**, dari jumlah dan maknanya.

In [13]:
# transformasi teks baru kedalam matriks
simple_test_vect = vect.transform(simple_test)

In [14]:
# lihat menggunakan pandas dataframe
pd.DataFrame(simple_test_vect.toarray(), columns=vect.get_feature_names())

Unnamed: 0,adalah,antusias,banyak,belajar,bertebaran,data,di,ilmu,internet,kami,learning,machine,mempelajari,sangat,science,sedang,teks,untuk
0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0


**Ringkasan:**

- `vect.fit(train)` **memelajari vocabulary** dari data training
- `vect.transform(train)` menggunakan **vocabulary yang sudah dibuat** untuk membangun matriks fitur data training
- `vect.transform(test)` menggunakan **vocabulary yang sudah dibuat** untuk membangun matriks fitur data test

### Model Tf-Idf

Kita akan menggunakan [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) untuk mengubah "teks menjadi matriks":

In [15]:
# inisiasi model tfidf
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer()

In [16]:
# pelajari vocab pada corpus
vect.fit(corpus)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [17]:
# lihat vocab
vect.get_feature_names()

['adalah',
 'antusias',
 'banyak',
 'belajar',
 'bertebaran',
 'data',
 'di',
 'ilmu',
 'internet',
 'kami',
 'learning',
 'machine',
 'mempelajari',
 'sangat',
 'science',
 'sedang',
 'teks',
 'untuk']

In [18]:
# transformasikan list corpus menjadi matriks fitur
X_vect = vect.transform(corpus)

In [19]:
# ubah sparse matriks menjadi dense matriks
X_vect.toarray()

array([[0.        , 0.        , 0.        , 0.48431837, 0.        ,
        0.28604623, 0.        , 0.        , 0.        , 0.402028  ,
        0.        , 0.        , 0.        , 0.        , 0.402028  ,
        0.60030013, 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.2000152 , 0.        , 0.        , 0.        , 0.28111439,
        0.41975436, 0.41975436, 0.41975436, 0.        , 0.        ,
        0.        , 0.41975436, 0.41975436],
       [0.54580932, 0.        , 0.        , 0.        , 0.        ,
        0.52016213, 0.        , 0.54580932, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.36553486,
        0.        , 0.        , 0.        ],
       [0.        , 0.51468496, 0.        , 0.41524458, 0.        ,
        0.24525014, 0.        , 0.        , 0.        , 0.34469051,
        0.        , 0.        , 0.        , 0.51468496, 0.34469051,
        0.        , 0.        , 0.        ],
    

In [20]:
# lihat arti dari fitur menggunakan pandas dataframe
pd.DataFrame(X_vect.toarray(), columns=vect.get_feature_names())

Unnamed: 0,adalah,antusias,banyak,belajar,bertebaran,data,di,ilmu,internet,kami,learning,machine,mempelajari,sangat,science,sedang,teks,untuk
0,0.0,0.0,0.0,0.484318,0.0,0.286046,0.0,0.0,0.0,0.402028,0.0,0.0,0.0,0.0,0.402028,0.6003,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.200015,0.0,0.0,0.0,0.281114,0.419754,0.419754,0.419754,0.0,0.0,0.0,0.419754,0.419754
2,0.545809,0.0,0.0,0.0,0.0,0.520162,0.0,0.545809,0.0,0.0,0.0,0.0,0.0,0.0,0.365535,0.0,0.0,0.0
3,0.0,0.514685,0.0,0.415245,0.0,0.24525,0.0,0.0,0.0,0.344691,0.0,0.0,0.0,0.514685,0.344691,0.0,0.0,0.0
4,0.0,0.0,0.451375,0.0,0.451375,0.430165,0.451375,0.0,0.451375,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [21]:
# transformasi teks baru kedalam matriks
X_vect_new = vect.transform(simple_test)

In [22]:
# lihat menggunakan pandas dataframe
pd.DataFrame(X_vect_new.toarray(), columns=vect.get_feature_names())

Unnamed: 0,adalah,antusias,banyak,belajar,bertebaran,data,di,ilmu,internet,kami,learning,machine,mempelajari,sangat,science,sedang,teks,untuk
0,0.90275,0.0,0.0,0.0,0.0,0.430165,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
