# 09 Feature Extraction

Feature extraction starts from an initial set of measured data and builds derived values (features) intended to be informative and non-redundant, facilitating the subsequent learning and generalization steps, and in some cases leading to better human interpretations.

Referensi: [https://en.wikipedia.org/wiki/Feature_extraction](https://en.wikipedia.org/wiki/Feature_extraction)

## One Hot Encoding pada Categorical / Nominal Variables

One Hot Encoding merepresentasikan setiap nilai dari suatu feature (explanatory variable) sebagai nilai biner (binary).

### Dataset

In [1]:
X = [
    {'kota': 'Jakarta'},
    {'kota': 'Bandung'},
    {'kota': 'Surabaya'},
]

### One Hot Encoding dengan `DictVectorizer`

One Hot Encoding dapat diterapkan dengan memanfatkan `DictVectorizer`.

In [2]:
from sklearn.feature_extraction import DictVectorizer
onehot_encoder = DictVectorizer()

In [3]:
encoded_X = onehot_encoder.fit_transform(X).toarray()
encoded_X

array([[0., 1., 0.],
       [1., 0., 0.],
       [0., 0., 1.]])

In [4]:
for x, encoded in zip(X, encoded_X):
    print(f'{x["kota"]} -> {encoded}')

Jakarta -> [0. 1. 0.]
Bandung -> [1. 0. 0.]
Surabaya -> [0. 0. 1.]


## Standarisasi Features

### Dataset

In [5]:
import numpy as np

X = np.array([[1., 1., 6., 14., 10.0, 2.], 
              [1., 1., 14., 16., 11., 16.],
              [1., 4., 16., 3., 1., 12.]])

X

array([[ 1.,  1.,  6., 14., 10.,  2.],
       [ 1.,  1., 14., 16., 11., 16.],
       [ 1.,  4., 16.,  3.,  1., 12.]])

### Standarisasi dengan `scale`

In [6]:
from sklearn import preprocessing

preprocessing.scale(X)

array([[ 0.        , -0.70710678, -1.38873015,  0.52489066,  0.59299945,
        -1.35873244],
       [ 0.        , -0.70710678,  0.46291005,  0.87481777,  0.81537425,
         1.01904933],
       [ 0.        ,  1.41421356,  0.9258201 , -1.39970842, -1.4083737 ,
         0.33968311]])

## Bag of Words model sebagai representasi text

Referensi: [https://en.wikipedia.org/wiki/Bag-of-words_model](https://en.wikipedia.org/wiki/Bag-of-words_model)

### Dataset

In [7]:
corpus = [
    'La Liga talking points',
    'Premier League talking points',
    'Talking points from the weekend Serie A matches'
]

### Bag of Words model dengan `CountVectorizer`

Bag of Words model dapat diterapkan dengan memanfatkan `CountVectorizer`.

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorized_X = vectorizer.fit_transform(corpus).todense()
vectorized_X

matrix([[0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0],
        [0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0],
        [1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1]])

In [9]:
vectorizer.vocabulary_

{'la': 1,
 'liga': 3,
 'talking': 8,
 'points': 5,
 'premier': 6,
 'league': 2,
 'from': 0,
 'the': 9,
 'weekend': 10,
 'serie': 7,
 'matches': 4}

In [10]:
dict(sorted(vectorizer.vocabulary_.items(), key=lambda item: item[0]))

{'from': 0,
 'la': 1,
 'league': 2,
 'liga': 3,
 'matches': 4,
 'points': 5,
 'premier': 6,
 'serie': 7,
 'talking': 8,
 'the': 9,
 'weekend': 10}

### Euclidean Distance untuk mengukur kedekatan/jarak antar dokumen (vector)


In [14]:
from sklearn.metrics.pairwise import euclidean_distances

for i in range(len(vectorized_X)):
    for j in range(i, len(vectorized_X)):
        if i == j:
            continue
        jarak = euclidean_distances(vectorized_X[i], vectorized_X[j])
        print(f'Jarak dokumen {i+1} dan {j+1}: {jarak}')

Jarak dokumen 1 dan 2: [[2.]]
Jarak dokumen 1 dan 3: [[2.64575131]]
Jarak dokumen 2 dan 3: [[2.64575131]]
