The sklearn.feature_extraction module can be used to extract features in a format supported by machine learning algorithms from datasets consisting of formats such as text and image.


In [4]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import time

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer, FeatureHasher


pd.options.display.max_columns = None

# feature extraction methods

- [1. Basic feature extraction](#1.)

  - [1.1 DictVectorizer](#1.1)
  - [1.2 FeatureHasher](#1.2)

- [2. text feature extraction](#2.)

- [3. images feature extraction](#3.)


# [1. Basic feature extraction](#1.)


### [1.1 DictVectorizer](#1.1)

The class DictVectorizer can be used to convert feature arrays represented as lists of standard Python dict objects to the NumPy/SciPy representation used by scikit-learn estimators.

DictVectorizer implements what is called one-of-K or “one-hot” coding for categorical (aka nominal, discrete) features.

In general DictVectorizer, like oneHotEnconding, but it has an advantage, inb which it can understan the numbercal data, so that it won't make it a binary features


In [18]:
measurements = [
    {"city": "Dubai", "temperature": 33.0},
    {"city": "London", "temperature": 12.0},
    {"city": "San Francisco", "temperature": 18.0},
]

vec = DictVectorizer()

features = vec.fit_transform(measurements).toarray()
columns = vec.get_feature_names_out()
features, columns

(array([[ 1.,  0.,  0., 33.],
        [ 0.,  1.,  0., 12.],
        [ 0.,  0.,  1., 18.]]),
 array(['city=Dubai', 'city=London', 'city=San Francisco', 'temperature'],
       dtype=object))

In [19]:
df = pd.DataFrame(features, columns=columns)
df

Unnamed: 0,city=Dubai,city=London,city=San Francisco,temperature
0,1.0,0.0,0.0,33.0
1,0.0,1.0,0.0,12.0
2,0.0,0.0,1.0,18.0


### [1.2 FeatureHasher](#1.2)

Implements feature hashing, aka the hashing trick.

This class turns sequences of symbolic feature names (strings) into scipy.sparse matrices, using a hash function to compute the matrix column corresponding to a name. The hash function employed is the signed 32-bit version of Murmurhash3.

This class is a low-memory alternative to DictVectorizer and CountVectorizer


In [30]:
from sklearn.feature_extraction import FeatureHasher

h = FeatureHasher(n_features=10, input_type="dict")
D = [{"dog": 1, "cat": 2, "elephant": 4}, {"dog": 2, "run": 5}]
f = h.transform(D)
f.toarray()

array([[ 0.,  0., -4., -1.,  0.,  0.,  0.,  0.,  0.,  2.],
       [ 0.,  0.,  0., -2., -5.,  0.,  0.,  0.,  0.,  0.]])

In [39]:
h = FeatureHasher(n_features=15, input_type="string", alternate_sign=False)
raw_X = [["dog", "cat", "snake"], ["snake", "dog"], ["cat", "bird"]]
f = h.transform(raw_X)
f.toarray()

array([[0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 2., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])