### [Feature Extraction (FE)](https://scikit-learn.org/stable/modules/feature_extraction.html)

- Used to extract feature information from text & image datasets.
- Very different from [feature selection]() (FE is a technique that is applied to the result of a FE method.)

### [Feactures from Dicts](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html#sklearn.feature_extraction.DictVectorizer)

- [DictVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html#sklearn.feature_extraction.DictVectorizer) converts feature arrays (lists of Python `dict` objects) to NumPy/SymPy format.

- Uses one-of-K (aka "one hot") category coding. Category features are unordered `attribute:value` pairs.

In [3]:
measurements = [
    {'city': 'Dubai', 'temperature': 33.},
    {'city': 'London', 'temperature': 12.},
    {'city': 'San Francisco', 'temperature': 18.},
]

from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()
vec.fit_transform(measurements).toarray()
vec.get_feature_names()

['city=Dubai', 'city=London', 'city=San Francisco', 'temperature']

- DictVectorizer accepts multiple strings for one feature (aka, multiple categories per movie).

movie_entry = [{'category': ['thriller', 'drama'], 'year': 2003},
               {'category': ['animation', 'family'], 'year': 2011},
               {'year': 1974}]

vec.fit_transform(movie_entry).toarray()

vec.get_feature_names() == ['category=animation', 
                            'category=drama',
                            'category=family', 
                            'category=thriller',
                            'year']

vec.transform({'category': ['thriller'],
               'unseen_feature': '3'}).toarray()

### Dict Vectorizer - NLP applications

- Suppose we have an algorithm that extracts Part of Speech (PoS) tags to use for training a sequence classifier (e.g. a chunker). The following dict could be such a window of features extracted around the word ‘sat’ in the sentence ‘The cat sat on the mat.’:

- The description can be vectorized into a sparse 2D matrix, suitable for a classifier.

- Extracting this info around each individual word of a corpus of documents will return a *very wide (many one-hot-features) matrix* - with mostly zero values. `DictVectorizer` therefore uses a scipy.sparse matrix by default.

In [9]:
pos_window = [
    {
        'word-2': 'the',
        'pos-2': 'DT',
        'word-1': 'cat',
        'pos-1': 'NN',
        'word+1': 'on',
        'pos+1': 'PP',
    },
    # in a real application one would extract many such dictionaries
]

vec = DictVectorizer()
pos_vectorized = vec.fit_transform(pos_window)
print(pos_vectorized)

pos_vectorized.toarray()
print(vec.get_feature_names())

  (0, 0)	1.0
  (0, 1)	1.0
  (0, 2)	1.0
  (0, 3)	1.0
  (0, 4)	1.0
  (0, 5)	1.0
['pos+1=PP', 'pos-1=NN', 'pos-2=DT', 'word+1=on', 'word-1=cat', 'word-2=the']


### [Feature Hashing](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html#sklearn.feature_extraction.FeatureHasher) 

- [FeatureHasher](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html#sklearn.feature_extraction.FeatureHasher) is a high-speed, low-memory vectorizer that uses a technique known as feature hashing, or the “hashing trick”. 

- Instead of building a hash table of features during training, as vectorizers do, instances of FeatureHasher *apply a hash function to the features to directly determine their column index* in sample matrices. 

- The result is increased speed and reduced memory usage, at the expense of inspectability; the hasher *does not remember what the input features looked like* and has `no inverse_transform` method.

- Since the hash function can cause collisions between (unrelated) features, a signed hash function is used. The sign determines the sign of the value stored in the output matrix for a feature. 

- This means that collisions are likely to cancel out rather than accumulate error - so the expected mean of any output feature’s value is zero. 

- It is enabled by default with `alternate_sign=True` and is particularly useful for small hash table sizes (n_features < 10000). For large hash table sizes, it can be disabled. This will allow outputs to be passed to estimators like MultinomialNB or chi2 feature selectors that expect non-negative inputs.

- `FeatureHasher` accepts maps (like Python’s dict and its variants in the collections module), (feature, value) pairs, or strings, depending on the constructor parameter input_type. Maps are treated as lists of `(feature, value)` pairs. 

- Single strings have an implicit value of 1, so ['feat1', 'feat2', 'feat3'] is interpreted as [('feat1', 1), ('feat2', 1), ('feat3', 1)]. 

- If a single feature occurs multiple times in a sample, the feature values will be summed (so ('feat', 2) and ('feat', 3.5) become ('feat', 5.5)). The output from `FeatureHasher` is a scipy.sparse matrix in the CSR format.

- Feature hashing can be used in document classification. Unlike `CountVectorizer`, `FeatureHasher` does not do word splitting or any other preprocessing except Unicode-to-UTF-8 encoding. See below for a combined tokenizer/hasher.

### Implementation

- FeatureHasher uses the signed 32-bit variant of MurmurHash3. The maximum number of features supported is currently $2^31-1$.

- The original formulation of the hashing trick used two separate hash functions $h$ and $phi$ to determine the column index and sign of a feature,. This implementation assumes the sign bit of MurmurHash3 is independent of its other bits.

- Since a simple modulo is used to transform the hash function to a column index, consider using a power of two as the `n_features` param. Otherwise the features will not be mapped evenly to the columns.