<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Feature-Extraction-for-Text-Data" data-toc-modified-id="Feature-Extraction-for-Text-Data-1">Feature Extraction for Text Data</a></span></li><li><span><a href="#What-about-new-words-in-the-test-set?" data-toc-modified-id="What-about-new-words-in-the-test-set?-2">What about new words in the test set?</a></span></li><li><span><a href="#Unseen-data-are-not-at-prediction-time" data-toc-modified-id="Unseen-data-are-not-at-prediction-time-3">Unseen data are not at prediction time</a></span></li><li><span><a href="#What-is-Hashing?" data-toc-modified-id="What-is-Hashing?-4">What is Hashing?</a></span></li><li><span><a href="#What-are-Hash-Maps?" data-toc-modified-id="What-are-Hash-Maps?-5">What are Hash Maps?</a></span></li><li><span><a href="#Feature-Hashing,-aka-hashing-trick" data-toc-modified-id="Feature-Hashing,-aka-hashing-trick-6">Feature Hashing, aka hashing trick</a></span></li><li><span><a href="#The-Advantages-of-Feature-Hashing" data-toc-modified-id="The-Advantages-of-Feature-Hashing-7">The Advantages of Feature Hashing</a></span></li><li><span><a href="#The-Disadvantages-of-Feature-Hashing" data-toc-modified-id="The-Disadvantages-of-Feature-Hashing-8">The Disadvantages of Feature Hashing</a></span></li><li><span><a href="#HashingVectorizer" data-toc-modified-id="HashingVectorizer-9">HashingVectorizer</a></span></li><li><span><a href="#TfidfVectorizer" data-toc-modified-id="TfidfVectorizer-10">TfidfVectorizer</a></span></li><li><span><a href="#Caveats" data-toc-modified-id="Caveats-11">Caveats</a></span></li><li><span><a href="#Bonus-Material" data-toc-modified-id="Bonus-Material-12">Bonus Material</a></span></li></ul></div>

<center><h2>Feature Extraction for Text Data</h2></center>


Text is a very rich data source. 

However, it can challenging to make it amenable to machine learning.

We tools to build feature vectors from text documents.

In [21]:
reset -fs

In [22]:
from sklearn.feature_extraction.text import *

In [23]:
whos

Variable                Type         Data/Info
----------------------------------------------
CountVectorizer         type         <class 'sklearn.feature_e<...>on.text.CountVectorizer'>
ENGLISH_STOP_WORDS      frozenset    frozenset({'one', 'full',<...>ch', 'a', 'alone', 're'})
HashingVectorizer       type         <class 'sklearn.feature_e<...>.text.HashingVectorizer'>
TfidfTransformer        type         <class 'sklearn.feature_e<...>n.text.TfidfTransformer'>
TfidfVectorizer         type         <class 'sklearn.feature_e<...>on.text.TfidfVectorizer'>
strip_accents_ascii     function     <function strip_accents_ascii at 0x7f8b6bf7f550>
strip_accents_unicode   function     <function strip_accents_u<...>nicode at 0x7f8b6bf7f3a0>
strip_tags              function     <function strip_tags at 0x7f8b6bfe69d0>


| Method name | What it does |  
|:-------|:------|
| feature_extraction.text.CountVectorizer | Convert a collection of text documents to a matrix of token counts |
| feature_extraction.text.HashingVectorizer | Convert a collection of text documents to a matrix of token occurrences| 
|feature_extraction.text.TfidfTransformer | Transform a count matrix to a normalized tf or tf-idf representation |
|feature_extraction.text.TfidfVectorizer | Convert a collection of raw documents to a matrix of TF-IDF features. |

In [24]:
# Sample text
text = ["problem of evil",
        "evil king",
        "horizon problem",
        "king of kings"]

In [25]:
from sklearn.feature_extraction.text import CountVectorizer

count_vectors = CountVectorizer()
X = count_vectors.fit_transform(text)
X

<4x6 sparse matrix of type '<class 'numpy.int64'>'
	with 10 stored elements in Compressed Sparse Row format>

In [26]:
X.toarray()

array([[1, 0, 0, 0, 1, 1],
       [1, 0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0, 1],
       [0, 0, 1, 1, 1, 0]])

In [27]:
count_vectors.get_feature_names()

['evil', 'horizon', 'king', 'kings', 'of', 'problem']

In [28]:
import pandas as pd

pd.DataFrame(X.toarray(),
             columns=count_vectors.get_feature_names())

Unnamed: 0,evil,horizon,king,kings,of,problem
0,1,0,0,0,1,1
1,1,0,1,0,0,0
2,0,1,0,0,0,1
3,0,0,1,1,1,0


<center><h2>What about new words in the test set?</h2></center>



In [29]:
test_text = ["queen problem"]

# Not you only call transform on test set (never call fit or fit_transform)
X_test = count_vectors.transform(test_text)

pd.DataFrame(X_test.toarray(),
             columns=count_vectors.get_feature_names())

Unnamed: 0,evil,horizon,king,kings,of,problem
0,0,0,0,0,0,1


<center><h2>Unseen data are not at prediction time</h2></center>

Any new words are dropped because the model was not trained on them. This extends to categorical data and some continuous data.

This is issue is frequently called "out of distribution". Out of distribution is one of the most difficult problems in machine learning.

Remember - "Data is the world's best regularizer."

You want to collect as much data as possible. In particular, collect data that will look your production data.

[Source](https://stackoverflow.com/questions/30287371/countvectorizer-matrix-varies-with-new-test-data-for-classification)

<center><h2>What is Hashing?</h2></center>

<center>Converting an object into a single numeric value.</center>

<center> Maps data of arbitrary sizes to data of a fixed size (often in repeatable way).</center>

<center><h2>What are Hash Maps?</h2></center>

<center>Key-value pairs - where the key is numeric location (index) and the value is an object</center>

<center><h2>Feature Hashing, aka hashing trick</h2></center>
<br>
<center><img src="../images/feature_hashing.jpeg" width="100%"/></center>

<center>Map feature values to indices in a feature vector.</center>

<center>Those feature values could be <b>any</b> hashable object</center>

Source: https://www.quora.com/Can-you-explain-feature-hashing-in-an-easily-understandable-way

In [9]:
from sklearn.feature_extraction import FeatureHasher

# help(FeatureHasher)

In [10]:
# Let's feature hash 
h = FeatureHasher(n_features=10)

# 1 instance & 1 feature with a value
d = [{'dog': 10,}] 

# 1 instance & 1 feature with a changed value
# d = [{'dog': 4,}] 

# 1 instance & 3 features
# d = [{'dog': 10, 'cat':2, 'elephant':4}] 

 # 2 instances
# d = [{'dog': 10, 'cat':2, 'elephant':4},
#      {'dog': 2, 'run': 5}] 

# 1 instance & many features 🤭
# d = [{l:n for l, n in zip('abcdefghijklmnopqrstuvwxyz', range(26))}] 
    
h.fit_transform(d).toarray()

array([[  0.,   0.,   0., -10.,   0.,   0.,   0.,   0.,   0.,   0.]])

Learn more: https://www.youtube.com/watch?v=Uv9dY6Obv-s

<center><h2>The Advantages of Feature Hashing</h2></center>

- Fast 
- Simple
- Memory efficient
    - Limits feature vector size (compared to one-hot encoding)
    - No need to store mapping feature itself (just index integer)

<center><h2>The Disadvantages of Feature Hashing</h2></center>

- Interpretability & feature importances - Cannot go from feature indices back to feature names
- Hash collisions 

<center><h2>HashingVectorizer</h2></center>

In [30]:
from sklearn.feature_extraction.text import HashingVectorizer

hash_vector = HashingVectorizer()
X = hash_vector.fit_transform(text)
X.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [31]:
hash_vector = HashingVectorizer(n_features=7)
X = hash_vector.transform(text) #  HashingVectorizer is stateless, meaning that you don’t have to call fit on it:
X.toarray()

array([[ 0.        ,  0.        , -0.57735027,  0.57735027,  0.57735027,
         0.        ,  0.        ],
       [ 0.        ,  0.        , -0.70710678,  0.70710678,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        , -0.70710678,  0.        , -0.70710678,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  1.        ,
         0.        ,  0.        ]])

In [32]:
hash_vector.get_feature_names()

# Hashing in this case is a way operation

AttributeError: 'HashingVectorizer' object has no attribute 'get_feature_names'

There are also a couple of cons (vs using a CountVectorizer with an in-memory vocabulary):

- there is no way to compute the inverse transform (from feature indices to string feature names) which can be a problem when trying to introspect which features are most important to a model.

- there can be collisions: distinct tokens can be mapped to the same feature index. However in practice this is rarely an issue if n_features is large enough (e.g. 2 ** 18 for text classification problems).

- no IDF weighting as this would render the transformer stateful.

[Source](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)

<center><h2>TfidfVectorizer</h2></center>

TF-IDF was covered in a previous course (MSDS 621). Reference those notes [here](https://github.com/parrt/msds692/blob/9a7ded161fe37eacd6d3b21533aebd9d7e148ef1/notes/tfidf.ipynb) and [here](https://github.com/parrt/msds692/blob/6ec05ef972ec94e185645f407002956d63111a86/hw/tfidf.md).

<center><h2>Caveats</h2></center>

- scikit-learn is not designed for rich text processing and modeling. It does a couple of things.
- I suggest using spaCy for more options.

<center><h2>Bonus Material</h2></center>

How should I choose n features in featurehasher?

- Default is good enough.
- Otherwise a power of 2 is a good idea.

[Source](https://datascience.stackexchange.com/questions/77819/how-should-i-choose-n-features-in-featurehasher-in-sklearn)

<br>
<br> 
<br>

----