# Video: Examples of Feature Extraction in Scikit-Learn

This video shows off the variety of feature extraction methods available in scikit-learn.

In [None]:
import numpy as np
import sklearn.feature_extraction

Script:
* The feature extraction module of scikit-learn has many different classes designed to turn different kinds of data into new feature columns.
* I will now show you three examples to illustrate their breadth.
* The first example is the `DictVectorizer` class.
* This class takes in a sequence of dictionaries and maps each key to a new column.

In [None]:
dict_data = [{"fruit": "mango", "yellow": 2, "rating": 5}, {"fruit": "pear", "green": 4, "rating": 3}]

In [None]:
dict_vectorizer = sklearn.feature_extraction.DictVectorizer()
dict_vectorizer.fit(dict_data)

Script:
* The fit function saves all the keys found in the data, and if values are strings, saves the key/value pairs.
* Each of the saved keys or key/value pairs will become a new column.


In [None]:
dict_vectorizer.feature_names_

['fruit=mango', 'fruit=pear', 'green', 'rating', 'yellow']

Script:
* Note the different handling between string and number values.
* String values will require more columns, one for each string value in a column.
* After fitting, we can transform the data.

In [None]:
dict_vectorizer.transform(dict_data)

<2x5 sparse matrix of type '<class 'numpy.float64'>'
	with 6 stored elements in Compressed Sparse Row format>

Script:
* Because the keys and values present may vary a lot from dictionary to dictionary, the vectorizer assumes only a small portion of them will be used at any time and returns a sparse matrix.
* Let's change that to a dense matrix to get a better look -- this example is small enough that we do not need to worry about the space.

In [None]:
dict_vectorizer.transform(dict_data).todense()

matrix([[1., 0., 0., 5., 2.],
        [0., 1., 4., 3., 0.]])

Script:
* If a string value was not seen during training, it will not have a column to represent that value occuring.
* A feature hasher is a more flexible version that can handle new string values, but comes at the cost of blurring which key/value pairs were actually present.
* Instead, a fixed number of columns is chosen at the start, and string values are pseudorandomly mapped to those columns.
* The worse visibility from sharing columns is a downside, but a possible upside is not needing to "fit" the data to asign columns.
* With a feature hasher, the data can be mapped to columns without a first pass identifying features.


In [None]:
feature_hasher = sklearn.feature_extraction.FeatureHasher(n_features=10)

In [None]:
feature_hasher.transform(dict_data).todense()

matrix([[0., 0., 5., 2., 0., 0., 0., 1., 0., 0.],
        [0., 0., 3., 0., 5., 0., 0., 0., 0., 0.]])

Script:
* Usually feature hashers are used with many more columns -- the default is about a million -- but I reduced the number to make this example more clear.
* Both dict vectorizers and feature hashers are very flexible ways to handle dictionaries of data.
* The last feature extractor that I will show now is for text instead.
* This feature extractor basically breaks up the input text into words and assigns them weights based on how frequent they are in that particular text, and how rare they are overall.
* Here are some sample input texts.

In [None]:
documents = []
documents.append("mangos are the best.")
documents.append("mango mango mango")
documents.append("pears are ok")
documents.append("apples are decent")

Script:
* And here is the class to turn these texts into vectors.

In [None]:
text_vectorizer = sklearn.feature_extraction.text.TfidfVectorizer()
text_vectorizer.fit(documents)

Script:
* What does that name mean?
* TF stands for term frequency.
* How many times does a particular word show up in this text?
* IDF stands for inverse document frequency.
* Which is basically how rare this word is across documents.
* That is calculated as the number of documents in total divided by the number of documents with this word.
* So a word like "the" usually has a low inverse document frequency since it is ubiquitous.
* And a word like "kumquat" usually has a high inverse document frequency since it is not used often.
* Those frequencies will be recalculated to match the documents with which you fit the vectorizer.
* Let's look at what words were vectorized.

In [None]:
text_vectorizer.get_feature_names_out()

array(['apples', 'are', 'best', 'decent', 'mango', 'mangos', 'ok',
       'pears', 'the'], dtype=object)

Script:
* You can see the text was split up on spaces into words.
* The one period in the input above was removed.
* And both "mango" and "mangos" are present -- it is not smart enough to combine them.
* Though that can often be done by adding preprocessing.
* Let's look at the features that come out of this vectorizer.

In [None]:
text_vectorizer.transform(documents).todense()

matrix([[0.        , 0.34578314, 0.5417361 , 0.        , 0.        ,
         0.5417361 , 0.        , 0.        , 0.5417361 ],
        [0.        , 0.        , 0.        , 0.        , 1.        ,
         0.        , 0.        , 0.        , 0.        ],
        [0.        , 0.41137791, 0.        , 0.        , 0.        ,
         0.        , 0.64450299, 0.64450299, 0.        ],
        [0.64450299, 0.41137791, 0.        , 0.64450299, 0.        ,
         0.        , 0.        , 0.        , 0.        ]])

Script:
* Like most feature extractors, the output defaults to a sparse matrix.
* And most of the entries are zero since the expectation is that there most documents will have a fraction of the total possible words.
* That's it for these feature extraction examples.
* Check out the feature extraction module in scikit-learn for more choices.