# Scikit Learn Tutorial #13 - Feature extraction

<table align="left"><td>
  <a target="_blank"  href="https://colab.research.google.com/github/TannerGilbert/Tutorials/blob/master/Scikit-Learn-Tutorial/13.%20Feature%20extraction.ipynb">
    <img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab
  </a>
</td><td>
  <a target="_blank"  href="https://github.com/TannerGilbert/Tutorials/blob/master/Scikit-Learn-Tutorial/13.%20Feature%20extraction.ipynb">
    <img width=32px src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
</td></table>

![Scikit Learn Logo](http://scikit-learn.org/stable/_static/scikit-learn-logo-small.png)

## Why is feature extraction important?

Sometimes our data isn't in the right format for Machine Learning. Feature extraction can be used to extract features in a format supported by machine learning algorithms.

## Feature Extraction in Scikit Learn

Scikit Learns <i>sklearn.feature_extraction</i> provides a lot of different functions to extract features from something like text or images.

### Loading features from dicts (DictVectorizer)

DictVectorizer can be used to transform your data from a Python dict to a Numpy array which can be used for Machine Learning. It also transforms categorical data into a one hot encoded data.

In [1]:
from sklearn.feature_extraction import DictVectorizer

# Creating array of dicts 
measurements = [
    {'city': 'Dubai', 'temperature': 33.},
    {'city': 'Londo', 'temperature': 12.},
    {'city': 'San Francisco', 'temperature': 18.},
]

vec = DictVectorizer()
vec.fit_transform(measurements).toarray()

array([[ 1.,  0.,  0., 33.],
       [ 0.,  1.,  0., 12.],
       [ 0.,  0.,  1., 18.]])

In [2]:
# Printing the names of the new features
vec.get_feature_names()

['city=Dubai', 'city=Londo', 'city=San Francisco', 'temperature']

### Feature hashing (FeatureHasher)

It is a high speed, low memory vectorizer which uses a technique known as feature hashing to vectorize data. 

In [3]:
from sklearn.feature_extraction import FeatureHasher

# Creating array of dicts
data = [
    {'dog': -1, 'cat': 2, 'elephant': 4},
    {'dog': 2, 'run': 5, 'cat':-7}
]

h = FeatureHasher(n_features=4)
h.transform(data).toarray()

array([[ 0.,  1., -4.,  2.],
       [-5., -2.,  0., -7.]])

### Text feature extraction

Scikit Learn offers multiple ways to extract numeric feature from text:
<ul>
    <li><b>tokenizing</b> strings and giving an integer id for each possible token.</li>
    <li><b>counting</b> the occurrences of tokens in each document.</li>
    <li><b>normalizing</b> and weighting with diminishing importance tokens that occur in the majority of samples / documents.</li>
</ul>

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

# Creating Dataset
data = [
    'Test sentence one of three.',
    'Second test sentence of three.',
    'Last sentence of three.'
]

vec = CountVectorizer()

vec.fit_transform(data).toarray()

array([[0, 1, 1, 0, 1, 1, 1],
       [0, 1, 0, 1, 1, 1, 1],
       [1, 1, 0, 0, 1, 0, 1]], dtype=int64)

In [5]:
vec.transform(['New sentence']).toarray()

array([[0, 0, 0, 0, 1, 0, 0]], dtype=int64)

In large texts there will be a lot of words like "a" and "the" which don't provide meaning to our classifier but rather trick our model. To prevent this we could run CountVectorizer and then delete all tokens that appear more the k percent or we could use Scikit Learns <i>TfidfTransformer</i> in combination with the CountVectorizer or <i>TfidfVectorizer</i> which combines both of them. These to functions put weights on the tokens. The weights are lower  the more often the token occurs.

In [6]:
from sklearn.feature_extraction.text import TfidfTransformer

data_vec = vec.transform(data).toarray()
data_vec

array([[0, 1, 1, 0, 1, 1, 1],
       [0, 1, 0, 1, 1, 1, 1],
       [1, 1, 0, 0, 1, 0, 1]], dtype=int64)

In [7]:
tfidf = TfidfTransformer()
data_vec_weighted = tfidf.fit_transform(data_vec)
data_vec_weighted.toarray()

array([[0.        , 0.3645444 , 0.61722732, 0.        , 0.3645444 ,
        0.46941728, 0.3645444 ],
       [0.        , 0.3645444 , 0.        , 0.61722732, 0.3645444 ,
        0.46941728, 0.3645444 ],
       [0.69903033, 0.41285857, 0.        , 0.        , 0.41285857,
        0.        , 0.41285857]])

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vec = TfidfVectorizer()
tfidf_vec.fit_transform(data).toarray()

array([[0.        , 0.3645444 , 0.61722732, 0.        , 0.3645444 ,
        0.46941728, 0.3645444 ],
       [0.        , 0.3645444 , 0.        , 0.61722732, 0.3645444 ,
        0.46941728, 0.3645444 ],
       [0.69903033, 0.41285857, 0.        , 0.        , 0.41285857,
        0.        , 0.41285857]])

## Resources

<ul>
    <li><a href="http://scikit-learn.org/stable/modules/feature_extraction.html">Feature Extraction (Scikit Learn Documentation)</a></li>
    <li><a href="https://datascience.stackexchange.com/questions/29006/feature-selection-vs-feature-extraction-which-to-use-when">Feature selection vs Feature extraction. Which to use when? (stackexchange)</a></li>
    <li><a href="https://en.wikipedia.org/wiki/Feature_extraction">Feature Extraction (Wikipedia)</a></li>
</ul>

## Conclusion

That was a quick overview of feature extraction and how to implement it in Scikit Learn. 
I hope you liked this tutorial if you did consider subscribing on my <a href="https://www.youtube.com/channel/UCBOKpYBjPe2kD8FSvGRhJwA">Youtube Channel</a> or following me on Social Media. If you have any question feel free to contact me.