# Introduction

In this lab, we will learn how to perform feature extraction using the scikit-learn library. Feature extraction is the process of transforming raw data into numerical features that can be used by machine learning algorithms. It involves extracting relevant information from different types of data such as text and images.



# Loading features from dicts

In this step, we will learn how to load features from dictionaries using the **DictVectorizer** class in scikit-learn.

In [1]:
from sklearn.feature_extraction import DictVectorizer

measurements = [
    {'city': 'Dubai', 'temperature': 33.},
    {'city': 'London', 'temperature': 12.},
    {'city': 'San Francisco', 'temperature': 18.},
]

vec = DictVectorizer()
features = vec.fit_transform(measurements).toarray()
feature_names = vec.get_feature_names_out()

print(features)
print(feature_names)

[[ 1.  0.  0. 33.]
 [ 0.  1.  0. 12.]
 [ 0.  0.  1. 18.]]
['city=Dubai' 'city=London' 'city=San Francisco' 'temperature']


# Feature hashing

In this step, we will learn how to perform feature hashing using the **FeatureHasher** class in scikit-learn. Feature hashing is a technique that maps features to a fixed-length vector using a hash function.

In [2]:
from sklearn.feature_extraction import FeatureHasher

movies = [
    {'category': ['thriller', 'drama'], 'year': 2003},
    {'category': ['animation', 'family'], 'year': 2011},
    {'year': 1974},
]

hasher = FeatureHasher(input_type='string')
hashed_features = hasher.transform(movies).toarray()

print(hashed_features)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


# Text feature extraction

In this step, we will learn how to perform text feature extraction using the **CountVectorizer** and **TfidfVectorizer** classes in scikit-learn. These classes can be used to convert text data into numerical features.

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
]

vectorizer = CountVectorizer()
features = vectorizer.fit_transform(corpus).toarray()
feature_names = vectorizer.get_feature_names_out()

print(features)
print(feature_names)

[[0 1 1 1 0 0 1 0 1]
 [0 1 0 1 0 2 1 0 1]
 [1 0 0 0 1 0 1 1 0]
 [0 1 1 1 0 0 1 0 1]]
['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']


# Customizing the vectorizer classes

In this step, we will learn how to customize the behavior of vectorizer classes by passing callable functions to them.

In [4]:
def my_tokenizer(s):
    return s.split()

vectorizer = CountVectorizer(tokenizer=my_tokenizer)
features = vectorizer.fit_transform(corpus).toarray()

print(features)

[[0 1 0 1 1 0 0 1 0 1]
 [0 1 0 0 1 0 2 1 0 1]
 [1 0 0 0 0 1 0 1 1 0]
 [0 0 1 1 1 0 0 1 0 1]]




# Summary

In this lab, we learned how to perform feature extraction using the scikit-learn library. We explored various techniques such as loading features from dicts, feature hashing, and text feature extraction. We also learned how to customize the behavior of vectorizer classes to suit our specific needs. Feature extraction is an important step in machine learning as it helps transform raw data into a format that can be used by algorithms to make predictions or classify data.