# Extracting Features with Transformers

The datasets we have used so far have been described in terms of features. In the previous notebooks, we used a transaction-centric dataset. However, ultimately this was just a different format for representing feature-based data.

There are many other types of datasets, including text, images, sounds, movies, or even real objects. Most data mining algorithms, however, rely on having numerical or categorical features. This means we need a way to represent these types before we input them into the data mining algorithm.

The key concepts:
- Extracting features from datasets
- Creating new features
- Selecting good features
- Creating your own transformer for custom datasets

## Feature extraction

Extracting features is one of the most critical tasks in data mining, and **it generally affects your end result more than the choice of data mining algorithm**. Unfortunately, there are no hard and fast rules for choosing features that will result in high performance data mining. In many ways, this is where the science of data mining becomes more of an art. Creating good features relies on intuition, domain expertise, data mining experience, trial and error, and sometimes a little luck.

## Representing reality in models

Not all datasets are presented in terms of features. Sometimes, a dataset consists of nothing more than all of the books that have been written by a given author. Sometimes, it is the film of each of the movies released in 1979. At other times, it is a library collection of interesting historical artifacts.

From these datasets, we may want to perform a data mining task. For the books, we may want to know the different categories that the author writes. In the films, we
may wish to see how women are portrayed. In the historical artifacts, we may want
to know whether they are from one country or another. It isn't possible to just pass
these raw datasets into a decision tree and see what the result is.

For a data mining algorithm to assist us here, we need to represent these as features.
Features are a way to create a model and the model provides an approximation of
reality in a way that data mining algorithms can understand. Therefore, a model is
just a simplified version of some aspect of the real world. As an example, the game of
chess is a simplified model for historical warfare.

Selecting features has another advantage: they reduce the complexity of the real
world into a more manageable model. Imagine how much information it would take
to properly, accurately, and fully describe a real-world object to someone that has
no background knowledge of the item. You would need to describe the size, weight,
texture, composition, age, flaws, purpose, origin, and so on.

The complexity of real objects is too much for current algorithms, so we use these
simpler models instead.

This simplification also focuses our intent in the data mining application. In later
chapters, we will look at clustering and where it is critically important. If you put
random features in, you will get random results out.

However, there is a downside as this simplification reduces the detail, or may
remove good indicators of the things we wish to perform data mining on.

Thought should always be given to how to represent reality in the form of a model.
Rather than just using what has been used in the past, you need to consider the goal
of the data mining exercise. What are you trying to achieve? In Chapter 3, Predicting
Sports Winners with Decision Trees, we created features by thinking about the goal
(predicting winners) and used a little domain knowledge to come up with ideas for
new features.

**Note**: Not all features need to be numeric or categorical. Algorithms have
been developed that work directly on text, graphs, and other data
structures. Unfortunately, those algorithms are outside the scope of
this book. In this book, we mainly use numeric or categorical features.
