In [1]:
import pandas as pd, numpy as np, matplotlib.pyplot as plt


# Feature Engineering

Let's talk about the real-world challenges of machine learning! While the core concepts we've covered assume you're working with neat numerical data arranged in rows and columns, reality is often messier. That's where feature engineering comes in – it's the crucial art of transforming raw information into numerical data that machine learning models can actually understand.

Throughout this chapter, we'll explore practical feature engineering techniques for different types of data. We'll dive into handling categorical variables, working with text data, and processing images. Plus, we'll look at ways to create new features to enhance model complexity and deal with missing data. This whole process is known as vectorization – essentially converting diverse data types into well-structured numerical vectors that our models can work with.

So let's bridge the gap between textbook machine learning and real-world applications!



## Categorical Features

Let's discuss categorical data in a way that's both polished yet approachable. Consider exploring housing market data: alongside quantitative metrics like price points and room counts, you'll encounter descriptive attributes such as neighborhood designations. To illustrate, your dataset might be structured as follows:


In [9]:
data= [
    {'price': 850000, 'rooms': 4, 'neighborhood': 'Queen Anne'},
    {'price': 700000, 'rooms': 3, 'neighborhood': 'Fremont'},
    {'price': 650000, 'rooms': 3, 'neighborhood': 'Wallingford'},
    {'price': 600000, 'rooms': 2, 'neighborhood': 'Fremont'}
]
pd.DataFrame(data)

Unnamed: 0,price,rooms,neighborhood
0,850000,4,Queen Anne
1,700000,3,Fremont
2,650000,3,Wallingford
3,600000,2,Fremont


One approach you may consider is implementing a direct numerical encoding for this dataset:

In [3]:
{'Queen Anne': 1, 'Fremont': 2, 'Wallingford': 3}

{'Queen Anne': 1, 'Fremont': 2, 'Wallingford': 3}

While mapping categorical data to numbers might seem intuitive, Scikit-Learn's architecture doesn't quite work that way. The framework interprets numerical features as mathematical values, which would create some quirky implications - imagine suggesting that 

$Queen Anne < Fremont < Wallingford$ 

(Local real estate agents might get a kick out of that one.)
Instead, the industry standard is one-hot encoding, which cleverly creates binary (0 or 1) columns for each category. The good news? If you're working with dictionary-format data, Scikit-Learn's DictVectorizer handles this transformation automatically.

In [11]:
from sklearn.feature_extraction import DictVectorizer
vec=DictVectorizer(sparse=False,dtype=int)
vec.fit_transform(data)

array([[     0,      1,      0, 850000,      4],
       [     1,      0,      0, 700000,      3],
       [     0,      0,      1, 650000,      3],
       [     1,      0,      0, 600000,      2]])

The categorical `neighborhood` data has been split into three distinct columns, with each property marked by a "1" in its corresponding `neighborhood` column. This one-hot encoded format makes it ready for use in your Scikit-Learn model.

Want to check what each column represents? Just take a look at the feature names.