# M1.1 Feature Engineering 
[![View notebooks on Github](https://img.shields.io/static/v1.svg?logo=github&label=Repo&message=View%20On%20Github&color=lightgrey)](https://github.com/cltl/ml4nlp_tutorial_notebooks/blob/main/my_notebooks/m1_1_feature_engineering.ipynb)
[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/cltl/ml4nlp_tutorial_notebooks/blob/main/my_notebooks/m1_1_feature_engineering.ipynb)  


### Learning Objectives
By working through this notebook, you will learn:
1. What features are and the different types that exist
2. How to use DictVectorize to process and transform features

## Introduction to Feature Engineering

Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy.

**Why is it important?**
- Raw data is often not in a format suitable for machine learning algorithms 
- Good features can make simple models perform extremely well
- By extracting features we can understand our data a lot better.


## Understanding Features

### What is a Feature?

A feature is an individual measurable property or characteristic of a phenomenon being observed. In machine learning, features are the input variables used to make predictions. For problems in NLP what a feature to use is heavily dependend on the task we are trying to complete. For example for named entity recognition, where we aim to check if a word or substring refers to a named entity, it might be relevant if this string starts with a capital letter. Of course since many other non-entity words also start with a capital (such as the first word in a sentence), based on the feature `is_capitalized` we might now know for sure if the word refers to a named entity, thus it is useful to have many different features. In general we distiguish between a few different types of features based on the values it can take, below is a list of the different categories of features.

### Types of Features
1. **Numerical Features**: Continuous or discrete numbers
2. **Categorical Features**: Discrete categories or labels
3. **Binary Features**: True/False, Yes/No, 1/0
4. **Text Features**: Words, characters, n-grams

**Question**: Can you identify the type of features for the instance below?
```python
house_instance = {
    'city': 'Amsterdam', 
    'rooms': 3,
    'area_m2': 75.0, 
    'type': 'apartment'
    }
```


## Example 1: Numerical Features
Let's start with example of numerical data, taking the NLP side a bit more loosly and focusing on a more general machine learning problem to keep it simpler. Let's say we have some measurements from different weather stations in the Netherlands, for each weather station we have some information stored: temperature and humidity measurements. See the python cell below. 
<details>
  <summary>Refresher: instance, keys, values
  </summary>

So each **instance** in your dataset is a dictionary. A dictionary contains keys and values. For this toy example an instance is a weather station, the **keys** are the features (e.g. temperature, humidity, city), and the **values** (as the name indicates) are what value that feature has for a specific instance, for a specific weather station the value for the key "temperature" might be 18.5 (probably in degrees).
</details>
 

**Note:** For NLP feature engineering and in these exercises we will _first_ store the features in a python dictionary (as it's easier to control).

In [2]:
# Temperature and humidity measurements
measurements = [
    {'temperature': 18.5, 'humidity': 75},
    {'temperature': 22.0, 'humidity': 60},
    {'temperature': 15.2, 'humidity': 80},
    {'temperature': 20.1, 'humidity': 65},
]

### Vectorizing the data: `DictVectorizer` 

Computers however love numbers, so we want to simplify this it and convert each dictionary to a vector, which is again just a list of numbers. Therefore for each weather station we obtain a vector.
For our experiments a nice existing package we can use is from `sklearn`, a python class `DictVectorizer`, which makes it really easy to convert our dictionaries to vectors.z

Let's go and vectorize our `measurements`!

In [3]:
# ignore warning messages for cleaner output of the website
import warnings
warnings.filterwarnings('ignore')

from sklearn.feature_extraction import DictVectorizer

In [4]:
# Remember to set sparse=False, otherwise you get a weird type of variable that is annoying to work with (we discuss this later)
vec_num = DictVectorizer(sparse=False)  
X_num = vec_num.fit_transform(measurements)

print("Feature matrix is of type:", type(X_num))
print("Feature matrix: \n", X_num)
print("\nFeature names:", vec_num.get_feature_names_out())   

Feature matrix is of type: <class 'numpy.ndarray'>
Feature matrix: 
 [[75.  18.5]
 [60.  22. ]
 [80.  15.2]
 [65.  20.1]]

Feature names: ['humidity' 'temperature']


**What happened?** When values are numerical (int or float), DictVectorizer treats them as **numerical features** and keeps them as-is. Each feature becomes one column in the matrix. Simple!

<details>
  <summary><b>Question</b> : What are the columns? </summary>

The columns represent the dictionary features, but remember that dictionaries don't really have an order (which also makes them efficient via cool math tricks), thus to understand what each column represents we use the function `vec.get_feature_names_out()`, which shows us what each feature each row is.  
-  `humidity` → 1 column
- `temperature` → 1 column
- **Total: 2 columns**

Tbh not as relevant here, but more relevant for categorical values
</details>


Also the option `sparse=True` ensures that the returned object is a numpy array, which are very easy to work with. If you don't set this option you get the `scipy.sparse._csr.csr_matrix` object, which is more "memory efficient" but much harder to work with/modify.

<details>
  <summary><b>Extra</b> : How does DictVectorizer remember the features? </summary>

In python usually if we have a variable like `vec_num` this value does not change unless it is overwritten. 
But it seems that in the code above `vec_num` is created in line 2, then it is used in line 3 via `fit_transform()`, but then in line 7  using `get_feature_names_out()` it seems to have remember the features. 

The important part is that `DictVectorizer` is a class object, and `vec_num` is an instantiation of that class, which can store variables as well. So while it seems that the line `fit_transform()` only returns `X_num` it also  _stores_ the feature names. 

</details>


## Example 2. Categorical Values
Now let's look at categorical (string) data:

In [5]:
# Weather data with categorical features
weather_data = [
    {'city': 'Amsterdam', 'weather': 'sunny', 'wind': 'strong'},
    {'city': 'Rotterdam', 'weather': 'cloudy', 'wind': 'weak'},
    {'city': 'Utrecht', 'weather': 'rainy', 'wind': 'weak'},
    {'city': 'Amsterdam', 'weather': 'snowy', 'wind': 'weak'},
]


### Vectorizing Categorical Data -> _One-Hot Encoding_
For many types of features we don't have the exact numbers, but instead we have different types of values it can take. For example the feature "city" can be "Amsterdam", "Rotterdam" and "Utrecht", so it can take 3 different categories. Again for our Machine Learning experiments we want to convert this to numbers.

In [6]:
# Create and fit the vectorizer
vec_cat = DictVectorizer(sparse=False)  # sparse=False to get dense array for visualization
X_cat = vec_cat.fit_transform(weather_data)

print("Feature matrix is of shape:", X_cat.shape)   # Shape returns (number of rows, number of columns). 
print("Feature matrix: \n", X_cat)
# print("\nFeature names:", vec_cat.get_feature_names_out()) 

Feature matrix is of shape: (4, 9)
Feature matrix: 
 [[1. 0. 0. 0. 0. 0. 1. 1. 0.]
 [0. 1. 0. 1. 0. 0. 0. 0. 1.]
 [0. 0. 1. 0. 1. 0. 0. 0. 1.]
 [1. 0. 0. 0. 0. 1. 0. 0. 1.]]



<!-- ### Understanding the Output: One-Hot Encoding -->

**What happened?** DictVectorizer detected that all values are strings and treated them as **categorical features**. 
For categorical features it creates a separate collumn for each value. This representation with is called **one-hot encoding** as it creates a binary (0 or 1) feature for each unique value, but the model is 1 at the place of that value and 0 of the other places.

<!-- #### How Many Columns Are Created? -->

So for categorical features, the number of columns equals the number of unique values:
- `city` has 3 unique values (Amsterdam, Rotterdam, Utrecht) → **3 columns**
- `weather` has 3 unique values (sunny, cloudy, rainy, snowy) → **4 columns**
- `wind` has 3 unique values (strong, weak) → **2 columns**
- **Total: 9 columns**

<!-- #### Reading the Feature Matrix -->

Let's look at the first row (Amsterdam, sunny, strong):


In [7]:
print("First instance features:")
print(weather_data[0])
print("\nFirst row of feature matrix:")
print(X_cat[0])
print("\nFeature names:")
print(vec_cat.get_feature_names_out())

First instance features:
{'city': 'Amsterdam', 'weather': 'sunny', 'wind': 'strong'}

First row of feature matrix:
[1. 0. 0. 0. 0. 0. 1. 1. 0.]

Feature names:
['city=Amsterdam' 'city=Rotterdam' 'city=Utrecht' 'weather=cloudy'
 'weather=rainy' 'weather=snowy' 'weather=sunny' 'wind=strong' 'wind=weak']


The vector `[1, 0, 0, 1, 0, 0, 0, 1, 0]` means:
- `city=Amsterdam`: 1 (others are 0)
- `weather=rainy`: 1 (others are 0)
- `wind=strong`: 1 (others are 0)

Each instance has exactly one '1' per original feature, indicating which category it belongs to.

Second check. For the 1st column we see that besided the first row, also the last row has the value 1. We can see that this is true because the last item in our `weather_data` list also has 'Amsterdam' as the city.

As you may have realized by now, reading one-hot vectors is a tad anoying, it is good practice that you understand what is going on, but you may now understand why we prefer to read and create our feature data in dictionaries first. :) 

## Example 3: When Categorical Values Go Wrong

What happens if we accidentally have categorical data stored as numbers? Let's see:

In [8]:
# Oops! City codes stored as numbers
bad_data = [
    {'city_code': 1, 'weather': 'rainy'},   # 1 = Amsterdam
    {'city_code': 2, 'weather': 'sunny'},   # 2 = Rotterdam
    {'city_code': 1, 'weather': 'cloudy'},  # 1 = Amsterdam
    {'city_code': 3, 'weather': 'rainy'},   # 3 = Utrecht
]

vec_bad = DictVectorizer(sparse=False)
X_bad = vec_bad.fit_transform(bad_data)

print("Feature matrix (WRONG!):")
print(X_bad)
print("\nFeature names:")
print(vec_bad.get_feature_names_out())

Feature matrix (WRONG!):
[[1. 0. 1. 0.]
 [2. 0. 0. 1.]
 [1. 1. 0. 0.]
 [3. 0. 1. 0.]]

Feature names:
['city_code' 'weather=cloudy' 'weather=rainy' 'weather=sunny']



**Problem!** 
 - DictVectorizer treats `city_code` as a numerical feature because the values are numbers. But city codes are categorical, so while we can still distinghuish the cities in this way, it creates a ordering right now that is likely very random to us, but for the machine means a lot. 
 - Due to this arbirary ordering it seems both that city 2 (Rotterdam) is is "twice as much" as city 1 (Amsterdam), something that will not sit right with people form Amsterdam. Moreover, the classification algorithms we will train later on can not really separate the classes right now, as the effect of the other city codes mess with eachother. So let's fix this again.

**The fix:** Convert categorical data to strings before vectorizing:


In [9]:
# Fix: Convert to strings
good_data = [
    {'city_code': '1', 'weather': 'rainy'},
    {'city_code': '2', 'weather': 'sunny'},
    {'city_code': '1', 'weather': 'cloudy'},
    {'city_code': '3', 'weather': 'rainy'},
]

vec_good = DictVectorizer(sparse=False)
X_good = vec_good.fit_transform(good_data)

print("Feature matrix (CORRECT!):")
print(X_good)
print("\nFeature names:")
print(vec_good.get_feature_names_out())

Feature matrix (CORRECT!):
[[1. 0. 0. 0. 1. 0.]
 [0. 1. 0. 0. 0. 1.]
 [1. 0. 0. 1. 0. 0.]
 [0. 0. 1. 0. 1. 0.]]

Feature names:
['city_code=1' 'city_code=2' 'city_code=3' 'weather=cloudy'
 'weather=rainy' 'weather=sunny']



Now `city_code` is properly one-hot encoded with 3 columns!

**Key lesson:** Always ensure your categorical variables are stored as strings, not numbers, before using DictVectorizer. But also don't forget that for some features you really do want the numerical values, for example for an NLP task where we want to store a feature `word_length` to count then number of characters in a word, we really do want to use an integer number as scale means something here. 

## Example 4: Mixed Features

Real-world data often has both categorical and numerical features:

In [10]:
# Housing data (simplified)
housing_data = [
    {'city': 'Amsterdam', 'rooms': 3, 'area_m2': 75.0, 'type': 'apartment'},
    {'city': 'Rotterdam', 'rooms': 4, 'area_m2': 120.0, 'type': 'house'},
    {'city': 'Utrecht', 'rooms': 2, 'area_m2': 55.0, 'type': 'apartment'},
    {'city': 'Amsterdam', 'rooms': 5, 'area_m2': 150.0, 'type': 'house'},
]

vec_mixed = DictVectorizer(sparse=False)
X_mixed = vec_mixed.fit_transform(housing_data)

print("Feature matrix:")
print(X_mixed)
print("\nFeature names:")
print(vec_mixed.get_feature_names_out())

Feature matrix:
[[ 75.   1.   0.   0.   3.   1.   0.]
 [120.   0.   1.   0.   4.   0.   1.]
 [ 55.   0.   0.   1.   2.   1.   0.]
 [150.   1.   0.   0.   5.   0.   1.]]

Feature names:
['area_m2' 'city=Amsterdam' 'city=Rotterdam' 'city=Utrecht' 'rooms'
 'type=apartment' 'type=house']



**Column count breakdown:**
- `area_m2` (numerical) → 1 column
- `city` (3 unique values) → 3 columns
- `rooms` (numerical) → 1 column
- `type` (2 unique values) → 2 columns
- **Total: 7 columns**

Notice how `city` and `type` (strings) are one-hot encoded, while `rooms` and `area_m2` (numbers) remain as single numerical features.



## Sparse vs Dense Representation

When you have many categorical features with many possible values, most entries in the feature matrix are zeros. This creates **sparse** matrices (think of it that most of the values are 0). Let's see the difference:


In [11]:
# Create a larger dataset with more categories
large_data = [
    {'city': f'City_{i%10}', 'district': f'District_{i%20}', 'price': i*1000}
    for i in range(50)
]

# Dense representation
vec_dense = DictVectorizer(sparse=False)
X_dense = vec_dense.fit_transform(large_data)

# Sparse representation (default)
vec_sparse = DictVectorizer(sparse=True)
X_sparse = vec_sparse.fit_transform(large_data)

print(f"Dense matrix shape: {X_dense.shape} - Type of the variable: {type(X_dense)}  (classic Numpy array, which we like)")
print(f"Dense matrix size in memory: {X_dense.nbytes} bytes")
print(f"\nSparse matrix shape: {X_sparse.shape} - Type of the variable: {type(X_sparse)}  (weird Scipy sparse matrix type)")
print(f"Sparse matrix size in memory: {X_sparse.data.nbytes + X_sparse.indptr.nbytes + X_sparse.indices.nbytes} bytes")
print(f"\nMemory savings: {(1 - (X_sparse.data.nbytes + X_sparse.indptr.nbytes + X_sparse.indices.nbytes) / X_dense.nbytes) * 100:.1f}%")

Dense matrix shape: (50, 31) - Type of the variable: <class 'numpy.ndarray'>  (classic Numpy array, which we like)
Dense matrix size in memory: 12400 bytes

Sparse matrix shape: (50, 31) - Type of the variable: <class 'scipy.sparse._csr.csr_matrix'>  (weird Scipy sparse matrix type)
Sparse matrix size in memory: 2004 bytes

Memory savings: 83.8%


**Sparse matrices** only store non-zero values, making them much more memory-efficient. This is especially important when working with large datasets or many categorical features. However, when we create Machine Learning algorithms we often do want the original sparse vector.

## Handling Unknown Values
Untill now our DictVectorizer used the function `vec.fit_transform()`, which is nice in the sense that it is only 1 line of code (and we coders are lazy by nature), but for our experiments this is a problem. 
When we have different dataset splits, lets say our training data, test data, or validation data. Then perhaps we don't want to do fit() and transform() on each split separately, but use it on each of them combined. 
An issue that could appear if we do `vec.fit_transform()` on each split separately is 1) What if some values are not in one split but do appear in the others? 2) What if the ordering of our features is different for each?
Let's see some of these issues in practice to understand them better.
<!-- But it is crucial here that the columns of them match, and that it is complete, that is we have enough column to account for all our features (numerical or catagorical). -->

What happens when new data has categories not seen during training?

### Error - new feature values in test are dropped

In [12]:
# Original training data
train_data = [
    {'city': 'Amsterdam', 'weather': 'rainy'},
    {'city': 'Rotterdam', 'weather': 'sunny'},
]

# New test data with unseen category
test_data = [
    {'city': 'Amsterdam', 'weather': 'sunny'},
    {'city': 'Den Haag', 'weather': 'rainy'},  # New city!
]

vec = DictVectorizer(sparse=False)
vec.fit(train_data)

# we could also do fit and transform in one go with:
# X_train = vec.fit_transform(train_data)

# Transform test data - unseen values are ignored by default
X_test = vec.transform(test_data)

print(f"Test features (Shape is: {X_test.shape}):")
print(X_test)
print("\nFeature names:")
print(vec.get_feature_names_out())

Test features (Shape is: (2, 4)):
[[1. 0. 0. 1.]
 [0. 0. 1. 0.]]

Feature names:
['city=Amsterdam' 'city=Rotterdam' 'weather=rainy' 'weather=sunny']


Notice that 'Den Haag' is silently ignored. All its columns for the city feature are zeros, because 'Den Haag' wasn't in the training data, so when we fitted our DictVectorizer it was not aware of Den Haag. 

### Solution: Ensure that the DictVectorizer is fit on all the data (if possible)

In [13]:
# Original training data
train_data = [
    {'city': 'Amsterdam', 'weather': 'rainy'},
    {'city': 'Rotterdam', 'weather': 'sunny'},
]

# New test data with unseen category
test_data = [
    {'city': 'Amsterdam', 'weather': 'sunny'},
    {'city': 'Den Haag', 'weather': 'rainy'},  # New city!
]

# We combine the lists of dictionaries
full_data = train_data + test_data

# We just do fit, not transforming yet, as fitting can be bit faster than fit_transform
vec = DictVectorizer(sparse=False)
vec.fit(full_data)

# Transform test data - this time 'Den Haag' is known to the vectorizer 
X_test = vec.transform(test_data)

print(f"Test features (Shape is: {X_test.shape}):")
print(X_test)
print("\nFeature names:")
print(vec.get_feature_names_out())

Test features (Shape is: (2, 5)):
[[1. 0. 0. 0. 1.]
 [0. 1. 0. 1. 0.]]

Feature names:
['city=Amsterdam' 'city=Den Haag' 'city=Rotterdam' 'weather=rainy'
 'weather=sunny']


So now our `vec.transform()` returns vectors trained on the full dataset, so that each vector has 5 rows, also one for "Den Haag" 

##  Self-check questions:
- What are the different types of features in machine learning?
- How are the different features transformed when we vectorize them?
- How does `DictVectorize` work, what does `vec.fit()`, `vec.transform()`,  and `vec.fit_transform()` do?