# Introduction to Feature Engineering

[Resource](https://www.learndatasci.com/tutorials/intro-feature-engineering-machine-learning-python/)

During the creation of my first linear regression models, I noticed a serious knowledge gap in the form of feature engineering. After some preparation, a new learning path has been formed, and I will take the time to do a much needed deep dive into feature engineering. This is an intermediary step in the creation of my first legitimate model. Not the most fun topic, but like all others, it's an essential one.

As I always say, let's begin.

# Introduction

Let's put a face to the name: **Feature engineering is the process of transforming data to increase the predictive performance of machine learning models**. it's both useful and necessary for the following reasons:
1. Often better predictive accuracy: Leads to better weighting of variables. It can even lead to faster convergence.
1. Better interpretability of relationships in the data: When we engineer new features and understand how they relate with our outcome of interest, that opens up our understanding of the data. If we skip feature engineering and use complex models (that to a large degree automate feature engineering), we may still achieve a high evaluation score, at the cost of better understanding out data and its relationship with the target variable.

Every data science pipeline begins with **Exploratory Data Analysis (EDA)**. EDA is a crucial pre-cursor step as we get a better sense of what features we need to create/modify. The next step is usually data cleaning/standardization depending on how unstructured or messy the data is.

Feature engineering follows next and we begin that process by evaluating the baseline performance of the data at hand. We then iteratively construct features and continuously evaluate model performance (and compare it with the baseline performance) through a process called feature selection, until we are satisfied with the results.

## What this article does and doesn't cover

Feature engineering is a vast field as there are many domain-specific tangents. This article covers some of the popular techniques employed in handling tabular datasets. This will not cover feature engineering for natural language processing, image classification, time-series data, etc.

(Unfortunate. I don't. care for any of these except time-series data.)

# The two approaches to feature engineering

1. **The checklist approach:** using tried and tested methods to construct features.
1. **The domain-based approach:** incorporating domain knowledge of the dataset's subject matter into constructing new features.

Now we're gonna take a look at these approaches using some actual data. Let'do some importing.

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()
sns.set_palette(sns.color_palette(['#851836', '#edbd17']))
sns.set_style("darkgrid")

from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

We will now demonstrate the checklist approach using a dataset on supermarket sales. Note that the dataset has been slightly modified for the tutorial.