Skip to content

Latest commit

 

History

History
131 lines (86 loc) · 13 KB

File metadata and controls

131 lines (86 loc) · 13 KB

Stars Badge Forks Badge Pull Requests Badge Issues Badge GitHub contributors Visitors

Module 6: Feature Engineering

Don't forget to hit the ⭐ if you like this repo.

Group: DataAce

Name Matric No
Myza Nazifa binti Nazry A20EC0219
Nur Izzah Mardhiah binti Rashidi A20EC0116
Amirah Raihanah binti Abdul Rahim A20EC0182
Radin Dafina binti Radin Zulkar Nain A20EC0135

Contents:

Definition

What is Feature Engineering?

Firstly, we will understand the first definition. Feature engineering is a machine learning technique that leverages data to create new variables that aren’t in the training set. It can produce new features for both supervised and unsupervised learning, with the goal of simplifying and speeding up data transformations while also enhancing model accuracy. Feature engineering is required when working with machine learning models. Regardless of the data or architecture, a terrible feature will have a direct impact on your model.

To understand better, Feature engineering is the pre-processing step of machine learning, which extracts features from raw data. It helps to represent an underlying problem to predictive models in a better way, which as a result, improve the accuracy of the model for unseen data. The predictive model contains predictor variables and an outcome variable, and while the feature engineering process selects the most useful predictor variables for the model.

image

Since 2016, automated feature engineering is also used in different machine learning software that helps in automatically extracting features from raw data. Feature engineering in ML contains mainly four processes: Feature Creation, Transformations, Feature Extraction, and Feature Selection.

Importance of Feature Engineering

Feature engineering refers to the process of designing artificial features into an algorithm. These artificial features are then used by that algorithm in order to improve its performance, or in other words reap better results. Data scientists spend most of their time with data, and it becomes important to make models accurate.

When feature engineering activities are done correctly, the resulting dataset is optimal and contains all of the important factors that affect the business problem. As a result of these datasets, the most accurate predictive models and the most useful insights are produced.

Processes in Feature Engineering

1. Feature Creation: Feature creation is finding the most useful variables to be used in a predictive model. The process is subjective, and it requires human creativity and intervention. The new features are created by mixing existing features using addition, subtraction, and ration, and these new features have great flexibility.

2. Transformations: The transformation step of feature engineering involves adjusting the predictor variable to improve the accuracy and performance of the model. For example, it ensures that the model is flexible to take input of the variety of data; it ensures that all the variables are on the same scale, making the model easier to understand. It improves the model's accuracy and ensures that all the features are within the acceptable range to avoid any computational error.

3. Feature Extraction: Feature extraction is an automated feature engineering process that generates new variables by extracting them from the raw data. The main aim of this step is to reduce the volume of data so that it can be easily used and managed for data modelling. Feature extraction methods include cluster analysis, text analytics, edge detection algorithms, and principal components analysis (PCA).

4. Feature Selection: While developing the machine learning model, only a few variables in the dataset are useful for building the model, and the rest features are either redundant or irrelevant. If we input the dataset with all these redundant and irrelevant features, it may negatively impact and reduce the overall performance and accuracy of the model. Hence it is very important to identify and select the most appropriate features from the data and remove the irrelevant or less important features, which is done with the help of feature selection in machine learning. "Feature selection is a way of selecting the subset of the most relevant features from the original features set by removing the redundant, irrelevant, or noisy features."

Feature Engineering Techniques

1. Imputation: Feature engineering deals with inappropriate data, missing values, human interruption, general errors, insufficient data sources, etc. Missing values within the dataset highly affect the performance of the algorithm, and to deal with them "Imputation" technique is used. Imputation is responsible for handling irregularities within the dataset.

For example, removing the missing values from the complete row or complete column by a huge percentage of missing values. But at the same time, to maintain the data size, it is required to impute the missing data, which can be done as:

For numerical data imputation, a default value can be imputed in a column, and missing values can be filled with means or medians of the columns. For categorical data imputation, missing values can be interchanged with the maximum occurred value in a column.

2. Handling Outliers: Outliers are the deviated values or data points that are observed too away from other data points in such a way that they badly affect the performance of the model. Outliers can be handled with this feature engineering technique. This technique first identifies the outliers and then remove them out.

Standard deviation can be used to identify the outliers. For example, each value within a space has a definite to an average distance, but if a value is greater distant than a certain value, it can be considered as an outlier. Z-score can also be used to detect outliers.

3. Log transform: Logarithm transformation or log transform is one of the commonly used mathematical techniques in machine learning. Log transform helps in handling the skewed data, and it makes the distribution more approximate to normal after transformation. It also reduces the effects of outliers on the data, as because of the normalization of magnitude differences, a model becomes much robust.

4. Binning: In machine learning, overfitting is one of the main issues that degrade the performance of the model and which occurs due to a greater number of parameters and noisy data. However, one of the popular techniques of feature engineering, "binning", can be used to normalize the noisy data. This process involves segmenting different features into bins.

5. Feature Split: As the name suggests, feature split is the process of splitting features intimately into two or more parts and performing to make new features. This technique helps the algorithms to better understand and learn the patterns in the dataset.

The feature splitting process enables the new features to be clustered and binned, which results in extracting useful information and improving the performance of the data models.

6. One hot encoding: One hot encoding is the popular encoding technique in machine learning. It is a technique that converts the categorical data in a form so that they can be easily understood by machine learning algorithms and hence can make a good prediction. It enables group the of categorical data without losing any information.

Advanced Techniques for Feature Engineering

As the field of data science continues to evolve, feature engineering has emerged as a critical step in the machine learning pipeline. "Advanced Techniques for Feature Engineering" is a topic that covers the more complex and specialized techniques used in feature engineering. While basic techniques such as one-hot encoding and feature scaling are commonly used, there are a number of advanced techniques that can be employed for specific data types and applications. This topic covers advanced techniques for feature engineering in areas such as unstructured data, natural language processing, time-series data, and recommendation systems, as well as novel approaches to feature engineering such as automated feature engineering and unsupervised feature learning. By exploring these advanced techniques, data scientists can improve the performance of their machine learning models and gain a deeper understanding of the feature engineering process.

Proper evaluation of feature engineering techniques is critical to ensure that the selected features are informative and relevant to the machine learning model. In this topic, we will discuss techniques for evaluating feature importance and feature relevance, as well as methods for evaluating machine learning models such as cross-validation and metrics like precision, recall, and F1-score.

By understanding how to properly evaluate feature engineering techniques and machine learning models, data scientists can improve the accuracy and robustness of their models, leading to better performance and more informed decisions.

Advantages and Disadvantages of Feature Engineering

Advantages

Advantage Description
Improved model performance Improve the accuracy and performance of machine learning models. By selecting and transforming the most relevant features, the model can better capture patterns in the data, resulting in better predictions.
Better interpretation Makes the model more interpretable by creating features that are based on domain-specific knowledge. Hence, it is easier to understand how the model is making predictions and identify the most important factors.
More efficient modeling Help reduce the complexity of the model, making it easier and faster to train.

Disadvantages

Disadvantage Description
Time-consuming Requires a lot of trial and error and can be difficult to know which features are most relevant and how to transform them in the most effective way.
Risk of overfitting Occurs when the model is too closely tuned to the training data and is unable to generalize to new data.
Expertise required Effective feature engineering often requires domain-specific expertise and knowledge.
Increased complexity Feature engineering can also increase the complexity of the model in some cases, making it more difficult to interpret and understand. This can be a disadvantage if interpretability is a key consideration.

Contribution 🛠️

Please create an Issue for any improvements, suggestions or errors in the content.

You can also contact me using Linkedin for any other queries or feedback.

Visitors