<a href="https://colab.research.google.com/github/hussain0048/Machine-Learning/blob/master/Feature_Engineering_in_ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **1-Introduction**
In the real world, data rarely comes in perfect form. With this in mind, one of the more critical steps in using machine learning in practice is Feature Engineering, that is, taking whatever information you have about your problem and turning it into numbers that you can use to build your feature matrix[1].

Feature Engineering is the procedure of using the domain knowledge of the data to create features that can be used in training a Machine Learning algorithm. If the process of feature engineering is executed correctly, it increases the accuracy of our trained machine learning model’s prediction.[1]

In this article, I will cover a few common examples of feature engineering tasks: features for representing categorical data, functions for rendering text.

#**2-Categorical Features**

One common type of non-numerical data is categorical data. For example, imagine you are exploring some data on housing prices, and along with numerical features like “price” and “rooms,” you also have “neighborhood” information.

For example, your data might look something like this [1]

In [1]:
data = [
    {'price': 850000, 'rooms': 4, 'neighborhood': 'Queen Anne'},
    {'price': 700000, 'rooms': 3, 'neighborhood': 'Fremont'},
    {'price': 650000, 'rooms': 3, 'neighborhood': 'Wallingford'},
    {'price': 600000, 'rooms': 2, 'neighborhood': 'Fremont'}
]

In [None]:
data

You might be tempted to encode this data with a straightforward numerical mapping:



In [None]:
{'Queen Anne': 1, 'Fremont': 2, 'Wallingford': 3}

It turns out that this is not generally a useful approach in Scikit-Learn: the package’s models make the fundamental assumption that numerical features reflect algebraic quantities[1].

Thus such a mapping would imply, for example, that Queen Anne < Fremont < Wallingford, or even that Wallingford – Queen Anne = Fremont, which (niche demographic jokes aside) does not make much sense.

In this case, one proven technique is to use **one-hot encoding**, which effectively creates extra columns indicating the presence or absence of a category with a value of 1 or 0, respectively. When your data comes as a list of dictionaries, Scikit-Learn’s DictVectorizer will do this for you:[1]

In [None]:
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False, dtype=int)
vec.fit_transform(data)

# **Referecse**

[[1]-Feature Engineering in Machine Learning](https://thecleverprogrammer.com/2020/07/04/feature-engineering-in-machine-learning/)