# Feature Engineering: Beginners Guide Part 1
---
#### Techniques to process Numerical and Categorical Data in Python

## Introduction 
This Notebook is Supplimant to the [Feature engineering in python: The Basics.(Free Guide)](https://www.theblublog.com/feature-engineering-in-python-a-free-guide). The Notebooks aims to provide starter code and examples of Engineering Numerical and Categorical features.

To learn More on this or other data science topics visit [The Blu Blog](https://www.theblublog.com). Learn data science with 100% Free Guides and Interactive Notebooks.


## Engineering features for Numerical data

### Rescaling Numeric features
Rescaling is a common preprocessing task in machine learning. There are several rescaling techniques, but one of the simplest is called min-max scaling. Min-max scaling uses the minimum and maximum values of a feature to rescale values to within a range. 
The Scikit-learn 'MinMaxScaler' offers two options to rescale a feature. One option is to use fit to calculate the minimum and maximum values of the feature, then use transform to rescale the feature. The second option is to use 'fit_transform()' to do both operations at once. There is no mathematical difference between the two options, but it may sometimes be useful to perform the functions separately on different data. Following is an example with code.

Let's start by importing necessary libraries

In [None]:
#Load Libraries
import numpy as np
from sklearn import preprocessing
import random
import matplotlib.pyplot as plt

Let's create a randomized dataset called income with the help of random Library

In [None]:
#Creating a dataset
sales = np.array([[-200],[-10],[50],[1000],[15],[20],[30],[50],[100],[200],[10000],[-12000],[150000],[160000]])

#for x in range(50):
# sales.append(random.randint(-100000000,1000000000))
print(sales)

Now let's create a Scaler and scale the sales

In [None]:
# Create a scaler
minmax_scale = preprocessing.MinMaxScaler(feature_range=(0,1))

#Scale feature
scaled_sales = minmax_scale.fit_transform(sales)

#Show feature
scaled_sales

### Standardizing features
The scaling of features to be roughly standard and normally distributed is a common substitute for the min-max scaling. To accomplish this, we standardize the data so that it has a mean, of 0, and a standard deviation of 1.
The transformed feature shows how far the original value deviates from the mean value of the feature by standard deviations (also called a z-score in statistics). Standardization is frequently chosen over min-max scaling as the preferred scaling technique for machine learning preprocessing, in my experience. It is, however, subject to the learning algorithm. For instance, standardization frequently improves the performance of principal component analysis, and min-max scaling is typically advised for neural networks.


In [None]:
#Create a scaler
std_scaler = preprocessing.StandardScaler()
std_sales = std_scaler.fit_transform(sales)

# Show feature standardized
std_sales

### Normalizing
One method for feature scaling is normalization. We use normalization most often when the data is not skewed along either axis or when it does not follow the Gaussian distribution. By converting data features with different scales to a single scale during normalization, we further simplify the processing of the data for modeling. As a result, each data feature (variable) tends to have a similar impact on the final model.

Let's import the Normalizer Library from scikit learn

In [None]:
from sklearn.preprocessing import Normalizer

In [None]:
# Create feature matrix
x = np.array([[2.5, 1.5],[2.1, 3.4], [1.5, 10.2], [4.63, 34.4], [10.9, 3.3], [17.5,0.8], [15.4, 0.7]])

# Create normalizer
normalizer = Normalizer(norm="l2")

# Transform feature matrix normalizer.transform(features)
normalizer.transform(x)

## Engineering Features for Catogorical data

### Encoding for Ordinal
Encoding is the process of converting ordinal data into a numeric format so that the Machine learning algorithm can make sense of it. For transforming ordinal data into numeric data, we usually convert each class into a number. For example cold, average, is mapped to 1, 2, and 3 respectively. Let’s see how we can do this easily. 

Let's start by importing pandas and creating a data set.

In [None]:
#Importing libraries
import pandas as pd

#Creating the data
data = pd.DataFrame({"Temprature":["Very Cold", "Cold", "Warm","Hot", "Very Hot"]})

print(data)


Now Let's map the data to numerical values.

In [None]:
#Mapping to numerical data
scale_map = {"Very Cold": -3,
             "Cold": -1,
             "Warm": 0,
             "Hot" : 1,
             "Very Hot" : 3}

#Replacing with mapped values
data_mapped = data["Temprature"].replace(scale_map)
data["encoded_temp"] = data_mapped
data

### Nominal Data
In one hot encoding, we convert each class of nominal data into its own feature and we assign a binary value of 1 or 0 to tell whether the feature is true or false. Let’s see how this can be done using the MultiLibraryBinarizer in scikit learn.


importing data and creating a data frame.

In [None]:
#Import libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelBinarizer
# Create the dataset
color_data = {"itemid": ["A1","B1","C2", "D4","E9"],
              "color" : ["red","blue","green","yellow","pink"]}

color_data = pd.DataFrame(color_data)
color_data


One hot encoding with Label Binarizer

In [None]:
# Creating one-hot encoder
one_hot = LabelBinarizer() 

# One-hot encode the data and assign to a var
color_encoding = one_hot.fit_transform(color_data.color)

#  feature classes
color_new = one_hot.classes_

#creating new Data Frame with encoded values 
encoded = pd.DataFrame(color_encoding)
encoded.columns = color_new

#Deleting color column and merging with encoded values
color_data_new = color_data.drop("color",axis = 1)
color_data_new = pd.concat([color_data,encoded],axis = 1)

#Viewing new data
print(color_data_new)

One hot encoding with Pandas 



In [None]:
#Creating encoded df
encoded_pd  = pd.get_dummies(color_data.color)

#Deleting color column and merging with encoded values
color_data_pd = color_data.drop("color", axis = 1)
color_data_pd = pd.concat([color_data,encoded_pd],axis = 1)

#Viewing new data
print(color_data_pd)

It’s good practice to drop one of the features after one hot encoding to reduce linear dependency.


In [None]:
#Dropping final column
color_data_pd.drop("yellow",axis =1, inplace = True)
color_data_pd