[source](https://google.com)

# Feature Engineering

Prediction quality of any machine learning algorithm depends predominantly on the quality of input being passed

Process of creating appropriate data features by applying business context is called **feature engineering**

## Dealing with missing values

Missing data can mislead or create problems for analyzing the data

In order to avoid any such issues, you need to impute missing data 

There are four most commonly used techniques for data imputation

1. Delete
2. Replace with summary
3. Random replace
4. Using predictive model

In [3]:
from numpy import nan
import pandas as pd
data = {'A':[2,nan,nan,10,10,10],'B':[6,6,6,10,10,10],'C':[nan,2,nan,10,10,10]}
df = pd.DataFrame(data,columns=['A','B','C'])
df

Unnamed: 0,A,B,C
0,2.0,6,
1,,6,2.0
2,,6,
3,10.0,10,10.0
4,10.0,10,10.0
5,10.0,10,10.0


### Delete
Delete the rows containing missing values

Suitable and effective when the number of missing value rows count is insignificant (say < 5%) compare to the overall record count

`dropna()` function in Pandas

In [5]:
df.dropna()

Unnamed: 0,A,B,C
3,10.0,10,10.0
4,10.0,10,10.0
5,10.0,10,10.0


### Replace with summary: 
Most commonly used imputation technique

For **continuous or quantitative** variables, either mean/average or mode or median value of the respective column can be used to replace the missing values

For **categorical or qualitative** variables, the mode (most frequent) summation technique works better

`fillna()` function in Pandas

In [7]:
df.fillna(df.mean())

Unnamed: 0,A,B,C
0,2.0,6,8.0
1,8.0,6,2.0
2,8.0,6,8.0
3,10.0,10,10.0
4,10.0,10,10.0
5,10.0,10,10.0


## Handling Categorical Data
Most of the machine’s learning libraries are designed to work well with numerical variables. 

So categorical variables in their original form of text description can’t be directly used for model building.

###Create dummy variable: 
This is a Boolean variable that indicates the presence of a category with the value 1 and 0 for absence

You should create **k-1** dummy variables, where **k** is the number of levels 

Pandas provides a useful function *‘get_dummies’* to create a dummy variable for a given categorical variable

In [10]:
import pandas as pd
from patsy import dmatrices
df = pd.DataFrame({'A': ['high', 'medium', 'low'],
 'B': [10,20,30]},
 index=[0, 1, 2])
df

Unnamed: 0,A,B
0,high,10
1,medium,20
2,low,30


In [11]:
# Create dummy varables
pd.get_dummies(df, prefix='A', columns=['A'])

Unnamed: 0,B,A_high,A_low,A_medium
0,10,1,0,0
1,20,0,0,1
2,30,0,1,0


## Normalizing Data
A unit or scale of measurement for different variables varies, so an analysis with the raw measurement could be artificially skewed toward the variables with higher absolute values

Bringing all the different types of variable units in the same order of magnitude thus eliminates the potential outlier measurements that would misrepresent the finding and negatively affect the accuracy of the conclusion 

Two broadly used methods for rescaling data are **normalization** and **standardization**.

Normalizing data can be achieved by *Min-Max* scaling; the formula is given below, which will scale all numeric values in the range 0 to 1.

$$(X{_n}{_o}{_r}{_m}=(X-X{_m}{_i}{_n})/(X{_m}{_a}{_x}-X{_m}{_i}{_n}))$$

> Ensure you remove extreme outliers before applying the above technique as it can skew the normal values in your data to a small interval

The standardization technique will transform the variables to have a zero mean and standard deviation of one. The formula for standardization is given below and the outcome is commonly known as z-scores.

$$(Z=(X-\mu)/\sigma)$$

Where μ is the mean and σ is the standard deviation.
Standardization has often been the preferred method for various analysis as it tells us where each data point lies within its distribution and a rough indication of outliers.

In [13]:
from sklearn import datasets
import numpy as np
from sklearn import preprocessing
iris = datasets.load_iris()
X = iris.data[:, [2, 3]]
y = iris.target
std_scale = preprocessing.StandardScaler().fit(X)
X_std = std_scale.transform(X)
minmax_scale = preprocessing.MinMaxScaler().fit(X)
X_minmax = minmax_scale.transform(X)

print('Mean before standardization: petal length={:.1f}, petal width={:.1f}'
 .format(X[:,0].mean(), X[:,1].mean()))
print('SD before standardization: petal length={:.1f}, petal width={:.1f}'
 .format(X[:,0].std(), X[:,1].std()))
print('Mean after standardization: petal length={:.1f}, petal width={:.1f}'
 .format(X_std[:,0].mean(), X_std[:,1].mean()))
print('SD after standardization: petal length={:.1f}, petal width={:.1f}'
 .format(X_std[:,0].std(), X_std[:,1].std()))
print('\nMin value before min-max scaling: patel length={:.1f}, patel width={:.1f}'
 .format(X[:,0].min(), X[:,1].min()))
print('Max value before min-max scaling: petal length={:.1f}, petal width={:.1f}'
 .format(X[:,0].max(), X[:,1].max()))
print('Min value after min-max scaling: patel length={:.1f}, patel width={:.1f}'
 .format(X_minmax[:,0].min(), X_minmax[:,1].min()))
print('Max value after min-max scaling: petal length={:.1f}, petal width={:.1f}'
 .format(X_minmax[:,0].max(), X_minmax[:,1].max()))

In [14]:
from sklearn import datasets
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
iris = datasets.load_iris()
# Let's convert to dataframe
iris = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
 columns= iris['feature_names'] + ['species'])
# replace the values with class labels
iris.species = np.where(iris.species == 0.0, 'setosa', np.where(iris.
species==1.0,'versicolor', 'virginica'))
# let's remove spaces from column name
iris.columns = iris.columns.str.replace(' ','')
iris.describe()

Unnamed: 0,sepallength(cm),sepalwidth(cm),petallength(cm),petalwidth(cm)
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


[example data](https://www.kaggle.com/c/bnp-paribas-cardif-claims-management/overview)

##Exploratory Data Analysis (EDA)
EDA is all about understanding your data by employing summarizing and visualizing techniques. At a high level the EDA can be performed in two folds, that is, univariate analysis and multivariate analysis.

###Univariate Analysis
Individual variables are analyzed in isolation to have a better understanding about them.

Pandas provide the describe function to create summary statistics in tabular format for all variables.

These statistics are very useful for numerical types of variables to understand any quality issues such as missing values and the presence of outliers.

In [17]:
iris['species'].value_counts()

In [18]:
# Set the size of the plot
#plt.figsize(15, 8)
iris.hist() # plot histogram
plt.suptitle("Histogram", fontsize=16) # use suptitle to add title to all
sublots
plt.show()

In [19]:
iris.boxplot() # plot boxplot
plt.title("Bar Plot", fontsize=16)
plt.show()

###Multivariate Analysis
In multivariate analysis you try to establish a sense of relationship of all variables with one other.

In [21]:
# print the mean for each column by species
iris.groupby(by = "species").mean()
# plot for mean of each feature for each label class
iris.groupby(by = "species").mean().plot(kind="bar")
plt.title('Class vs Measurements')
plt.ylabel('mean measurement(cm)')
plt.xticks(rotation=0) # manage the xticks rotation
plt.grid(True)
# Use bbox_to_anchor option to place the legend outside plot area to be tidy
plt.legend(loc="upper left", bbox_to_anchor=(1,1))

###Correlation Matrix
The correlation function uses Pearson correlation coefficient, which results in a number between -1 to 1. A strong negative relationship is indicated by a coefficient closer to -1 and a strong positive correlation is indicated by a coefficient toward 1.

In [23]:
# create correlation matrix
corr = iris.corr()
corr

Unnamed: 0,sepallength(cm),sepalwidth(cm),petallength(cm),petalwidth(cm)
sepallength(cm),1.0,-0.11757,0.871754,0.817941
sepalwidth(cm),-0.11757,1.0,-0.42844,-0.366126
petallength(cm),0.871754,-0.42844,1.0,0.962865
petalwidth(cm),0.817941,-0.366126,0.962865,1.0


In [24]:
import statsmodels.api as sm
sm.graphics.plot_corr(corr, xnames=list(corr.columns))
plt.show()

###Pair Plot
You can understand the relationship attributes by looking at the distribution of the interactions of each pair of attributes. This uses a built-in function to create a matrix of scatter plots of all attributes against all attributes.

### Findings from EDA
* There are no missing values.
* Sepal is longer than petal. Sepal length ranges between 4.3 to 7.9 with average length of 5.8, whereas petal length ranges between 1 to 6.9 with average length of 3.7.
* Sepal is also wider than petal. Sepal width ranges between 2 to 4.4 with a average width of 3.05, whereas petal width ranges between 0.1 to 2.5 with average width of 1.19.
* Average petal length of setosa is much smaller than versicolor and virginica; however the average sepal width of setosa is higher than versicolor and virginica.
* Petal length and width are strongly correlated, that is, 96% of the time width increases with increase in length.
* Petal length has negative correlation with sepal width, that is, 42% of the time increase in sepal width will decrease petal length.
* Initial conclusion from data: Based on length and width of sepal/ petal alone, you can conclude that versicolor/virginica might resemble in size; however setosa characteristics seem to be noticeably different from the other two.

In [27]:
df = pd.read_csv('') 