# **Feature Engineering Flowchart** 



*   **Feature Transformation** :
 
 - Missing Value Imputation
 - Handling Categorical Features
 - Outlier Detection
 - Feature Scaling

*   **Feature Construction**:

 - Feature Splitting


*   **Feature Selection**
 - Backward Elimination
 - Forward Selection


*   **Feature Extraction**
 - Curse of Dimensionality.
 - PCA
 - LDA
 - tsne



P.S: This notebook is going to be more about Feature Construction and Curse of Dimensionality ( Feature Extraction )  

# **Feature Construction:**

**Feature Construction** - It is a very special kind of thing in Feature Engineering. In simple terms we can say it is thoroughly manual or there's no procedure. There's no mathematical formula in here, we need to think ourselves and on the basis of our intuition create stuff! cool...

**Formal definition of Feature Construction:** Higher-level features can be obtained from already available features and added to the feature vector; for example, for the study of diseases the feature 'Age' is useful and is defined as Age = 'Year of death' minus 'Year of birth' . This process is referred to as feature construction.

# **Feature Splitting:**

**Feature Splitting:** Sometimes the data that we have is not tidy data means the data is too congested so we can split these kind of columns to analyse the data more precisely.


OR


It is the process of splitting features intimately into two or more parts and performing to make new features. This technique helps the algorithms to better understand and learn the patterns in the dataset.

# Code:

In [42]:
import numpy as np 
import pandas as pd
import seaborn as sns

from sklearn.model_selection import train_test_split , cross_val_score
from sklearn.linear_model import LogisticRegression

import warnings

In [43]:
warnings.filterwarnings('ignore')

In [44]:
from google.colab import files

In [45]:
upload = files.upload()

Saving train.csv to train (1).csv


In [46]:
df = pd.read_csv('train.csv')[['Age' , 'Pclass' , 'SibSp' , 'Parch' , 'Survived']]

In [47]:
df.head()

Unnamed: 0,Age,Pclass,SibSp,Parch,Survived
0,22.0,3,1,0,0
1,38.0,1,1,0,1
2,26.0,3,0,0,1
3,35.0,1,1,0,1
4,35.0,3,0,0,0


In [48]:
df.dropna(inplace = True)

Dropped all the missing data using dropna

In [49]:
df.head()

Unnamed: 0,Age,Pclass,SibSp,Parch,Survived
0,22.0,3,1,0,0
1,38.0,1,1,0,1
2,26.0,3,0,0,1
3,35.0,1,1,0,1
4,35.0,3,0,0,0


In [50]:
x = df.iloc[:,0:4]
y = df.iloc[:,-1]

In [51]:
x.head()

Unnamed: 0,Age,Pclass,SibSp,Parch
0,22.0,3,1,0
1,38.0,1,1,0
2,26.0,3,0,0
3,35.0,1,1,0
4,35.0,3,0,0


In [52]:
np.mean(cross_val_score(LogisticRegression(),x,y,scoring='accuracy',cv=20)) 

0.6933333333333332

**Inference**: Using cross val score to see if the model has generalised over the whole dataset and using Logistic Regression we got only 69% accuracy which is quite low...

Lets use Feature Construction to see if we can improve by working more on our data.

## Applying Feature Construction

In [53]:
x['Family_size'] = x['SibSp'] + x['Parch'] + 1

Making a new column Family size by joining two columns sibsp and parch.

 +1 is used to add that person also to the Family Count/size.

In [54]:
x.head()

Unnamed: 0,Age,Pclass,SibSp,Parch,Family_size
0,22.0,3,1,0,2
1,38.0,1,1,0,2
2,26.0,3,0,0,1
3,35.0,1,1,0,2
4,35.0,3,0,0,1


creating a function for the size of the family.

In [55]:
def myfamily(num):
  if num == 1:
    #alone 
    return 0
  elif num > 1 and num <= 4:
    #small family
    return 1
  else:
    #large family
    return 2

In [56]:
myfamily(4)

1

In [57]:
x['Family_type'] = x['Family_size'].apply(myfamily)

On the basis of the function we created we can now differentiate the families or if the person is travelling alone.

In [58]:
x.head()

Unnamed: 0,Age,Pclass,SibSp,Parch,Family_size,Family_type
0,22.0,3,1,0,2,1
1,38.0,1,1,0,2,1
2,26.0,3,0,0,1,0
3,35.0,1,1,0,2,1
4,35.0,3,0,0,1,0


In [59]:
x.drop(columns = ['SibSp' , 'Parch' , 'Family_size'] , inplace = True)

In [60]:
x.head()

Unnamed: 0,Age,Pclass,Family_type
0,22.0,3,1
1,38.0,1,1
2,26.0,3,0
3,35.0,1,1
4,35.0,3,0


In [61]:
np.mean(cross_val_score(LogisticRegression(),x,y,scoring='accuracy',cv=20))

0.7003174603174602

Now we can see the accuracy has increased a bit by using Feature Construction technique.

# Using Feature Splitting technique

In [62]:
df = pd.read_csv('train.csv')

In [63]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [64]:
df['Name']

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

**Inference:** We can see that the names of the passengers are such that Braund, Mr. Owen Harris, it is the surname(Braund) , salutation(Mr) followed by the name(Owen Harris).

In [65]:
df['Title'] = df['Name'].str.split(',' , expand = True)[1].str.split('.',expand=True)[0]

With the help of this code we first split the name before and after comma and then extracted the name from after the comma.

In [66]:
df['Name'].str.split(',' , expand = True)[1]          

0                                  Mr. Owen Harris
1       Mrs. John Bradley (Florence Briggs Thayer)
2                                      Miss. Laina
3               Mrs. Jacques Heath (Lily May Peel)
4                                Mr. William Henry
                          ...                     
886                                    Rev. Juozas
887                           Miss. Margaret Edith
888                 Miss. Catherine Helen "Carrie"
889                                Mr. Karl Howell
890                                    Mr. Patrick
Name: 1, Length: 891, dtype: object

**Inference**: we removed the before comma surname i.e for example Braund

In [67]:
df['Name'].str.split(',' , expand = True)[1].str.split('.',expand=True)   

Unnamed: 0,0,1,2
0,Mr,Owen Harris,
1,Mrs,John Bradley (Florence Briggs Thayer),
2,Miss,Laina,
3,Mrs,Jacques Heath (Lily May Peel),
4,Mr,William Henry,
...,...,...,...
886,Rev,Juozas,
887,Miss,Margaret Edith,
888,Miss,"Catherine Helen ""Carrie""",
889,Mr,Karl Howell,


**Inference:** now we get the name only and the salutation is also removed.

In [68]:
df['Name'].str.split(',' , expand = True)[1].str.split('.',expand=True)[0]

0         Mr
1        Mrs
2       Miss
3        Mrs
4         Mr
       ...  
886      Rev
887     Miss
888     Miss
889       Mr
890       Mr
Name: 0, Length: 891, dtype: object

**Inference:** Got the salutation of the passengers

In [69]:
df[['Title' , 'Name']]

Unnamed: 0,Title,Name
0,Mr,"Braund, Mr. Owen Harris"
1,Mrs,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
2,Miss,"Heikkinen, Miss. Laina"
3,Mrs,"Futrelle, Mrs. Jacques Heath (Lily May Peel)"
4,Mr,"Allen, Mr. William Henry"
...,...,...
886,Rev,"Montvila, Rev. Juozas"
887,Miss,"Graham, Miss. Margaret Edith"
888,Miss,"Johnston, Miss. Catherine Helen ""Carrie"""
889,Mr,"Behr, Mr. Karl Howell"


**Inference:** Therefore, we split one column from the original dataset into two columns name and title, which can prove out to be useful for us in the further analysis of data.

In [70]:
(df.groupby('Title').mean()['Survived']).sort_values(ascending = False)

Title
 the Countess    1.000000
 Mlle            1.000000
 Sir             1.000000
 Ms              1.000000
 Lady            1.000000
 Mme             1.000000
 Mrs             0.792000
 Miss            0.697802
 Master          0.575000
 Col             0.500000
 Major           0.500000
 Dr              0.428571
 Mr              0.156673
 Jonkheer        0.000000
 Rev             0.000000
 Don             0.000000
 Capt            0.000000
Name: Survived, dtype: float64

**Inference:** Here we can check the survival rates of passengers from their titles, we get that people with title of Mrs, Miss have high survival rate.

Hence, with feature splitting we got another aspect to figure out for the prediction of the survival rate.

# **Feature Extraction**

**Feature Extraction:** Feature extraction refers to the process of transforming raw data into numerical features that can be processed while preserving the information in the original data set. It yields better results than applying machine learning directly to the raw data.

# **Curse of Dimensionality**

**Curse of Dimensionality**: After studying about feature adding and feature splitting we know how beneficial it is if we add more features in the data, but this also works till a certain level.

Adding too many features can also lead to downfall of the performance of our model and also makes the model very very complex.

So basically on continously adding features the performance of the model increases until a optimum point but after that point the performance begins to drop.

In machine learning there will be cases when we will have high dimensional data when we are working with:

1.   Images
2.   Text data (too many columns)




**What's the exact problem with Higher Dimensional space?**

- sparcity: as we increase the dimensions the data points are spreading away from each other.

- too difficult computation

**Solution:**

''DIMENSIONALITY REDUCTION''

The dimensionality can be reduced in 2 ways:
 - Feature Selection: Out of all the n features, selecting some features to work on.


 - Feature Extraction: we are extracting a new set of features from the given set of features.

**Summary:** Higher Dimensional data is dangerous for your machine learning model because after a certain point its very difficult to work with it.