### Feature Engineering Techniques

#### 1) Handling Missing Values
a) dropna()<br>
b) fillna()<br>

#### 2) Encoding Categorical Variables
a) Label Encoding<br>
b) Find and Replace<br>
c) pd.get_dummies()<br>

#### 3) Creating New Feature (Creating Date related columns)

#### 4) Scaling and Standardization
a) StandardScaler<br>
b) MinMaxScaler<br>
c) Normalize<br>

#### 5) Binning

#### 6) Groupby

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#### 1) Handling Missing Values

In [2]:
d = {'Prod':['A','B','A','C','B','C','C','B',np.nan,'B','A','B','C','B','C'],
    'Sales':[120,230,150,np.nan,100,80,180,240,np.nan,140,210,320,450,np.nan,200]}

df = pd.DataFrame(d)
df.head()

Unnamed: 0,Prod,Sales
0,A,120.0
1,B,230.0
2,A,150.0
3,C,
4,B,100.0


In [3]:
df.shape

(15, 2)

In [4]:
df.isnull().sum()

Prod     1
Sales    3
dtype: int64

In [5]:
df1 = df.copy()
df1

Unnamed: 0,Prod,Sales
0,A,120.0
1,B,230.0
2,A,150.0
3,C,
4,B,100.0
5,C,80.0
6,C,180.0
7,B,240.0
8,,
9,B,140.0


#### a) dropna()
1) It is used to drop/remove null values.<br>
2) it drop the rows containing null values

In [6]:
df1.isnull().sum()

Prod     1
Sales    3
dtype: int64

In [7]:
df1.dropna(inplace=True,axis=0) # axis=0 => Row wise

In [8]:
df1.isnull().sum()

Prod     0
Sales    0
dtype: int64

In [9]:
df1

Unnamed: 0,Prod,Sales
0,A,120.0
1,B,230.0
2,A,150.0
4,B,100.0
5,C,80.0
6,C,180.0
7,B,240.0
9,B,140.0
10,A,210.0
11,B,320.0


In [10]:
df1.shape

(12, 2)

#### b) fillna()
1) Fill the null value.<br>
2) For categorical columns, we fill the null value with mostly mode of the column or we create a new category as 'Other' or 'Null', 'Not Avialable'.<br>
3) For continuous columns, we handle the null values with methods like mean imputation, ffill,bfill, interpolate, median imputation etc.

In [11]:
df2 = df.copy()
df2.isnull().sum()

Prod     1
Sales    3
dtype: int64

In [12]:
df2.sample(5)

Unnamed: 0,Prod,Sales
9,B,140.0
7,B,240.0
6,C,180.0
0,A,120.0
2,A,150.0


In [13]:
df['Prod'].value_counts()

B    6
C    5
A    3
Name: Prod, dtype: int64

In [14]:
df2['Sales'].fillna(df2['Sales'].mean(),inplace=True)
df2.isnull().sum()

Prod     1
Sales    0
dtype: int64

In [15]:
df2.head()

Unnamed: 0,Prod,Sales
0,A,120.0
1,B,230.0
2,A,150.0
3,C,201.666667
4,B,100.0


In [16]:
df2['Prod'].value_counts()

B    6
C    5
A    3
Name: Prod, dtype: int64

In [17]:
df2['Prod'].fillna('B',inplace=True)
df2.isnull().sum()

Prod     0
Sales    0
dtype: int64

In [18]:
df2.head()

Unnamed: 0,Prod,Sales
0,A,120.0
1,B,230.0
2,A,150.0
3,C,201.666667
4,B,100.0


### 2) Encoding Categorical Columns
1) Label Encoding<br>
2) find and replace<br>
3) pd.get_dummies()<br>

In [19]:
d1 = {'State' :['UP','MP','Goa','AP','Goa','UP','MP'],
     'Month':['Jan','Feb','Apr','Mar','Feb','Apr','Apr'],
     'Prod':['A1','B2','A1','B1','C2','C2','B2']}
dfcc = pd.DataFrame(d1)
dfcc.head()

Unnamed: 0,State,Month,Prod
0,UP,Jan,A1
1,MP,Feb,B2
2,Goa,Apr,A1
3,AP,Mar,B1
4,Goa,Feb,C2


In [20]:
dfcc.shape

(7, 3)

In [21]:
dfcc.dtypes

State    object
Month    object
Prod     object
dtype: object

#### a) Label Encoding

In [22]:
from sklearn.preprocessing import LabelEncoder
lb = LabelEncoder()

In [23]:
dfcc['Month'].value_counts()

Apr    3
Feb    2
Jan    1
Mar    1
Name: Month, dtype: int64

In [24]:
dfcc['Month'] = lb.fit_transform(dfcc['Month'])
dfcc['Month'].value_counts()

0    3
1    2
2    1
3    1
Name: Month, dtype: int64

In [25]:
dfcc.dtypes

State    object
Month     int32
Prod     object
dtype: object

#### b) Find and replace

In [26]:
dfcc['State'].value_counts()

UP     2
MP     2
Goa    2
AP     1
Name: State, dtype: int64

In [27]:
dfcc['State'] = dfcc['State'].replace({'UP':10,'Goa':15,'MP':20,'AP':25})
dfcc['State'].value_counts()

10    2
20    2
15    2
25    1
Name: State, dtype: int64

In [28]:
dfcc.dtypes

State     int64
Month     int32
Prod     object
dtype: object

In [29]:
dfcc.head()

Unnamed: 0,State,Month,Prod
0,10,2,A1
1,20,1,B2
2,15,0,A1
3,25,3,B1
4,15,1,C2


In [30]:
dfcc['Prod'].value_counts()

A1    2
B2    2
C2    2
B1    1
Name: Prod, dtype: int64

#### c) pd.get_dummies()
1) It genearates a dataframe.<br>
2) It creates as many columns as there are categories in the columns that needs to be encoded

In [31]:
dfcc2 = pd.get_dummies(data=dfcc,columns=['Prod'])
dfcc2

Unnamed: 0,State,Month,Prod_A1,Prod_B1,Prod_B2,Prod_C2
0,10,2,1,0,0,0
1,20,1,0,0,1,0
2,15,0,1,0,0,0
3,25,3,0,1,0,0
4,15,1,0,0,0,1
5,10,0,0,0,0,1
6,20,0,0,0,1,0


### 3) Creating New Columns (Date related columns)

In [32]:
# Date-Format : YYYY-MM-DD  or YYYY/MM/DD
d3 = {'Dates':['2019-03-19','2022-10-12','2021-07-05','2020-01-26','2021-05-24']}
dw = pd.DataFrame(d3)
dw.head()

Unnamed: 0,Dates
0,2019-03-19
1,2022-10-12
2,2021-07-05
3,2020-01-26
4,2021-05-24


#### Q) How to create Year, Quarter, Month, Date, week columns from Dates column in dataframe dw

In [33]:
dw.dtypes

Dates    object
dtype: object

In [34]:
dw['Dates'] = pd.to_datetime(dw['Dates'])
dw.dtypes

Dates    datetime64[ns]
dtype: object

In [35]:
dw['Year'] = dw['Dates'].dt.year
dw['Quarter'] = dw['Dates'].dt.quarter
dw['Month'] = dw['Dates'].dt.month
dw['Day'] = dw['Dates'].dt.day
dw.head()

Unnamed: 0,Dates,Year,Quarter,Month,Day
0,2019-03-19,2019,1,3,19
1,2022-10-12,2022,4,10,12
2,2021-07-05,2021,3,7,5
3,2020-01-26,2020,1,1,26
4,2021-05-24,2021,2,5,24


In [36]:
dw['Weekday'] = dw['Dates'].dt.weekday
dw['Week_Num'] = dw['Dates'].dt.week
dw.head()

  dw['Week_Num'] = dw['Dates'].dt.week


Unnamed: 0,Dates,Year,Quarter,Month,Day,Weekday,Week_Num
0,2019-03-19,2019,1,3,19,1,12
1,2022-10-12,2022,4,10,12,2,41
2,2021-07-05,2021,3,7,5,0,27
3,2020-01-26,2020,1,1,26,6,4
4,2021-05-24,2021,2,5,24,0,21


### 4) Scaling, Standardization and Normalization

#### a) normalize
1) It returns a numpy array

In [37]:
from sklearn.preprocessing import normalize

In [38]:
d3 = {'x1':[1,2,3,4,5],
     'x2': [6,7,8,9,10]}
dfn =  pd.DataFrame(d3)
dfn

Unnamed: 0,x1,x2
0,1,6
1,2,7
2,3,8
3,4,9
4,5,10


In [39]:
dfn_l1 = normalize(dfn,norm='l1')
dfn_l1
# x = x/(sum of all the value in the same row)
# After normlaization, sum of row values is going to be 1

array([[0.14285714, 0.85714286],
       [0.22222222, 0.77777778],
       [0.27272727, 0.72727273],
       [0.30769231, 0.69230769],
       [0.33333333, 0.66666667]])

In [40]:
dfn

Unnamed: 0,x1,x2
0,1,6
1,2,7
2,3,8
3,4,9
4,5,10


In [41]:
dfn_l2 = normalize(dfn,norm='l2')
dfn_l2

array([[0.16439899, 0.98639392],
       [0.27472113, 0.96152395],
       [0.35112344, 0.93632918],
       [0.40613847, 0.91381155],
       [0.4472136 , 0.89442719]])

#### MinMaxScaler

In [42]:
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler((3,5))

In [43]:
dfn

Unnamed: 0,x1,x2
0,1,6
1,2,7
2,3,8
3,4,9
4,5,10


In [44]:
dfn_mms = mms.fit_transform(dfn)
dfn_mms

array([[3. , 3. ],
       [3.5, 3.5],
       [4. , 4. ],
       [4.5, 4.5],
       [5. , 5. ]])

In [45]:
# X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
# X_scaled = X_std * (max - min) + min

In [46]:
x_std4 = (4 - dfn['x1'].min())/(dfn['x1'].max() - dfn['x1'].min())
print(x_std4)
x_sc4 = x_std4*(5-3) + 3
print(x_sc4)

0.75
4.5


### 5) Binning
1) Creating gropus(bins) out of existing continuous data.<br>
2) bins can have same range or they can even be of different range.

In [47]:
d4 = {'Sales':[1350,2450,4530,4552,5430,2190,5409,6765,3020,2810]}
df_bin = pd.DataFrame(d4)
df_bin.head()

Unnamed: 0,Sales
0,1350
1,2450
2,4530
3,4552
4,5430


In [48]:
df_bin['Sales_bin'] = pd.cut(df_bin['Sales'],bins=[1300,2800,4530,6500],
                             labels=['Low','Med','High'])
df_bin.head(10)
# bin1 = [1300,2800]  - Low
# bin2 = [2800,4200]  - Med
# bin3 = [4200,6500]  - High

Unnamed: 0,Sales,Sales_bin
0,1350,Low
1,2450,Low
2,4530,Med
3,4552,High
4,5430,High
5,2190,Low
6,5409,High
7,6765,
8,3020,Med
9,2810,Med


In [49]:
# Create a Sales_bin, based on low and high sales values
df_bin['Sales_bin_binary'] = np.where(df_bin['Sales']<=3500,'Low_Sales','High_Sales')
df_bin.head(10)

Unnamed: 0,Sales,Sales_bin,Sales_bin_binary
0,1350,Low,Low_Sales
1,2450,Low,Low_Sales
2,4530,Med,High_Sales
3,4552,High,High_Sales
4,5430,High,High_Sales
5,2190,Low,Low_Sales
6,5409,High,High_Sales
7,6765,,High_Sales
8,3020,Med,Low_Sales
9,2810,Med,Low_Sales


### 6) Groupby operations
We can use different aggregate functions with group by - sum(), min(),max(), count(), mean()

In [50]:
df2

Unnamed: 0,Prod,Sales
0,A,120.0
1,B,230.0
2,A,150.0
3,C,201.666667
4,B,100.0
5,C,80.0
6,C,180.0
7,B,240.0
8,B,201.666667
9,B,140.0


In [51]:
df2['Prod'].value_counts()

B    7
C    5
A    3
Name: Prod, dtype: int64

#### Q) Find Sum of Sales based on Prod
#### Q) Find Prod wise sum of Sales

In [52]:
res = df2.groupby(['Prod'])['Sales'].sum()
res

Prod
A     480.000000
B    1433.333333
C    1111.666667
Name: Sales, dtype: float64

In [53]:
res2 = df2.groupby(['Prod']).agg({'Sales':['min','max','sum','mean','count','median','var','std']})
res2

Unnamed: 0_level_0,Sales,Sales,Sales,Sales,Sales,Sales,Sales,Sales
Unnamed: 0_level_1,min,max,sum,mean,count,median,var,std
Prod,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
A,120.0,210.0,480.0,160.0,3,150.0,2100.0,45.825757
B,100.0,320.0,1433.333333,204.761905,7,201.666667,5057.804233,71.118241
C,80.0,450.0,1111.666667,222.333333,5,200.0,18702.222222,136.756068


### Cross Valdiation
It is a resampling technique used to split the data into training and test data

### Types of Cross Validation
#### 1) K-Fold CV
a) The whole data is divided into k sets of almost equal sizes. The first set is selected as the test set and the model is trained on the remaining k-1 sets. The test error rate is then calculated after fitting the model to the test data.
<br>
b) In the second iteration, the 2nd set is selected as a test set and the remaining k-1 sets are used to train the data and the error is calculated. This process continues for all the k sets.
<br>
#### 2) Stratified KFold
Suppose your data contains reviews for a cosmetic product used by both the male and female population. When we perform random sampling to split the data into train and test sets, there is a possibility that most of the data representing males is not represented in training data but might end up in test data. When we train the model on sample training data that is not a correct representation of the actual population, the model will not predict the test data with good accuracy.
<br>
100 records , 80% is summer_Sale, 20% winter sale
80 records are summer sales, 20 are winter sales
We we split the data (75% train data, 25% test data)
It is possible that 75% of train data contains all the summer sales records and no winter sales records
<br>
</n>
This is where Stratified Sampling comes to the rescue. Here the data is split in such a way that it represents all the classes from the population.
<br>
Let’s consider the above example which has a cosmetic product review of 1000 customers out of which 60% is female and 40% is male. I want to split the data into train and test data in proportion (80:20). 80% of 1000 customers will be 800 which will be chosen in such a way that there are 480 reviews associated with the female population and 320 representing the male population. In a similar fashion, 20% of 1000 customers will be chosen for the test data ( with the same female and male representation).
<br>
#### 3) LeaveOneOut
a) Instead of dividing the data into 2 subsets, we select a single observation as test data, and everything else is labeled as training data and the model is trained. In the second iteration, now the 2nd observation is selected as test data and the model is trained on the remaining data.

In [54]:
from sklearn.model_selection import KFold, StratifiedKFold, LeaveOneOut

#### KFold

In [55]:
x = np.array(['a','b','c','d','e','f'])
kf = KFold(n_splits=3)

for train,test in kf.split(x):
    print('Train data',x[train],'Test data',x[test])

Train data ['c' 'd' 'e' 'f'] Test data ['a' 'b']
Train data ['a' 'b' 'e' 'f'] Test data ['c' 'd']
Train data ['a' 'b' 'c' 'd'] Test data ['e' 'f']


In [56]:
x = np.array(['a','b','c','d','e','f'])
kf = KFold(n_splits=4)

for train,test in kf.split(x):
    print('Train data',x[train],'Test data',x[test])

Train data ['c' 'd' 'e' 'f'] Test data ['a' 'b']
Train data ['a' 'b' 'e' 'f'] Test data ['c' 'd']
Train data ['a' 'b' 'c' 'd' 'f'] Test data ['e']
Train data ['a' 'b' 'c' 'd' 'e'] Test data ['f']


In [57]:
x = np.array(['a','b','c','d','e','f','g','h','i','j','k','l'])
kf = KFold(n_splits=5)

for train,test in kf.split(x):
    print('Train data',x[train],'Test data',x[test])

Train data ['d' 'e' 'f' 'g' 'h' 'i' 'j' 'k' 'l'] Test data ['a' 'b' 'c']
Train data ['a' 'b' 'c' 'g' 'h' 'i' 'j' 'k' 'l'] Test data ['d' 'e' 'f']
Train data ['a' 'b' 'c' 'd' 'e' 'f' 'i' 'j' 'k' 'l'] Test data ['g' 'h']
Train data ['a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'k' 'l'] Test data ['i' 'j']
Train data ['a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j'] Test data ['k' 'l']


#### Stratified K-Fold

In [58]:
x = np.array([5,10,15,20,25,30,35,40,45,50])
y = np.array([1,0,0,1,1,0,0,1,0,1])
skf = StratifiedKFold(n_splits=2)
for train,test in skf.split(x,y):
    print('Train',x[train],'Test',x[test])
    
# 1 (5,20,25,40,50)
# 0 (10,15,30,35,45)

# Split1 -> Train(30(0),35(0),40(1),45(0),50(1)) - Test(5(1),10(0),15(0),20(1),25(1))
# Split2 -> Train(5(1),10(0),15(0),20(1),25(1)) -  Test(30(0),35(0),40(1),45(0),50(1))

Train [30 35 40 45 50] Test [ 5 10 15 20 25]
Train [ 5 10 15 20 25] Test [30 35 40 45 50]


In [59]:
x = np.array([5,10,15,20,25,30,35,40,45,50])
y = np.array([1,0,0,1,1,0,0,1,0,1])
skf = StratifiedKFold(n_splits=4)
for train,test in skf.split(x,y):
    print('Train',x[train],'Test',x[test])

Train [15 25 30 35 40 45 50] Test [ 5 10 20]
Train [ 5 10 20 35 40 45 50] Test [15 25 30]
Train [ 5 10 15 20 25 30 45 50] Test [35 40]
Train [ 5 10 15 20 25 30 35 40] Test [45 50]


#### LeaveOneOut

In [60]:
x = np.array(['a','b','c','d','e','f','g','h'])
loo = LeaveOneOut()
for train,test in loo.split(x):
    print('Train data',x[train],'Test data',x[test])

Train data ['b' 'c' 'd' 'e' 'f' 'g' 'h'] Test data ['a']
Train data ['a' 'c' 'd' 'e' 'f' 'g' 'h'] Test data ['b']
Train data ['a' 'b' 'd' 'e' 'f' 'g' 'h'] Test data ['c']
Train data ['a' 'b' 'c' 'e' 'f' 'g' 'h'] Test data ['d']
Train data ['a' 'b' 'c' 'd' 'f' 'g' 'h'] Test data ['e']
Train data ['a' 'b' 'c' 'd' 'e' 'g' 'h'] Test data ['f']
Train data ['a' 'b' 'c' 'd' 'e' 'f' 'h'] Test data ['g']
Train data ['a' 'b' 'c' 'd' 'e' 'f' 'g'] Test data ['h']
