<a href="https://colab.research.google.com/github/dk-wei/ml-algo-implementation/blob/main/ML_Feature_Engineering_Udemy_Notes_(%E5%BC%BA%E6%8E%A8).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feature Engineeering


Udemy Course: [Feature Engineering for ML](https://cisco.udemy.com/course/feature-engineering-for-machine-learning/learn/lecture/15675348#overview)

1. Variable Types
  - Numerical variable
    - Discrete (eg. Number of bank account, number of pets in family)
    
    - Continuous (eg. amount money paid in customer each time, interest rate paid)
      
  - Categorical variable
    - Ordinal (顺序有意义 eg. students grade: A, B, C..., Education degree: BA, MA... )
    - Nominal (顺序无意义 eg. vehicle make: BMW, Mercedez, County of Birth: China, Germany)
      - High Cardinality effect
        - 某个categorical feature出现的label种类太多了，几乎每个record有不同的label (例如ID), tree-based model 对这样的feature importance束手无策 (over fitting)
        - 如果要使用sciket-learn，就必须要将string转化为numerical，一定要encoding, 初级一点的one-hot encoding
        - Uneven distribution btn training and teting dataset, 有些label只存在于训练集，有些则只存在于测试集
        - Overfitting, 太多的label会导致overfitting，特别是tree based algo, 因为增加了split
        - 可以通过target encoding解决，或者直接删掉这一个feature
      - Rare lebels effect
        - 类似high cardinality, 就是某些feature存在那么几个rare label，数量特别少，可以通过查看每个label的比例发现，会产生类似的问题，uneven啦，overfitting 啦，两个解决方法：
          1. one-hot encoding 
          2. 全部直接标记为rare group
  - Date and time
    - 提取week of year, quater of year, how long since xxx 来作为特征
    - 例如 Payment date ('1987-09-01 15:20:20'),可以derive：
      - Day: 周几，是否是weekend
      - Month：1-12，什么quarter, semster, 是否是holiday season
      - Year：哪一年
      - Hour: **上午下午白天晚上？**
      - Elapsed time: 和第一次payment的time difference

  - Mix variable (既有numerical也有categorical，eg. vehicle registration 湘A 22331) 
    - 一般会提取numerical和categorical features，放入Num和Cat两列
    - number of missed payments (1-3, D, A)
      - variable中又有数字又有category，就创建两列，一列放数字，一列放category，其余的都放missing

2. Missing Data Imputation
  - 直接删除record，有些时候data missing completely at random，missing records总数不超过5%， 这时候简单粗暴，直接删除 
  - 直接删除feature，如果某个feature的missing value数太多，例如超过一半，也就没有留下来的必要了
  - Missing data impuration 都是基于一个小假设：all data are misssing completely at random.
  - Numerical Features 
    - 注意一个细节，填充都是用training data的值去替代training和test两个集合的NA
    - mean/median imputation  (少量missing value的情况下)
      - 填充完比较一下前后的data describe/variance
    - arbitrary value imputation
      - 直接填充正常range之外的outlier，例如`999`或`-1`完事儿
      - 填充完比较一下前后的data describe/variance
    - end of trail imputation
      - 填充`mean+3*std`
  - **Categorical Features (少量missing value的情况下)**
    - freq category imputation
     - Mode imputation, 直接填充mode value
     - 这样在提取特征的时候其实很蛋疼,因为与事实不符
    - add a 'missing' category
      - 直接填充'missing'
  - Both
    - Random sample imputation
      - 直接填充random values，但是这种方法很蛋疼的是可能每次训练的模型都不同，给出的prediction也不同，我们可以set eeed，虽然叫是叫是random，其实也没那么random
    - Add a missing indicator + mean/median/mode imputation
      - 在使用各种值填充的同时，加一列missing column表示是否`missing`，**特别是当missing value present in more than 5% of the observations。**
3. Categorical Data Encoding
  - 注意，所谓的categorical encoding谈论的都是一个column里面的不同label
  - Traditional techniques
    1. one-hot encoding / dummy variable
      - 例如我们在color列中有red, green, blue等feature，直接每个颜色弄一列，要么1要么0，简单粗暴
      - **当心dummy variablee trap,只需要为(k-1)个feature进行hot-encoding**
      - 但是也有例外，例如tree-based models不需要做到(k-1),因为**tree model并不会用上所有的features**，还有当需要知道每个feature的importance的时候，也要用到k个variabless
      - 优点是特别适合linear model，且保留了所有categorical values的信息，缺点是可能引起curse of dimensionality，太多redundant information
    2. ordinal/label encoding
      - 直接把categorical feature用數字1,2,3..代替
      - 优点是简单粗暴，**和tree based model配合良好**，不会增加feature space，缺点是不适合linear model，而且好像给每个label排了序，其实并没有区别
    3. count/freq encoding
      - 用每个label的count/freq代表每个label，例如red出现了两次，red的label就是2
      - 优点是简单粗暴，和tree based model配合良好，不会增加feature space，缺点是给不同的label一样的编码，lose valuable info，不适合linear model，不能handle new features
  - Monotonic relationship
    1. Target guided ordered label encoding
      - 根据每个label对应target的平均个数的顺序来定label，属于monotonic relationship btn categories and target
      - 优点是适用于linear和non-linear model，但是缺点是可能引起overfitting，有点难cross-validation
    2. **mean/target encoding**
      - 根据每个label对应target的平均个数的来定label, 详见[案例](https://maxhalford.github.io/blog/target-encoding/), 属于monotonic relationship btn categories and target
      - 优点是适用于linear和non-linear model，**很适合high cardinality features**, 但是缺点是可能引起overfitting，有点难cross-validation，也可能出现不同label，相同编码
    3. probability ratio encoding
    4. weight of evidence (WOE)
      - 公式：
      WOE = ln([proportion of good events]/[proportion of bad events])
      - WOE刚开始是出现在financial & credit industry用于衡量risk
       of loan default, **这其实就是个odds**
      - 优点是创造了天然的logistic scale，**特别适合logistic regression**，而且不同label之间WOE值可以相互比较，创造了一种label和target之间的monotonic relationship，缺点是还是容易产生overfitting，而且可能出现denominator为0的情况

    5. 针对predominant variable/few categories/high cardinality三种情况
      - predominant variable: 就两种label，一种label极少，另一种label 99%
      - few categories:: 就几种label，一种label 99%，另几种label加起来也不多，直接encoding为rare label
      - high cardinality: 使用targer encoding，或者把几种label全部加起来，直接encoding为rare label
  - Altrenative techniques
    1. binary eencoding
      - 用二进制编码features
    2. feature hashing
      - 稀奇古怪的编码方法，一般竞赛可能用的多
    3. others
4. Linear Model Asssumption   **四大假设**
  1. Linear relationshsip between variable and target
    - Y = C + a0x0 + a1x1 + ... + anxn
    - assess with scatter plots
    - non-linear transformations of variables can improve the linear relationship
    - **Residuals (difference btn predictions and real values) should be normaly distributed and center around zero**  
  2. Multivariate normality 
    -  Each variable values follows Gausian disstribution
    -  assess with Q-Q plot, residual plot and stat tested by komogrov-smirnov test
    - log-transformation may help if variables are not normaly distributed 
  3. no or little co-linearity btn features
    - assess with correlation matrix or the variance inflation factor (VIF)
  4. Homoscedacity (of variance)
    - the independent variables have the same finite variance, which means error terms is the same across all independent variables
    - tested by 
       - residual plot 
       - levene's test
       - barlett's test
       - goldfeld-quandt's test
    - non-linear transformaton and feature scailing can help improve homoscedacity
5. Distributions
  - Distribution is the likelihood of obtaining a certain value 
  - Linear model对数据的distribution要求较高，别的model，**例如svm，neural network 则不需要数据满足distribution，但是一般来说满足gaussian distribution会让模型表现的好一点**。下面列出改变distribution的一些方法：
    1. log, $ln(x)$
    2. Exponential, $X^{(n)}$
    3. Reciprol, $1/x$
    4. Box-cox， $\frac{x^{\lambda} - 1}{\lambda}, \lambda \in[-5, 5]$
  - Types
    - Discrete
      - Binomial
      - Possion
    - Continuous 
      - Gaussian
      - Skewed
        - 对于feature with skewed distribution， 我们采用median进行missing data imputation，因为此时median相比mean更能represent distribution
6. Outliers
  - 一般指value偏大或者偏小，超出population的正常范围很多
  - 对不同的model影响不同
    - 深受其害：linear model, regression-basd model, adaboost
    - 影响不大：tree-based algo, Decision Tree or a Bagging Tree or a Random Forest, all these models can handle outliers very effectively.
  - outlier检测：
    1. normal distribution: 3 std, 5th or 95th quantile
    2. skewed disstribution: IQR = 75th Quantile - 25th Quantile, Upper limite = 75th Quantile + 1.5 IQR, Lower limit = 25th Quantile + 1.5 IQR, 这个1.5可以增加到3
      - 也可以直接看box-plot
   - outlier处理：
    1. trimming
      - 直接删除
      - 优点是简单粗暴，缺点是可能会是删除太多data
    2. missing data
      - 标记为missing data，按照missing data imputation处理
    3. discretisation
      - binning中归为upper或者lower bin
    4. censoring
      - capping，winsorization，top/bottom coding
      - 用capping值替换，不允许大于/小于某个值
        - capping种类：3倍std，75th quantile + 1.5 IQR / 25th quantile - 1.5IQR 
    


7. Variable Transformation
  - 将skewed数据分布变成normal distribution，一般会得到更好的模型表现
  - 具体方法有：
    1. logarithmic    
      - `np.log(1+x)`
    2. exponential/power tranformation    
      - `x^{1/2}, x^{1/1.5}, x^{2}, x^{n}`都可以试试
    3. reciprocal     
      -  `1/x`
    4. rank tranformation
    5. box-cox
      - `stat.boxcox(x)`   这里面的lambda是超参数，存在最优lambd
    6. yeo-johnson  超复杂
      - `stat.yeojohnson(x)`   这里面的lambda同样是超参数，存在最优lambda

8. Discretisation 离散化
  - **我们的对象都是continous numeric variables**，将continous numeric variables离散化成discrete numeric variables，也就是我们常说的**binning**，优点有：
      1. 可以改变feature的distribution，将skewed变成unskewed
      2. 可以处理outlier
  - 具体方法分为supervised和unsupervised：
    - Unsupervised
      1. equal-width
        - 平均分组，每组的间隔相同
        - 优点是基本复制了原数列distribution，能够handle outliers，简单易行创建discrete values，易于下一步进行categorical encoding
      2. equal-freq
        - 分出n个组，每个组的variable数目大致相同
        - 优点是improve value spread，distribution从尖峰变平了，也能够handle outliers，create discrete variables，也易于下一步进行categorical encoding
      3. K means
         - 在序列中找出n个centroid，离每个centroid较近的variabls成一组
         - 优点是基本复制了原数列distribution，能够handle outliers (尽管outliers还是有影响，特别是最后一个bucket)，简单易行创建discrete values，易于下一步进行categorical encoding，缺点是是并木有提升distribution
    - Supervised
      1. decision tree
        - 利用单个continous variable column和target的关系建立一个很简单(max_depth = 3)的decision tree，然后用每组的probability来进行binning
        - 优点是能够handle outliers (tree based model 特性), 创建discrete values，更关键的是建立了monotonic relationship. 
    - Domain knowlwdge / Arbitrary 
      - 按照常识或者规定进行binning
        - 例如20-40， 40-60一个年龄段，0-40k, 45k-60k一个收入段，直接人为定义。
   - 当我们discretisation了之后，我们的continous variabels其实就变成了ordinal categorical feature，例如，第一组，第二组...或者[0, 4], [5, 10]这样的区间，我们又需要参照上面的categorical encoding进行转换

9. Feature scailing 
  - 通常是feature processing pipeline最后一步，也就是马上要开始投入模型了
  - features with bigger magnitude dominates the regression model
  - feature scaling对model的影响不同：
    - 深受其害：linear & logistic regression，Neural network, SVM, euclean distances based models (KNN & Kmeans), Linear Discriminant Analuysis (LDA), Principle Component Analysis (PCA)， Gradient Descent
    - 影响不大：Tree-based model, Radome forest, Gradient boosted trees
  - Feature scailing的方法
    1. standardization
      - z-score = (x-mean)/std
      - 调整了feature的range，variance为1, 中心在0点，但是基本不改变distribution，outlier该在还是在，如果要transfor成normalization，不应该采用这样的方法，而应该用上面提到的variable transformation的方法
    2. mean normalization
      - x_scaled = (x-mean)/(max - min)
      - 调整了feature的range，range为[-1,1], 中心在0点，可能会改变distribution，outlier该在还是在，如果要transfor成normalization，不应该采用这样的方法，而应该用上面提到的variable transformation的方法
    3. scailing to max and min (MinMaXScailing)
      - x_scaled = (x-min)/(max - min)
      - 调整了feature的range，range为[0,1], variance varies，mean varies，可能会改变distribution，outlier该在还是在，如果要transfor成normalization，不应该采用这样的方法，而应该用上面提到的variable transformation的方法
    4. scailing to absolute max (MaxAbsScailing)
      - x_scaled = x/max(x)
      - 感觉和MinMax效果差不多
    5. scailing to median and quantiles (robust scailing)
      - x_scaled = [x-median(x)]/[75th quant(x) - 25th quant(x)]
      - median centred at 0, 之所以叫做robust scailing，就是因为用的是median和quantile，都是绝佳对抗outlier的工具，所以可以handle outliers
    6. scailing to unit form
      - 用多列的variable，创建L2 Norm，然后每个variable
      除以L2 Norm
      - 很特别的方法，不知道有啥特别之处



9. Feature selsection
  - **Note:** numerical feature我们常常可以通过binning转变为categorical feature，然后使用下列的方法
  - Correlation method (compare btn only features)
    - pearson's correlation coefficient (linear correlation)
      - 大于0.7就算非常correlated了
    - spearman's rank correlation coefficient
    - kendall rank correlation coefficient
  - Statistical method (compare btn features and target rank then select, higher rarnk or lower p-value)
    - mutual information / information gain 
      - mutual information btn each feature and target, 越大越好
    - Chi-quare / fisher score 
      - 多用于categorical features, target也是categorical，也就是classification，例如：想知道male/female这样的feature对是否survived有没有影响
      - Null hypo为这个feature对target的预测无效，P-value越大，越能拒绝Null hypo，说明**相比之下**, 这个feature越important.
    - univariate testss / ANOVA
      - ANOVA的Null hypo是，两个或多个sample有着相同的mean。ANOVA的假设有：samples are independent, normaly distributed, 还有homogenity of variance

  - Performance metric method （推荐，更现代，更适合machine learning）
    - univariate roc-auc / rmse
      - **用每个单独的feature例如用tree model做classification，按照每个feature的metric排序, 大于0.5的roc-auc一般都予以考虑。当你有大量的features
      时候，这种方法比较好**
        - classification: roc-auc, accuracy, precision, recall, etc
        - regression: MSE, RMSE, R2, etc
     - select features by target mean encoding
        - 这种方法专门针对于numerical feature，先得到利用training set的target encoding，然后算出roc-auc进行ranking
  - Coefficient method (并不推荐，因为assumption要求高，且regularization可能distort)
    - Linear model / Logistic model 
      - 当满足下列assumption，那么coefficient系数就能代表feature的importance：
        1. linear relationship 
        2. no multicoliniearity
        3. independent
        4. normally distributed
        5. **all features in the same scale**, needs feature scailing
      - 如果采用了regularization，coeeficient的大小会收到影响，就不适用了
  - Laaso-regularization
    - Regularization的意义在于reduce the freedom of the model, hence the model will be less likely to fit (robust to) the noise of the training data and will improve the genaralization ability of the model, 就是说限制参数，让模型少学一些noise，让模型的泛化能力更好, 往大了说，这里面有一个bias & variance 的trade off。主要有三种regularization：
        1. L1 (lasso)
          - this method will shrink some (less important) parameter to **ZERO**
        2. L2 (ridge)
        3. L1/ L2 (elastic net)
  - Tree-based importance
    - Tree model 有下列优点：
      1. most popular ML algo
      2. highly accurate
      3. good generqalization (low oeverfitting)
      4. robust to outlier
      5. interpretability 
    - **Random forest importance 是按照每个feature的在N个tree中平均impurity来计算feature importance**
    - 注意可能有坑：
      - correlated features show equal or similar importance 说明importance高，但是可能只是correlated to other feature，所以对importance相近的feature需要double check
      - correlated features importance is lower than the real importance
      - **highly cardinal variables** show greater importance (trees are biased to this type of variables)
    - Hybrid feature selection methods 一些野路子，如果你对模型要求很高
      - feature shuffuling
        - shuffle 每个feature，看哪个feature 被shuffle后的模型表现drop最多，说明越重要
      - recursive feature elimination
        - 先用tree model得出importance，然后根据importance ranking，从小到大挨个去除，看去除之后模型的performance drop情况，如果drop很多应该留着feature，如果没啥影响，那就remove吧
      - recursive feature addition·
        - 和上面的方法类似，先用tree model得出importance，然后根据importance ranking，从大到小挨个增加feature，看增加之后模型的performance improve的情况，如果improve很多应该留着feature，如果没啥影响，那就remove吧
