# Feature Engineering

I'll do a baseline model performance first, so that we can see whether the feature engineering method is effective and useful or not


## Baseline Model

I'll compare using some of the mainstream models, like

- DummyClassifier (can be most-frequent class, or by random (stratified and unstratified))
- Logistic Regression (since target variable is a boolean)
- Decision Tree Classifier
- Naive Bayes
- Random Forest Classifier
- XGBoost


In [1]:
import pandas as pd

data = pd.read_csv("../data/raw/online_shoppers_intention.csv")
data

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
0,0,0.0,0,0.0,1,0.000000,0.200000,0.200000,0.000000,0.0,Feb,1,1,1,1,Returning_Visitor,False,False
1,0,0.0,0,0.0,2,64.000000,0.000000,0.100000,0.000000,0.0,Feb,2,2,1,2,Returning_Visitor,False,False
2,0,0.0,0,0.0,1,0.000000,0.200000,0.200000,0.000000,0.0,Feb,4,1,9,3,Returning_Visitor,False,False
3,0,0.0,0,0.0,2,2.666667,0.050000,0.140000,0.000000,0.0,Feb,3,2,2,4,Returning_Visitor,False,False
4,0,0.0,0,0.0,10,627.500000,0.020000,0.050000,0.000000,0.0,Feb,3,3,1,4,Returning_Visitor,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12325,3,145.0,0,0.0,53,1783.791667,0.007143,0.029031,12.241717,0.0,Dec,4,6,1,1,Returning_Visitor,True,False
12326,0,0.0,0,0.0,5,465.750000,0.000000,0.021333,0.000000,0.0,Nov,3,2,1,8,Returning_Visitor,True,False
12327,0,0.0,0,0.0,6,184.250000,0.083333,0.086667,0.000000,0.0,Nov,3,2,1,13,Returning_Visitor,True,False
12328,4,75.0,0,0.0,15,346.000000,0.000000,0.021053,0.000000,0.0,Nov,2,2,3,11,Returning_Visitor,False,False


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

# Splitting the data into features and target
X = data.drop(columns=['Revenue'])
y = data['Revenue']

# One-hot encoding for categorical variables
X = pd.get_dummies(X, drop_first=True)

# Splitting into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

# Models to evaluate
models = {
    "Dummy (most frequent)": DummyClassifier(strategy="most_frequent"),
    "Dummy (stratified)": DummyClassifier(strategy="stratified"),
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Decision Tree": DecisionTreeClassifier(),
    "Naive Bayes": GaussianNB(),
    "Random Forest": RandomForestClassifier(),
    "XGBoost": XGBClassifier(eval_metric='logloss')
}

# Evaluating each model
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f"Model: {name}")
    print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
    print(classification_report(y_test, y_pred))
    print("-" * 50)

Model: Dummy (most frequent)
Accuracy: 0.8333
              precision    recall  f1-score   support

       False       0.83      1.00      0.91      2055
        True       0.00      0.00      0.00       411

    accuracy                           0.83      2466
   macro avg       0.42      0.50      0.45      2466
weighted avg       0.69      0.83      0.76      2466

--------------------------------------------------
Model: Dummy (stratified)
Accuracy: 0.7332
              precision    recall  f1-score   support

       False       0.84      0.85      0.84      2055
        True       0.18      0.17      0.17       411

    accuracy                           0.73      2466
   macro avg       0.51      0.51      0.51      2466
weighted avg       0.73      0.73      0.73      2466

--------------------------------------------------


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Model: Logistic Regression
Accuracy: 0.8715
              precision    recall  f1-score   support

       False       0.88      0.98      0.93      2055
        True       0.74      0.35      0.47       411

    accuracy                           0.87      2466
   macro avg       0.81      0.66      0.70      2466
weighted avg       0.86      0.87      0.85      2466

--------------------------------------------------
Model: Decision Tree
Accuracy: 0.8573
              precision    recall  f1-score   support

       False       0.92      0.91      0.91      2055
        True       0.57      0.59      0.58       411

    accuracy                           0.86      2466
   macro avg       0.74      0.75      0.75      2466
weighted avg       0.86      0.86      0.86      2466

--------------------------------------------------
Model: Naive Bayes
Accuracy: 0.8021
              precision    recall  f1-score   support

       False       0.92      0.84      0.88      2055
        True     

| Model                 | Accuracy | Precision (False) | Recall (False) | F1-Score (False) | Precision (True) | Recall (True) | F1-Score (True) | Macro Avg Precision | Macro Avg Recall | Macro Avg F1-Score | Weighted Avg Precision | Weighted Avg Recall | Weighted Avg F1-Score |
| --------------------- | -------- | ----------------- | -------------- | ---------------- | ---------------- | ------------- | --------------- | ------------------- | ---------------- | ------------------ | ---------------------- | ------------------- | --------------------- |
| Dummy (most frequent) | 0.8333   | 0.83              | 1.00           | 0.91             | 0.00             | 0.00          | 0.00            | 0.42                | 0.50             | 0.45               | 0.69                   | 0.83                | 0.76                  |
| Dummy (stratified)    | 0.7332   | 0.84              | 0.85           | 0.84             | 0.18             | 0.17          | 0.17            | 0.51                | 0.51             | 0.51               | 0.73                   | 0.73                | 0.73                  |
| Logistic Regression   | 0.8715   | 0.88              | 0.98           | 0.93             | 0.74             | 0.35          | 0.47            | 0.81                | 0.66             | 0.70               | 0.86                   | 0.87                | 0.85                  |
| Decision Tree         | 0.8573   | 0.92              | 0.91           | 0.91             | 0.57             | 0.59          | 0.58            | 0.74                | 0.75             | 0.75               | 0.86                   | 0.86                | 0.86                  |
| Naive Bayes           | 0.8021   | 0.92              | 0.84           | 0.88             | 0.43             | 0.62          | 0.51            | 0.68                | 0.73             | 0.69               | 0.84                   | 0.80                | 0.81                  |
| Random Forest         | 0.8954   | 0.92              | 0.96           | 0.94             | 0.75             | 0.55          | 0.64            | 0.83                | 0.76             | 0.79               | 0.89                   | 0.90                | 0.89                  |
| XGBoost               | 0.8917   | 0.92              | 0.96           | 0.94             | 0.72             | 0.57          | 0.64            | 0.82                | 0.76             | 0.79               | 0.89                   | 0.89                | 0.89                  |

<br/>

**Random Forest** had the best accuracy, with 0.8954 (not what I expected, as I always thought XGBoost was like the 'best' defacto model in the market right now)

One more thing to note is that the dataset doesn't have alot of positive samples (15.5%), as such looking into metrics like Precision and Recall is important.
Precision ensures false positives are reduced, while recall ensures false negatives are reduced. In our case, since we are predicting customer's conversion from visiting to an actual transaction, it is more important to reduce FPs.

Hence Precision is slightly more important in this case, and Random Forest also has the highest precision among all the models tested


### Possible Ways of Performing Feature Engineering

(Generated from GPT, but will need to vet through and see if they are applicable)

1. **Handling Missing Values**:

   - Impute missing numerical values with mean, median, or a constant.
   - Impute missing categorical values with the most frequent value or a constant.

> Enoch's comment: No missing values

2. **Scaling and Normalization**:

   - Standardize numerical features using StandardScaler.
   - Normalize numerical features using MinMaxScaler or RobustScaler.

> Enoch's comment: Makes sense, helps process data for ML training later on

3. **Encoding Categorical Variables**:

   - One-hot encoding for nominal categorical variables.
   - Ordinal encoding for ordinal categorical variables.

> Enoch's comment: Ordinal: 'Month' ; Nominal: 'OperatingSystem', 'Browser', 'Region', 'TrafficType', 'VisitorType'

4. **Interaction Features**:

   - Create interaction terms between numerical features (e.g., `Administrative * Administrative_Duration`).
   - Create interaction terms between categorical features (e.g., `Month * VisitorType`).

> Enoch's comment: How do I figure out which features are 'good' interaction terms?

5. **Polynomial Features**:

   - Generate polynomial features for numerical columns (e.g., square, cube).

> Enoch's comment: How do I figure out which features should use polynomial features?

6. **Feature Binning**:

   - Bin numerical features into discrete intervals (e.g., binning `Administrative_Duration` into low, medium, high).

> Enoch's comment: Columns that are applicable: 'Administrative', 'Administrative_Duration', 'Informational', 'Informational_Duration', 'ProductRelated', 'ProductRelated_Duration'. Maybe 'PageValues' too? IDK

7. **Log Transformation**:

   - Apply log transformation to skewed numerical features (e.g., `np.log1p(ProductRelated_Duration)`).

> Enoch's comment: How to find skewed numerical features?

8. **Feature Aggregation**:

   - Aggregate related features (e.g., sum or mean of `Administrative`, `Informational`, and `ProductRelated`).

> Enoch's comment: This makes sense - probably Sum or Mean makes the most sense. Not sure if min/max will also play a similar effect. Maybe a short maximum time spent also has a correlation to whether the user decides to proceed with a transaction

9. **Feature Selection**:

   - Use statistical tests (e.g., ANOVA, chi-square) to select important features.
   - Use model-based feature selection (e.g., feature importance from tree-based models).

> Enoch's comment: Can try this

10. **Dimensionality Reduction**:

    - Apply PCA (Principal Component Analysis) to reduce dimensionality of numerical features.
    - Use t-SNE or UMAP for visualization or dimensionality reduction.

> Enoch's comment: Not very familiar with this - have heard about it, and know what it does, but not sure how it will affect or make the data 'better'

11. **Temporal Features**:

    - Extract temporal features from `Month` (e.g., seasonality, quarter).
    - Create binary features for specific time periods (e.g., `is_holiday`).

> Enoch's comment: Hmm... need to explore more into this

12. **Target Encoding**:

    - Encode categorical variables based on their relationship with the target variable (e.g., mean encoding).

> Enoch's comment: Huh?

13. **Text Features**:

    - If any text data exists, extract features using TF-IDF or word embeddings.

> Enoch's comment: Don't have

14. **Outlier Detection and Removal**:

    - Detect and remove outliers using statistical methods (e.g., z-score, IQR).

> Enoch's comment: Probably applicable to the numerical ones

15. **Feature Clustering**:

    - Cluster similar features and create cluster labels as new features.

> Enoch's comment: ?

16. **Feature Importance Analysis**:

    - Analyze feature importance using models like Random Forest or XGBoost and drop less important features.

> Enoch's comment: ??

17. **Custom Feature Engineering**:

    - Create domain-specific features (e.g., `BounceRates / ExitRates` to measure engagement).

> Enoch's comment: ???

18. **Binary Features**:

    - Convert categorical features into binary flags (e.g., `Weekend` as 0 or 1).

> Enoch's comment: Easy - just convert boolean to int

19. **Interaction with Target**:

    - Create features based on interaction with the target variable (e.g., conditional probabilities).

> Enoch's comment: Huh...?

20. **Cross-Validation Based Features**:

    - Generate features using cross-validation predictions (e.g., stacking or blending).

> Enoch's comment:

21. **Feature Hashing**:

    - Use feature hashing for high-cardinality categorical variables.

> Enoch's comment:

22. **Lag Features**:

    - Create lag features for time-series data (e.g., previous month's `PageValues`).

> Enoch's comment:

23. **Cyclic Features**:

    - Encode cyclic features like `Month` using sine and cosine transformations.

> Enoch's comment: I heard about this, can look more into this

24. **Feature Pruning**:

    - Remove highly correlated features to avoid multicollinearity.

> Enoch's comment:

25. **Synthetic Features**:
    - Generate synthetic features using GANs or other generative models.

> Enoch's comment: Something to consider in the future, considering the number of positive samples are quite low (15.5%)
