# 3. Feature Engineering

In [1]:
from pathlib import Path
import pandas as pd
import numpy as np

In [2]:
BASE_DIR = Path.cwd().parent

# Random seed for reproducibility
SEED = 42
np.random.seed(SEED)

Now that we have our labeled subset of the dataset - annotated in [**2. Exploratory Data Analysis & Data Labeling**](https://github.com/alikhalajii/text-classification-life-sciences/blob/master/notebooks/02-exploratory-data-analysis-data-labeling.ipynb) - we’re ready to transform the text data into numerical features using **TF-IDF (Term Frequency–Inverse Document Frequency)**.

### Why TF-IDF?

TF-IDF is a widely used technique for converting raw text into meaningful numerical representations. It helps us:

- **Capture word importance**: Words that appear frequently in a document but rarely across others get higher scores, making them more informative.
- **Reduce noise**: Common words (e.g. “app”, “system”) that appear in many documents are down-weighted.
- **Enable model training**: Most machine learning models require numerical input. TF-IDF provides a sparse, interpretable feature matrix suitable for classifiers like logistic regression.

This transformation allows us to move from raw text to structured input, setting the stage for effective model development.


For each column in the fully cleaned dataset (df_clean), we count the number of unique values (“Unique Counts”) and, based on clearly defined thresholds, select an appropriate encoding method:


In [3]:
df_clean = pd.read_csv(
    BASE_DIR / "data/02_df_cleaned_all_text.csv",
    encoding='utf-8',
    keep_default_na=False,  # <— don’t treat empty strings as NaN
    na_values=[]            # <— no additional “missing” tokens
)

df_clean.nunique().sort_values(ascending=True)

AppType                              13
LeadingIndustry                      15
PFDBIndustry                        100
GTMLoBName                          131
Information                         241
SolutionsCapability                 353
GTMInnovationName                  1087
BusinessRoleDescription            1121
ApplicationComponentText           1127
Subtitle                           1498
GTMAppName                         2225
RoleCombinedToolTipDescription     2251
AppKeyFeatures                     2959
GTMAppDescription                  4024
Title                             11637
CombinedTitle                     12636
AppName                           13903
all_text                          14098
fioriId                           14112
dtype: int64

## 3.1 Automated Feature Encoding Strategy Based on Full Dataset

* To ensure our feature encoding strategy generalizes beyond the small labeled sample (300 rows), we analyze the full metadata dataset (`df_clean`) containing over 14,000 entries.

Each column is inspected based on:
- its number of unique values (cardinality),
- and its data type (e.g., numeric or categorical)  (!included here for completeness, although in our current dataset all features are of type `object`.)

From this analysis, we assign an appropriate encoding strategy as follows:

- **One-Hot Encoding** for features with **≤ 15 unique values**, preserving interpretability for low-cardinality categorical variables.
- **Target or Frequency Encoding** for features with **16 to 350 unique values**, balancing dimensionality and signal strength.
- **Grouping/Embedding** for features with **> 350 unique values**, typically better handled with learned representations in neural architectures.

We exclude the following columns from this encoding logic:
- Technical IDs (`fioriId`),
- Target column (`Label`),
- Pre-labels used during verification (`pre_label`),
- and long free-text input (`all_text`).

The final output is a decision table (`decision_df`) summarizing the data type, cardinality, and recommended encoding for each relevant feature.



In [4]:
# Compute metadata from full dataset
unique_counts = df_clean.nunique(dropna=False)
dtypes = df_clean.dtypes.astype(str)

# Decide encoding strategy per feature
def decide_encoding(col: str) -> str:
    if pd.api.types.is_numeric_dtype(df_clean[col]):
        return "keep (numeric)"
    u = unique_counts[col]
    if u <= 15:
        return "one-hot"
    elif u <= 350:
        return "target/frequency"
    else:
        return "group/embed"

# Apply decision logic, exclude non-feature columns
excluded = ["Label", "pre_label", "all_text"]
decisions = {
    col: decide_encoding(col)
    for col in df_clean.columns if col not in excluded
}

# Assemble encoding decision table
decision_df = pd.DataFrame({
    "dtype": dtypes,
    "unique_count": unique_counts,
    "decision": pd.Series(decisions)
}).sort_values("decision")

from IPython.display import display
display(decision_df)


Unnamed: 0,dtype,unique_count,decision
AppKeyFeatures,object,2959,group/embed
Subtitle,object,1498,group/embed
SolutionsCapability,object,353,group/embed
RoleCombinedToolTipDescription,object,2251,group/embed
Title,object,11637,group/embed
GTMAppName,object,2225,group/embed
GTMInnovationName,object,1087,group/embed
CombinedTitle,object,12636,group/embed
BusinessRoleDescription,object,1121,group/embed
ApplicationComponentText,object,1127,group/embed


> **Insight:** The heuristic flags potential positives automatically, but all `pre_label` values must be manually verified and corrected.  


## 3.2 Apply One-Hot Encoding for Low-Cardinality Features

* We transform all categorical features that were marked as **one-hot** in the `decisions` dictionary. This approach creates binary indicator columns for each category, which preserves interpretability and avoids unnecessary dimensionality inflation—especially important for features with low cardinality. We begin by loading the annotated dataset from CSV, then identify the relevant columns using a list comprehension over `decisions`. Using `pandas.get_dummies()`, we generate the one-hot encoded columns without dropping the first category, ensuring full representation. Finally, we drop the original categorical columns and concatenate the new binary indicators back into the main DataFrame, resulting in a numerically enriched dataset ready for further transformations.

In [5]:
df_annotated_full = pd.read_csv(
    BASE_DIR / "data/02_df_annotated_full.csv",
    encoding='utf-8',
    keep_default_na=False,
    na_values=[]
)

In [6]:
# Identify one-hot features
one_hot_cols = [col for col, strategy in decisions.items() if strategy == "one-hot"]

# Apply pandas get_dummies
df_ohe = pd.get_dummies(
    df_annotated_full[one_hot_cols],
    prefix=one_hot_cols,
    drop_first=False
)

# Drop original cols and concat new indicators
df_transformed = pd.concat(
    [df_annotated_full.drop(columns=one_hot_cols), df_ohe],
    axis=1
)


## 3.3. Compute Target/Frequency Encoding for Medium-Cardinality Features

* Next, we transform features labeled as **target/frequency** in `decision_df`.  
- **Frequency encoding** replaces each category with its overall occurrence frequency.  
- **Target encoding** replaces each category with the mean of the target variable (`Label`) for that category.  

We will implement frequency encoding here for simplicity, but target encoding can be swapped in.

In [7]:
from sklearn.model_selection import KFold

# Identify medium-cardinality features
freq_cols = [col for col, strategy in decisions.items() if strategy == "target/frequency"]

# Frequency encoding
freq_maps = {
    col: df_transformed[col].value_counts(normalize=True)
    for col in freq_cols
}

for col, fmap in freq_maps.items():
    df_transformed[f"{col}_freq"] = df_transformed[col].map(fmap)

# Optionally drop original columns
df_transformed.drop(columns=freq_cols, inplace=True)


In [8]:
df_transformed[["PFDBIndustry_freq", "GTMLoBName_freq", "Information_freq"]].head()

Unnamed: 0,PFDBIndustry_freq,GTMLoBName_freq,Information_freq
0,0.002,0.002,0.91
1,0.812,0.818,0.91
2,0.812,0.818,0.91
3,0.812,0.818,0.91
4,0.812,0.818,0.91


To avoid recomputing the `df_transformed` in future notebooks, we export it:

In [9]:
# Save the transformed DataFrame
df_transformed.to_csv(BASE_DIR / "data/03_df_transformed.csv", index=False)


## 3.4. Group Rare Categories for High-Cardinality Features

* For features marked as **group/embed**, we collapse infrequent categories into an "Other" bucket.  
Here, we keep the top 10 most common categories and map all other values to `"Other"`.  
This preserves major signals while avoiding thousands of sparse dummy columns.



In [10]:
# Identify high-cardinality features
embed_cols = [col for col, strategy in decisions.items() if strategy == "group/embed"]

df_grouped = df_transformed.copy()

# For each high-cardinality feature, keep top 10 most frequent, group others
for col in embed_cols:
    top_cats = df_grouped[col].value_counts().nlargest(10).index
    df_grouped[f"{col}_grp"] = df_grouped[col].where(df_grouped[col].isin(top_cats), other="Other")

# Drop original high-card cols
df_grouped.drop(columns=embed_cols, inplace=True)

df_grouped.head()


Unnamed: 0,all_text_label,pre_label,Label,all_text,AppType_Analytical,AppType_Reuse Component,AppType_SAP GUI,AppType_Transactional,"AppType_Transactional, Analytical","AppType_Transactional, Fact sheet",...,ApplicationComponentText_grp,BusinessRoleDescription_grp,CombinedTitle_grp,GTMAppDescription_grp,GTMAppName_grp,GTMInnovationName_grp,RoleCombinedToolTipDescription_grp,SolutionsCapability_grp,Subtitle_grp,Title_grp
0,Plan Repairs Call up a worklist of all repair ...,0,0,Plan Repairs Call up a worklist of all repair ...,False,False,False,True,False,False,...,Other,Other,Other,Other,Other,Other,Other,Other,,Other
1,Delete Driver This app is a SAP GUI for HTML t...,1,1,Delete Driver This app is a SAP GUI for HTML t...,False,False,True,False,False,False,...,Other,Other,Other,This app is a SAP GUI for HTML transaction. Th...,SAP Fiori theme for SAP GUI for HTML (SAP S/4H...,SAP Fiori visual theme for classic application...,Other,,,Other
2,Display Pegged Requirements For the MRP elemen...,0,0,Display Pegged Requirements For the MRP elemen...,False,False,True,False,False,False,...,Other,,Other,Other,SAP Fiori theme for SAP GUI for HTML (SAP S/4H...,SAP Fiori visual theme for classic application...,unknown,,,Other
3,Manage Amount for Prepayment Meter This app is...,1,1,Manage Amount for Prepayment Meter This app is...,False,False,True,False,False,False,...,Other,,Other,This app is a SAP GUI for HTML transaction. Th...,SAP Fiori theme for SAP GUI for HTML (SAP S/4H...,SAP Fiori visual theme for classic application...,unknown,,,Other
4,Create Measurement Document This app is a SAP ...,1,1,Create Measurement Document This app is a SAP ...,False,False,True,False,False,False,...,Measuring points and counters,Other,Other,This app is a SAP GUI for HTML transaction. Th...,SAP Fiori theme for SAP GUI for HTML (SAP S/4H...,SAP Fiori visual theme for classic application...,Other,,,Other


* Low-cardinality features remain fully interpretable via binary indicators. Medium-cardinality features are compactly represented through numerical aggregation, and very high-cardinality features are grouped or embedded into a low-dimensional space to prevent an uncontrolled explosion of the feature dimension.


Overall, we aim to minimize memory and computational overhead while preserving the full information content of the data.

In [11]:
df_grouped.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 40 columns):
 #   Column                                            Non-Null Count  Dtype  
---  ------                                            --------------  -----  
 0   all_text_label                                    500 non-null    object 
 1   pre_label                                         500 non-null    int64  
 2   Label                                             500 non-null    int64  
 3   all_text                                          500 non-null    object 
 4   AppType_Analytical                                500 non-null    bool   
 5   AppType_Reuse Component                           500 non-null    bool   
 6   AppType_SAP GUI                                   500 non-null    bool   
 7   AppType_Transactional                             500 non-null    bool   
 8   AppType_Transactional, Analytical                 500 non-null    bool   
 9   AppType_Transactional

## 3.5 Text Vectorization with TF–IDF and SVD

To incorporate semantic information from the free-text field `all_text`, we apply a two-step transformation:

1. **TF–IDF Vectorization**  
   We compute a TF–IDF matrix using up to 5,000 features, capturing both unigrams and bigrams while filtering out English stop words. This converts raw text into a sparse numerical representation that reflects term importance across the dataset.

2. **Dimensionality Reduction via Truncated SVD**  
   We reduce the TF–IDF matrix to 50 latent components using Truncated SVD. This step compresses the feature space while preserving key semantic structures, making the data more tractable for modeling.


In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

# TF-IDF Vectorization on the concatenated text field
tfidf = TfidfVectorizer(max_features=5000, ngram_range=(1, 2), stop_words='english')
X_tfidf = tfidf.fit_transform(df_grouped['all_text'])

# Truncated SVD
n_components = 50
svd = TruncatedSVD(n_components=n_components, random_state=SEED)
X_svd = svd.fit_transform(X_tfidf)

# Mapping: Original Terms to SVD Components
svd_feature_mapping = pd.DataFrame(
    svd.components_,
    index=[f"svd_{i}" for i in range(n_components)],
    columns=tfidf.get_feature_names_out()
)

# Save the SVD feature mapping to a CSV file for later use in feature interpretation
svd_feature_mapping.to_csv(BASE_DIR / "data/03_svd_feature_mapping.csv")

# Top-5 Terms per SVD Component
top_terms = {
    comp: svd_feature_mapping.loc[comp]
                        .nlargest(5)
                        .index
                        .tolist()
    for comp in svd_feature_mapping.index
}

# Create a DataFrame with SVD components as features
df_text = pd.DataFrame(
    X_svd,
    index=df_grouped.index,
    columns=svd_feature_mapping.index
)
df_final = pd.concat(
    [df_grouped.drop(columns=['all_text']), df_text],
    axis=1
)

# Save the final DataFrame with SVD features
svd_feature_mapping.to_csv(BASE_DIR / "data/svd_feature_mapping.csv")


To avoid recomputing the TF-IDF matrix in future notebooks, we persist it using `joblib`:

In [13]:
import joblib
joblib.dump(X_tfidf, BASE_DIR / "checkpoints/X_tfidf_vectorized.pkl")


['/home/khalaji/Coding/text-classification-life-sciences/checkpoints/X_tfidf_vectorized.pkl']

In [14]:
# For next steps
df_final.to_csv(BASE_DIR / "data/03_df_final.csv", index=False, encoding='utf-8')


### Text Vectorization & Dimensionality Reduction Summary

- **Component Interpretation**  
  Each SVD axis is mapped back to its top-5 contributing terms, enabling qualitative inspection of the latent topics.

- **Feature Assembly**  
  The 50 SVD features are concatenated with our cleaned metadata frame to form `df_final`, the complete input for downstream modeling.

---

👉 Continue to the next step:  
<a href="https://github.com/alikhalajii/text-classification-life-sciences/blob/master/notebooks/04-traning-baseline-models.ipynb" target="_blank">**4. Training Baseline Models (LR–RF)**</a>
