In [1]:
from pathlib import Path
import pandas as pd
import numpy as np

In [2]:
BASE_DIR = Path.cwd().parent

# Random seed for reproducibility
SEED = 42
np.random.seed(SEED)

# 2. Exploratory Data Analysis & Data Labeling

After investigating and cleaning the Fiori Apps dataset in [**1. Data Cleaning**](https://github.com/alikhalajii/text-classification-life-sciences/blob/master/notebooks/01-data-cleaning.ipynb), we now move forward to explore the cleaned data in more depth.

This notebook focuses on understanding the structure, distribution, and patterns within the dataset to support manual annotation and guide model development.


## 2.1. Load the cleaned CSV

We begin by reloading the cleaned dataset generated in the previous step and verifying its integrity.

In [3]:
df_clean = pd.read_csv(
    BASE_DIR / "data/01_cleaned_df.csv",
    index_col=0,
    encoding='utf-8',
    keep_default_na=False,  # <— don’t treat empty strings as NaN
    na_values=[]            # <— no additional “missing” tokens
)


print(f"Shape: {df_clean.shape}")         # Expect about (14_112, 17)
print("Missing values per column:")
print(df_clean.isna().sum())              # All should be zero
print("\nData types:")
print(df_clean.dtypes)                    # Expect all 'object'

Shape: (14112, 17)
Missing values per column:
AppKeyFeatures                    0
AppName                           0
AppType                           0
ApplicationComponentText          0
BusinessRoleDescription           0
CombinedTitle                     0
GTMAppDescription                 0
GTMAppName                        0
GTMInnovationName                 0
GTMLoBName                        0
Information                       0
LeadingIndustry                   0
PFDBIndustry                      0
RoleCombinedToolTipDescription    0
SolutionsCapability               0
Subtitle                          0
Title                             0
dtype: int64

Data types:
AppKeyFeatures                    object
AppName                           object
AppType                           object
ApplicationComponentText          object
BusinessRoleDescription           object
CombinedTitle                     object
GTMAppDescription                 object
GTMAppName                   

---

## 2.2. Univariate Feature Checks

* Quickly inspect distribution and uniqueness of a few high-priority features.

First, we inspect a few key attributes—**LeadingIndustry**, **AppType**, and **CombinedTitle**, using simple value counts,



In [4]:
# Inspect LeadingIndustry distribution
print(">>> LeadingIndustry (unique:", df_clean["LeadingIndustry"].nunique(), "categories)")
display(
    df_clean["LeadingIndustry"]
      .value_counts()
      .sort_values(ascending=False)
      .head(10)
)

# Check CombinedTitle uniqueness & top-10
print(f">>> CombinedTitle unique values: {df_clean['CombinedTitle'].nunique()}")
display(df_clean["CombinedTitle"].value_counts().head(10))

# Inspect AppType distribution
print(">>> AppType (unique:", df_clean["AppType"].nunique(), "categories)")
df_clean["AppType"].value_counts().head(10)


>>> LeadingIndustry (unique: 15 categories)


LeadingIndustry
                                    11896
Cross Industry                       1921
Oil, Gas, and Energy                   71
Retail                                 62
Public Sector                          51
Defense and Security                   28
Insurance                              19
Utilities                              17
Agribusiness (Consumer Products)       16
Consumer Products                       9
Name: count, dtype: int64

>>> CombinedTitle unique values: 12636


CombinedTitle
                                                                                                                                                             1246
My Inbox - All Items                                                                                                                                            9
Schedule Unplanned Contract Settlement                                                                                                                          7
Schedule Update Settlement Calendar                                                                                                                             5
Schedule Accruals Reversal - Obsolete Contracts, Schedule Accruals Reversal - Obsolete Contracts                                                                4
Approve Purchase Contracts, Approve Purchase Orders, Approve Service Entry Sheets - Lean Services, Approve Supplier Invoices, Approve Supplier Quotations       4
Display Change

>>> AppType (unique: 13 categories)


AppType
SAP GUI                           10346
Transactional                      2000
Web Dynpro                          898
Analytical                          396
Web Client UI                       128
Reuse Component                     112
Transactional, Reuse Component       83
Transactional, Analytical            67
Fact sheet                           46
Transactional, Fact sheet            14
Name: count, dtype: int64

We have identiefied:
- Class imbalances, e.g. the dominance of “Cross Industry” in **LeadingIndustry**  
- Spot rare or overly diverse categories (e.g. **CombinedTitle** has over 12,000 unique values)  
- Boilerplate or template descriptions that can later be filtered out  

These insights guide a balanced, representative sampling strategy for our manual annotation phase.

---

## 2.3. Text Field Consolidation

**Create a unified `all_text` field**  


Based on our univariate checks, we now merge all descriptive and title-like columns into a single free-text field called **`all_text`**. This combined text field will:

- Capture all Life-Science–relevant keywords and phrases in one place  
- Simplify downstream text processing (tokenization, vectorization)  
- Leave structured categorical features (e.g. **LeadingIndustry**, **AppType**) intact for separate encoding  

In [5]:
# Prepare a single free-text field for annotation
text_cols = [
    "CombinedTitle", "AppKeyFeatures", "GTMAppDescription",
    "RoleCombinedToolTipDescription", "ApplicationComponentText",
    "BusinessRoleDescription", "Subtitle", "AppName"
]

df_clean["all_text"] = (
    df_clean[text_cols]
        .agg(" ".join, axis=1)                # concatenate with space
        .str.replace(r"\s+", " ", regex=True) # collapse multiple spaces
        .str.strip()                          # trim leading/trailing
)

# Ensure 'fioriId' is the index
df_clean.index.name = "fioriId"
df_clean = df_clean.reset_index()  # brings index into columns

df_clean[["fioriId", "all_text"]].head()

Unnamed: 0,fioriId,all_text
0,1KE4,Profit Center Assignment Monitor You can obtai...
1,AB08,Reverse Journal Entry - Asset Accounting-Speci...
2,ABAAL,Post Depreciation Manually - Unplanned and Pla...
3,ABAON,Post Retirement (Non-Integrated) - Without Cus...
4,ABAVN,Post Retirement - By Scrapping This app is a S...


By centralizing narrative content in `all_text`, we ensure our keyword filtering and embedding pipelines focus on the richest textual signal while preserving metadata for later models.  

We export the cleaned dataframe containing the all_text column to a CSV file so it can be reused in the next steps of the pipeline.

In [6]:
df_clean.to_csv(BASE_DIR / "data/02_df_cleaned_all_text.csv", index=False, encoding='utf-8')

By centralizing narrative content in `all_text`, we ensure our keyword filtering and embedding pipelines focus on the richest textual signal while preserving metadata for later models.  

---

## 2.4. Build Annotation Sample

* 250 random + 250 keyword hits  

We create two equally sized pools:  
1. **Random sample** – 250 records drawn uniformly from the full dataset.  
2. **Keyword sample** – 250 records whose `all_text` contains at least one Life-Science keyword.  

In [7]:
# Keyword filtering for annotation sample
keywords = ["clinical", "patient", "lab", "gxp", "regulatory", "validation"]
pattern   = "|".join(keywords)
hits_mask = df_clean["all_text"].str.contains(pattern, case=False, na=False)

# Keyword pool (may be smaller than 150 if few hits)
kw_pool   = df_clean[hits_mask]
rnd_pool  = df_clean[~hits_mask]

kw_n  = min(250, len(kw_pool))
rnd_n = 250

sample_keyword = kw_pool.sample(n=kw_n, random_state=SEED)
sample_random  = rnd_pool.sample(n=rnd_n, random_state=SEED)

# Combine, deduplicate (in case of overlap), shuffle
annotation_df = (
    pd.concat([sample_keyword, sample_random])
      .drop_duplicates(subset="fioriId")
      .sample(frac=1, random_state=SEED)
      .reset_index(drop=True)
)

print(annotation_df.shape) # Should be (300, 19)
annotation_df.head()


(500, 19)


Unnamed: 0,fioriId,AppKeyFeatures,AppName,AppType,ApplicationComponentText,BusinessRoleDescription,CombinedTitle,GTMAppDescription,GTMAppName,GTMInnovationName,GTMLoBName,Information,LeadingIndustry,PFDBIndustry,RoleCombinedToolTipDescription,SolutionsCapability,Subtitle,Title,all_text
0,F4152,Call up a worklist of all repair objects to in...,Plan Repairs,Transactional,S4CRM: In-House Repair,Customer Service Representative - In-House Rep...,Plan Repairs,You can use this app to schedule the diagnosis...,Plan Repairs (SAP S/4HANA OP),Optimized service delivery through streamlined...,"Constituent Omnichannel Services, Customer, In...",,Cross Industry,Aerospace and Defense; Automotive; Chemicals; ...,Customer Service Manager - In-House Repair : M...,In-House Repair (S/4),,Plan Repairs,Plan Repairs Call up a worklist of all repair ...
1,O4D4,,Delete Driver,SAP GUI,Transportation and Distribution,Shipping Specialist (IOG)| Transportation Sche...,Delete Driver,This app is a SAP GUI for HTML transaction. Th...,SAP Fiori theme for SAP GUI for HTML (SAP S/4H...,SAP Fiori visual theme for classic application...,,,,,Transportation Scheduler (Oil & Gas) : Schedul...,,,Delete Driver,Delete Driver This app is a SAP GUI for HTML t...
2,MD09,"For the MRP element selected, you can display:...",Display Pegged Requirements,SAP GUI,Material Requirements Planning,,Display Pegged Requirements,You can use the Pegged requirements function t...,SAP Fiori theme for SAP GUI for HTML (SAP S/4H...,SAP Fiori visual theme for classic application...,,,,,unknown,,,Display Pegged Requirements,Display Pegged Requirements For the MRP elemen...
3,EPREPAY,,Manage Amount for Prepayment Meter,SAP GUI,Contract Billing,,Manage Amount for Prepayment Meter,This app is a SAP GUI for HTML transaction. Th...,SAP Fiori theme for SAP GUI for HTML (SAP S/4H...,SAP Fiori visual theme for classic application...,,,,,unknown,,,Manage Amount for Prepayment Meter,Manage Amount for Prepayment Meter This app is...
4,IK11,,Create Measurement Document,SAP GUI,Measuring points and counters,Maintenance Supervisor| Production Operator - ...,Create Measurement Document,This app is a SAP GUI for HTML transaction. Th...,SAP Fiori theme for SAP GUI for HTML (SAP S/4H...,SAP Fiori visual theme for classic application...,,,,,Production Operator - Discrete Manufacturing :...,,,Create Measurement Document,Create Measurement Document This app is a SAP ...


---

## 2.5. Heuristic pre-labeling

**Pre-Labeling Strategy**  
We automatically assign a **pre-label** to all 500 sampled apps to speed up manual review:

- **Label 1 (Potentially Relevant):**  
  If the combined text contains **any** of five Life-Science keywords.

- **Label 0 (Potentially Not Relevant):**  
  If none of those keywords appear.

This pre-labeling flags likely relevant cases for prioritized inspection, reducing the manual effort on clearly irrelevant entries.  


In [8]:
# Add heuristic pre-label
keywords = ["clinical", "patient", "lab", "gxp", "regulatory", "validation"]
pattern = "|".join(keywords)

annotation_df["pre_label"] = (
    annotation_df["all_text"]
        .str.contains(pattern, case=False, na=False)
        .astype(int)          # True→1, False→0
)

annotation_df["pre_label"].value_counts()


pre_label
0    250
1    250
Name: count, dtype: int64

---

## 2.6. Manual Review & Label Verification

* Provide annotators with just the necessary fields (`fioriId`, `all_text`, `pre_label`):  



In [9]:
 # Export to CSV for Verification of Pre-labels

cols_to_export = ["fioriId", "all_text", "pre_label"]
annotation_df[cols_to_export].to_csv(BASE_DIR / "data/02_annotation_sample.csv", index=False)

We perform a **manual audit** of the 500 pre-labeled apps to ensure high-quality ground truth:

- **Preserve all entries**:  No rows are removed or altered< every app’s original text stays intact.  
- **Adjust only the `Label` field**:  We overwrite the pre-labels where necessary based on human judgment.  
- **Balance the sample**:  We aim for an even **150 relevant / 150 not relevant** final split.

This fully validated, evenly distributed sample now serves as a reliable training set for our subsequent modeling steps.  

***Feel free to verify it yourself, or continue with our manually verified version of the annotated dataset.***

**Quality Check of Manual Labels**

- **Data Integrity**  
  We verify that all columns (except `Label`) remain identical to the original CSV—no text or metadata have been altered or lost.

- **Label Balance**  
  Then we confirm the label distribution:  
  ```python

In [10]:
df_annotation_sample_verified = pd.read_csv(BASE_DIR / "data/annotation_sample_verified.csv", encoding='utf-8')

df_annotation_sample = pd.read_csv(BASE_DIR / "data/02_annotation_sample.csv", encoding='utf-8')

print(df_annotation_sample_verified.iloc[:, :2].equals(df_annotation_sample.iloc[:, :2])) # True
print(df_annotation_sample_verified["Label"].value_counts()) # Balance check


True
Label
0    250
1    250
Name: count, dtype: int64


---

## 2.7. Merge Annotations with Full Metadata

Once the labels have been manually verified, we merge the annotated sample (`df_annotation_sample_verified`) with the original metadata (`df_clean`) using the common key `fioriId`.

This results in a unified dataset (`df_annotated_full`) that includes both the human-verified labels and all original descriptive fields, ready for downstream feature engineering.

We ensure that:
- All 500 annotated apps are retained (left join).
- No metadata is lost or altered during the merge.


In [11]:
# Merge verified labels with full app metadata on fioriId
df_annotated_full = df_annotation_sample_verified.set_index("fioriId").join(
    df_clean.set_index("fioriId"),
    how="left",
    lsuffix="_label"  # to preserve original 'Label' column name
).reset_index()

df_annotated_full.to_csv(BASE_DIR / "data/02_df_annotated_full.csv", index=False, encoding='utf-8')


In [12]:
df_annotated_full.head()

Unnamed: 0,fioriId,all_text_label,pre_label,Label,AppKeyFeatures,AppName,AppType,ApplicationComponentText,BusinessRoleDescription,CombinedTitle,...,GTMInnovationName,GTMLoBName,Information,LeadingIndustry,PFDBIndustry,RoleCombinedToolTipDescription,SolutionsCapability,Subtitle,Title,all_text
0,F4152,Plan Repairs Call up a worklist of all repair ...,0,0,Call up a worklist of all repair objects to in...,Plan Repairs,Transactional,S4CRM: In-House Repair,Customer Service Representative - In-House Rep...,Plan Repairs,...,Optimized service delivery through streamlined...,"Constituent Omnichannel Services, Customer, In...",,Cross Industry,Aerospace and Defense; Automotive; Chemicals; ...,Customer Service Manager - In-House Repair : M...,In-House Repair (S/4),,Plan Repairs,Plan Repairs Call up a worklist of all repair ...
1,O4D4,Delete Driver This app is a SAP GUI for HTML t...,1,1,,Delete Driver,SAP GUI,Transportation and Distribution,Shipping Specialist (IOG)| Transportation Sche...,Delete Driver,...,SAP Fiori visual theme for classic application...,,,,,Transportation Scheduler (Oil & Gas) : Schedul...,,,Delete Driver,Delete Driver This app is a SAP GUI for HTML t...
2,MD09,Display Pegged Requirements For the MRP elemen...,0,0,"For the MRP element selected, you can display:...",Display Pegged Requirements,SAP GUI,Material Requirements Planning,,Display Pegged Requirements,...,SAP Fiori visual theme for classic application...,,,,,unknown,,,Display Pegged Requirements,Display Pegged Requirements For the MRP elemen...
3,EPREPAY,Manage Amount for Prepayment Meter This app is...,1,1,,Manage Amount for Prepayment Meter,SAP GUI,Contract Billing,,Manage Amount for Prepayment Meter,...,SAP Fiori visual theme for classic application...,,,,,unknown,,,Manage Amount for Prepayment Meter,Manage Amount for Prepayment Meter This app is...
4,IK11,Create Measurement Document This app is a SAP ...,1,1,,Create Measurement Document,SAP GUI,Measuring points and counters,Maintenance Supervisor| Production Operator - ...,Create Measurement Document,...,SAP Fiori visual theme for classic application...,,,,,Production Operator - Discrete Manufacturing :...,,,Create Measurement Document,Create Measurement Document This app is a SAP ...




This exported dataframe serves as the foundation for feature engineering in the next notebook: [**3. Feature Engineering**](https://github.com/alikhalajii/text-classification-life-sciences/blob/master/notebooks/03-feature-engineering.ipynb).