# A Light-weight TMF Document Classifier
This classifer attempts to classify 'artifact names' per the CDISC Trial Master Files (TMF) Reference Model by the titles of sub-artifacts or the Subject line of a printed email message.

- Baseline Pre-trained Model
- Class-binning Model
- GPT Augmented Trainging Set approach
- Test and compare models with unseen data

## Preprocessing

In [1]:
import pandas as pd

# Read selected columns from the TMF Reference Model's speccifications
tmf_specs_file = "data/specs/TMF-Reference-Model.xlsx"
tmf_rm = pd.read_excel(tmf_specs_file, 
                       sheet_name='Taxonomy',
                       usecols="B, E, F, H, Q, R, S",
                       header=0,)
tmf_rm.head(5)

Unnamed: 0,Zone Name,Artifact #,Artifact name,Subartifacts,Trial Level Document,Region Level Document,Site Level Document
0,Trial Management,01.01.01,Trial Master File Plan,"Document Transfer Documentation,\nEvidence of ...",X,,
1,Trial Management,01.01.02,Trial Management Plan,"Clinical Development Plan,\nProject Management...",X,X,
2,Trial Management,01.01.03,Quality Plan,"Quality Documentation,\nQuality Plan,\nQuality...",X,X,
3,Trial Management,01.01.04,List of SOPs Current During Trial,"List of SOPs Current During Trial,\nSOP Waiver...",X,X,
4,Trial Management,01.01.05,Operational Procedure Manual,Operational Procedure Manual,X,X,


In [2]:
mapping = {'Zone Name': 'Zone',
          'Artifact name': 'Artifact',
          'Trial Level Document': 'Trial Level',
          'Region Level Document': 'Region Level',
          'Site Level Document': 'Site Level'}
tmf_rm.rename(mapping, axis=1, inplace=True)
tmf_rm

Unnamed: 0,Zone,Artifact #,Artifact,Subartifacts,Trial Level,Region Level,Site Level
0,Trial Management,01.01.01,Trial Master File Plan,"Document Transfer Documentation,\nEvidence of ...",X,,
1,Trial Management,01.01.02,Trial Management Plan,"Clinical Development Plan,\nProject Management...",X,X,
2,Trial Management,01.01.03,Quality Plan,"Quality Documentation,\nQuality Plan,\nQuality...",X,X,
3,Trial Management,01.01.04,List of SOPs Current During Trial,"List of SOPs Current During Trial,\nSOP Waiver...",X,X,
4,Trial Management,01.01.05,Operational Procedure Manual,Operational Procedure Manual,X,X,
...,...,...,...,...,...,...,...
226,Statistics,11.05.02,Tracking Information,Tracking Information,X,,
227,Statistics,11.05.03,Meeting Material,"Agenda,\nAttendance Sheet,\nMinutes,\nPresenta...",X,,
228,Statistics,11.05.04,Filenote,Filenote,X,,
229,TBD,99.99.99,Curriculum Vitae,Curriculum Vitae,,,


### Binarize 'Scope' Columns
- A 'check mark', 'X' indicates a 'Yes' in some columns

In [3]:
tmf_rm.fillna(0, inplace=True)
# Replace 'X' with 1 in check marked columns
check_mark_cols = ['Trial Level', 'Region Level', 'Site Level']
tmf_rm[check_mark_cols] = tmf_rm[check_mark_cols].replace('X', 1)
tmf_rm

Unnamed: 0,Zone,Artifact #,Artifact,Subartifacts,Trial Level,Region Level,Site Level
0,Trial Management,01.01.01,Trial Master File Plan,"Document Transfer Documentation,\nEvidence of ...",1,0,0
1,Trial Management,01.01.02,Trial Management Plan,"Clinical Development Plan,\nProject Management...",1,1,0
2,Trial Management,01.01.03,Quality Plan,"Quality Documentation,\nQuality Plan,\nQuality...",1,1,0
3,Trial Management,01.01.04,List of SOPs Current During Trial,"List of SOPs Current During Trial,\nSOP Waiver...",1,1,0
4,Trial Management,01.01.05,Operational Procedure Manual,Operational Procedure Manual,1,1,0
...,...,...,...,...,...,...,...
226,Statistics,11.05.02,Tracking Information,Tracking Information,1,0,0
227,Statistics,11.05.03,Meeting Material,"Agenda,\nAttendance Sheet,\nMinutes,\nPresenta...",1,0,0
228,Statistics,11.05.04,Filenote,Filenote,1,0,0
229,TBD,99.99.99,Curriculum Vitae,Curriculum Vitae,0,0,0


### Split Sub-artifacts as individual Titles
- Sub-artifacts of an `Artifact` type are stored as comman-separated list  of tiltes in a signle cell

In [4]:
def split_subartifacts(row, artifact_subs):
    '''
    Split sub-artifacts for each Artifact and
    construct a DataFrame with Artifact #, Sub-artifact
    '''
    subartifacts = row['Subartifacts'].split(',')
    # *** TO DO: does not clean
    clean_subartifacts = [x.strip() for x in subartifacts if x.strip() != '']
    for sub_artifact in clean_subartifacts:
        artifact_subs.append({'Artifact #': row['Artifact #'], 
#                               'Artifact': row['Artifact'],
                              'Subartifact Title': sub_artifact.lower()})

# List to accumulate data
artifact_subs = []

# Apply the function to each row of tmf_rm, passing the accumulator list
tmf_rm.apply(lambda row: split_subartifacts(row, artifact_subs), axis=1)

# Create the DataFrame after accumulating all data
tmf_sub_artifacts = pd.DataFrame(artifact_subs)

tmf_sub_artifacts

Unnamed: 0,Artifact #,Subartifact Title
0,01.01.01,document transfer documentation
1,01.01.01,evidence of quality review
2,01.01.01,request to lock tmf
3,01.01.01,trial master file plan
4,01.01.01,trial master file index
...,...,...
640,06.01.04,investigational product shipment
641,06.01.04,packaging order
642,06.01.04,investigational product shipment request form
643,06.01.04,temperature monitoring


### Systematic Indiscernable Sub-artifact Titles
- The same Artifact or sub-artifact for or distributed to different entities or in different `Zone`. They cannot be distinguished only by the tilte of a sub-artifact.
* For example, CVs or certificates of an Investigator must be classified to different `Artifact` types depending on whethere the Investagor is a 'Principal Investigator'

In [5]:
# Join two tables
tmf = pd.merge(tmf_rm[['Zone','Artifact #', 
                       'Trial Level', 'Region Level', 'Site Level']], 
               tmf_sub_artifacts, on='Artifact #')

# Non-unique sub-artifacts/Titles
duplicates = tmf.groupby(['Subartifact Title']).filter(lambda x: len(x) > 1)
indiscernable_titles = duplicates['Subartifact Title'].unique()
indiscernable_titles

array(['monitoring plan', 'audit certificate', 'icf summary of changes',
       'receipt of acknowledgement', 'quarterly line listing', 'medwatch',
       'serious adverse reaction', 'serious adverse device events',
       'analysis of similar event', 'susars', 'usade', 'cioms',
       'medical license', 'clinical trial agreement', 'manual',
       'irt uat certification', 'sdtm', 'validation plan',
       'validation report', 'relevant communications',
       'tracking information', 'filenote', 'agenda', 'attendance sheet',
       'minutes', 'presentation materials'], dtype=object)

### Synthesize 'Artifact Bins' to Group Indiscernable Titles

In [6]:
# Synthesize an artificial (or binned) Artifact for each indiscernable title
synthetic_artifact = []
for i, title in zip(range(0, len(indiscernable_titles)), indiscernable_titles):
    synthetic_artifact.append({'Zone': 'TBD', 
                             'Artifact #': f"99.00.{i + 1}",
                             'Trial Level': 1,
                             'Region Level': 1,
                             'Site Level': 1,
                             'Subartifact Title': title}
                           )
synthetic_artifact = pd.DataFrame(synthetic_artifact)

# Binned artifacts store the mapping between origional and synthesized Artifact #
subset = ['Artifact #', 'Subartifact Title']
binned_artifact_map = pd.merge(synthetic_artifact[subset], duplicates[subset],  on='Subartifact Title')
binned_artifact_map.rename(columns={'Artifact #_x': 'Artifact Bin', 
                                 'Artifact #_y': 'Artifact #',
                                }, inplace=True)
binned_artifact_map.drop(columns=['Subartifact Title'], inplace=True)

# One-to-many reationship between Artifiact Bin and Artifact #
binned_artifact_map

Unnamed: 0,Artifact Bin,Artifact #
0,99.00.1,01.01.08
1,99.00.1,01.01.09
2,99.00.2,01.01.14
3,99.00.2,09.01.01
4,99.00.3,02.02.03
...,...,...
102,99.00.26,07.03.03
103,99.00.26,08.03.03
104,99.00.26,09.03.03
105,99.00.26,10.05.03


### Write Synthetic `Artifacts` to Excel
- for labeling unseen training data

In [7]:
binned_sub_artifact_specs = pd.DataFrame()
binned_sub_artifact_specs = pd.merge(binned_artifact_map, tmf,  on='Artifact #')
# binned_artifact_specs.drop(columns=['Subartifact Title_y'], inplace=True)
binned_sub_artifact_specs

Unnamed: 0,Artifact Bin,Artifact #,Zone,Trial Level,Region Level,Site Level,Subartifact Title
0,99.00.1,01.01.08,Trial Management,1,1,1,monitoring plan
1,99.00.1,01.01.08,Trial Management,1,1,1,risk-based monitoring plan
2,99.00.1,01.01.08,Trial Management,1,1,1,risk based monitoring plan
3,99.00.1,01.01.08,Trial Management,1,1,1,risk based monitoring evidence
4,99.00.1,01.01.09,Trial Management,1,0,0,monitoring plan
...,...,...,...,...,...,...,...
509,99.00.25,11.05.03,Statistics,1,0,0,presentation materials
510,99.00.26,11.05.03,Statistics,1,0,0,agenda
511,99.00.26,11.05.03,Statistics,1,0,0,attendance sheet
512,99.00.26,11.05.03,Statistics,1,0,0,minutes


In [8]:

### Write Binned artifact specs to labeled training data sets

try:
    labeled_data_file = "data/labeled_training_sets.xlsx"
    sheet_name = 'Artifact Bins'
    writer = pd.ExcelWriter(labeled_data_file, engine='openpyxl', mode='a', if_sheet_exists='replace')
    with writer:
        binned_sub_artifact_specs.to_excel(writer, sheet_name=sheet_name)
    print(f"Sheet '{sheet_name}' updated to {labeled_data_file}")
except Exception as e:
    print(f"An error occurred: {e}")
# finally:
#     # Explicitly close the writer
#     writer.close()

An error occurred: [Errno 13] Permission denied: 'data/labeled_training_sets.xlsx'


#### Remove Indiscernable Tiltes and Replace with Synthetic Artifact Bins

In [9]:
synthetic_artifact

Unnamed: 0,Zone,Artifact #,Trial Level,Region Level,Site Level,Subartifact Title
0,TBD,99.00.1,1,1,1,monitoring plan
1,TBD,99.00.2,1,1,1,audit certificate
2,TBD,99.00.3,1,1,1,icf summary of changes
3,TBD,99.00.4,1,1,1,receipt of acknowledgement
4,TBD,99.00.5,1,1,1,quarterly line listing
5,TBD,99.00.6,1,1,1,medwatch
6,TBD,99.00.7,1,1,1,serious adverse reaction
7,TBD,99.00.8,1,1,1,serious adverse device events
8,TBD,99.00.9,1,1,1,analysis of similar event
9,TBD,99.00.10,1,1,1,susars


In [10]:
# Delete indistinguishable Titles Before Creating binned Artifacts
idx_duplicates = duplicates.index
tmf = tmf[~tmf.index.isin(idx_duplicates)]

# Append synthesized Artifact #
tmf = pd.concat([tmf, synthetic_artifact], ignore_index=True)

# Training data management
try:
    labeled_traing_data = tmf[['Artifact #', 'Subartifact Title']]
    labeled_data_file = "data/labeled_training_sets.xlsx"
    pre_train_specs = 'TMF 3.3'
    writer = pd.ExcelWriter(labeled_data_file, engine='openpyxl', 
                            mode='a', if_sheet_exists='replace')
    with writer:
        labeled_traing_data.to_excel(writer, sheet_name=pre_train_specs)
    print(f"Sheet '{pre_train_specs}' updated to {labeled_data_file}")
except Exception as e:
    print(f"An error occurred: {e}")
    
# labeled_traing_data

An error occurred: [Errno 13] Permission denied: 'data/labeled_training_sets.xlsx'


### Imbalanced Class Distributions in the TMF Reference Model

In [11]:
# Statistics of sub-artifact counts
tmf_sub_artifacts.groupby(['Artifact #']).size().describe().T

count    231.000000
mean       2.792208
std        2.170981
min        1.000000
25%        1.000000
50%        2.000000
75%        4.000000
max       14.000000
dtype: float64

In [12]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.corpus import stopwords

from concurrent.futures import ThreadPoolExecutor
import time

# Custom transformer using NLTK PorterStemmer and tokenizer
class StemmingTransformer():
    def __init__(self):
        self.stemmer = PorterStemmer()
        self.transform_time = None
    
    def fit(self, X, y=None):
        return self    

    def stem_text(self, text):
        return " ".join([self.stemmer.stem(token) for token in nltk.word_tokenize(text)])

    def transform(self, X):
        start_time_ = time.time()
        
        with ThreadPoolExecutor() as executor:
            X_transformed = list(executor.map(self.stem_text, X))
        
        self.transform_time = time.time() - start_time_
        return X_transformed    

# Custom transformer using NLTK lemmatizer and tokenizer
class LemmatizingTransformer():
    def __init__(self):
        self.stemmer = WordNetLemmatizer()
        self.transform_time = None
            
    def fit(self, X, y=None):
        return self

    def stem_text(self, text):
        return " ".join([self.stemmer.lemmatize(token) for token in nltk.word_tokenize(text)])

    def transform(self, X):
        start_time_ = time.time()
        X_transformed = [" ".join([self.stemmer.lemmatize(token) for token in word_tokenize(text)]) for text in X]
        self.transform_time = time.time() - start_time_
        return X_transformed

# Download required NLTK data 
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
# nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Jimmy\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Jimmy\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Jimmy\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [13]:
corpus = tmf.copy()

text_var  = 'Subartifact Title'
target = 'Artifact #'

stemmer = StemmingTransformer()
corpus['stem'] = stemmer.transform(corpus[text_var])
X, y = corpus[['stem']], corpus[target]
print(f'Time lapsed: {stemmer.transform_time}')

corpus[[text_var, 'stem', target]]

Time lapsed: 0.20485258102416992


Unnamed: 0,Subartifact Title,stem,Artifact #
0,document transfer documentation,document transfer document,01.01.01
1,evidence of quality review,evid of qualiti review,01.01.01
2,request to lock tmf,request to lock tmf,01.01.01
3,trial master file plan,trial master file plan,01.01.01
4,trial master file index,trial master file index,01.01.01
...,...,...,...
559,filenote,filenot,99.00.22
560,agenda,agenda,99.00.23
561,attendance sheet,attend sheet,99.00.24
562,minutes,minut,99.00.25


In [14]:
corpus

Unnamed: 0,Zone,Artifact #,Trial Level,Region Level,Site Level,Subartifact Title,stem
0,Trial Management,01.01.01,1,0,0,document transfer documentation,document transfer document
1,Trial Management,01.01.01,1,0,0,evidence of quality review,evid of qualiti review
2,Trial Management,01.01.01,1,0,0,request to lock tmf,request to lock tmf
3,Trial Management,01.01.01,1,0,0,trial master file plan,trial master file plan
4,Trial Management,01.01.01,1,0,0,trial master file index,trial master file index
...,...,...,...,...,...,...,...
559,TBD,99.00.22,1,1,1,filenote,filenot
560,TBD,99.00.23,1,1,1,agenda,agenda
561,TBD,99.00.24,1,1,1,attendance sheet,attend sheet
562,TBD,99.00.25,1,1,1,minutes,minut


In [15]:
# lemmatizer = LemmatizingTransformer()
# corpus['lemma'] = lemmatizer.transform(corpus[text_var])
# print(f'Time lapsed:{lemmatizer.transform_time}')
# corpus

In [16]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier, export_text, plot_tree
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier


In [17]:
from sklearn.compose import ColumnTransformer

# Define a custom lemmatization function
class LemmaTokenizer:
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, doc):
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc) if t.isalpha()]

my_stop_words = ['a', 'of', 'and', 'for', 'to', 
                 'document', 'documentation', 'plan', 'letter', 'form',
                 'information'
                ]  

preprocessor = ColumnTransformer(
    transformers=[
        ('tfidf', TfidfVectorizer(ngram_range=(1, 2), 
                                  stop_words=stopwords.words('english'),
                                  lowercase=True), 'stem')
    ],
    remainder='passthrough')
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
     ])

best_model = pipeline.fit(X, y)
pre_train_score =  best_model.score(X, y)
print(f"Pre-train Score: {pre_train_score}")

# Missed predictions
if pre_train_score < 1:
    print("Missed Predictions:")
    preditions = best_model.predict(X)
    missed = []
    for real, predicted in zip (y, preditions):
        if real != predicted:
            missed.append({'Actual': real, 'Predicted': predicted})

    missed = pd.DataFrame(missed)
    print(missed)

Pre-train Score: 1.0


### Save Pre-trained Model

In [18]:
from joblib import dump, load

# Save the model
version = '0.1'
model_filename =  f"TMF_classifier_v{version}.joblib"
dump(best_model, model_filename)


['TMF_classifier_v0.1.joblib']

### Build the Classifier with GPT-augmented Training Data

In [19]:
# Read selected columns from the TMF Reference Model's speccifications
tmf_specs_file = "data/specs/TMF-Reference-Model.xlsx"
gpt_sub_titles = pd.read_excel(tmf_specs_file, 
                       sheet_name='GPT prompts',
                       usecols="A, C",
                       header=0,)
gpt_sub_titles.dropna(inplace=True)
gpt_subs = []
gpt_sub_titles.apply(lambda row: split_subartifacts(row, gpt_subs), axis=1)
gpt_subs = pd.DataFrame(gpt_subs)
gpt_subs.head(10)

Unnamed: 0,Artifact #,Subartifact Title
0,01.01.05,study operations manual
1,01.01.05,investigator site file guide
2,01.01.10,study results dissemination plan
3,01.01.10,trial results publication guidelines
4,01.01.12,clinical trial progress update
5,01.01.12,interim trial status summary
6,01.01.13,study update bulletin
7,01.01.13,trial progress newsletter
8,01.01.15,trial filenote log
9,01.01.15,study note tracker


In [20]:
text_var  = 'Subartifact Title'
target = 'Artifact #'

stemmer = StemmingTransformer()
gpt_subs['stem'] = stemmer.transform(gpt_subs[text_var])
print(f'Time lapsed: {stemmer.transform_time:.4f}')
# gpt_subs[[text_var, 'stem', target]]

X_gpt = pd.concat([corpus[['stem']], gpt_subs[['stem']]], axis=0)
y_gpt = pd.concat([corpus[target], gpt_subs[target]], axis=0)
X_gpt

Time lapsed: 0.0313


Unnamed: 0,stem
0,document transfer document
1,evid of qualiti review
2,request to lock tmf
3,trial master file plan
4,trial master file index
...,...
89,study-level submiss dataset
90,interim statist summari
91,mid-trial statist report
92,final statist summari


In [21]:
gpt_model = pipeline.fit(X_gpt, y_gpt)
print(f"GPT score: {gpt_model.score(X_gpt, y_gpt)}")

GPT score: 0.9984802431610942


In [22]:
from joblib import dump, load

# Save the model
version = '0.1'
gpt_model_filename =  f"TMF_classifier_gpt_v{version}.joblib"
dump(gpt_model, gpt_model_filename)

['TMF_classifier_gpt_v0.1.joblib']