Classification of Movies.

### Problem Statement
Return to Table of Contents

In DATA data there are about 1500 cases that use the generic "Injury" code under the "Injury/Illness" column or feature. We want to understand if a classification model can be used to better label these reports with a more relevant injury/illness code. To do this we can use NLP and Supervised Classification model to explore what would be better injury/illness codes (i.e., classes or lables).

This notebook will focus on applying the classification algorithmand classifying those records labeled as "INJURY" and explore any other potential class.

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime

import matplotlib.pyplot as plt
import seaborn as sns

import nltk
from nltk.stem import wordnet, WordNetLemmatizer, PorterStemmer
from nltk.tokenize import RegexpTokenizer
from nltk.stem.snowball import SnowballStemmer
from nltk import pos_tag # For parts of speech
from nltk import word_tokenize # To create tokens
from nltk.corpus import wordnet, stopwords # For stop words

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer
from sklearn.metrics import r2_score, accuracy_score, mean_absolute_error, confusion_matrix, ConfusionMatrixDisplay, classification_report, balanced_accuracy_score, precision_score, recall_score, f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

from IPython.display import display, clear_output

In [None]:
pd.set_option('display.max_colwidth', None) # PD has a limit of 50 characters.  This takes out the limit and uses the full text.
pd.options.display.float_format = "{:.4f}".format # Pandas displays float numbers as 4 decimal places.

In [None]:
# Set the default color cycle for MatPlotLib Plots
#plt.rcParams['axes.prop_cycle'] = plt.cycler(color=['b', 'r', 'g']) # Specifies specific color cycling (https://matplotlib.org/stable/gallery/color/named_colors.html)
#plt.style.available # Available Color Styles
plt.style.use('tableau-colorblind10') # Defining a specific color style to use.  Tableu-Colorblind
#plt.style.use('seaborn-colorblind') # Defining a specific color style to use. Seaborn-colorblind

#colors = plt.rcParams['axes.prop_cycle'].by_key()['color'] # Extract Colors being defined in the plt.style.use                       
#print('\n'.join(color for color in colors))

# TEXT NORMALIZATION FUNCTIONS
Return to Table of Contents

Note of caution. If we are using the normalization functions to do some query ranking like calcualtion (e.g., similarity ranking) and apply the same function that was applied to the target cleaned text and input query text. In the case for classification we will not use the text normalization function, however, when vectorizing we may call stopwords and we may inadvertently introduce new stopwords (more on this below).

For now, recall the list of stopwords and text normalization functions that we use in the previous step and Jupyter Notebook.

In [None]:
# Stopwords used in this DATASET.
stopwords_to_add = ['area','building','employee','employees','facility','personnel','work','worker','workers', 
                    'na', 'n/a']
remove_as_stopword = ['no']
#stopwords.words('english') # Default Stopwords from NLTK
stopwords_english = list(filter(lambda w: w not in remove_as_stopword, stopwords.words('english'))) 
stopwords_custom = stopwords_english + stopwords_to_add
print(stopwords_custom)

In [None]:
def text_normalization(text, word_reduction_method):
    text = str(text) # Convert narrative to string.
    df = pd.DataFrame({'': [text]}) # Converts narrative to a dataframe format use replace functions.
    df[''] = df[''].str.lower() # Covert narrative to lower case.
    df[''] = df[''].str.replace("\d+", " ") # Remove numbers
    df[''] = df[''].str.replace("[^\w\s]", " ") # Remove special characters
    df[''] = df[''].str.replace("_", " ") # Remove underscores characters
    df[''] = df[''].str.replace('\s+', ' ', regex=True) # Replace multiple spaces with single
    text = str(df[0:1]) # Extracts narrative from dataframe.
    tokenizer = RegexpTokenizer(r'\w+') # Tokenizer.
    tokens = tokenizer.tokenize(text) # Tokenize words.
    filtered_words = [w for w in tokens if len(w) > 1 if not w in stopwords_custom] # Note for NRC remove words of 1 letter only (e.g., don't remove TS and other accronyms of 1 letter). Can increase to higher value as needed.
    if word_reduction_method == 'Lemmatization':
        lemmatizer = WordNetLemmatizer()
        reduced_words=[lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in filtered_words] # Lemmatization.  The second argument is the POS tag.
    if word_reduction_method == 'Stemming':
        stemmer = PorterStemmer() # Stemming also could make the word unreadable but is faster than lemmatization.
        reduced_words=[stemmer.stem(w) for w in filtered_words]
    return " ".join(reduced_words) # Join words with space.

def get_wordnet_pos(word): # Reference: https://www.machinelearningplus.com/nlp/lemmatization-examples-python/#wordnetlemmatizer
    #"""Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

In [None]:
# DATA LOADING: DF Cleaned Data
Return to Table of Contents

In [None]:
# Read dataframe.

df_DF = pd.read_csv('.\output_data\df_DF_clean_ML_All.csv', encoding = "utf-8-sig", #parse_dates=['Occurrence Date', 'Reporting Date', 'Final Date'],
                      keep_default_na=False,
                      na_values=['', '-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A','N/A', '#NA', 'NULL', 'NaN', '-NaN', 'nan', '-nan']) # Removed 'NA' from list as NA is used for NNSA.

# Encoding "cp1252" or "utf-8-sig" used so that Excel does not create special characters. Standard Python is utf-8.
# See reference for explanation https://stackoverflow.com/questions/57061645/why-is-%C3%82-printed-in-front-of-%C2%B1-when-code-is-run

df_DF.head(1)

In [None]:
df_DF.columns


In [None]:
# Column selection for selecting columsn in loops used in the data cleaning, visualization and model functions below.
dfcolumns = list(df_DF.columns.values)
dfcolumns_index = pd.DataFrame(dfcolumns, columns=['column'])
pd.set_option('display.max_rows', None) # Uncomment to see all the columsn in the DF.
dfcolumns_index

In [None]:
print(df_DF.shape)
df_DF.info()
# Note that the range index and the total number of rows match.
# There may be some columns here that we may not need but for the excercise we will leave them.

# EDA Pre-Classification
Return to Table of Contents

At this point we want to evaluate the data and identify any potential issues (e.g., data balance). Many of the features identified in the columns could in theory be used to classify the DF reports that is if the report information includes enough information for such classification. We had previously identified issues with data balance.

In our problem statement we want to reclassify the reports labeled as "INJURY" into something more specific. Given the data we could train a classification model to use the reports to classify the "Body parts".

In [None]:
df_DF.columns


In [None]:
print(f'Number of "ORG" unique values: {len(df_DF["ORG"].unique())}')
df_DF['ORG'].value_counts(dropna = False).head(5)

In [None]:
print(f'Number of "Site" unique values: {len(df_DF["Site"].unique())}')
df_DF['Site'].value_counts(dropna = False).head(5)

In [None]:
print(f'Number of "Activity" unique values: {len(df_DF["Activity"].unique())}')
df_DF['Activity'].value_counts(dropna = False).head(5)

In [None]:
print(f'Number of "Body Part" unique values: {len(df_DF["Body Part"].unique())}')
df_DF['Body Part'].value_counts(dropna = False).head(5)
# Some of the "body parts" will be highly correlated to some "injury/illness" 
# Like "Ear(s)" body part and "Hearing Loss" or "Hearing Impairment" injury/illness.

In [None]:
DF_injury_illness_counts = df_DF['Injury / Illness'].value_counts(dropna = False).rename_axis('unique_values').reset_index(name='counts')
DF_injury_illness_counts#.head(10)

Notes: In this df_DF dataset that we have we have a total of over 20K reports and a total of 97 classes/lables, some of which are very well represented while others are not well represented (e.g., data imbalance or data balance issue). For example, there are over XXXX reports that are labeled as "Strain" while a few labels (e.g., electrocution, snake bite, hepatitis, etc.) only have one report. This is an example of data balance issue or imbalanced dataset. Because of this, an algorithm will have challenges trying to apply these low representative labels from the training data. Not only this but we will not be able to evaluate performance.

To address this there may be a few approaches that we can take:

Get more representative data for those reports that do not have enough records.
Undersampling/oversampling approaches
Obtain more data representative of these codes.
Create synthetic data: With the help of a subject matter expert, create synthetic records with words and terms expected for that label/class.
Augment the data (i.e., from other similar datasets): For example, we could try to obtain similar data from BLS/CDC/NIOSH and itegrate with this one. Integration would need to have some form of processing step to make the datasets compatible.
Undersampling: Remove labels that do not meet a threshold (e.g., minimum of 50 records). This will mean that we will loose those labels below the threshold.
Oversampling:
Simplest approach would be to repeat the records a number of times until it meats a minimum threshold.
Synthetic Minority Over-sampling TEchnique (SMOTE)
Use the data as is without modifying with the acknowledgement that these classes with low numbers will probably not be reliable and we will not be able to evalaute its performance.
Develop a curated standard used for machine learning approaches which addresses known challenges (e.g., data balance, data bias, etc.)
Use combination of multiple approaches above.
Manually divide the data especially for those that have small number of values with the acknowledgement that there is not enough data and model may not perform well in this classes.
In our case we would ideally also curate a sample of legitimate "INJURY" reports and use that for the training data of our model.
In any case those labels that have a very small number of records have a low frequency of occurring and most probably the reports that we want to reclassify there is also a low probability that they would fall within these labels.

In [None]:
# TO EXPORT AS CSV
DF_injury_illness_counts.to_csv (r'.\output_data\DF_injury_illness_counts.csv', 
                                    encoding='utf-8-sig', index = False, header=True) 

# Address Data Balance
Return to Table of Contents

As discussed above, there are multiple approaches that we can take each with its own advantages and disadvantages. For this example, we will use any class label that has at least 60 records.

This approach will develop a new dataframe that will be used for training/testing that has at least 60 records for each class up to 300. To accomplish this there are many scripts that we can develop to filter the data. We could focus on long narratives, do random sampling, manually select data, etc.

This approach does the following:

Random sampling for classes/lables that have more than 300 records and selects all records for those that have less.
Drops all records with a value counts of label/class less than 60
Removes the records with the generic "Injury" label as the purpose of the excercise is that once we have a model trained we want to assign new labels to the records that have the "Injury" label.

In [None]:
# Random Sampling to address data balance.
# Random sample for classes that have 300 or more and keeps all that have less.
df_DF_filtered = df_DF.groupby('Injury / Illness').apply(lambda s: s.sample(min(len(s), 
                                                                                      500))).copy().reset_index(drop = True)
# Filters out classes that have less than 50 records if any.
df_DF_filtered = df_DF_filtered.groupby('Injury / Illness').filter(lambda x: len(x) > 60).reset_index(drop = True)

# We also don't want to have records with the generic "INJURY" as the point of the excercise is to find a better class/lable.
df_DF_filtered = df_DF_filtered[df_DF_filtered['Injury / Illness'] != 'INJURY'].reset_index(drop = True)

In [None]:
print(df_DF_filtered.shape)
df_DF_filtered.info()

Notes:

We can see that our df_DF_filtered dataframe now only has value counts of labels between 300 records and 60.
We will use this new dataframe for training/testing our model.
We could probably spend many hours just discussing is this is or it is not a correct approach.
There are some classes (e.g., those that have less than 60 records) that we will not be represented and the model will not assign.
Labels like "HEARING IMPAIRMENT" had less than 60 records while "HEARING LOSS" had more. This will cause our model that any hearing related issue will potentially assign "HEARING LOSS" we need to recognize the limitation of filtering classes out in our training data.

In [None]:
print(df_DF_filtered.shape)
df_DF_filtered['Injury / Illness'].value_counts().rename_axis('unique_values').reset_index(name='counts')

# Bag of Words and Classification Models
Return to Table of Contents

This section discusses the Bag of Words (BoW) model matrix and classification model (training, and testing).

The classification model used in this notebook is a logistic regression which is one fo the simplest more explainable classification models. Using a different model will follow the same process. There are many methods and algorithms for classification (https://www.geeksforgeeks.org/multiclass-classification-using-scikit-learn/).

The approach initially used here is that from BLS Alexander Measures using a logistic regression. Note that these were the early approaches and U.S. Bureau of Labor Statistics (BLS) and U.S. the Center for Disease Control and Prevention (CDC) National Institute for Occupational Safety and Health (NIOSH) have published newer methods which may be more accurate. BLS and NIOSH have been using ML classification algorithms for more than a decade to classify worker injury data of the U.S. industry. Their data is from the Occupational Injury and Illness Classification System (OIICS)

References:

NIOSH CDC NASA AI MSHA Data Autocoding Competition Github (https://github.com/NASA-Tournament-Lab/CDC-NLP-Occ-Injury-Coding): Solutions use deep learning algorithms to classify the data.
BLS Autocoding with Deep Neural Networks (https://www.bls.gov/iif/automated-coding/deep-neural-networks.pdf)
BLS Alexander Measures Autocoding Github (https://github.com/ameasure/autocoding-class): Solution sses Logistic Regression model to classify the data.
BLS MSHA Autocoding: https://www.bls.gov/iif/automated-coding.htm
MSHA Data and Reports: https://www.msha.gov/data-and-reports
NIOSH AI MSHA Data Autocoding Competition https://archive.cdc.gov/www_cdc_gov/niosh/updates/upd-02-26-20.html
Other References:

https://towardsdatascience.com/how-to-balance-a-dataset-in-python-36dff9d12704
https://towardsdatascience.com/essential-guide-to-multi-class-and-multi-output-algorithms-in-python-3041fea55214

# Bag of Words (BoW) Model Matrix
Return to Table of Contents

The Bag of Words (BoW) model is a vector representation of each token (e.g., words, phrases, sentences, etc.) in the text of each record. This matrix is then used to perform calculations in this case to calculate which class or label a text belongs to.

References:

SK-Learn Bag of Words: https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#bags-of-words

In [None]:
# Creating BOW for the Target DataFrame with selected Vectorizer TFIDF
# Note that CountVectorizer is another vectorization function.
# The below function defines an instance of the vectorizer with its specific parameters.
# Note that the ngram, max_df and min_df will have a large impact on the size of the matrix.

vectorizer = TfidfVectorizer(lowercase=True, 
                             analyzer='word',
                             stop_words=stopwords_custom, 
                             # Ensures I am applying the same stopwords here and in text normalization function.
                             ngram_range=(1, 2), # Considers 1-grams (i.e., single words) and 2-grams (i.e., two words) 
                             max_df = 0.95, 
                             min_df = 0.001)

The classification algorithm can be applied to the vectorization of different texts in the dataframe. In this notebook we have created/divided a version of the text that is lemmatized and another version that is stemmed from the combined text fields of df_DF data.

As a data scientist you will need to decide what word reduction method (if any) works best for your application. Note that there may be many different approaches to develop a model. The BLS and NIOSH references above also provide examples of different ways to solve the same problem.

It is important to note that NLP are high dimensional problems where each token is a dimension. In many cases it is recommended to use dimensionalit reduction methods. See more on the Dimensionality Reduction Section.

Vectorization function references:

SK Learn TFIDF Vectorizer: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
SK Learn Count Vectorizer: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
For max_df and min_df explanation: https://stackoverflow.com/questions/27697766/understanding-min-df-and-max-df-in-scikit-countvectorizer

In [None]:
# BoW Array creates an array of the vectors for each token.
# The vectors for each token are represented as a TFIDF value.
# Note that if we wanted to use the lemmatized version of teh text just change the column name.
bow_array = vectorizer.fit_transform(df_DF_filtered['norm_text_stem']).toarray()
#bow_array

In [None]:
features = vectorizer.get_feature_names_out() # Extracts the token names.
df_bow = pd.DataFrame(bow_array, columns = features) # Converts the BoW array to a dataframe with token names.

In [None]:
# The BoW Matrix is also called sparse matrix
print(df_bow.shape)
df_bow.tail(15)

# Classification Process
Return to Table of Contents

Recall from Notebook_0, when developing and evaluating a supervised ML model (e.g., classification), it involves the following general steps:

Defining dependent variable or variables to be predicted (y)
Defining independent variable or variables (x) that will be used to predict y
Scaling/normalization if needed
Train/Test Split (Typically 80/20)
Fitting the training data to the model
Calculate predictions using the testing data
Evaluate Model Performance (e.g., Calculating metrics, predicting, using the model, etc.)
Deploy and use the model.
Data considerations and issues that may affect supervised machine learning:

Data balance (e.g., are all classes represented). Null accurac may be a worthwhile metric.
Amount of data. Evaluate if there is enough data for dividing the dataset into training/testing (typically 80/20).
Data bias: any biases in the data most probably, knowingly or unknowingly, will also show in the outputs of the model.
Explore the need for using synthetic/augmented data.

# Step 1: Defining Dependent Variables
Return to Table of Contents

In [None]:
y = df_DF_filtered.loc[:,'Injury / Illness'] # Variable to be predicted


In [None]:
len(y)


# Step 2: Defining Independent Variables
Return to Table of Contents

In [None]:
x = df_DF_filtered.loc[:,'norm_text_stem'] # The features we want to use for predicting is the tokens from the narratives.


In [None]:
len(x)


# Step 3: Scaling/Normalization (if needed)
Return to Table of Contents

TFIDF vectors are already normalized and there is no need to normalize data in this case.

# Step 4: Train/Test Split
Return to Table of Contents

We use the scikit-learn train_test_split function (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).

In [None]:
# Dividing the sampled data into 80 for Training and 20 for testing.
X_train, X_test, y_train, y_test = train_test_split(x, y, 
                                                    stratify = y, # Stratified sampling attempts to balance classes.
                                                    test_size=0.2, # 80/20 for training/testing
                                                    random_state=1)

Stratified sampling tries to balance the training data and maintain the relative class frequencies is approximately preserved in each train and validation fold.

In [None]:
X_train.sample(2)


# Step 4a: Vectorization
Return to Table of Contents

Because this is NLP Classification there is an extra step to vectorize the training and testing data.

In [None]:
# Converts the training text data to a Bag of Words Array Matrix of the vectors of previously defined tokens.
X_train = vectorizer.transform(X_train).toarray()

# Note that we could have also recreated a new BoW matrix by using the fit_transform here.

In [None]:
print(f"Total lenght of the data {len(y)}.") # Total number of values in df_DF_filtered dataframe
print(f"Training Lenght of the data {len(y_train)}.") # This should be approximately 80% of the data.
print(f"Testing Lenght of the data {len(y_test)}.") # This should be approximately 20% of the data.

In [None]:
print(df_DF_filtered.shape) # Shape of the df_DF_filtered dataframe.
print(len(X_train), len(X_train[1])) # Shape of X_train
X_train # Sample of X_train (i.e., sparse matrix of training data)

In [None]:
# Converts the testing data to the BoW Array of the Training data
X_test = vectorizer.transform(X_test).toarray()

IMPORTANT NOTE If you erroneously apply the ".fit_transform" you may get an error in later parts of the process. fit_transform will recalculate new tokens and BoW matrix based on the test data and cause Train and Test data BoW matrices to have different shapes. The Transform creates the vectors for the Testing data based on the vectors of the Training data.



In [None]:
print(df_DF_filtered.shape) # Shape of the df_DF_filtered dataframe.
print(len(X_test), len(X_test[1])) # Shape of X_train
X_test # Sample of X_train (i.e., sparse matrix of training data)

# Step 5: Fitting the training data to the model
Return to Table of Contents

In this step we will fit the training data to the model. There are many classification algorithms that can be used from Logistic Regression, Support Vector Machines (SVM), Naive-Bayes, Random Forests, and other classifiers. Typically there will be a process of comparing performance of various algorithm before selecting and deployment.

Documentation References:

https://scikit-learn.org/stable/modules/svm.html
Multi-Class Classification metrics: Vary from those of binary classification and have to make adjustments in the parameters when of each of the metric evaluation functions.

https://scikit-learn.org/stable/modules/model_evaluation.html
Confusion Matrix:

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html#sklearn.metrics.confusion_matrix
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ConfusionMatrixDisplay.html#sklearn.metrics.ConfusionMatrixDisplay

In [None]:
# Defines Classifier (clf) Algorithms
# Applying other algorithms would have a similar script by replacing the appropriate CLF.

#clf = LogisticRegression(C=0.1, solver='liblinear', multi_class='auto', class_weight='balanced', random_state = 0)
clf = RandomForestClassifier(max_depth = None,  class_weight = 'balanced', random_state=0) 
#clf = MultinomialNB() 
#clf = svm.SVC(C = 1.0, class_weight = 'balanced', break_ties = True)

# Note that some algorithms may take longer to train/fit.

In [None]:
%%time
# Fits training data to CLF algorithm.
clf.fit(X_train, y_train)

# Step 6: Calculate predictions using the testing data
Return to Table of Contents

In [None]:
%%time
# Different CLF algorihtms may take longer to calculate predictions. 
y_predict = clf.predict(X_test)

cm = confusion_matrix(y_test, y_predict)
print('Confusion Matrix : \n', cm)
total1=sum(sum(cm))

# Step 7: Evaluating Model Performance
Return to Table of Contents

In [None]:
%%time
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=clf.classes_)
fig, ax = plt.subplots(figsize=(12,12))
disp.plot(ax=ax)
plt.xticks(rotation=90)
plt.show();
# Diagonal shows the Predicted matches the Label from the Test data and hence predicted correctly.

In [None]:
# List of targets is needed to calculate some metrics.
target_name_list = sorted(df_DF_filtered['Injury / Illness'].unique().tolist())
print(target_name_list)

Accuracy Score Computes the accuracy, either the fraction (default) or the count (normalize=False) of correct predictions.

Documentation References:

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.balanced_accuracy_score.html#sklearn.metrics.balanced_accuracy_score

In [None]:
accuracy_score(y_test, y_predict) # Accuracy score is how many were predicted correctly vs. predicted incorrectly.


In [None]:
balanced_accuracy_score(y_test, y_predict) # Deals with imbalanced datasets. Is the average recall for each class.


In [None]:
# Comparing of these two lists can be used to see if there is any class that appears in the Training data and 
# not in the Test data and vice versa. 
# If there are any classes identified that do not occur in one of the training/test/predicted datasets 
# it is probably caused by low number of records in the class and split not accounting for such low number.

unique_train_list = sorted(y_train.unique())
unique_test_list = sorted(y_test.unique())
unique_pred_list = sorted(set(list(y_predict)))
#print(f'Unique Train Classes: {unique_train_list}')
#print(f'Unique Test Classes: {unique_test_list}')
#print(f'Unique Predicted Classes: {unique_pred_list}')
print(f'Classes found in Training or Test data but not both: {set(unique_train_list) - set(unique_test_list)}')
print(f'Classes found in Test or Predicted but not both: {set(unique_pred_list) - set(unique_test_list)}')

Precision Score The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative. The best value is 1 and the worst value is 0.

The average parameter has the following options: ‘micro’, ‘macro’, ‘samples’, ‘weighted’, None or the default = ’binary’. This parameter is required for multiclass/multilabel targets. If None, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data.

'binary': Only report results for the class specified by pos_label. This is applicable only if targets (y_{true,pred}) are binary.
'micro': Calculate metrics globally by counting the total true positives, false negatives and false positives.
'macro': Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
'weighted': Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.
'samples': Calculate metrics for each instance, and find their average (only meaningful for multilabel classification where this differs from accuracy_score).
Documentation References:

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.precision_score

In [None]:
print(precision_score(y_test, y_predict, labels = target_name_list, average='micro'))


Recall The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.

The best value is 1 and the worst value is 0.

Documentation References:

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html

In [None]:
recall_score(y_test, y_predict, average='micro')


F1 Score F1 score can be interpreted as a harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The average parameter has the same options as the Precision above.

Documentation References: - https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score

In [None]:
print(f1_score(y_test, y_predict, labels = target_name_list, average='micro'))


Classification Report Provides the class model classification metrics for each class as well as for the full dataset. It also provides the number of records or data points used for testing.



In [None]:
print(classification_report(y_test, y_predict, labels = target_name_list, zero_division = 0))
# Observations:
# Classes that had a low number of records tended to have low metrics. 
# Although low metrics can also occur in the classes with high number of records.

Note that there are many "Injury/Illness" labels and classes that are either very generic (e.g., "other...", "multiple...") or may have overlap/relation between definition.

Having more data on these labels/classes that would benefit in improving model overall performance.

# Step 8: Deploy and use the Model
Return to Table of Contents

Now that we have a model we can use this to assign, classify, or predict a better class to the generic "INJURY" reports.

First we want to filter those reports that have the "INJURY" class applied in the "Injury / Illness" column.

In [None]:
df_DF_INJURY = df_DF[df_DF['Injury / Illness'] == 'INJURY'].reset_index(drop = True)


In [None]:
print(df_DF_INJURY.shape)
df_DF_INJURY['Injury / Illness'].value_counts()

Now that we have the data filtered, we want to use our classifier to predict which would have been a potential better class.



In [None]:
x_INJURY = df_DF_INJURY.loc[:,'norm_text_stem'] # The features we want to use for predicting is the tokens from the narratives.


In [None]:
x_INJURY.sample(2)


In [None]:
# Converts the INJURY data to the BoW Array of the Training data
x_INJURY = vectorizer.transform(x_INJURY).toarray()

In [None]:
print(len(X_test), len(X_test[1])) # Shape of X_train
print(len(x_INJURY), len(x_INJURY[1])) # Shape of X_INJURY. The shape of [1] should match the Training data.
x_INJURY

In [None]:
%%time
# Different CLF algorihtms may take longer to calculate predictions. 
y_predict_INJURY = clf.predict(x_INJURY)  # Predicted values for our Injury Reports

In [None]:
print(len(y_predict_INJURY)) # This should match our number of Injury reports.
y_predict_INJURY[:3]

In [None]:
# Create a new column with the predicted values.
df_DF_INJURY['Predicted Injury/Illness'] = y_predict_INJURY

In [None]:
# We can extract the probability of the prediction.
# Note that this is a probability of all labels/classes being predicted.
y_predict_INJURY_prob = clf.predict_proba(x_INJURY)
df_DF_INJURY['Prediction Probability'] = y_predict_INJURY_prob.max(axis=1)

In [None]:
df_DF_INJURY['Predicted Injury/Illness'].value_counts()


Let's explore some of the "Injury / Illness" that should be obvious for both a person and a classification model. The followin classes also do not seem to have an overlap with other labels/classes.

INSECT STING: Should have terms such as insect, sting, insect names, etc.
HEARING LOSS: Should have the terms such as hearing, noise, dB, hearing protection, audio, etc.
LOSS OF CONSCIOUSNESS: Should have terms such as unconscious, pass out, faint, etc.
COVID-19: Should have terms such as virus, COVID, quarintine, etc.
Could use the probability to assign with a ML Classification Model if a high probability and those with low probability requiring manual review of a SME.

Recall that some of this will be legitimate generic "Injury" reports. Ideally we would also include in our model's training data examples of what these generic "Injury" reports look like so that the model can make a decision.

In [None]:
df_DF_INJURY[df_DF_INJURY['Predicted Injury/Illness'] == 'INSECT STING']\
[['ID', 'COMBINED_NARRATIVES', 'Injury / Illness', 'Predicted Injury/Illness', 'Prediction Probability']]\
.sort_values('Prediction Probability', ascending = False)

In [None]:
df_DF_INJURY[df_DF_INJURY['Predicted Injury/Illness'] == 'COVID-19']\
[['ID', 'COMBINED_NARRATIVES', 'Injury / Illness', 'Predicted Injury/Illness', 'Prediction Probability']]\
.sort_values('Prediction Probability', ascending = False)

In [None]:
df_DF_INJURY[df_DF_INJURY['Predicted Injury/Illness'] == 'HEARING LOSS']\
[['ID', 'COMBINED_NARRATIVES', 'Injury / Illness', 'Predicted Injury/Illness', 'Prediction Probability']]\
.sort_values('Prediction Probability', ascending = False)

# RESULTS EXPORT


In [None]:
# TO EXPORT AS CSV
df_DF_INJURY.to_csv (r'.\output_data\df_DF_INJURY.csv',
                        encoding='utf-8-sig', index = False, header=True)

# Using the Classifier as Recommendation System
Return to Table of Contents

A model that recommends or even fully automates classification of categorical values given the description text is possible. This system could be developed within df_DF or any system (e.g., OTHER to predict keywords). This would minimize human errors (e.g., missclassification) or even resolve issues like the above where the best label class is not appliedfor whatever the reason (e.g., staff not familiar with all classes, time constraints, lack of training, interpretation, laziness, etc.). This will also make the process more efficient while at the same time increasing data quality. However, this would require a few things:

Good quality training data, potentially a curated data set
Willingness from EHSS
The example below uses an input text box where we can submit a text description (e.g., df_DF combined text) and it will recommend what are the best "INJURY /ILLNESS" classes given the classifier model training data.

In [None]:
# Function to input custom narrative.
text_to_predict = pd.Series(input('Input text to predict df_DF "INJURY/ILLNESS" Category'))
# Normalizes input text using stemming and converts to Pandas series which is needed as input to the classification model.
text_to_predict = pd.Series(text_normalization(text = text_to_predict, word_reduction_method = "Stemming"))

# Tranform the input narrative into the BoW model vector array
X_input = vectorizer.transform(text_to_predict).toarray()
# Predicts the OIICS category given the input text vectors
y_predict_input = clf.predict(X_input)
# Probability of prediction. Because this is multiclass problem the probability of the assigned can be lower than 0.5.
y_predict_input_prob = sorted(clf.predict_proba(X_input)[0], reverse = True)[0]

df_prediction_probabilities = pd.DataFrame()
df_prediction_probabilities['CLASS'] = y.unique() # Class labels
df_prediction_probabilities['PROBABILITIES'] = clf.predict_proba(X_input)[0] # Probabilities of all classes.

# Prints the Category Number and the OIICS Event Description.
print(f"\nPredicted Text INJURY/ILLNESS is: {y_predict_input[0]}")
print(f"Probability of prediction is: {y_predict_input_prob}")

display(df_prediction_probabilities.sort_values('PROBABILITIES', ascending = False))
print(f"Total of probabilities: {round(df_prediction_probabilities['PROBABILITIES'].sum(), 2)}.")

We can explore some narratives that contain specific words (e.g., "snake"). We can see that the word "snake" may be related to various types of injuries such as injuries related to the use of motorized plumbing snakes, animal snake, and other. The model uses all of the words in the training data related to the injury to determine its type or class of injury/illness. Depending on how much training data the class had it may had a easier or harder time to classify.

In [None]:
df_DF[df_DF['COMBINED_NARRATIVES'].str.contains('snake', case = False)]\
[['COMBINED_NARRATIVES','Injury / Illness']].sample(10)

# NOTEBOOK END