# OrderImaging Phrase Structures Final Analytics

#### Overview
- Notebook by Adam Lang
- This is a demo of work I completed while at Nuance/Microsoft in 2023.
- The purpose of this notebook is to complete the analysis of the OrderImaging phrase patterns that were done in the previous notebook.

# Part 1: Duplicate/Overlapping patterns seen in NLU model + Core n-grams

### 1. Import libraries

In [3]:
#import libraries
import pandas as pd

### 2. Load excel files into pandas

In [4]:
#load data
df1 = pd.read_excel("sample_pattern_transformed_data.xlsx", sheet_name="sample_pattern_transformed_data")
df2 = pd.read_excel("sample_pattern_transformed_data.xlsx", sheet_name="Comparison_Tab")

In [5]:
#check df1
df1.head()

Unnamed: 0,Index,number_of_samples_in_NLU_model,Phrase_Samples_with_original_entities,Phrase_Samples_with_new_entities,ordered_entities,entity_pattern,number_of_entities_used
0,0,38,"(ORDER_ACTION, ORDER_TYPE_SECRET, FOR_SECRET, ...","ORDER_ACTION, ORDER_TYPE_SECRET, FOR_SECRET, M...",MODALITY BODY_REGION CONTRAST_MODIFIER,MBC,3
1,1,35,"(ORDER_ACTION, ORDER_TYPE_SECRET, FOR_SECRET, ...","ORDER_ACTION, ORDER_TYPE_SECRET, FOR_SECRET, M...",MODALITY BODY_REGION,MB,2
2,2,31,"(ORDER_ACTION, MODALITY, PRIORITY)","ORDER_ACTION, MODALITY, PRIORITY",MODALITY PRIORITY,MPr,2
3,3,29,"(ORDER_ACTION, ORDER_TYPE_SECRET, FOR_SECRET, ...","ORDER_ACTION, ORDER_TYPE_SECRET, FOR_SECRET, M...",MODALITY BODY_REGION PRIORITY,MBPr,3
4,4,28,"(ORDER_ACTION, MODALITY, FOR_SECRET, BODY_REGION)","ORDER_ACTION, MODALITY, FOR_SECRET, BODY_REGION",MODALITY BODY_REGION,MB,2


In [6]:
#df1 columns - review columns in dataframe
df1.columns

Index(['Index', 'number_of_samples_in_NLU_model',
       'Phrase_Samples_with_original_entities',
       'Phrase_Samples_with_new_entities', 'ordered_entities',
       'entity_pattern', 'number_of_entities_used'],
      dtype='object')

In [7]:
#df1 info
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2220 entries, 0 to 2219
Data columns (total 7 columns):
 #   Column                                 Non-Null Count  Dtype 
---  ------                                 --------------  ----- 
 0   Index                                  2220 non-null   int64 
 1   number_of_samples_in_NLU_model         2220 non-null   int64 
 2   Phrase_Samples_with_original_entities  2220 non-null   object
 3   Phrase_Samples_with_new_entities       2220 non-null   object
 4   ordered_entities                       2220 non-null   object
 5   entity_pattern                         2220 non-null   object
 6   number_of_entities_used                2220 non-null   int64 
dtypes: int64(3), object(4)
memory usage: 121.5+ KB


In [8]:
#df2
df2.head(10)

Unnamed: 0,transformed_pattern,entity_pattern_count,Sample Count
0,BMPPC,5,3
1,BMPPPC,6,3
2,BMPPPV,6,37
3,BMPPPVC,7,3
4,BMPPVC,6,3
5,BMPVC,5,3
6,BMPrC,4,6
7,BMPrP,4,28
8,BMPrPP,5,29
9,BMPrPPP,6,26


In [9]:
#df2 info
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92 entries, 0 to 91
Data columns (total 3 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   transformed_pattern   92 non-null     object
 1   entity_pattern_count  92 non-null     int64 
 2   Sample Count          92 non-null     int64 
dtypes: int64(2), object(1)
memory usage: 2.3+ KB


### 3. Filter and select the relevant columns from df2

In [10]:
#filter and select relevant columns
filtered_df2 = df2[df2['transformed_pattern'].isin(df1['entity_pattern'])][['transformed_pattern', 'Sample Count']]

In [11]:
#view filtered df2
filtered_df2.head()

Unnamed: 0,transformed_pattern,Sample Count
0,BMPPC,3
1,BMPPPC,3
2,BMPPPV,37
3,BMPPPVC,3
4,BMPPVC,3


### 4. Merge df1 with the filtered_df2 based on common column "entity_pattern"

In [12]:
#merge based on common column
merged_df = pd.merge(df1, filtered_df2, how='inner', left_on='entity_pattern', right_on='transformed_pattern')


#head
merged_df.head()

Unnamed: 0,Index,number_of_samples_in_NLU_model,Phrase_Samples_with_original_entities,Phrase_Samples_with_new_entities,ordered_entities,entity_pattern,number_of_entities_used,transformed_pattern,Sample Count
0,27,18,"(ORDER_ACTION, ORDER_TYPE_SECRET, FOR_SECRET, ...","ORDER_ACTION, ORDER_TYPE_SECRET, FOR_SECRET, P...",PRIORITY MODALITY BODY_REGION CONTRAST_MODIFIER,PrMBC,4,PrMBC,31
1,35,17,"(ORDER_ACTION, ORDER_TYPE_SECRET, FOR_SECRET, ...","ORDER_ACTION, ORDER_TYPE_SECRET, FOR_SECRET, M...",PRIORITY MODALITY BODY_REGION CONTRAST_MODIFIER,PrMBC,4,PrMBC,31
2,45,17,"(ORDER_ACTION, FOR_SECRET, PATIENT_SECRET, PRI...","ORDER_ACTION, FOR_SECRET, PATIENT_SECRET, PRIO...",PRIORITY MODALITY BODY_REGION CONTRAST_MODIFIER,PrMBC,4,PrMBC,31
3,47,16,"(ORDER_ACTION, A_AN_SECRET, PRIORITY, MODALITY...","ORDER_ACTION, A_AN_SECRET, PRIORITY, MODALITY,...",PRIORITY MODALITY BODY_REGION CONTRAST_MODIFIER,PrMBC,4,PrMBC,31
4,51,16,"(ORDER_ACTION, FOR_SECRET, MY_THE_THIS_SECRET,...","ORDER_ACTION, FOR_SECRET, MY_THE_THIS_SECRET, ...",PRIORITY MODALITY BODY_REGION CONTRAST_MODIFIER,PrMBC,4,PrMBC,31


### 5. Drop extra 'transformed_pattern' column and rename 'Sample Count' to match 'entity_pattern

In [13]:
merged_df.drop(columns=['transformed_pattern'], inplace=True)
merged_df.rename(columns={'Sample_Count': 'transformed_pattern'}, inplace=True)

### 6. Final dataframe

In [14]:
#print df
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1518 entries, 0 to 1517
Data columns (total 8 columns):
 #   Column                                 Non-Null Count  Dtype 
---  ------                                 --------------  ----- 
 0   Index                                  1518 non-null   int64 
 1   number_of_samples_in_NLU_model         1518 non-null   int64 
 2   Phrase_Samples_with_original_entities  1518 non-null   object
 3   Phrase_Samples_with_new_entities       1518 non-null   object
 4   ordered_entities                       1518 non-null   object
 5   entity_pattern                         1518 non-null   object
 6   number_of_entities_used                1518 non-null   int64 
 7   Sample Count                           1518 non-null   int64 
dtypes: int64(4), object(4)
memory usage: 95.0+ KB


### 7. To csv

In [15]:
#to csv
merged_df.to_csv('sample_duplicate_patterns_df.csv', index=True)

# Part 2: NLU samples not in core list
- The purpose of this portion of the notebook is to find what string patterns or NLU samples are not in the core list of samples.

### 1. Import libraries

In [17]:
#import libraries
import pandas as pd

### 2. Load data from excel to pandas

In [16]:
#load data
df3 = pd.read_excel("sample_pattern_transformed_data.xlsx", sheet_name="sample_pattern_transformed_data")
df4 = pd.read_excel("sample_pattern_transformed_data.xlsx", sheet_name="Comparison_Tab")

In [17]:
#check df3 - full dataset
df3.head()

Unnamed: 0,Index,number_of_samples_in_NLU_model,Phrase_Samples_with_original_entities,Phrase_Samples_with_new_entities,ordered_entities,entity_pattern,number_of_entities_used
0,0,38,"(ORDER_ACTION, ORDER_TYPE_SECRET, FOR_SECRET, ...","ORDER_ACTION, ORDER_TYPE_SECRET, FOR_SECRET, M...",MODALITY BODY_REGION CONTRAST_MODIFIER,MBC,3
1,1,35,"(ORDER_ACTION, ORDER_TYPE_SECRET, FOR_SECRET, ...","ORDER_ACTION, ORDER_TYPE_SECRET, FOR_SECRET, M...",MODALITY BODY_REGION,MB,2
2,2,31,"(ORDER_ACTION, MODALITY, PRIORITY)","ORDER_ACTION, MODALITY, PRIORITY",MODALITY PRIORITY,MPr,2
3,3,29,"(ORDER_ACTION, ORDER_TYPE_SECRET, FOR_SECRET, ...","ORDER_ACTION, ORDER_TYPE_SECRET, FOR_SECRET, M...",MODALITY BODY_REGION PRIORITY,MBPr,3
4,4,28,"(ORDER_ACTION, MODALITY, FOR_SECRET, BODY_REGION)","ORDER_ACTION, MODALITY, FOR_SECRET, BODY_REGION",MODALITY BODY_REGION,MB,2


In [18]:
#df3 columns
df3.columns

Index(['Index', 'number_of_samples_in_NLU_model',
       'Phrase_Samples_with_original_entities',
       'Phrase_Samples_with_new_entities', 'ordered_entities',
       'entity_pattern', 'number_of_entities_used'],
      dtype='object')

In [19]:
#df3 info
df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2220 entries, 0 to 2219
Data columns (total 7 columns):
 #   Column                                 Non-Null Count  Dtype 
---  ------                                 --------------  ----- 
 0   Index                                  2220 non-null   int64 
 1   number_of_samples_in_NLU_model         2220 non-null   int64 
 2   Phrase_Samples_with_original_entities  2220 non-null   object
 3   Phrase_Samples_with_new_entities       2220 non-null   object
 4   ordered_entities                       2220 non-null   object
 5   entity_pattern                         2220 non-null   object
 6   number_of_entities_used                2220 non-null   int64 
dtypes: int64(3), object(4)
memory usage: 121.5+ KB


In [20]:
#df4 head
df4.head()

Unnamed: 0,transformed_pattern,entity_pattern_count,Sample Count
0,BMPPC,5,3
1,BMPPPC,6,3
2,BMPPPV,6,37
3,BMPPPVC,7,3
4,BMPPVC,6,3


In [21]:
#df4 info
df4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92 entries, 0 to 91
Data columns (total 3 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   transformed_pattern   92 non-null     object
 1   entity_pattern_count  92 non-null     int64 
 2   Sample Count          92 non-null     int64 
dtypes: int64(2), object(1)
memory usage: 2.3+ KB


### 3. Filter and select the relevant columns from df4

In [22]:
#filter and select relevant columns
filtered_df4 = df4[df4['transformed_pattern'].isin(df3['entity_pattern'])][['transformed_pattern', 'Sample Count']]

In [23]:
#view filtered df4
filtered_df4.head()

Unnamed: 0,transformed_pattern,Sample Count
0,BMPPC,3
1,BMPPPC,3
2,BMPPPV,37
3,BMPPPVC,3
4,BMPPVC,3


In [24]:
#filtered_df4 info
filtered_df4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92 entries, 0 to 91
Data columns (total 2 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   transformed_pattern  92 non-null     object
 1   Sample Count         92 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 1.6+ KB


### 4. Merge df3 with filtered_df4 based on common column entity_pattern

In [25]:
#merge based on common column
merged_df_NLU = pd.merge(df3, filtered_df4, how='inner', left_on='entity_pattern', right_on='transformed_pattern')

In [26]:
#head
merged_df_NLU.head()

Unnamed: 0,Index,number_of_samples_in_NLU_model,Phrase_Samples_with_original_entities,Phrase_Samples_with_new_entities,ordered_entities,entity_pattern,number_of_entities_used,transformed_pattern,Sample Count
0,27,18,"(ORDER_ACTION, ORDER_TYPE_SECRET, FOR_SECRET, ...","ORDER_ACTION, ORDER_TYPE_SECRET, FOR_SECRET, P...",PRIORITY MODALITY BODY_REGION CONTRAST_MODIFIER,PrMBC,4,PrMBC,31
1,35,17,"(ORDER_ACTION, ORDER_TYPE_SECRET, FOR_SECRET, ...","ORDER_ACTION, ORDER_TYPE_SECRET, FOR_SECRET, M...",PRIORITY MODALITY BODY_REGION CONTRAST_MODIFIER,PrMBC,4,PrMBC,31
2,45,17,"(ORDER_ACTION, FOR_SECRET, PATIENT_SECRET, PRI...","ORDER_ACTION, FOR_SECRET, PATIENT_SECRET, PRIO...",PRIORITY MODALITY BODY_REGION CONTRAST_MODIFIER,PrMBC,4,PrMBC,31
3,47,16,"(ORDER_ACTION, A_AN_SECRET, PRIORITY, MODALITY...","ORDER_ACTION, A_AN_SECRET, PRIORITY, MODALITY,...",PRIORITY MODALITY BODY_REGION CONTRAST_MODIFIER,PrMBC,4,PrMBC,31
4,51,16,"(ORDER_ACTION, FOR_SECRET, MY_THE_THIS_SECRET,...","ORDER_ACTION, FOR_SECRET, MY_THE_THIS_SECRET, ...",PRIORITY MODALITY BODY_REGION CONTRAST_MODIFIER,PrMBC,4,PrMBC,31


In [27]:
#merged_df_NLU info
merged_df_NLU.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1518 entries, 0 to 1517
Data columns (total 9 columns):
 #   Column                                 Non-Null Count  Dtype 
---  ------                                 --------------  ----- 
 0   Index                                  1518 non-null   int64 
 1   number_of_samples_in_NLU_model         1518 non-null   int64 
 2   Phrase_Samples_with_original_entities  1518 non-null   object
 3   Phrase_Samples_with_new_entities       1518 non-null   object
 4   ordered_entities                       1518 non-null   object
 5   entity_pattern                         1518 non-null   object
 6   number_of_entities_used                1518 non-null   int64 
 7   transformed_pattern                    1518 non-null   object
 8   Sample Count                           1518 non-null   int64 
dtypes: int64(4), object(5)
memory usage: 106.9+ KB


### 5. Drop extra 'transformed_pattern' column and rename 'Sample Count' to match entity_pattern

In [28]:
merged_df_NLU.drop(columns=['transformed_pattern'], inplace=True)
merged_df_NLU.rename(columns={'Sample_Count': 'transformed_pattern'}, inplace=True)

### 6. Final Dataframe

In [29]:
#merged_df_NLU info
merged_df_NLU.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1518 entries, 0 to 1517
Data columns (total 8 columns):
 #   Column                                 Non-Null Count  Dtype 
---  ------                                 --------------  ----- 
 0   Index                                  1518 non-null   int64 
 1   number_of_samples_in_NLU_model         1518 non-null   int64 
 2   Phrase_Samples_with_original_entities  1518 non-null   object
 3   Phrase_Samples_with_new_entities       1518 non-null   object
 4   ordered_entities                       1518 non-null   object
 5   entity_pattern                         1518 non-null   object
 6   number_of_entities_used                1518 non-null   int64 
 7   Sample Count                           1518 non-null   int64 
dtypes: int64(4), object(4)
memory usage: 95.0+ KB


### 7. To CSV

In [30]:
#to csv
merged_df_NLU.to_csv('sample_pattern_NLU_not_in_core_df.csv', index=True)