# Coding Treatment Keywords

With a clean dataset of health records, we can now scan the report text for keywords. Those keywords are registered in a new column with a boolean test for presence in the text. Finally, we count the number of True statements for each row, which will let us easily find records that haven't been classified by the script.

In [1]:
import pandas as pd

In [2]:
origin_df = pd.read_csv('C:\\Users\\avery\\OneDrive\\health_database_docs\\pos_only_second_review.csv', parse_dates=['report_date', 'report_time', 'timestamp'])

print(origin_df.dtypes)

print(origin_df.columns)

person_id                  int64
report_date       datetime64[ns]
report_time       datetime64[ns]
timestamp         datetime64[ns]
ailment_text              object
treatment_text            object
temp                     float64
season_visits              int64
is_camper                   bool
home_notified               bool
fixed                       bool
flagged                     bool
flagged_again               bool
dtype: object
Index(['person_id', 'report_date', 'report_time', 'timestamp', 'ailment_text',
       'treatment_text', 'temp', 'season_visits', 'is_camper', 'home_notified',
       'fixed', 'flagged', 'flagged_again'],
      dtype='object')


In [3]:
print(origin_df.shape)

(3794, 13)


Import csv containing keywords and their synonyms.

In [4]:
keyword_df = pd.read_csv('C:\\Users\\avery\\OneDrive\\health_database_docs\\keyword_v1.csv')

keyword_df.fillna('----', inplace=True)

# copy first column so the code includes it in the search later
keyword_df['index_copy'] = keyword_df['keyword']

# then set first column as index
keyword_df.set_index('index_copy', inplace=True)


print(keyword_df.head())

                       keyword        syn_1  syn_2   syn_3      syn_4  \
index_copy                                                              
abdominal pain  abdominal pain     abd pain   ----    ----       ----   
abrasion              abrasion       scrape   ----    ----       ----   
allergy                allergy     allergic  runny  sneeze  histamine   
anaphylaxis        anaphylaxis  epinephrine   ----    ----       ----   
blister                blister     hot spot   ----    ----       ----   

                    syn_5 syn_6 syn_7 Unnamed: 8  
index_copy                                        
abdominal pain       ----  ----  ----       ----  
abrasion             ----  ----  ----       ----  
allergy         allergies  ----  ----       ----  
anaphylaxis          ----  ----  ----       ----  
blister              ----  ----  ----       ----  


Transpose from long to wide format so it can be easily made into dictionary in next step

In [5]:
transposed = keyword_df.transpose(copy=True)
print(transposed.head())

index_copy  abdominal pain  abrasion    allergy  anaphylaxis   blister  \
keyword     abdominal pain  abrasion    allergy  anaphylaxis   blister   
syn_1             abd pain    scrape   allergic  epinephrine  hot spot   
syn_2                 ----      ----      runny         ----      ----   
syn_3                 ----      ----     sneeze         ----      ----   
syn_4                 ----      ----  histamine         ----      ----   

index_copy         bm     bruise  bug bite  congestion         cut  ...  \
keyword            bm     bruise  bug bite  congestion         cut  ...   
syn_1           bowel  contusion  mosquito       nasal  laceration  ...   
syn_2             lax       ----      bite      stuffy       slice  ...   
syn_3        movement       ----    papule   congested     scratch  ...   
syn_4       constipat       ----       fly       sinus        ----  ...   

index_copy  predator  capture  hike   hit    campout  anxiety   lice  grass  \
keyword     predator  cap

Create dictionary from dataframe.

In [6]:
keyword_dict = transposed.to_dict(orient='list')
print(keyword_dict)

{'abdominal pain': ['abdominal pain', 'abd pain', '----', '----', '----', '----', '----', '----', '----'], 'abrasion': ['abrasion', 'scrape', '----', '----', '----', '----', '----', '----', '----'], 'allergy': ['allergy', 'allergic', 'runny', 'sneeze', 'histamine', 'allergies', '----', '----', '----'], 'anaphylaxis': ['anaphylaxis', 'epinephrine', '----', '----', '----', '----', '----', '----', '----'], 'blister': ['blister', 'hot spot', '----', '----', '----', '----', '----', '----', '----'], 'bm': ['bm', 'bowel', 'lax', 'movement', 'constipat', '----', '----', '----', '----'], 'bruise': ['bruise', 'contusion', '----', '----', '----', '----', '----', '----', '----'], 'bug bite': ['bug bite', 'mosquito', 'bite', 'papule', 'fly', 'bug', 'bites', '----', '----'], 'congestion': ['congestion', 'nasal', 'stuffy', 'congested', 'sinus', '----', '----', '----', '----'], 'cut': ['cut', 'laceration', 'slice', 'scratch', '----', '----', '----', '----', '----'], 'diarrhea': ['diarrhea', 'dirrhea',

Add a column to the health records for each key in the dictionary, defaulting value to False.

In [7]:
default_bool_df = origin_df.copy(deep=True)

for key in keyword_dict:
    default_bool_df[key] = False

print(default_bool_df.shape)

(3794, 64)


Define a function to search a column for a keyword.

In [8]:
def test_for_word(dataframe, text_column, keyword, bool_column):
    """creates a boolean test to match the keyword in a column of a dataframe, then sets a target column to True"""
    
    # Selects column of dataframe and returns True for each record that contains the keyword
    test = dataframe[text_column].str.contains(keyword)
    
    # Locates all records where the above test is True, then sets the desired column to True
    dataframe.loc[test, bool_column] = True 

Run a nested loop to apply above function on each key in the dictionary and each item in each value list. For example, we want to search for 'nausea', 'nauseous', 'queasy' etc and set the column 'nausea' to True for each record that includes anything from the synonym list.

In [9]:
for key in keyword_dict:
        
    for value in keyword_dict[key]:
        test_for_word(default_bool_df, 'ailment_text', value, key)
        
# check output
print(default_bool_df['wound'].sum())

99


In [10]:
print(default_bool_df['vomitting'].sum())

86


In [11]:
print(default_bool_df.columns)

Index(['person_id', 'report_date', 'report_time', 'timestamp', 'ailment_text',
       'treatment_text', 'temp', 'season_visits', 'is_camper', 'home_notified',
       'fixed', 'flagged', 'flagged_again', 'abdominal pain', 'abrasion',
       'allergy', 'anaphylaxis', 'blister', 'bm', 'bruise', 'bug bite',
       'congestion', 'cut', 'diarrhea', 'fatigue', 'fever', 'headache',
       'hives', 'homesick', 'itch', 'menstrual ', 'mono', 'nausea',
       'respiratory', 'sore throat', 'splinter', 'sting', 'stomach ache',
       'sun burn', 'tick', 'rolled ankle', 'vomitting', 'wound', 'poison ivy',
       'swelling', 'rash', 'cough', 'asthma', 'fracture', 'burn', 'cold',
       'nosebleed', 'toe', 'bike', 'predator', 'capture', 'hike', 'hit',
       'campout', 'anxiety', ' lice', 'grass', 'eye', 'sleep'],
      dtype='object')


Flag records that have no Trues. 

In [15]:
# Create copy of df
df_complete = default_bool_df.copy(deep=True)

# Add new column, "sum_true", which sums the number of trues in the bool columns.
# Columns 0 - 12 are essential data, and 13: are bools representing categories.
df_complete['sum_true'] = df_complete.iloc[:, 13:].sum(axis=1)

print(df_complete[['report_date', 'sum_true']].head())

  report_date  sum_true
0  2013-06-11         0
1  2013-06-19         2
2  2013-06-23         2
3  2013-06-26         1
4  2013-08-01         1


Export to csv for exploration.

In [16]:
df_complete.to_csv('C:\\Users\\avery\\OneDrive\\health_database_docs\\complete_records.csv')