# Flag misleading text

Before we can scan report text and classify our records, a source of pervasive error needs to be fixed. Often healthcare staff ask patients if they have a certain symptom, like a headache, and then report the *lack* of symptom in their report. For example, a report might have a line that reads "no vomiting" or "denies headache". Records with phrases like this need to be adjusted.

Luckily, almost all of the time the negative reports come *after* positive reports. That means a simple function can be applied to all records which removes all characters after a negative keyword.


In [1]:
import pandas as pd

In [2]:
origin_df = pd.read_csv('C:\\Users\\avery\\OneDrive\\health_database_docs\\positive_only_records.csv')

Add flag column and set default to false:

In [3]:
origin_df['flagged'] = False

Select all records with keywords to flag, then set 'flagged' to True for those records.

In [4]:
flag_list = ['no ', 'denies', 'not', 'none', 'deny']

flagged_copy = origin_df.copy(deep=True)
flagged_copy['flagged_again'] = False

def flag_word(dataframe, text_col, keyword, result_col):
    """creates a test to match the keyword in a column of a dataframe, then sets flagged to True"""
    test = (dataframe[text_col].str.contains(keyword, case=False)) & (dataframe['flagged'] == False)
    dataframe.loc[test, result_col] = True 

    
for keyword in flag_list:
    flag_word(flagged_copy, 'ailment_text', keyword, 'flagged_again')
    

print(flagged_copy['flagged_again'].value_counts())


False    2910
True      888
Name: flagged_again, dtype: int64


Export flagged results to csv for review.

In [5]:
flagged_copy.to_csv('C:\\Users\\avery\\OneDrive\\health_database_docs\\flagged_twice.csv')