# Flagged words check

The goal of this project is to automize the process of the flagged words checks during the <b>sub QA</b> process. The project will be done in the following steps:

<a href='#1'> 1. Data preprocessing </a>

<a href='#2'>2. Flagged words search</a>

<a href='#3'>3. Analysis of the flagged words</a>

<a href='#4'>4. Conclusion</a>

<a id='1'></a> 
 ### 1. Data preprocessing

In [1]:
#import libraries
import pandas as pd
import numpy as np
import gspread
from oauth2client.service_account import ServiceAccountCredentials
from pandas.io.json import json_normalize

In [2]:
#import dataset
df = pd.read_csv('submissions_project_14564_20211219175212_91.csv')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [3]:
#remove columns that might distort our results
column_names_list = ['please_scan',
                    'please_share_a_picture',
                    'please_take_a_photo',
                    'please_take_the_photo',
                    'please_upload',
                    'we_would_like_to_see_a_picture',
                    'please_take_a_screenshot',
                    'you_work',
                    'status'
                    ]

def columns_drop(df, column_name):
    result = []
    for name in column_name:
        result = df.drop(df.columns[df.columns.str.contains(name)], axis=1)
        df = result
    return result

df_update = columns_drop(df, column_names_list)

df_update = df_update.astype(str).apply(lambda x: x.str.lower())
df_update['id'] = df_update['id'].astype('int')

In [4]:
#import flagged words list from GoogleSheets
scope = ['https://spreadsheets.google.com/feeds']
credentials = ServiceAccountCredentials.from_json_keyfile_name('/Users/anaitagadzhanyan/Desktop/Практикум/extreme-minutia-321117-677aeb6efe7b.json', scope)
gc = gspread.authorize(credentials)
spreadsheet_key = '1ss6VRW29xQOynuzhm2VG_0Wu4YLf1pM9omnQXxkHrPs'
book = gc.open_by_key(spreadsheet_key)
worksheet = book.worksheet("sheet")
table = worksheet.get_all_values()

list = []
df_dataframe = pd.DataFrame(table)
list = df_dataframe[0]
flagged_words_list = df_dataframe[0].to_list()

The original dataset contains many columns, but for our flagged words analysis we only need columns with the open text (OT) answers and the `submission_id` column. One way to do this is to manually select columns with OT answers, which would take a lot of time. Therefore we decided to remove some columns <b>without</b> OT answers that contain flagged words (such as "streetbees" as a part of the link) and therefore distort our results.

We have also converted the string values to lower case and `submission_id` to an integer type.

We have also imported the `flagged words sheet` from a Google Sheet file.

<a id='2'></a> 
 ### 2. Flagged word search

In [5]:
#Applying function
def flagged_words_function(df, words):
    result = []
    result_append = []
    df.index = df['id']
    for word in words:
        mask = df.applymap(lambda x: isinstance(x, str) and word in x)
        df_mask = df[mask]
        result_append.append(df_mask)
    result_append = pd.concat(result_append)
    return result_append

df_result = flagged_words_function(df_update, flagged_words_list).dropna(how='all').drop_duplicates()
df_result_new = pd.DataFrame(df_result.replace(0, np.nan).stack())

df_result_new.to_excel("flagged_words.xlsx")

As a result of applying the function, we get an excel sheet with the `submission_id`, question and the answer if it contains a flagged word.

<a id='3'></a> 
 ### 3. Analysis of the flagged words

Further, we can look through the resulting file and if we see a suspicious OT answer, copy-paste the `submission id` to the Admin and have a closer look at the submission.

<a id='4'></a> 
 ### 4. Conclusion

As a result of the automated process of the flagged words check we can exclude the human mistake (e. g. QA specialist missing a word) and save time. We only need to spend some time updating the script with the list of columns that may contain links or non-Ot answers that may distort the results. 