# Data cleaning
This notebook provides sample [EDSL](https://docs.expectedparrot.com/) code for automating a data cleaning task. In a series of steps we show how to use EDSL to automatically suggest appropriate sense checks for a given dataset and then run them to generate a new dataset of data failing the checks.

EDSL is an open-source library for simulating surveys and experiments with AI agents and large language models. Please see our [documentation page](https://docs.expectedparrot.com/) for tips and tutorials on getting started.

### Importing the tools
EDSL comes with a [variety of question types](https://docs.expectedparrot.com/en/latest/questions.html) that we can choose from based on the desired form of the response (multiple choice, free text, etc.). Here we select `QuestionList` for prompting a model to generate a list of appropriate sense checks for a dataset, and `QuestionMultipleChoice` for generating responses to the sense check questions. `Scenario` objects are used to parameterize the questions with the data (explained below):

In [1]:
from edsl.questions import QuestionMultipleChoice, QuestionList
from edsl import Scenario, Survey

import pandas
import random

We'll use some observations of ages as our dataset for cleaning. It's a list of random ages between 22 and 85 with some bad values mixed in:

In [2]:
ages = [
84, 62, 79, 57, 59, 55, 68, 66, 47, 54, 76, 33, 74, 56, 47, 24, 23, 38, 38, 54, 51, 84, 71, 46, 38,
26, 50, 56, 62, 39, 31, 52, 69, 84, 69, 48, 48, 23, 65, 54, 78, 51, 69, 77, 75, 76, 26, 44, 61, 32,
70, 24, 74, 22, 32, 24, 80, 65, 36, 42, 84, 66, 40, 85, 28, 22, 67, 25, 70, 77, 53, 69, 64, 27, 61,
68, 68, 78, 0.99, 83, 58, 33, 46, 43, 50, 85, 28, 82, 50, 61, 66, 32, 45, 70, 56, 50, 43, 30, 43, 55,
33, 72, 43, 43, -5, 32, 43, 45, 67, 84, 37, 63, 52, 53, 58, 79, 79, 80, 62, 75, 57, 60, 39, 79, 49,
60, 60, 37, 45, 36, 1050, 73, 70, 56, 39, 58, 69, 77, 68, 84, 78, 48, 31, 74, 27, 55, 56, 66, 35, 39,
57, 47, 29, 24, 47, 60, 43, 37, 84, 64, 28, 22, 37, 71, 77, 76, 84, 63, 76, 58, 41, 72, 22, 63, 78,
49, 82, 69, 'old', 37, 27, 29, 54, 83, 80, 74, 48, 76, 49, 26, 38, 35, 36, 25, 23, 71, 33, 39, 40, 35,
85, 24, 57, 85, 63, 53, 62, 47, 69, 76, 71, 48, 62, 23, 25, 84, 32, 63, 75, 31, 25, 50, 85, 36, 58,
85, 34, 62, 43, 2, 50, 83, 44, 73, 81, 44, 43, 82, 84, 30, 24, 63, 63, 59, 46, 30, 62, 25, 52, 23
]

Next we create a question prompting a model to draft a list of sense check questions about each piece of data. We provide a description and sample of the data, and specify that the sense checks should be formatted as dictionaries so that we can readily transform them into new EDSL questions that we will have the model answer. Note that we specially instruct the model to include a placeholder for a piece of data in the question texts so that we can run them for each piece of data all at once, using `Scenario` objects ([learn more about using scenarios](https://docs.expectedparrot.com/en/latest/scenarios.html)):

In [3]:
data_description = "a list of ages (in years) of adult participants in a social science experiment"
sample_data = random.sample(ages, 5)
number_of_questions = 4

In [4]:
q = QuestionList(
    question_name = "data_cleaning_questions",
    question_text = f"""Consider a dataset consisting of {data_description}.
    Here is a sample of the data: {sample_data}.
    Draft a set of {number_of_questions} appropriate sense checks for this dataset, formatted as a list 
    of dictionaries where each dictionary has keys 'question_name', 'question_text' and 'question_options'. 
    The 'question_name' should be formatted 'q0', 'q1,' etc.
    The 'question_text' should be questions to be asked about each piece of data individually and without 
    reference to other data, using '<data>' as a placeholder for the piece of data in each question text. 
    The question_options should be complete, appropriate sets of responses for the corresponding question 
    that provide clariety about potential problems with the individual piece of data.""",
    max_list_items = number_of_questions
)

We generate a response by calling the `run()` method for the question. This generates a dataset of `Results` that we can begin analyzing:

In [5]:
results = q.run()

EDSL comes with [built-in methods for analyzing results](https://docs.expectedparrot.com/en/latest/results.html) as datasets, dataframes, JSON and other forms. We can inspect a list of all the components that are directly accessible:

In [6]:
results.columns

['agent.agent_instruction',
 'agent.agent_name',
 'answer.data_cleaning_questions',
 'comment.data_cleaning_questions_comment',
 'iteration.iteration',
 'model.frequency_penalty',
 'model.logprobs',
 'model.max_tokens',
 'model.model',
 'model.presence_penalty',
 'model.temperature',
 'model.top_logprobs',
 'model.top_p',
 'prompt.data_cleaning_questions_system_prompt',
 'prompt.data_cleaning_questions_user_prompt',
 'question_options.data_cleaning_questions_question_options',
 'question_text.data_cleaning_questions_question_text',
 'question_type.data_cleaning_questions_question_type',
 'raw_model_response.data_cleaning_questions_raw_model_response']

Here we select just the response and print it in a table:

In [7]:
results.select("data_cleaning_questions").print(format="rich")

We can extract the responses to use to create new `Question` objects to run the sense checks with the data:

In [8]:
data_cleaning_questions = results.select("data_cleaning_questions").to_list()[0]
data_cleaning_questions

[{'question_name': 'q0',
  'question_text': 'Is the age <data> a reasonable value for an adult participant?',
  'question_options': ['Yes',
   "No, it's too low",
   "No, it's too high",
   "No, it's not a number"]},
 {'question_name': 'q1',
  'question_text': 'Does the age <data> follow the expected numerical format (integer)?',
  'question_options': ['Yes',
   'No, it contains decimal points',
   'No, it includes non-numeric characters',
   "No, it's not a number at all"]},
 {'question_name': 'q2',
  'question_text': 'Is the age <data> within the typical lifespan of a human being?',
  'question_options': ['Yes',
   "No, it's unusually high",
   "No, it's negative",
   "No, it's not a valid age"]},
 {'question_name': 'q3',
  'question_text': 'Is the age <data> indicative of a living adult human?',
  'question_options': ['Yes',
   "No, it's below the adult age threshold",
   "No, it's implausibly high",
   "No, it's not a valid age"]}]

Here we can make any desired edits to the proposed questions, and then use them to create new formatted questions which we combine into a `Survey` to administer all at once to the model ([learn more about constructing surveys](https://docs.expectedparrot.com/en/latest/surveys.html):

In [9]:
questions = []

for q in data_cleaning_questions:
    question_name = q["question_name"]
    question_text = q["question_text"]
    question_options = q["question_options"]
        
    new_q = QuestionMultipleChoice(
        question_name = question_name,
        question_text = question_text.replace("<data>", "{{ age }}"),
        question_options = question_options
    )
    questions.append(new_q)

questions

[Question('multiple_choice', question_name = 'q0', question_text = 'Is the age {{ age }} a reasonable value for an adult participant?', question_options = ['Yes', "No, it's too low", "No, it's too high", "No, it's not a number"]),
 Question('multiple_choice', question_name = 'q1', question_text = 'Does the age {{ age }} follow the expected numerical format (integer)?', question_options = ['Yes', 'No, it contains decimal points', 'No, it includes non-numeric characters', "No, it's not a number at all"]),
 Question('multiple_choice', question_name = 'q2', question_text = 'Is the age {{ age }} within the typical lifespan of a human being?', question_options = ['Yes', "No, it's unusually high", "No, it's negative", "No, it's not a valid age"]),
 Question('multiple_choice', question_name = 'q3', question_text = 'Is the age {{ age }} indicative of a living adult human?', question_options = ['Yes', "No, it's below the adult age threshold", "No, it's implausibly high", "No, it's not a valid ag

In [10]:
survey = Survey(questions)

We create `Scenario` objects for the individual ages that we will insert in the question texts when we run the survey. This allows us to efficiently run multiple versions of the questions at once ([learn more about using scenarios](https://docs.expectedparrot.com/en/latest/scenarios.html)):

In [11]:
scenarios = [Scenario({"age":age}) for age in ages]

Finally, we add the scenarios to the survey and run it. The new results will include a `Result` for each response to the survey:

In [12]:
results = survey.by(scenarios).run()

Here we inspect all the components of these new results, print a table of a set of them (set `max_rows`), and then print a table where the results are filtered to cases for review, based on the responses to the questions (see the `filter` logic):

In [13]:
results.columns

['agent.agent_instruction',
 'agent.agent_name',
 'answer.q0',
 'answer.q1',
 'answer.q2',
 'answer.q3',
 'comment.q0_comment',
 'comment.q1_comment',
 'comment.q2_comment',
 'comment.q3_comment',
 'iteration.iteration',
 'model.frequency_penalty',
 'model.logprobs',
 'model.max_tokens',
 'model.model',
 'model.presence_penalty',
 'model.temperature',
 'model.top_logprobs',
 'model.top_p',
 'prompt.q0_system_prompt',
 'prompt.q0_user_prompt',
 'prompt.q1_system_prompt',
 'prompt.q1_user_prompt',
 'prompt.q2_system_prompt',
 'prompt.q2_user_prompt',
 'prompt.q3_system_prompt',
 'prompt.q3_user_prompt',
 'question_options.q0_question_options',
 'question_options.q1_question_options',
 'question_options.q2_question_options',
 'question_options.q3_question_options',
 'question_text.q0_question_text',
 'question_text.q1_question_text',
 'question_text.q2_question_text',
 'question_text.q3_question_text',
 'question_type.q0_question_type',
 'question_type.q1_question_type',
 'question_type.q

In [14]:
(results
 .select("age", "answer.*")
 .print(pretty_labels = {"age":"Age",
                         "answer.q0":'question_text.q0_question_text', 
                         "answer.q1":'question_text.q1_question_text',
                         "answer.q2":'question_text.q2_question_text',
                         "answer.q3":'question_text.q3_question_text',
                         "answer.q4":'question_text.q4_question_text'},
        format="rich",
       max_rows=10)
)

Showing only the first 10 rows of 250 rows.


In [15]:
(results
 .filter("q0 != 'Yes' or q1 != 'Yes' or q2 != 'Yes' or q3 != 'Yes'")
 .select("age").print(format="rich")
)

This notebook can be readily edited and expanded for other data cleaning and data labeling purposes. Please see our [documentation page](https://docs.expectedparrot.com/) for examples of other methods and use cases and let us know if you have any questions!