### 3. Data Quality Checks

IPA designed (FILL)

#### High Frequency Checks

As mentioned earlier, a uniquely powerful feature of CAI is that you can spot issues with data collection and relay comments to the field team to address issues during the actual implementation. This could mean anything from spotting a flaw in the survey design that is preventing data from being properly collected to running checks that flag poor/inconsistent enumerator performance. While data RAs sometimes provide technical support to help develop the cleaning and checking code, it is the field RAs that run the code daily, or as often as they can; thus, this type of checking code is called a “high frequency check” (HFC). Once the checks have been run, the field RAs should read the output, looking for any data issues or signs of poor collection efforts, and communicate their findings to the field team to seek out explanations.
IPA has developed programs with the most common checks that every project should run. These are flexible enough to allow the person setting them up to customize depending on the project and data structure. We recommend you start with these checks, as they are the most important ones and relatively easy to set up. In addition to these, you could add more advanced checks that are specifically meant to monitor the quality of incoming data. See the resources below for tips on setting up a system to run HFCs and learning best practices.
-	**Resource:** [CAI high-frequency checks](https://github.com/PovertyAction/high-frequency-checks) (GitHub)
-	**Resource:** [HFC code examples](https://northwestern.app.box.com/folder/50078386361) (Box) – see the sub-folder called “Advanced Checks” for some additional types of data quality checks that can be implemented.

Some HFC best practices:

- Set up a system whereby one person runs the checks on a daily basis, or as regularly as possible. Automate this process as much as possible (ideally it would be a “one-click” step).
- Save the output in a folder which includes the date in the title. Naming the folder with the date should be done automatically in Stata using the local \`c(current_date)’. For example, you could write “${file_directory}/\`c(current_date)'/hfc_output.xlsx”. This is important so that each time you run your HFCs, you do not overwrite previously generated output.
  
- Think about the variables you include in your checks.
    -	Be selective – Do not add all the variables in your survey to your HFCs. You should include variables that are important for your main analysis (e.g. income, consumption) or that are critical to identifying observations (e.g. household size, respondent/household head name). You might also want to considering including variables that you think are vulnerable to improper entry (e.g. sensitive health questions that enumerators might have skipped over because they felt uncomfortable, long sections that are prefaced with a screening question that enumerators might be tempted to answer “no” to)

    -	You might need to construct aggregated variables – Some of the variables that you want to run checks on might not exist in the raw data. In this case, before running your HFCs you will have to first create a cleaning/outcomes do file where you construct some outcome variables of interest. For example, a survey could realistically ask about different types of income-generating activities for different members of the household, and it could be useful to aggregate the income over all activities for all individuals in order to generate a measure of total household income. The fact that one member of the household repots having zero income might not be problematic in itself, but if the income of the whole household is calculated to be zero, then some follow up with the field team may be necessary. 
-	Have a clear system laid out ahead of time to divide the various tasks of running HFCs among the project team e.g. who will be responsible for running the code, reading the output, communicating with the field team and/or PIs, making corrections to the data etc.
-	Save a backup of the raw data often – you can automate this in your HFC code.

#### Back Checking

To ensure that we are collecting high quality data, we use “back checks” during the survey process to double check that surveying teams and instruments alike are performing well in the field.

Generally, the field RA will hire a separate team of back checkers to randomly visit respondents who have recently completed the survey and ask them a subset of questions from the survey. This “back check” helps us to do two things:

- Gauge the reliability of questions we ask.
- Gauge the reliability of each enumerator on the team.

Field RAs are in charge of designing and executing the back checks. However, data RAs sometimes collaborate with the field RA to construct code that will give the field RAs close to real-time information about the reliability of the surveys after a back check survey is finished. To get a sense of how field RAs think about the back check process, read:

- **Resource:** [Back Check Manual](https://northwestern.app.box.com/file/311246841032) (Box)

If you’re assigned to a project that will require back check code, be sure to learn about and use bcstats, a Stata program that helps automate the back checking process.

- **Resource:** [bcstats](https://github.com/PovertyAction/bcstats) (GitHub). IPA user-written Stata program to compare survey and back check data, producing a dataset of comparisons. It also completes checks of the data, implementing “enumerator checks” for type 1 and type 2 variables and “stability checks” for type 3 variables.