# Evaluation criteria

The goal of this assignment is to get a view on your hands-on "data engineering" skills.  
The assignment below measures your proficiency in general programming and engineering tasks using python.

It is not important that you get the final answer(s) correct in the end, but you should be able to explain your code and what choices and assumptions you made.

Set aside approximately 1 hour to complete this exercise.

**In this exercise we expect you to demonstrate your ability to / knowledge of:**
 * Efficiently getting the job done
 * Understanding the dataset and assumptions
 * Choose meaningful names for variables & functions
 * Writing maintainable code (for this exercise, there should be a couple high-level comments in each cell explaining what your code does and code should be readable)


# Setting-up a data science workspace

We allow you full freedom in setting up a data science runtime.  
The main objective is having a runtime where you can run this notebook and the code you will develop. 
You can choose for a local setup on your pc, or even a cloud setup if you're up for it.  
Please have this notebook ready for a discussion of your code and for additional questions during our next conversation. 

**In your environment, you will need things for:**
 * python3
 * managing dataframes 
 * ingesting JSON



# Importing packages

We would like you to put all your import statements here, together in 1 place.  
Before submitting, please make sure you remove any unused imports.   

In [None]:
## your imports go here.

import pandas as pd

# Data ingestion

## Load in answer data from answers.txt

We've extracted data from our newest application __MYFAKESURVEYAPP__ into the file 'answers.txt'.  
Users of this application voluntarily answer survey questions that are packaged into survey forms to tell us how they're doing at any given time e.g. user Martha may answer the survey 'Daily Living' with 3 questions every day in the morning.  

Each row of the file is a json object representing a user answer to a question on a form.

There may also be references to voice recordings in the answer dataset, feel free to ignore those records.

Load the answers.json file into a pandas dataframe. The structure of a data record and a data dictionary of key fields are shown below.

**Hint: Use the questionHid field to find the answer value in the formValues dictionary**

```
{
  "formId": "xxxxxxx",                 = Identifier of the questionnaire being asked
  "formValues":                        = Dictionary capturing values returned from the form
  {
    "QXX": [                           = Question label - the key (label) may change for different questions
      "7.8"                            = Answer to question
    ],
    "questionId": [                    = Identifier of Question
      "xxxxxxx"
    ],
    "attachment": [
      ""
    ],
    "questionType": [
      "HTML"
    ],
    "taskId": [                        = Identifier of Task Instance (A task is an instance of a form that a user completes)
      "xxxxxxx"
    ]
  },
  "questionId": "xxxxxxx",             = Identifier of Question (Same as above)
  "questionHid": "QXX",                = Specifies the key (label) to find the answer value in formValues
  "created": 1557503177316,            = When answer was submitted (answered date)
  "resources": null,                   = File attachments would be shown here
  "id": "xxxxxxx",                     = Identifier of Answer
  "type": "HTML",
  "painLog": "",
  "updated": 1557503177316,
  "userId": "xxxxxxx",                 = Identifier of user answering the question
  "taskId": "xxxxxxx"                  = Identifier of Task Instance (A task is an instance of a form that a user completes - Same as above)
}
```

In [None]:
## your code goes here


# Data Pipeline
### Quality checks

We would like you to add several checks on this data based on these constraints:  
 * Each record should be unique (i.e. appear only once in the data set). Hint: each unique record can be identified by its 'id'
 * Answers to questions PAIN1, PAIN2, PAIN3 should be numeric and are in the range of 0 - 10 OR the data is missing (NaN or None value).

Filter out any records that do not meet these criteria.

In [None]:
## your code goes here



### Data Transformations

Create a Data Frame with a variable PAINAVG that contains the combined average of answers to questions PAIN1, PAIN2, PAIN3 for each user for each day. 

e.g. for each day and patient, PAINAVG = sum of all PAIN1, PAIN2, PAIN3 answers on the day/( number of PAIN1, PAIN2, PAIN3 answers on the day)

In [None]:
## your code goes here



Users' names are provided as answers to records with the quesitonHid NAME. Create a new Data Frame called usernames that includes a column identifying the user's name for every userId.


In [None]:
## your code goes here



# Computations


Using a pandas dataframe, compute the average of PAINAVG for each of the patient's NAMEs to create a table like the example shown below.  (Example numbers may not be accurate.)


| NAME        | AVGPAINAVG  |
| ----------- | ----------- |
| Adam      | 4.26       |
| Alejandro   | 5.20       |
| Allison   | 4.79       |



In [None]:
## your code goes here


Using a pandas dataframe, compute the percentage of answered questions given by non-null (or no value) answers for each of the patient's NAMEs to create a table like the example shown below.

| NAME        | COMPLIANCE  |
| ----------- | ----------- |
| Adam      | 0.84      |
| Alejandro   | 0.93       |
| Allison   | 0.93       |


What is the overall percentage of answered questions?


In [None]:
## your code goes here
