# Data Preparation
This script prepares the data for user-researchers to begin tagging the responses.

The end output is a combined dataset comprising of:
- User Intent Survey responses
- Feedex survey responses
    + Feedex is an anonymised subset of Zendesk

These two datsets should be de-duplicated and have Personal Identifying Information (PII) removed.

Given PII is already removed before we import them into Python, then the steps we need to take are:

- [ ] Remove duplicates from the UIS and Zendesk survey responses
- [ ] Extract the Feedex responses from the Zendesk survey
- [ ] Join the two datasets together

In [None]:
# allow multiple outputs in same cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import pandas as pd

# read in dataset
data_uis = pd.read_csv("../data/uisdata_all_cols_all_of_march.csv")
data_zendesk = pd.read_csv("../data/cleaned_zendesk-01-04-2020.csv")

Have plenty of columns in this dataframe so rather than view the contents of it, generate a list of column names instead.

In [None]:
list(data_uis)
list(data_zendesk)

In [None]:
data_uis.head(10)
data_zendesk.head(10)

## Saving for Version-control
Below, we write our Jupyter notebook into two different formats so that it is easy to version-control:

- HTML: This is so readers can see the outputs.
- Python: This is so readers can run the code.

In [None]:
# convert to HTML (for reading) and Python script (for coding)
!jupyter nbconvert --to html dataprep_uisfeedex.ipynb
!jupyter nbconvert --to python dataprep_uisfeedex.ipynb

In [None]:
!jupyter nbconvert --to python dataprep_uisfeedex.ipynb