NOTE: WHEN DOWNLOADING NEW DATA FROM SMARTSURVEY, ENSURE THAT THE 'Display Columns Headings in One Row' CHECKBOX IS CHECKED, OTHERWISE THE DATA WILL BE IN THE WRONG FORMAT.
This repo contains code used to automate the classification of responses to the user intent survey conducted on GOV.UK.
The project is described in the blog post
Nominally this application requires the following:
- Python 3.6.2
I would recommend setting up an environment using anaconda or venv before proceeding. pip install -r requirements.txt can then be used to install the required packages.
The only out of the ordinary requirement is the classifyintents package, developed to handle the cleaning of the data; this is installed with the above step.
-
create_training_set.py: Creates a training set from multiple disparately formatted data that have been manually classified to create a single authoritative training set.
-
cleaner.py: Conducts initial cleaning of a dataset prior to modelling or predicting.
python cleaner.py <input file (csv)> <output file (pkl)>
- trainer.py: Trains model using data output as a pickle object by cleaner.py.
python trainer.py <cleaned data (pkl)> <model object (pkl)>
- predictor.py: Makes predictions on newly aquired data downloaded from surveymonkey, using the model trained by trainer.py.
python predictor.py <input data (csv)> <model object (pkl)>
Note that for privacy reasons, no data are stored in this repository. .gitkeep files are used to retain the following directory structure:
- input_data
- Contains raw downloads from survey monkey prior to being classified using predictor.py
- models
- Pickle objects of the trained models are stored here.
- output_data
- Cleaned data produced by cleaner.py are stored here.
- Predicted data which has been classified using the predicted script.
- training_data
- Pre-classified training data is stored here.
To assess the performance of PII removal run the script pii_test_cases.py:
python pii_test_cases.py <input_data/raw.csv> <output_data/classified/classified.csv> <output.csv>
This will take all the cases in which PII was identified, and combine these with the uncleansed examples to facilitate comparison. From this comparison, new test cases can be created to improve the performance of the PII removal.
This work is due to be revisited in 2017.