Important
This repo is currently work in progress 🔨. The outputs of the application cannot be guaranteed at this time. Use with caution.
Public consultations are processes where government bodies invite the public to share their views on policy issues to inform decision-making. This is encouraged via response to specific, structured questions but responses can also be submitted as unstructured text via email.
For all responses to the consultation to be analysed, feedback from unstructured text responses needs to be processed and assigned to the relevant questions in the consultation. This process is currently time-consuming, error-prone and resource-intensive
Extr@ct is a Flask application that uses AI tools to automate the processing of these unstructured consultation responses, enabling faster information extraction and reducing the burden on staff. The tool allows a user to upload a structured list of questions and an unstructured response - the AI will then extract relevant answers.
This project is currently in development and not production-ready. We have prioritised functionality over the user interface for this initial release, so there may be unexpected errors. UI improvements are planned for future releases.
The project requirements are currently specific to the infrastructure within DSIT, including AWS and Azure services for deployment and GPT-4o access. We aim to support flexible model, server, and database configurations in future versions.
We are evaluating performance using historic labelled data and testing on live consultations. Initial results are promising, and we hope to publish a comprehensive evaluation in future.
Before getting started ensure you have installed:
- AWS CLI
- Terraform
- Docker
- Python 3.12
- Poetry
- Create a GPT-4o Endpoint in Azure OpenAI
The application relies on databases created in AWS and an Azure OpenAI endpoint. The databases required are created using terraform and are detailed in the infrastructure/ folder.
To get started, log in to AWS with
aws sso login --profile <your_aws_profile>
where <your_aws_profile> is the name of your AWS profile. If you have not created your AWS profile locally, use the AWS Documentation to configure your setup.
Create the required databases in AWS using terraform
cd infrastructure/
terraform init
terraform apply
cd ..
and review the changes.
Within the AWS S3 bucket, within the AWS Management Console, created by terraform (default: extract-upload-data-dev) create the system and human prompt templates in a new folder called prompt_templates/approved/prompt_template_system.jinja and prompt_templates/approved/prompt_template_human.jinja. You will need to copy these prompt templates from your local repo to the AWS bucket created by terraform. The templates can be found in this directory prompt_templates/approved/...
Next, we need to add your GPT-4o model configuration secrets which are handled by AWS Secrets Manager. To add the required secrets go to AWS Management Console > Secrets Manager > Secrets > extract-secrets. From here you can manually insert your secrets as key-value pairs. Please insert the following key-value pairs.
| Key | Explanation | Example |
|---|---|---|
| AZURE_OPENAI_ENDPOINT | Azure OpenAI URI | https://xyz.openai.azure.com/ |
| AZURE_OPENAI_API_KEY | API Key for Azure GPT-4o | abcd1234 |
| AZURE_OPENAI_API_VERSION | Version of the deployed GPT-4o model | 1970-01-01-preview |
| AZURE_OPENAI_CHAT_DEPLOYMENT_NAME | The name of your deployed model | my-gpt-4o-model |
For details on using Secrets Manager see these docs
To start developing locally, install dependencies using poetry
poetry install
and create a .env file. If using the default installation instructions as above then use
MODEL_SECRETS_ID=extract-secrets
RESPONSES_TABLE_NAME=extract-responses
QUESTIONS_TABLE_NAME=extract-questions
REVIEW_TABLE_NAME=extract-reviews
MANUAL_TABLE_NAME=extract-manual-reviews
USER_UPLOAD_S3_BUCKET=extract-upload-data-dev
SYSTEM_PROMPT_TEMPLATE_PATH=prompt_templates/approved/prompt_template_system.jinja
HUMAN_PROMPT_TEMPLATE_PATH=prompt_templates/approved/prompt_template_human.jinja
REGION=eu-west-2
ENVIRONMENT=DEV| Key | Description |
|---|---|
| MODEL_SECRETS_ID | AWS Secrets Manager ID |
| RESPONSES_TABLE_NAME | DynamoDB table name, responses overview data |
| QUESTIONS_TABLE_NAME | DynamoDB table name, each question and model response |
| REVIEW_TABLE_NAME | DynamoDB table name, reviews carried out by the user |
| MANUAL_TABLE_NAME | DynamoDB table name, text manually added to the extraction |
| USER_UPLOAD_S3_BUCKET | S3 bucket name, used to store uploaded questions and responses as well as prompt templates |
| SYSTEM_PROMPT_TEMPLATE_PATH | S3 file path, location of the system prompt template |
| HUMAN_PROMPT_TEMPLATE_PATH | S3 filepath, location of the human prompt template |
| REGION | AWS region services are deployed in |
| ENVIRONMENT | Can be DEV, PREPROD, or PROD |
If you have changed any naming of infrastructure this will need to be reflected in .env.
Now SSO into AWS using your AWS profile name
aws sso login --profile <your_aws_profile>
The Docker container can be built using
docker build -t <image_name> .
where we define the name of our Docker image build locally here, replacing <image_name> with an image name of our choice (e.g., extract)
Then run the image with your AWS credentials, your .env file and your AWS profile (named <your_aws_profile>)
docker run -it \
-p 5000:5000 \
--env-file=.env \
-v ~/.aws/credentials:/root/.aws/credentials:ro \
-v ~/.aws/config:/root/.aws/config:ro \
-v ~/.aws/sso/cache:/root/.aws/sso/cache:ro \
-v ~/.aws/cli/cache:/root/.aws/cli/cache:ro \
-v $(pwd)/.env:/app/.env:ro \
-e AWS_PROFILE=<your_aws_profile> \
<image_name>
Ensure you have logged into AWS via SSO to ensure the Docker container can connect to AWS via your AWS credentials.
Go to a browser and http://127.0.0.1:5000 to see the app running.
-
Go to "Upload Questions" tab and upload a
.csvof consultation questions you want to extract responses to from unstructured text. The.csvmust havequestion_labelandquestion_textcolumns wherequestion_labelvalues are unique to each question. -
Then go to "Upload Responses" to upload the
.pdfs you want to extract question responses from. You can upload multiple.pdfs at once. -
Then go to "Extract" and press "Activate LLM Extractor" button. This will start the processing of the questions and responses. Note this can take a few minutes depending on the number of questions and the size of the pdf(s). Once the extraction is complete an "Extraction successful" message will appear if successful or an error message if not.
-
Once extracted, go to "Review" to view all the questions the model has extracted responses to. You will then be able to Accept or Reject the model response. The process will return the exact text that is in the
.pdf, if you want to review what the model actually returned you can go to "Show Details" or if you want to see the.pdftext in full go to "Full Response Text". -
Then go to "Manual Extraction" to view all questions, including ones the model has extracted nothing for, where you will have the option to manually include any relevant text the model hasn't included to each question.
Our roadmap includes:
- Deploying on live consultations in the immediate term
- Publishing a robust evaluation of the tool's performance
- Releasing a stable v1.0 once the application reaches maturity
This project is licensed under the MIT Licence. See the LICENCE file for details.
If you'd like to contribute please follow the CONTRIBUTING.md link.
