![Rijksoverheid logo](https://www.rijksoverheid.nl/binaries/content/gallery/rijksoverheid/channel-afbeeldingen/logos/logo-ro.svg)

# Dutch Government Policy QA dataset
This dataset is open-source and can be found on the open data portal of the [Rijksoverheid](https://www.rijksoverheid.nl/opendata/vac-s). It contains up to 2500 frequently asked questions of Dutch citizens. The questions are concerned with Dutch government policies and contain topics like "Belasting", "Asbest", or "Klimaat".<br>
More info about the status and contact information can be found [here](https://data.overheid.nl/dataset/vraag-antwoordcombinaties-van-rijksoverheid-nl#panel-description). <br><br>

**How to use:** <br>
It is best to use Google Colab and run the notebook to get results

### In this notebook:
- The Dutch policy QA data is imported via api with a crawler
- Initial EDA is performed to check the size, completeness, and volume
- The neccessary columns are exported as a csv
- A short answer is retrieved manually from the context
- Extra EDA is performed with on the final dataset
- The PolicyQA dataset is converted to the correct input for the QA model by using our DF to JSON converter
- The PolicyQA dataset in JSON format is used as input for the model

In [None]:
# if using Colab, install necessary libraries
%pip install transformers

In [2]:
# Import libraries
import requests
import pandas as pd
import numpy as np
from pandas.io.json import json_normalize
import json
import time
import matplotlib as plt

## Import Data
First the data is imported using crawler.py <br>
Then the data is checked for volume, completeness etc.

In [None]:
# Run crawler
!python3 /scripts/crawler.py

In [None]:
# Import csv
df = pd.read_csv('policyqa-raw.csv')
df.head()

## Initial EDA

In [None]:
# Check info
df.info()

## Clean and export Data
We remove HTML tags as well as brackets. Brackets needed to be removed in order to create the input file as JSON.<br>
The data is exported so we can annotate the short answer manually.

In [None]:
# Take first 7 columns
df = df.iloc[:, 0:7]
# Remove column 3 and 4
df = df.drop(columns=["canonical", "dataurl"])
df

In [None]:
df.to_csv('policyqa-raw.csv', encoding = 'utf-8-sig') 