![Rijksoverheid logo](https://www.rijksoverheid.nl/binaries/content/gallery/rijksoverheid/channel-afbeeldingen/logos/logo-ro.svg)

# Dutch Government Policy QA dataset
This dataset is open-source and can be found on the open data portal of the [Rijksoverheid](https://www.rijksoverheid.nl/opendata/vac-s). It contains up to 2500 frequently asked questions of Dutch citizens. The questions are concerned with Dutch government policies and contain topics like "Belasting", "Asbest", or "Klimaat".<br>
More info about the status and contact information can be found [here](https://data.overheid.nl/dataset/vraag-antwoordcombinaties-van-rijksoverheid-nl#panel-description). <br><br>

**How to use:** <br>
It is best to use Google Colab and run the notebook to get results.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/berryyblom/domain-adaptation-transformers-forQA/blob/main/notebooks/eda-policyqa.ipynb)

### In this notebook:
- The Dutch policy QA data is imported via api with a crawler
- Initial EDA is performed to check the size, completeness, and volume
- The neccessary columns are exported as a csv
- A short answer is retrieved manually from the context
- Extra EDA is performed with on the final dataset
- The PolicyQA dataset is converted to the correct input for the QA model by using our DF to JSON converter
- The PolicyQA dataset in JSON format is used as input for the model

In [None]:
# if using Colab, install necessary libraries
%pip install transformers
!python -m pip install seaborn

In [None]:
# Import libraries
import requests
import pandas as pd
import numpy as np
from pandas.io.json import json_normalize
import json
import time
import matplotlib as plt
import matplotlib.pyplot as plt

## Import Data
First the data is imported using crawler.py <br>
Then the data is checked for volume, completeness etc.

In [None]:
# Run crawler
!python3 /scripts/crawler.py

In [None]:
# Import csv
dfraw = pd.read_csv('../data/temp/policyqa-raw.csv')
dfraw.head()

## Initial EDA
EDA performed on the raw data

In [None]:
# Check info
dfraw.info()

In [None]:
dfraw['content'][0]

## Raw conclusion
* By inspecting the data we can see the actual answer does only appear in the Introduction column for around 50% of the time.
* The actual answer is always in the Introduction column and the Content column is supplementary to the answer.
* Therefor the data is exported with the necessary columns
* We add supplementary annotations by adding a short answer, which is derived from the Introduction column (the actual answer, supplied by domain experts)

In [None]:
# Take first 7 columns
dfraw = dfraw.iloc[:, 0:7]
# Remove column 3 and 4
dfraw = dfraw.drop(columns=["canonical", "dataurl"])
# Export to csv
dfraw.to_csv('policyqa-raw.csv', encoding = 'utf-8-sig') 

## EDA
The annotated data is added manually. <br>
To convert the annotated data to a clean dataset, we call scripts/converter.py
This script:
- removes non-alphanumeric characters
- add all answer start character positions
- converts the Dataframe to the correct model input, which is saved at data/dataV3.json

In [None]:
# Run crawler
!python3 /scripts/converter.py

In [None]:
# Read annotated cleaned csv
df = pd.read_csv('../data/temp/policyqa-annotated-clean.csv')
df.head()

In [None]:
# Drop first 2 columns
df = df.iloc[: , 2:]

#### Charts for dataset statistics
- Number of characters
- Number of words

In [None]:
def plot_character_length_histogram(text):
    text.str.len().\
        hist().set(xlabel='Character length', ylabel='Count', title='Character length histogram (Context)')

plot_character_length_histogram(df['introductioncontent'])

The histogram shows that questions range from 10 to 100 characters and generally, it is between 30 and 70 characters

In [None]:
plot_character_length_histogram(df['introductioncontent'])

The histogram shows that questions range from 300 to 5800 characters and generally, it is between 900 and 1900 characters

In [None]:
df['question'].str.split().\
    map(lambda x: len(x)).\
    hist()

# we can see that the number of words range from 3 to 15

In [None]:
df['question'].str.split().\
   apply(lambda x : [len(i) for i in x]). \
   map(lambda x: np.mean(x)).hist()

# The average word lenght is 5.5

In [None]:
import nltk
nltk.download('stopwords')
stop=set(stopwords.words('dutch'))

In [None]:
corpus=[]
new= df['question'].str.split()
new=new.values.tolist()
corpus=[word for i in new for word in i]

from collections import defaultdict
dic=defaultdict(int)
for word in corpus:
    if word in stop:
        dic[word]+=1

In [None]:
def plot_top_stopwords_barchart(text):
    stop=set(stopwords.words('dutch'))
    
    new= text.str.split()
    new=new.values.tolist()
    corpus=[word for i in new for word in i]
    from collections import defaultdict
    dic=defaultdict(int)
    for word in corpus:
        if word in stop:
            dic[word]+=1
            
    top=sorted(dic.items(), key=lambda x:x[1],reverse=True)[:10] 
    x,y=zip(*top)
    plt.bar(x,y)

In [None]:
plot_top_stopwords_barchart(df['question'])
# We can see that stopwords like Ik (I), een (a), mijn (my), etc. are the most frequent stopwords.

In [None]:
# Code Snippet for Top Non-Stopwords Barchart

import seaborn as sns
from nltk.corpus import stopwords
from collections import  Counter

def plot_top_non_stopwords_barchart(text):
    stop=set(stopwords.words('dutch'))
    
    new= text.str.split()
    new=new.values.tolist()
    corpus=[word for i in new for word in i]

    counter=Counter(corpus)
    most=counter.most_common()
    x, y=[], []
    for word,count in most[:40]:
        if (word not in stop):
            x.append(word)
            y.append(count)
            
    sns.barplot(x=y,y=x).set(title='Top Non-Stopwords')

In [None]:
plot_top_non_stopwords_barchart(df['question'])

In [None]:
plot_top_non_stopwords_barchart(df['introductioncontent'])

We can see paragraph and paragraphtitle are common words. These do not add any meaning to the text, so let's remove them. Also the colon (:) and the word title do not add meaning to the text, so we remove those words as well.

In [None]:
df['introductioncontent'] = df['introductioncontent'].str.replace(r'paragraph', '', regex=True)
df['introductioncontent'] = df['introductioncontent'].str.replace(r'paragraphtitle', '', regex=True)
df['introductioncontent'] = df['introductioncontent'].str.replace(r':', '', regex=True)
df['introductioncontent'] = df['introductioncontent'].str.replace(r'title', '', regex=True)