# Data Quality Check
### This notebook samples a random meeting and compiles the data for that specific meeting and provides a URL to check in a browser whether scraped data aligns with the actual data

In [None]:
import pandas as pd
import random

# read the data
meeting = pd.read_parquet('./data/data_meeting.parquet')
data_agenda1 = pd.read_parquet('./data/data_agenda1.parquet')
data_agenda2 = pd.read_parquet('./data/data_agenda2.parquet')
data_agenda3 = pd.read_parquet('./data/data_agenda3.parquet')
data_speech1 = pd.read_parquet('./data/data_speech1.parquet')
data_speech2 = pd.read_parquet('./data/data_speech2.parquet')
data_speech3 = pd.read_parquet('./data/data_speech3.parquet')
parMem = pd.read_parquet('./data/parliament_members.parquet')

agenda = pd.concat([data_agenda1, data_agenda2, data_agenda3], axis=0)
speech = pd.concat([data_speech1, data_speech2, data_speech3], axis=0)

In [None]:
# extract random meeting sample
sample_id = random.sample(list(meeting['meeting_id']), 1)
sample_url = meeting[meeting['meeting_id']==sample_id[0]]['url'].values[0]

# look up all data for the sample meeting
sample_agendas = agenda[agenda['meeting_id'].isin(sample_id)]
sample_speech = speech[speech['meeting_id'].isin(sample_id)]

# subset the data to perform merge and remove noise
sample_agendas = sample_agendas[['meeting_id', 'agenda_item_id', 'title', 'type']]
sample_speech = sample_speech[['agenda_item_id', 'speech_item_id', 'time_start',
       'time_end', 'speaker_name', 'speaker_party', 'speaker_role', 'speech_item_text', 'duration', 'number_of_words']]

# merge the sample data
sample_data = pd.merge(sample_agendas, sample_speech, on='agenda_item_id', how='inner')
# print the url to check in a browser for matching data
print(f'url of sample meeting: {sample_url}')

# save the sample data to csv 
#sample_data[['agenda_item_id','speech_item_id','speaker_name','speaker_party','speaker_role','speech_item_text']].to_csv('./data/sample_data.csv', index=False)

## Caveats:
- Ids on speech items count every second, but still maintains sequential order -> because orator role is removed and starts all meetings from there dictates all talkers.
- We have redundant columns in agenda dataframe it should be only :'meeting_id', 'agenda_item_id', 'title', 'type'
- We have redundant columns in speech dataframe it should be only: 'agenda_item_id', 'speech_item_id', 'time_start',
       'time_end', 'speaker_name', 'speaker_party', 'speaker_role', 'speech_item_text', 'duration', 'number_of_words'.
       ('speaker_title' is simply a combination of speaker party and speaker name), i would say its better to have party and name in different fields and remove speaker title.
- What type of minister is not specified but only binary of minister or not. e.g. : Indenrigs- og sundhedsministeren (Bertel Haarder): only says "minister" in data. taken from https://www.ft.dk/forhandlinger/20101/20101M013_2010-11-05_1000.htm 
- Agendas are still a bit vague with short title only as it is a subset of the agenda text, for instance: short title from same link is: _1. behandling af \L 30: Om nedlæggelse af Momsfondet._
but the actual agenda is : _"Forslag til lov om ændring af lov om konkurrencemæssig ligestilling mellem kommuners og regioners egenproduktion og køb af ydelser hos eksterne leverandører i relation til udgifter til merværdiafgift m.v. samt om Momsfondet. (Nedlæggelse af Momsfondet)."_