# Sampling the data
This first step is to annotate the data. To get a representative sample I use stratified sampling. 

In [1]:
# import libraries
import pandas as pd
from sklearn.model_selection import StratifiedShuffleSplit

In [3]:
# load data
data_df = pd.read_csv("./data/abgeordnetenwatch_data_long.csv", sep = ";")

In this step I create an additional column which shows the possible combiniations from the categories party, gender, topic, parliament. The politician of a certain party or gender might be more or less inclined to answer a question. It might also be possible that certain topics of questions have a higher chance of being or not being answered. Therefore I want to have representative sample. From the newly created column as basis a sample will be drawn.

In [4]:
# create new column
data_df["stratify_column"] = data_df[["party", "gender", "topic"]].apply(lambda x: "_".join(x.astype(str)), axis=1)

This code chunk remove parties with less than 5 questions to be able to draw a proper random sample.

In [5]:
counted_values = data_df["stratify_column"].value_counts()

data_df = data_df[data_df["stratify_column"].map(counted_values) >= 5]

In [7]:
# draw sample
split = StratifiedShuffleSplit(n_splits = 1, test_size = 0.0075, random_state = 42)
for train_index, test_index in split.split(data_df, data_df["stratify_column"]):
    sample = data_df.iloc[test_index]

To simplify the encoding of the answers certain columns will be rearranged.

In [14]:
sample = sample[["party", "politician_id", "first_name", "last_name", "gender", "year_of_birth", "residence", "question_date", "question_id", "parliament", "topic", "stratify_column", "question_text", "question_teaser", "answer"]]

The next step will export the sample as csv.

In [15]:
sample.to_csv("./data/stratified_sample.csv", index=False)