# Parsing votes
> Downloading & parsing votes from [bundestag.de](https://www.bundestag.de/parlament/plenum/abstimmung/liste).

## How to use

Run the notebook top to bottom. Given parameters control the download and transformation components.

### CLI equivalent

Download

    uv run bundestag download bundestag_sheets --do-create-xlsx-uris-json

transformation

    uv run bundestag transform bundestag_sheet --sheet-source=json_file

### Skip processing

You can run this notebook to re-do some data processing. But know you can also skip this by running

    uv run bundestag download huggingface

instead.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
from bundestag.data.download.bundestag_sheets import (
    run as download_bundestag_sheets,
    Source,
)
from bundestag.data.transform.bundestag_sheets import run as transform_bundestag_sheets
from bundestag.data.utils import RE_SHEET
import pandas as pd
from bundestag.paths import get_paths
from bundestag.fine_logging import setup_logging
import logging
import json

## Setup

In [None]:
logger = logging.getLogger(__name__)
setup_logging(logging.INFO)

In [None]:
paths = get_paths("../data")
paths

In [None]:
dry = False

## Collecting URIs for `.xlsx`/`.xls` documents

In [None]:
nmax = 999
pattern = RE_SHEET
assume_yes = True
do_create_xlsx_uris_json = False
max_pages = 10
json_filename = "xlsx_uris.json"
source = Source.json_file

html_dir = paths.raw_bundestag_html
json_path = html_dir.parent / json_filename
sheet_dir = paths.raw_bundestag_sheets

In [None]:
download_bundestag_sheets(
    html_dir=html_dir,
    sheet_dir=sheet_dir,
    nmax=nmax,
    dry=dry,
    pattern=pattern,
    assume_yes=assume_yes,
    source=source,
    do_create_xlsx_uris_json=do_create_xlsx_uris_json,
    max_pages=max_pages,
)

In [None]:
with json_path.open("r") as f:
    xlsx_uris = json.load(f)

xlsx_uris

## Transforming sheet files

In [None]:
preprocessed_path = paths.preprocessed_bundestag

In [None]:
transform_bundestag_sheets(
    html_dir=html_dir,
    sheet_dir=sheet_dir,
    preprocessed_path=preprocessed_path,
    dry=dry,
    source=source,
    json_filename=json_filename,
)

In [None]:
results_parquet_path = preprocessed_path / "bundestag.de_votes.parquet"
results_parquet_path

## Loading sheets DataFrame

In [None]:
df = pd.read_parquet(results_parquet_path)
df.head()