# Clustering polls into topics
> Using titles/descriptions of polls for clustering. Goal is unsupervised grouping into topicsish based on [this](https://radimrehurek.com/gensim/auto_examples/core/run_topics_and_transformations.html#sphx-glr-auto-examples-core-run-topics-and-transformations-py).

> Note: You may need to run `python -m spacy download de_core_news_sm`, if not already done, to process the German language.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import pprint

import pandas as pd

from bundestag.fine_logging import setup_logging
import logging
from bundestag.paths import get_paths
from bundestag.poll_clustering import (
    SpacyTransformer,
    clean_text,
    compare_word_frequencies,
    pca_plot_lda_topics,
)


logger = logging.getLogger(__name__)
setup_logging(logging.INFO)

paths = get_paths("../data")
paths

In [None]:
file = paths.preprocessed_abgeordnetenwatch / "polls_111.parquet"
file

In [None]:
df_polls = pd.read_parquet(file)
df_polls.head(3).T

## Clustering based on poll title



Sanity checking word counts, longest and shortest titles

### Cleaning using spacy

https://towardsdatascience.com/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-lda-35ce4ed6b3e0

In [None]:
# !python -m spacy download de_core_news_sm

In [None]:
col = "poll_title"
nlp_col = f"{col}_nlp_processed"

In [None]:
st = SpacyTransformer()
df_polls[nlp_col] = df_polls.pipe(clean_text, col=col, nlp=st.nlp)

In [None]:
df_polls.head(3).T

### Inspecting word frequencies

In [None]:
compare_word_frequencies(df_polls, col, nlp_col)

The word count distribution shifted to lower values, as could be expected, but no documents were left without any words.

### Transforming using LDA

In [None]:
st.fit_lda(df_polls[nlp_col].values.tolist(), num_topics=10)
print("Discovered topics:")
pprint.pprint(st.lda_topics)

In [None]:
df_lda = st.transform_documents(df_polls[nlp_col])
df_lda.head().T

In [None]:
df_polls, nlp_feature_cols = df_polls.pipe(
    st.transform, col=nlp_col, return_new_cols=True
)
df_polls.head()

In [None]:
pca_plot_lda_topics(df_polls, st, col, nlp_feature_cols)