# 01 - TF-IDF on the Governance Set
This notebook runs TF-IDF on the governance data set.

The code in this notebook was roughly based on [TF-IDF Vectorizer scikit-learn](https://medium.com/@cmukesh8688/tf-idf-vectorizer-scikit-learn-dbc0244a911a) by Mukesh Chaudhary. We replicate their steps and build from there.

**braindump**

* XXX histogram woorden per klasse van document frequency maken voor Corné. Iets van 20 bins. Document frequency uitrekenen. Doel: df_min en df_max bestuderen.
* histogram df met aantal woorden binnen de DV set, zodat je kunt zien wat df_min en df_max doen.
* scree plot om aantal topics te bepalen
    - https://stackoverflow.com/questions/69091520/determine-the-correct-number-of-topics-using-latent-semantic-analysis
* commonality value @Mariek
* daar een beperkte set topics maken en dan per combinatie een scatterplot van de woord gewichten.


---
## Dependencies and Imports
This is where we install and import the dependencies. Most operations are performed using Scikit-learn ans Panda's. We also load Wordcloud and Matplotlib for visualising the word counts.

In [1]:
!pip install pandas scikit-learn wordcloud



In [2]:
import re
import sys
from pathlib import Path
WRITE='w'
READ_BINARY='rb'
print("python=={}".format(re.sub(r'\s.*', '', sys.version)))

from sklearn import __version__ as sklearn__version__
print(f"scikit-learn=={sklearn__version__}")
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

import pandas as pd
print(f"pandas=={pd.__version__}")
ROW    = 0
COLUMN = 1
STRING = 'string'
OBJECT = 'object'
NUMBER = 'number'
CATEGORY = 'category'
INTEGER = 'integer'
UNSIGNED = 'unsigned'
FLOAT = 'float'

import numpy as np
print(f"numpy=={np.__version__}")

from wordcloud import WordCloud
from wordcloud import __version__ as wordcloud__version__
print(f"wordcloud=={wordcloud__version__}")

import matplotlib.pyplot as plt
from matplotlib import __version__ as matplotlib__version__
print(f"matplotlib=={matplotlib__version__}")


python==3.11.4
scikit-learn==1.2.2
pandas==2.0.2
numpy==1.25.0
wordcloud==1.9.2
matplotlib==3.7.1


---
## Data Loading
All data was preprocessed by the "_00 - Preprocess the Governance Data Set_" notebook, so our data loading steps here can be simplified. We don't have to worry about tokenization, stemming and stop words. For this notebook we really only need the DV data set and for comparison we load the complete corpus.

In [None]:
CACHE_DIR = '../cache/Governance'

# The Parquet files, gzipped.
ALL_PARQUET_GZ = CACHE_DIR + '/ALL_documents.parquet.gz'
DV_PARQUET_GZ  = CACHE_DIR + '/DV_documents.parquet.gz'

ALL_corpus = pd.read_parquet(ALL_PARQUET_GZ)
DV_corpus  = pd.read_parquet(DV_PARQUET_GZ)

# columns ofthe data set
DOCUMENT_BODY = 'body'
DOCUMENT_TITLE = 'Titel'
DOCUMENT_JAAR = 'Jaar'
MUNICIPALITY_CODE='GM_CODE'

DV_corpus


For use in graphs, the Enexis and Google Sheets colour codes.

In [None]:
ENEXIS_PINK='#cc2b72'
ENEXIS_DARK_PINK='#a72a81'
ENEXIS_VERY_DARK_PINK='#942d88'
ENEXIS_GREEN='#c3da45'
ENEXIS_DARKGREEN='#94b950'
ENEXIS_LIGHTGREY='#f0f0f0'

SHEET_GREEN='#00b3ae'


---
## Document Frequency Tuning
The document frequency turned out to be a great way to clip off common (for the corpus) words from the data sets. More on that below. For now, we perform our TF-IDF analysis using the best values we found.

As our documents are quite large, we set `sublinear_tf` to apply a $log()$ to the TF. This reduces the significance of document size on the output.

*note*: the data type of `min_df` and `max_df` changes its meaning. `int`s are taken to mean counts, while `float`s are taken to mean percentages. We mix the two here, as you can see in the code below.

In [None]:
MIN_DF = 15   # count
MAX_DF = 0.85 # percent
SUBLINEAR_TF = True


---
## Apply Vectorizers to Governance Data Set
Here we define and run TF-IDF on the governance data sets. The count vectorizer is a good way to get some idea of how the documents and words relate. A better understanding of that helps with `max_df` and `min_df` tuning, for example.

The main reason to define these functions is to ensure the operations yield clean data frames for easy analysis.

In [None]:
def count_vectorize(series):
    vectorizer = CountVectorizer()

    # run the vectorizer on the data
    word_matrix = vectorizer.fit_transform(series)
    words_list = vectorizer.get_feature_names_out()

    # take the output and package it into various useful data frames
    per_document    = pd.DataFrame(index=series.index, columns=words_list, data=word_matrix.toarray())
    sum_over_corpus = pd.DataFrame(per_document.sum(), columns=['sum']).T

    return vectorizer, per_document, sum_over_corpus


def tfidf_vectorize(series, min_df, max_df, sublinear_tf):
    vectorizer = TfidfVectorizer(min_df=min_df, max_df=max_df, sublinear_tf=sublinear_tf)

    # run the vectorizer on the data
    word_matrix = vectorizer.fit_transform(series)
    words_list = vectorizer.get_feature_names_out()

    # take the output and package it into various useful data frames
    matrix = pd.DataFrame(index=series.index, columns=words_list, data=word_matrix.toarray())
    idf = pd.DataFrame(columns=words_list, data=[vectorizer.idf_])

    return vectorizer, matrix, idf


In [None]:
all_docs_vectorizer, all_docs_matrix, all_docs_idf = tfidf_vectorize(ALL_corpus[DOCUMENT_BODY], min_df=MIN_DF, max_df=MAX_DF, sublinear_tf=SUBLINEAR_TF)
all_docs_matrix


In [None]:
#all_docs_vectorizer.stop_words_
len(all_docs_vectorizer.stop_words_)


In [None]:
dv_docs_vectorizer, dv_docs_matrix, dv_docs_idf = tfidf_vectorize(DV_corpus[DOCUMENT_BODY], min_df=MIN_DF, max_df=MAX_DF, sublinear_tf=SUBLINEAR_TF)
dv_docs_matrix


In [None]:
#dv_docs_vectorizer.stop_words_
len(dv_docs_vectorizer.stop_words_)


### Dimensionality Reduction via `min_df` and `max_df` Tuning
In order to determine useful values for `min_df` and `max_df`. For that we need a histogram of the document frequency of each term.

In the dimensionality reduction we strive to remove words that are either not part of many documents (these are probably not on topic and likely to be spelling errors or specific to a certain municipality), or that are part of almost all documents. These do not contribute to the clustering and just end up making the clusters look very similar.

We arrived at the values for `MIN_DF` and `MAX_DF` by experimenting with a few word clouds, until we saw that the weasel words disappeared and the clusters started making sense.


In [None]:
DV_document_count = DV_corpus[DOCUMENT_BODY].shape[0]
max_df_line = int(DV_document_count * MAX_DF)

_, count_per_dv_document, _ = count_vectorize(DV_corpus[DOCUMENT_BODY])

histo_data = count_per_dv_document.astype(bool).sum(axis=ROW).sort_values()
histo_data[histo_data<MIN_DF].sample(10)


In [None]:
histo_data[histo_data>max_df_line].sample(10)


In [None]:
print(f"min_df={MIN_DF} and max_df={max_df_line} ({MAX_DF*100}% of {DV_document_count} documents)")

fig, ax = plt.subplots()
ax.set_yscale('log')
ax.axvline(x=MIN_DF, color=ENEXIS_PINK, label=f"min_df={MIN_DF}")
ax.axvline(x=max_df_line, color=ENEXIS_PINK, label=f"max_df={max_df_line} ({MAX_DF*100}% of {DV_document_count} documents)")

ax.spines[['right', 'top']].set_visible(False)
ax.spines[['left', 'bottom']].set_color(ENEXIS_LIGHTGREY)

# plt.legend(loc='upper right', fontsize=9)
plt.hist(histo_data, bins=DV_document_count, color=SHEET_GREEN);


In [None]:
count_per_dv_document


In [None]:
total_DV = dv_docs_matrix.mean().dropna().sort_values()
total_DV.nlargest(60)


---
## Word Cloud of Unique Words in DV vs All
Here we subtract the two word matrixes "all" and "dv" to determine what words are identifying for DV documents.

In [None]:
unique_for_DV = (dv_docs_matrix.mean() - all_docs_matrix.mean()).dropna().sort_values()
unique_for_DV.nlargest(20)


In [None]:
cloud = WordCloud(background_color="white", max_words=50).generate_from_frequencies(unique_for_DV)
plt.axis('off')
plt.imshow(cloud);


In [None]:
unique_on_idf = (dv_docs_idf.T - all_docs_idf.T).dropna()
unique_on_idf[0].nlargest(20)

In [None]:
cloud = WordCloud(background_color="white", max_words=50).generate_from_frequencies(unique_on_idf[0])
plt.axis('off')
plt.imshow(cloud);


In [None]:
dv_docs_idf.T[0].nlargest(10)
