### **Disclaimer**: The vectorization checkups for LDA are conducted based on **currently supported** bigram-based approach. This notebook reflects the results of the current vectorization pipeline.

## Cleaning Checkups

In [1]:
import sys
sys.path.append("../src")  

In [2]:
import pandas as pd
from config import CLEANED_COMPLAINTS_FILE

# -------------------------------------------------
# Load cleaned data
# -------------------------------------------------
df = pd.read_csv(CLEANED_COMPLAINTS_FILE)

print(f"Dataset shape: {df.shape}")

# -------------------------------------------------
# 1) NaN checks
# -------------------------------------------------
nan_counts = df[["lda_description", "bertopic_description"]].isna().sum()
print("\nNaN counts:")
print(nan_counts)

# -------------------------------------------------
# 2) Empty / whitespace-only checks
# -------------------------------------------------
empty_lda = (df["lda_description"].fillna("").str.strip() == "").sum()
empty_bertopic = (df["bertopic_description"].fillna("").str.strip() == "").sum()

print("\nEmpty-string counts:")
print(f"  LDA empty:        {empty_lda}")
print(f"  BERTopic empty:   {empty_bertopic}")

# -------------------------------------------------
# 3) Rows invalid for BOTH pipelines (should be 0)
# -------------------------------------------------
invalid_rows = df[
    (df["lda_description"].fillna("").str.strip() == "") &
    (df["bertopic_description"].fillna("").str.strip() == "")
]

print(f"\nRows invalid for BOTH pipelines: {len(invalid_rows)}")

# -------------------------------------------------
# 4) Accidental 'nan' strings (rare but dangerous)
# -------------------------------------------------
nan_string_lda = (df["lda_description"].astype(str).str.lower() == "nan").sum()
nan_string_bt = (df["bertopic_description"].astype(str).str.lower() == "nan").sum()

print("\nLiteral 'nan' string counts:")
print(f"  LDA:      {nan_string_lda}")
print(f"  BERTopic: {nan_string_bt}")

# -------------------------------------------------
# 5) Token / length sanity checks
# -------------------------------------------------
df["lda_token_count"] = df["lda_description"].fillna("").str.split().str.len()
df["bertopic_char_len"] = df["bertopic_description"].fillna("").str.len()

print("\nLDA token count summary:")
print(df["lda_token_count"].describe())

print("\nBERTopic character length summary:")
print(df["bertopic_char_len"].describe())

# -------------------------------------------------
# 6) Final hard assertions (fail fast)
# -------------------------------------------------
assert nan_counts.sum() == 0, "❌ NaN values detected"
assert nan_string_lda == 0 and nan_string_bt == 0, "❌ Literal 'nan' strings detected"
assert len(invalid_rows) == 0, "❌ Rows empty for both LDA and BERTopic"

print("\n✅ Cleaning sanity checks passed. Data is safe for vectorization & topic modeling.")

Dataset shape: (930, 4)

NaN counts:
lda_description         0
bertopic_description    0
dtype: int64

Empty-string counts:
  LDA empty:        0
  BERTopic empty:   0

Rows invalid for BOTH pipelines: 0

Literal 'nan' string counts:
  LDA:      0
  BERTopic: 0

LDA token count summary:
count    930.000000
mean       7.036559
std        5.962094
min        1.000000
25%        3.000000
50%        5.000000
75%       10.000000
max       65.000000
Name: lda_token_count, dtype: float64

BERTopic character length summary:
count    930.000000
mean      97.982796
std       88.028629
min        3.000000
25%       38.000000
50%       70.000000
75%      132.000000
max      941.000000
Name: bertopic_char_len, dtype: float64

✅ Cleaning sanity checks passed. Data is safe for vectorization & topic modeling.


### (1) Check-up of the LDA cleaning pipeline output 

In [3]:
import pandas as pd
from config import CLEANED_COMPLAINTS_FILE

def preview_cleaning(n=10):
    """Print cleaned text above raw text for easy inspection."""
    df = pd.read_csv(CLEANED_COMPLAINTS_FILE)
    sample = df[["description", "lda_description"]].sample(n)

    for i, row in sample.iterrows():
        print("------ SAMPLE ------")
        print("CLEANED:")
        print(row["lda_description"])
        print("\nRAW:")
        print(row["description"])
        print("\n")

# Run it:
preview_cleaning(10)

------ SAMPLE ------
CLEANED:
mülllagerung entlang neu pinakothek

RAW:
Mülllagerung entlang der neuen Pinakothek 


------ SAMPLE ------
CLEANED:
beleuchtung einseitig ausfallen

RAW:
Beleuchtung einseitig ausgefallen.


------ SAMPLE ------
CLEANED:
klara ziegler bogen verlängerung radweg grüngürtel gefilde richtung arnold sommerfeld neu parkbänk aufstellen mülleimer sammle öfter müll dauerhaft lösung sinnvoll

RAW:
Am Klara-Ziegler-Bogen - in der Verlängerung des Radwegs durch den "Grüngürtel im Gefilde" in Richtung Arnold-Sommerfeld-Straße wurden einige neue Parkbänke aufgestellt - aber leider kein Mülleimer. Ich sammle dort öfter Müll, aber eine dauerhafte Lösung wäre sinnvoller.



------ SAMPLE ------
CLEANED:
hängeleucht leuchtstoffröhren defekt lampengla verschmutzen

RAW:
Bei Hängeleuchte: eine von beiden Leuchtstoffröhren defekt und Lampenglas verschmutzt 


------ SAMPLE ------
CLEANED:
leuchte ostpark ausfallen

RAW:
Leuchte 7-4 im Ostpark ist ausgefallen.


------ SAMPLE 

### (2) Check-up of the BERTopic cleaning pipeline output 

In [4]:
import pandas as pd
from config import CLEANED_COMPLAINTS_FILE

def preview_cleaning(n=10):
    """Print cleaned text above raw text for easy inspection."""
    df = pd.read_csv(CLEANED_COMPLAINTS_FILE)
    sample = df[["description", "bertopic_description"]].sample(n)

    for i, row in sample.iterrows():
        print("------ SAMPLE ------")
        print("CLEANED:")
        print(row["bertopic_description"])
        print("\nRAW:")
        print(row["description"])
        print("\n")

# Run it:
preview_cleaning(10)

------ SAMPLE ------
CLEANED:
ampelanlage sigl zschokkestraße ausgefallen

RAW:
Ampelanlage Sigl-/Zschokkestraße ausgefallen 


------ SAMPLE ------
CLEANED:
schild beklebt

RAW:
Schild beklebt 


------ SAMPLE ------
CLEANED:
in der torgauer str vor hausnummer 3 auf der linken seite die zweite straßenbeleuchtung hab die mastnummer nicht mitgenommen leuchtet weiß oder gar nicht die anderen leuchten sind orange

RAW:
In der Torgauer Str. vor Hausnummer 3 auf der linken Seite die zweite Straßenbeleuchtung (hab die Mastnummer nicht mitgenommen) leuchtet weiß oder gar nicht. Die anderen Leuchten sind orange.


------ SAMPLE ------
CLEANED:
lampe blinkt

RAW:
Lampe blinkt


------ SAMPLE ------
CLEANED:
die straßenlampe steht vor dem haus wilhelm weitling str 18 mastnummerschild schlecht lesbar die strassenlampe ist heute zum wiederholten male ausgefallen

RAW:
Die Straßenlampe steht vor dem Haus Wilhelm-Weitling-Str.18, Mastnummerschild schlecht lesbar,
die Strassenlampe ist heute zum wied

## Vectorization Checkups (LDA)

In [5]:
from pathlib import Path
import sys
import os

PROJECT_ROOT = Path.cwd().parents[0]

SRC_PATH = PROJECT_ROOT / "src"
if str(SRC_PATH) not in sys.path:
    sys.path.insert(0, str(SRC_PATH))

os.chdir(PROJECT_ROOT)

print("Project root set to:", PROJECT_ROOT)
print("Working directory:", Path.cwd())
print("src/ added to sys.path")

Project root set to: /Users/dd/PycharmProjects/complaints_analysis
Working directory: /Users/dd/PycharmProjects/complaints_analysis
src/ added to sys.path


In [6]:
# Inspect feature names (vocabulary sanity check)

import joblib

bow_vocab = joblib.load("data/vectorized/lda/lda_bow_feature_names.joblib")
tfidf_vocab = joblib.load("data/vectorized/lda/lda_tfidf_feature_names.joblib")

In [7]:
bow_vocab[:50]

['abbauen',
 'abbrechen',
 'abdeckung',
 'abend',
 'abends',
 'abfall',
 'abfluss',
 'abgefallen',
 'ablauf',
 'ablegen',
 'abschaltung',
 'abschnitt',
 'absperrung',
 'abstellen',
 'agnesstr',
 'allee',
 'alt',
 'ampel',
 'ampel ausfallen',
 'ampel komplett',
 'ampel mast',
 'ampel rot',
 'ampelanlage',
 'anderer',
 'andrea',
 'anfang',
 'anforderung',
 'angebracht',
 'angefahr',
 'angefahr lampenschirm',
 'anlage',
 'ansehen',
 'anwesen',
 'art',
 'ast',
 'aufbauen',
 'auffüllen',
 'aufgang',
 'aufhängen',
 'aufkleber',
 'aufkleber öfter',
 'aufstellen',
 'augustiner',
 'ausfahrt',
 'ausfall',
 'ausfallen',
 'ausfallen einmündung',
 'ausfallen höhe',
 'ausfallen mast',
 'ausfallen mastnummer']

In [8]:
tfidf_vocab[:50]

['abbauen',
 'abbrechen',
 'abdeckung',
 'abend',
 'abends',
 'abfall',
 'abfluss',
 'abgefallen',
 'ablauf',
 'ablegen',
 'abschaltung',
 'abschnitt',
 'absperrung',
 'abstellen',
 'agnesstr',
 'allee',
 'alt',
 'ampel',
 'ampel ausfallen',
 'ampel komplett',
 'ampel mast',
 'ampel rot',
 'ampelanlage',
 'anderer',
 'andrea',
 'anfang',
 'anforderung',
 'angebracht',
 'angefahr',
 'angefahr lampenschirm',
 'anlage',
 'ansehen',
 'anwesen',
 'art',
 'ast',
 'aufbauen',
 'auffüllen',
 'aufgang',
 'aufhängen',
 'aufkleber',
 'aufkleber öfter',
 'aufstellen',
 'augustiner',
 'ausfahrt',
 'ausfall',
 'ausfallen',
 'ausfallen einmündung',
 'ausfallen höhe',
 'ausfallen mast',
 'ausfallen mastnummer']

In [10]:
from scipy.sparse import load_npz
import numpy as np
import pandas as pd

X = load_npz("data/vectorized/lda/lda_bow_matrix.npz")
freq = X.sum(axis=0).A1

top = pd.DataFrame({"token": bow_vocab, "freq": freq}).sort_values("freq", ascending=False)
top.head(30)

Unnamed: 0,token,freq
45,ausfallen,192
162,defekt,132
560,leuchte,126
625,mast,92
141,brunnen,90
504,lampe,89
570,leuchten,64
479,komplett,60
532,laterne,52
96,beleuchtung,50


In [11]:
#Document frequency distribution

from scipy.sparse import load_npz
import numpy as np

X = load_npz("data/vectorized/lda/lda_bow_matrix.npz")
doc_freq = np.asarray((X > 0).sum(axis=0)).ravel()
doc_freq_ratio = doc_freq / X.shape[0]

# Show some stats
doc_freq_ratio.min(), doc_freq_ratio.max()

(0.0021691973969631237, 0.20498915401301518)

Interpretation
	•	The least frequent terms appear in ~0.2% of documents (≈ 2 documents), matching the min_df=2 threshold.
	•	The most frequent term appears in ~20% of documents, indicating that no single token dominates the corpus.

Assessment
	•	The vocabulary is well balanced:
	•	Rare terms are filtered appropriately.
	•	Overly generic terms are successfully suppressed.
	•	The chosen min_df and max_df parameters are well calibrated for this dataset.

In [12]:
import pandas as pd

token_freq = X.sum(axis=0).A1
top = pd.DataFrame({
    "token": bow_vocab,
    "freq": token_freq
}).sort_values("freq", ascending=False)

top.head(20)

Unnamed: 0,token,freq
45,ausfallen,192
162,defekt,132
560,leuchte,126
625,mast,92
141,brunnen,90
504,lampe,89
570,leuchten,64
479,komplett,60
532,laterne,52
96,beleuchtung,50


Interpretation

• High-frequency terms continue to clearly reflect the core municipal complaint themes, including:
• Infrastructure failures and outages (ausfallen, defekt)
• Street lighting and public illumination (leuchte, lampe, laterne, licht, beleuchtung, dunkel)
• Public facilities and water features (brunnen, wasser)
• Spatial and contextual references (platz, richtung, ecke)

• In addition to unigrams, the presence of the frequent bigram “leuchte ausfallen” indicates that the n-gram configuration successfully captures recurring phrase-level complaint patterns, improving semantic specificity over unigram-only representations.

• No obvious noise tokens, stopword leakage, or preprocessing artifacts are present.

Assessment

• The vocabulary preserves domain-relevant semantics while enriching representations with meaningful multi-word expressions.
• The (1,2) n-gram configuration improves expressiveness without degrading vocabulary quality or introducing sparsity-related noise.
• The preprocessing and vectorization steps successfully retain both general and phrase-level signal, making the setup suitable as a stable baseline for topic modeling.


In [13]:
# Sparisty per document
tokens_per_doc = (X > 0).sum(axis=1).A1

tokens_per_doc.min(), tokens_per_doc.mean(), tokens_per_doc.max()

(1, 6.0813449023861175, 48)

Interpretation
	•	The complaints are short texts, with an average of ~6 tokens per document after cleaning.
	•	A small number of documents contain only a single token (e.g. “ausgefallen”), which is typical for municipal issue reports.
	•	Some longer descriptions (up to 40 tokens) exist and provide richer context, which helps stabilize topic modeling.

Assessment
	•	This distribution is expected for Open311-style complaint data.
	•	While short documents can limit topic granularity, the dataset still contains sufficient semantic signal.
	•	No additional filtering is required at this stage.

In [14]:
X_tfidf = load_npz("data/vectorized/lda/lda_tfidf_matrix.npz")

print(X.nnz / X.shape[0])
print(X_tfidf.nnz / X_tfidf.shape[0])

6.0813449023861175
6.0813449023861175


Interpretation
	•	Both Bag-of-Words and TF-IDF representations contain the same number of non-zero entries per document.
	•	This confirms that TF-IDF reweights terms but does not alter the underlying sparsity structure.

Assessment
	•	The two vectorization techniques are directly comparable.
	•	Any differences in topic modeling results will be attributable to term weighting, not preprocessing artifacts.

In [15]:
tfidf_means = np.asarray(X_tfidf.mean(axis=0)).ravel()
top = np.argsort(tfidf_means)[-20:]
[(tfidf_vocab[i], tfidf_means[i]) for i in reversed(top)]

[('ausfallen', 0.05314867091189336),
 ('leuchte', 0.04788258975708639),
 ('defekt', 0.04033811201454292),
 ('mast', 0.02979671664099885),
 ('lampe', 0.02602416891748578),
 ('brunnen', 0.021934535403264406),
 ('leuchten', 0.021445646524032),
 ('licht', 0.021320446909211452),
 ('beleuchtung', 0.018672738314930982),
 ('leuchte ausfallen', 0.018402756031356615),
 ('komplett', 0.018396382822178144),
 ('laterne', 0.015589136511541587),
 ('dunkel', 0.015095454023947212),
 ('funktionieren', 0.013151424829676624),
 ('liegen', 0.012907590783925489),
 ('röhre', 0.012746497368724376),
 ('straßenlaterne', 0.012379516293690225),
 ('ecke', 0.012323661845269823),
 ('ampel', 0.011037352666449021),
 ('fehlen', 0.010909501575143317)]

**TF-IDF Feature Inspection — Interpretation**

The listed terms have the highest **average TF-IDF scores** across documents, meaning they are
globally informative rather than just frequent.

- Dominated by **domain-specific failure terms** (*ausfallen, defekt, leuchte, lampe*).
- The presence of the bigram *leuchte ausfallen* shows that the **(1,2)-gram setup captures meaningful phrase-level semantics**
- No generic stopwords → **cleaning and vectorization are effective**.
- Strong semantic coherence around **infrastructure faults and lighting**.
- High overlap with BoW results → **stable and consistent feature space**.

**Conclusion:**  
TF-IDF features are well-formed and semantically coherent. The addition of bigrams improves expressiveness without introducing noise, making the setup suitable for LDA topic modeling.