## Cleaning Checkups

In [1]:
import sys
sys.path.append("../src")  

In [2]:
import pandas as pd
from config import CLEANED_COMPLAINTS_FILE

# -------------------------------------------------
# Load cleaned data
# -------------------------------------------------
df = pd.read_csv(CLEANED_COMPLAINTS_FILE)

print(f"Dataset shape: {df.shape}")

# -------------------------------------------------
# 1) NaN checks
# -------------------------------------------------
nan_counts = df[["lda_description", "bertopic_description"]].isna().sum()
print("\nNaN counts:")
print(nan_counts)

# -------------------------------------------------
# 2) Empty / whitespace-only checks
# -------------------------------------------------
empty_lda = (df["lda_description"].fillna("").str.strip() == "").sum()
empty_bertopic = (df["bertopic_description"].fillna("").str.strip() == "").sum()

print("\nEmpty-string counts:")
print(f"  LDA empty:        {empty_lda}")
print(f"  BERTopic empty:   {empty_bertopic}")

# -------------------------------------------------
# 3) Rows invalid for BOTH pipelines (should be 0)
# -------------------------------------------------
invalid_rows = df[
    (df["lda_description"].fillna("").str.strip() == "") &
    (df["bertopic_description"].fillna("").str.strip() == "")
]

print(f"\nRows invalid for BOTH pipelines: {len(invalid_rows)}")

# -------------------------------------------------
# 4) Accidental 'nan' strings (rare but dangerous)
# -------------------------------------------------
nan_string_lda = (df["lda_description"].astype(str).str.lower() == "nan").sum()
nan_string_bt = (df["bertopic_description"].astype(str).str.lower() == "nan").sum()

print("\nLiteral 'nan' string counts:")
print(f"  LDA:      {nan_string_lda}")
print(f"  BERTopic: {nan_string_bt}")

# -------------------------------------------------
# 5) Token / length sanity checks
# -------------------------------------------------
df["lda_token_count"] = df["lda_description"].fillna("").str.split().str.len()
df["bertopic_char_len"] = df["bertopic_description"].fillna("").str.len()

print("\nLDA token count summary:")
print(df["lda_token_count"].describe())

print("\nBERTopic character length summary:")
print(df["bertopic_char_len"].describe())

# -------------------------------------------------
# 6) Final hard assertions (fail fast)
# -------------------------------------------------
assert nan_counts.sum() == 0, "❌ NaN values detected"
assert nan_string_lda == 0 and nan_string_bt == 0, "❌ Literal 'nan' strings detected"
assert len(invalid_rows) == 0, "❌ Rows empty for both LDA and BERTopic"

print("\n✅ Cleaning sanity checks passed. Data is safe for vectorization & topic modeling.")

Dataset shape: (930, 4)

NaN counts:
lda_description         0
bertopic_description    0
dtype: int64

Empty-string counts:
  LDA empty:        0
  BERTopic empty:   0

Rows invalid for BOTH pipelines: 0

Literal 'nan' string counts:
  LDA:      0
  BERTopic: 0

LDA token count summary:
count    930.000000
mean       7.036559
std        5.962094
min        1.000000
25%        3.000000
50%        5.000000
75%       10.000000
max       65.000000
Name: lda_token_count, dtype: float64

BERTopic character length summary:
count    930.000000
mean      97.982796
std       88.028629
min        3.000000
25%       38.000000
50%       70.000000
75%      132.000000
max      941.000000
Name: bertopic_char_len, dtype: float64

✅ Cleaning sanity checks passed. Data is safe for vectorization & topic modeling.


### (1) Check-up of the LDA cleaning pipeline output 

In [3]:
import pandas as pd
from config import CLEANED_COMPLAINTS_FILE

def preview_cleaning(n=10):
    """Print cleaned text above raw text for easy inspection."""
    df = pd.read_csv(CLEANED_COMPLAINTS_FILE)
    sample = df[["description", "lda_description"]].sample(n)

    for i, row in sample.iterrows():
        print("------ SAMPLE ------")
        print("CLEANED:")
        print(row["lda_description"])
        print("\nRAW:")
        print(row["description"])
        print("\n")

# Run it:
preview_cleaning(10)

------ SAMPLE ------
CLEANED:
seite funktion

RAW:
Auf einer Seite nicht in Funktion


------ SAMPLE ------
CLEANED:
defekt stromkast beleuchtungsmast

RAW:
Defekter Stromkasten an  Beleuchtungsmast


------ SAMPLE ------
CLEANED:
hängeleucht leuchtstoffröhren defekt lampengla verschmutzen

RAW:
Bei Hängeleuchte: eine von beiden Leuchtstoffröhren defekt und Lampenglas verschmutzt


------ SAMPLE ------
CLEANED:
mast schwarz leuchtstoffstäb defekt

RAW:
Mast Nr. 7 (schwarze 3): Einer von beiden Leuchtstoffstäben defekt


------ SAMPLE ------
CLEANED:
leuchte

RAW:
Leuchte aus


------ SAMPLE ------
CLEANED:
stromverteilerkast vermutlich stassenlatern beschädigen standort gehweg kapruner ecke mitterfeldstasse kabel freilegen potentiell gefahr passant

RAW:
Stromverteilerkasten, vermutlich für Stassenlaternen, beschädigt. Standort: Gehweg Kapruner Str. Ecke Mitterfeldstasse. Kabel sind freigelegt. Potentielle Gefahr für Passanten.


------ SAMPLE ------
CLEANED:
zugang enlischer garten sc

### (2) Check-up of the BERTopic cleaning pipeline output 

In [4]:
import pandas as pd
from config import CLEANED_COMPLAINTS_FILE

def preview_cleaning(n=10):
    """Print cleaned text above raw text for easy inspection."""
    df = pd.read_csv(CLEANED_COMPLAINTS_FILE)
    sample = df[["description", "bertopic_description"]].sample(n)

    for i, row in sample.iterrows():
        print("------ SAMPLE ------")
        print("CLEANED:")
        print(row["bertopic_description"])
        print("\nRAW:")
        print(row["description"])
        print("\n")

# Run it:
preview_cleaning(10)

------ SAMPLE ------
CLEANED:
mindestens eine laterne defekt flackern evtl auch zwei da straße sehr dunkel anfang der straße auf höhe hausnummer 1 und 3

RAW:
Mindestens eine Laterne defekt (flackern) evtl auch zwei, da straße sehr dunkel. Anfang der Straße auf Höhe Hausnummer 1 und 3


------ SAMPLE ------
CLEANED:
schaltkasten defekt und offen sehr gefährlich für kinder wegen stromschlag

RAW:
Schaltkasten defekt und offen. Sehr gefährlich für Kinder wegen Stromschlag. 


------ SAMPLE ------
CLEANED:
eien röhre ausgefallen

RAW:
Eien Röhre ausgefallen


------ SAMPLE ------
CLEANED:
es liegt seit einem monat auf dem boden bitte nehmen sie danke

RAW:
Es liegt seit einem Monat auf dem Boden
Bitte nehmen sie  danke


------ SAMPLE ------
CLEANED:
am kombinierten fuß und radweg ist die lampe 11 ausgefallen mast nr 12 und 13 sind auch ausgefallen

RAW:
Am kombinierten Fuß- und Radweg ist die Lampe 11 ausgefallen. Mast Nr 12 und 13 sind auch ausgefallen. 


------ SAMPLE ------
CLEANED:


## Vectorization Checkups (LDA)

In [5]:
from pathlib import Path
import sys
import os

PROJECT_ROOT = Path.cwd().parents[0]

SRC_PATH = PROJECT_ROOT / "src"
if str(SRC_PATH) not in sys.path:
    sys.path.insert(0, str(SRC_PATH))

os.chdir(PROJECT_ROOT)

print("Project root set to:", PROJECT_ROOT)
print("Working directory:", Path.cwd())
print("src/ added to sys.path")

Project root set to: /Users/dd/PycharmProjects/complaints_analysis
Working directory: /Users/dd/PycharmProjects/complaints_analysis
src/ added to sys.path


In [6]:
# Inspect feature names (vocabulary sanity check)

import joblib

bow_vocab = joblib.load("data/vectorized/lda/lda_bow_feature_names.joblib")
tfidf_vocab = joblib.load("data/vectorized/lda/lda_tfidf_feature_names.joblib")

In [7]:
bow_vocab[:50]

['abbauen',
 'abbrechen',
 'abdeckung',
 'abend',
 'abends',
 'abfall',
 'abfluss',
 'abgefallen',
 'ablauf',
 'ablegen',
 'abschaltung',
 'abschnitt',
 'absperrung',
 'abstellen',
 'agnesstr',
 'allee',
 'alt',
 'ampel',
 'ampelanlage',
 'anderer',
 'andrea',
 'anfang',
 'anforderung',
 'angebracht',
 'angefahr',
 'anlage',
 'ansehen',
 'anwesen',
 'art',
 'ast',
 'aufbauen',
 'auffüllen',
 'aufgang',
 'aufhängen',
 'aufkleber',
 'aufstellen',
 'augustiner',
 'ausfahrt',
 'ausfall',
 'ausfallen',
 'ausgang',
 'ausgebrannt',
 'ausgefall',
 'ausgefalle',
 'ausgefallen',
 'ausgehen',
 'ausleger',
 'ausleuchten',
 'ausreichend',
 'ausschließlich']

In [8]:
tfidf_vocab[:50]

['abbauen',
 'abbrechen',
 'abdeckung',
 'abend',
 'abends',
 'abfall',
 'abfluss',
 'abgefallen',
 'ablauf',
 'ablegen',
 'abschaltung',
 'abschnitt',
 'absperrung',
 'abstellen',
 'agnesstr',
 'allee',
 'alt',
 'ampel',
 'ampelanlage',
 'anderer',
 'andrea',
 'anfang',
 'anforderung',
 'angebracht',
 'angefahr',
 'anlage',
 'ansehen',
 'anwesen',
 'art',
 'ast',
 'aufbauen',
 'auffüllen',
 'aufgang',
 'aufhängen',
 'aufkleber',
 'aufstellen',
 'augustiner',
 'ausfahrt',
 'ausfall',
 'ausfallen',
 'ausgang',
 'ausgebrannt',
 'ausgefall',
 'ausgefalle',
 'ausgefallen',
 'ausgehen',
 'ausleger',
 'ausleuchten',
 'ausreichend',
 'ausschließlich']

In [9]:
from scipy.sparse import load_npz
import numpy as np
import pandas as pd

X = load_npz("data/vectorized/lda/lda_bow_matrix.npz")
freq = X.sum(axis=0).A1

top = pd.DataFrame({"token": bow_vocab, "freq": freq}).sort_values("freq", ascending=False)
top.head(30)

Unnamed: 0,token,freq
39,ausfallen,192
119,defekt,132
383,leuchte,126
414,mast,92
107,brunnen,90
352,lampe,89
384,leuchten,64
339,komplett,60
366,laterne,52
77,beleuchtung,50


In [10]:
#Document frequency distribution

from scipy.sparse import load_npz
import numpy as np

X = load_npz("data/vectorized/lda/lda_bow_matrix.npz")
doc_freq = np.asarray((X > 0).sum(axis=0)).ravel()
doc_freq_ratio = doc_freq / X.shape[0]

# Show some stats
doc_freq_ratio.min(), doc_freq_ratio.max()

(0.0021691973969631237, 0.20498915401301518)

Interpretation
	•	The least frequent terms appear in ~0.2% of documents (≈ 2 documents), matching the min_df=2 threshold.
	•	The most frequent term appears in ~20% of documents, indicating that no single token dominates the corpus.

Assessment
	•	The vocabulary is well balanced:
	•	Rare terms are filtered appropriately.
	•	Overly generic terms are successfully suppressed.
	•	The chosen min_df and max_df parameters are well calibrated for this dataset.

In [11]:
import pandas as pd

token_freq = X.sum(axis=0).A1
top = pd.DataFrame({
    "token": bow_vocab,
    "freq": token_freq
}).sort_values("freq", ascending=False)

top.head(20)

Unnamed: 0,token,freq
39,ausfallen,192
119,defekt,132
383,leuchte,126
414,mast,92
107,brunnen,90
352,lampe,89
384,leuchten,64
339,komplett,60
366,laterne,52
77,beleuchtung,50


Interpretation
	•	High-frequency terms clearly reflect core municipal issue themes, such as:
	•	Infrastructure failures
	•	Street lighting
	•	Traffic signals
	•	Water features and waste
	•	No obvious noise tokens or stopword leakage are present.

Assessment
	•	The vocabulary preserves domain-relevant semantics.
	•	The preprocessing and vectorization steps successfully retained meaningful signal.

In [12]:
# Sparisty per document
tokens_per_doc = (X > 0).sum(axis=1).A1

tokens_per_doc.min(), tokens_per_doc.mean(), tokens_per_doc.max()

(1, 5.004338394793926, 40)

Interpretation
	•	The complaints are short texts, with an average of ~5 tokens per document after cleaning.
	•	A small number of documents contain only a single token (e.g. “ausgefallen”), which is typical for municipal issue reports.
	•	Some longer descriptions (up to 40 tokens) exist and provide richer context, which helps stabilize topic modeling.

Assessment
	•	This distribution is expected for Open311-style complaint data.
	•	While short documents can limit topic granularity, the dataset still contains sufficient semantic signal.
	•	No additional filtering is required at this stage.

In [13]:
X_tfidf = load_npz("data/vectorized/lda/lda_tfidf_matrix.npz")

print(X.nnz / X.shape[0])
print(X_tfidf.nnz / X_tfidf.shape[0])

5.004338394793926
5.004338394793926


Interpretation
	•	Both Bag-of-Words and TF-IDF representations contain the same number of non-zero entries per document.
	•	This confirms that TF-IDF reweights terms but does not alter the underlying sparsity structure.

Assessment
	•	The two vectorization techniques are directly comparable.
	•	Any differences in topic modeling results will be attributable to term weighting, not preprocessing artifacts.

In [14]:
tfidf_means = np.asarray(X_tfidf.mean(axis=0)).ravel()
top = np.argsort(tfidf_means)[-20:]
[(tfidf_vocab[i], tfidf_means[i]) for i in reversed(top)]

[('ausfallen', 0.06795509854614504),
 ('leuchte', 0.057813988145756126),
 ('defekt', 0.05123290162402364),
 ('mast', 0.03707816485638278),
 ('lampe', 0.032546784108814425),
 ('leuchten', 0.0260057723548115),
 ('komplett', 0.024744525309191676),
 ('brunnen', 0.024369866454794428),
 ('licht', 0.024180885516339495),
 ('beleuchtung', 0.022223461653063804),
 ('laterne', 0.018966296868147974),
 ('dunkel', 0.018906665145957256),
 ('funktionieren', 0.01503672117201347),
 ('röhre', 0.014780035514492982),
 ('straßenlaterne', 0.014116746507961377),
 ('liegen', 0.013981140594759968),
 ('ecke', 0.013965894415703775),
 ('fehlen', 0.012977885635742835),
 ('ampel', 0.012808725619474682),
 ('mastnummer', 0.012702596333957325)]

**TF-IDF Feature Inspection — Interpretation**

The listed terms have the highest **average TF-IDF scores** across documents, meaning they are
globally informative rather than just frequent.

- Dominated by **domain-specific failure terms** (*ausfallen, defekt, leuchte, lampe*).
- No generic stopwords → **cleaning and vectorization are effective**.
- Strong semantic coherence around **infrastructure faults and lighting**.
- High overlap with BoW results → **stable and consistent feature space**.

**Conclusion:**  
TF-IDF features are well-formed and suitable for LDA topic modeling. No changes needed.