## Cleaning Checkups

In [47]:
import sys
sys.path.append("../src")  

In [48]:
import pandas as pd
from config import CLEANED_COMPLAINTS_FILE

# -------------------------------------------------
# Load cleaned data
# -------------------------------------------------
df = pd.read_csv(CLEANED_COMPLAINTS_FILE)

print(f"Dataset shape: {df.shape}")

# -------------------------------------------------
# 1) NaN checks
# -------------------------------------------------
nan_counts = df[["lda_description", "bertopic_description"]].isna().sum()
print("\nNaN counts:")
print(nan_counts)

# -------------------------------------------------
# 2) Empty / whitespace-only checks
# -------------------------------------------------
empty_lda = (df["lda_description"].fillna("").str.strip() == "").sum()
empty_bertopic = (df["bertopic_description"].fillna("").str.strip() == "").sum()

print("\nEmpty-string counts:")
print(f"  LDA empty:        {empty_lda}")
print(f"  BERTopic empty:   {empty_bertopic}")

# -------------------------------------------------
# 3) Rows invalid for BOTH pipelines (should be 0)
# -------------------------------------------------
invalid_rows = df[
    (df["lda_description"].fillna("").str.strip() == "") &
    (df["bertopic_description"].fillna("").str.strip() == "")
]

print(f"\nRows invalid for BOTH pipelines: {len(invalid_rows)}")

# -------------------------------------------------
# 4) Accidental 'nan' strings (rare but dangerous)
# -------------------------------------------------
nan_string_lda = (df["lda_description"].astype(str).str.lower() == "nan").sum()
nan_string_bt = (df["bertopic_description"].astype(str).str.lower() == "nan").sum()

print("\nLiteral 'nan' string counts:")
print(f"  LDA:      {nan_string_lda}")
print(f"  BERTopic: {nan_string_bt}")

# -------------------------------------------------
# 5) Token / length sanity checks
# -------------------------------------------------
df["lda_token_count"] = df["lda_description"].fillna("").str.split().str.len()
df["bertopic_char_len"] = df["bertopic_description"].fillna("").str.len()

print("\nLDA token count summary:")
print(df["lda_token_count"].describe())

print("\nBERTopic character length summary:")
print(df["bertopic_char_len"].describe())

# -------------------------------------------------
# 6) Final hard assertions (fail fast)
# -------------------------------------------------
assert nan_counts.sum() == 0, "❌ NaN values detected"
assert nan_string_lda == 0 and nan_string_bt == 0, "❌ Literal 'nan' strings detected"
assert len(invalid_rows) == 0, "❌ Rows empty for both LDA and BERTopic"

print("\n✅ Cleaning sanity checks passed. Data is safe for vectorization & topic modeling.")

Dataset shape: (926, 4)

NaN counts:
lda_description         0
bertopic_description    0
dtype: int64

Empty-string counts:
  LDA empty:        0
  BERTopic empty:   0

Rows invalid for BOTH pipelines: 0

Literal 'nan' string counts:
  LDA:      0
  BERTopic: 0

LDA token count summary:
count    926.000000
mean       7.025918
std        5.960172
min        1.000000
25%        3.000000
50%        5.000000
75%       10.000000
max       65.000000
Name: lda_token_count, dtype: float64

BERTopic character length summary:
count    926.000000
mean      97.791577
std       88.034912
min        3.000000
25%       38.000000
50%       70.000000
75%      132.000000
max      941.000000
Name: bertopic_char_len, dtype: float64

✅ Cleaning sanity checks passed. Data is safe for vectorization & topic modeling.


### (1) Check-up of the LDA cleaning pipeline output 

In [49]:
import pandas as pd
from config import CLEANED_COMPLAINTS_FILE

def preview_cleaning(n=10):
    """Print cleaned text above raw text for easy inspection."""
    df = pd.read_csv(CLEANED_COMPLAINTS_FILE)
    sample = df[["description", "lda_description"]].sample(n)

    for i, row in sample.iterrows():
        print("------ SAMPLE ------")
        print("CLEANED:")
        print(row["lda_description"])
        print("\nRAW:")
        print(row["description"])
        print("\n")

# Run it:
preview_cleaning(10)

------ SAMPLE ------
CLEANED:
möchten beschweren müllberge dreck grillgestank rücksichtslos laut musiklärm jedesmal schön wetter neubürger sippe westpark grillbereich unterhalb rosengarten verursachen

RAW:
Möchte mich beschweren über die Müllberge,den Dreck,den Grillgestank und den rücksichtslos lauten Musiklärm,den jedesmal bei schönem Wetter immer wieder die gleichen Neubürger mit ihrer ganzen Sippe im Westpark im Grillbereich unterhalb des Rosengartens verursachen!!Warum wird nichts dagegen getan?


------ SAMPLE ------
CLEANED:
laterne defekt höhe nordendstraße münchen

RAW:
Laterne defekt auf Höhe Nordendstraße 34, 80801 München


------ SAMPLE ------
CLEANED:
sperrmüll bachbett

RAW:
Sperrmüll im „Bachbett“


------ SAMPLE ------
CLEANED:
isartalbahnweg mast defekt

RAW:
Am Isartalbahnweg ist der Mast Nr. 9 defekt


------ SAMPLE ------
CLEANED:
lochhausener mastnummer lampe ausfallen

RAW:
Lochhausener Str. - Mastnummer 123 - T8-Lampe ausgefallen


------ SAMPLE ------
CLEANED:

### (2) Check-up of the BERTopic cleaning pipeline output 

In [50]:
import pandas as pd
from config import CLEANED_COMPLAINTS_FILE

def preview_cleaning(n=10):
    """Print cleaned text above raw text for easy inspection."""
    df = pd.read_csv(CLEANED_COMPLAINTS_FILE)
    sample = df[["description", "bertopic_description"]].sample(n)

    for i, row in sample.iterrows():
        print("------ SAMPLE ------")
        print("CLEANED:")
        print(row["bertopic_description"])
        print("\nRAW:")
        print(row["description"])
        print("\n")

# Run it:
preview_cleaning(10)

------ SAMPLE ------
CLEANED:
eine der hängeleuchten ist aus

RAW:
Eine der Hängeleuchten ist aus. 


------ SAMPLE ------
CLEANED:
an der sophienstr ist eine gehwegleuchte ausgefallen

RAW:
An der Sophienstr. ist eine Gehwegleuchte ausgefallen. 


------ SAMPLE ------
CLEANED:
man sieht das auf dem foto nicht so gut von den beiden lampen ist die linke defekt

RAW:
Man sieht das auf dem Foto nicht so gut. Von den beiden Lampen ist die linke defekt


------ SAMPLE ------
CLEANED:
straßenlaterne vor am westpark 8 neben der litfaßsäule funktioniert nicht

RAW:
Straßenlaterne vor Am Westpark 8 (neben der Litfaßsäule) funktioniert nicht.


------ SAMPLE ------
CLEANED:
am anfang der perhamerstraße von fürstenriederstr bis auf höhe lutzstraße ging in der nacht von 6 9 auf 7 9 24 keine einzige straßenlaterne

RAW:
Am Anfang der Perhamerstraße (von Fürstenriederstr. bis auf Höhe Lutzstraße) ging in der Nacht von 6.9. auf 7.9.24 keine einzige Straßenlaterne.


------ SAMPLE ------
CLEANED:
neue

## Vectorization Checkups (LDA)

In [1]:
# ===== Project path bootstrap (RUN ONCE) =====
from pathlib import Path
import sys
import os

# Resolve project root assuming this notebook is in /notebooks/
PROJECT_ROOT = Path.cwd().parents[0]

# Add src/ to Python path for module imports
SRC_PATH = PROJECT_ROOT / "src"
if str(SRC_PATH) not in sys.path:
    sys.path.insert(0, str(SRC_PATH))

# Change working directory to project root (for relative file paths)
os.chdir(PROJECT_ROOT)

print("Project root set to:", PROJECT_ROOT)
print("Working directory:", Path.cwd())
print("src/ added to sys.path")

Project root set to: /Users/dd/PycharmProjects/complaints_analysis
Working directory: /Users/dd/PycharmProjects/complaints_analysis
src/ added to sys.path


In [2]:
# Inspect feature names (vocabulary sanity check)

import joblib

bow_vocab = joblib.load("data/vectorized/lda/lda_bow_feature_names.joblib")
tfidf_vocab = joblib.load("data/vectorized/lda/lda_tfidf_feature_names.joblib")

In [3]:
bow_vocab[:50]

['abbauen',
 'abbrechen',
 'abdeckung',
 'abend',
 'abends',
 'abfall',
 'abfluss',
 'abgefallen',
 'ablauf',
 'ablegen',
 'abschaltung',
 'abschnitt',
 'absperrung',
 'abstellen',
 'agnesstr',
 'allee',
 'alt',
 'ampel',
 'ampelanlage',
 'anderer',
 'andrea',
 'anfang',
 'anforderung',
 'angebracht',
 'angefahr',
 'anlage',
 'ansehen',
 'anwesen',
 'art',
 'ast',
 'aufbauen',
 'auffüllen',
 'aufgang',
 'aufhängen',
 'aufkleber',
 'aufstellen',
 'augustiner',
 'ausfahrt',
 'ausfall',
 'ausfallen',
 'ausgang',
 'ausgebrannt',
 'ausgefall',
 'ausgefalle',
 'ausgefallen',
 'ausgehen',
 'ausleger',
 'ausleuchten',
 'ausreichend',
 'ausschließlich']

In [4]:
tfidf_vocab[:50]

['abbauen',
 'abbrechen',
 'abdeckung',
 'abend',
 'abends',
 'abfall',
 'abfluss',
 'abgefallen',
 'ablauf',
 'ablegen',
 'abschaltung',
 'abschnitt',
 'absperrung',
 'abstellen',
 'agnesstr',
 'allee',
 'alt',
 'ampel',
 'ampelanlage',
 'anderer',
 'andrea',
 'anfang',
 'anforderung',
 'angebracht',
 'angefahr',
 'anlage',
 'ansehen',
 'anwesen',
 'art',
 'ast',
 'aufbauen',
 'auffüllen',
 'aufgang',
 'aufhängen',
 'aufkleber',
 'aufstellen',
 'augustiner',
 'ausfahrt',
 'ausfall',
 'ausfallen',
 'ausgang',
 'ausgebrannt',
 'ausgefall',
 'ausgefalle',
 'ausgefallen',
 'ausgehen',
 'ausleger',
 'ausleuchten',
 'ausreichend',
 'ausschließlich']

In [5]:
from scipy.sparse import load_npz
import numpy as np
import pandas as pd

X = load_npz("data/vectorized/lda/lda_bow_matrix.npz")
freq = X.sum(axis=0).A1

top = pd.DataFrame({"token": bow_vocab, "freq": freq}).sort_values("freq", ascending=False)
top.head(30)

Unnamed: 0,token,freq
39,ausfallen,191
117,defekt,132
380,leuchte,126
411,mast,92
105,brunnen,90
349,lampe,89
381,leuchten,63
336,komplett,59
363,laterne,50
77,beleuchtung,50


In [6]:
#Document frequency distribution

from scipy.sparse import load_npz
import numpy as np

X = load_npz("data/vectorized/lda/lda_bow_matrix.npz")
doc_freq = np.asarray((X > 0).sum(axis=0)).ravel()
doc_freq_ratio = doc_freq / X.shape[0]

# Show some stats
doc_freq_ratio.min(), doc_freq_ratio.max()

(0.002178649237472767, 0.2047930283224401)

Interpretation
	•	The least frequent terms appear in ~0.2% of documents (≈ 2 documents), matching the min_df=2 threshold.
	•	The most frequent term appears in ~20% of documents, indicating that no single token dominates the corpus.

Assessment
	•	The vocabulary is well balanced:
	•	Rare terms are filtered appropriately.
	•	Overly generic terms are successfully suppressed.
	•	The chosen min_df and max_df parameters are well calibrated for this dataset.

In [7]:
import pandas as pd

token_freq = X.sum(axis=0).A1
top = pd.DataFrame({
    "token": bow_vocab,
    "freq": token_freq
}).sort_values("freq", ascending=False)

top.head(20)

Unnamed: 0,token,freq
39,ausfallen,191
117,defekt,132
380,leuchte,126
411,mast,92
105,brunnen,90
349,lampe,89
381,leuchten,63
336,komplett,59
363,laterne,50
77,beleuchtung,50


Interpretation
	•	High-frequency terms clearly reflect core municipal issue themes, such as:
	•	Infrastructure failures
	•	Street lighting
	•	Traffic signals
	•	Water features and waste
	•	No obvious noise tokens or stopword leakage are present.

Assessment
	•	The vocabulary preserves domain-relevant semantics.
	•	The preprocessing and vectorization steps successfully retained meaningful signal.

In [8]:
# Sparisty per document
tokens_per_doc = (X > 0).sum(axis=1).A1

tokens_per_doc.min(), tokens_per_doc.mean(), tokens_per_doc.max()

(1, 4.994553376906318, 40)

Interpretation
	•	The complaints are short texts, with an average of ~5 tokens per document after cleaning.
	•	A small number of documents contain only a single token (e.g. “ausgefallen”), which is typical for municipal issue reports.
	•	Some longer descriptions (up to 40 tokens) exist and provide richer context, which helps stabilize topic modeling.

Assessment
	•	This distribution is expected for Open311-style complaint data.
	•	While short documents can limit topic granularity, the dataset still contains sufficient semantic signal.
	•	No additional filtering is required at this stage.

In [9]:
X_tfidf = load_npz("data/vectorized/lda/lda_tfidf_matrix.npz")

print(X.nnz / X.shape[0])
print(X_tfidf.nnz / X_tfidf.shape[0])

4.994553376906318
4.994553376906318


Interpretation
	•	Both Bag-of-Words and TF-IDF representations contain the same number of non-zero entries per document.
	•	This confirms that TF-IDF reweights terms but does not alter the underlying sparsity structure.

Assessment
	•	The two vectorization techniques are directly comparable.
	•	Any differences in topic modeling results will be attributable to term weighting, not preprocessing artifacts.

In [10]:
tfidf_means = np.asarray(X_tfidf.mean(axis=0)).ravel()
top = np.argsort(tfidf_means)[-20:]
[(tfidf_vocab[i], tfidf_means[i]) for i in reversed(top)]

[('ausfallen', 0.06811506872020248),
 ('leuchte', 0.058020352149674055),
 ('defekt', 0.05137879254587185),
 ('mast', 0.03719220622876415),
 ('lampe', 0.03266920163313266),
 ('leuchten', 0.025904562939074605),
 ('komplett', 0.02475369719785844),
 ('brunnen', 0.024453050065529985),
 ('licht', 0.02412279608641433),
 ('beleuchtung', 0.022306561459963202),
 ('dunkel', 0.018957822436670102),
 ('laterne', 0.018735438419480988),
 ('funktionieren', 0.014851177203165426),
 ('röhre', 0.01484489983295294),
 ('straßenlaterne', 0.014173669174849659),
 ('ecke', 0.014074454660146923),
 ('liegen', 0.01403574448520288),
 ('fehlen', 0.013021792220663526),
 ('ampel', 0.012847347302876951),
 ('höhe', 0.012512371305062315)]

**TF-IDF Feature Inspection — Interpretation**

The listed terms have the highest **average TF-IDF scores** across documents, meaning they are
globally informative rather than just frequent.

- Dominated by **domain-specific failure terms** (*ausfallen, defekt, leuchte, lampe*).
- No generic stopwords → **cleaning and vectorization are effective**.
- Strong semantic coherence around **infrastructure faults and lighting**.
- High overlap with BoW results → **stable and consistent feature space**.

**Conclusion:**  
TF-IDF features are well-formed and suitable for LDA topic modeling. No changes needed.