
# Clustering & Recommendations on Netflix ðŸŽ¬  
**Pipeline:** Cleaning â†’ TFâ€‘IDF â†’ Shingling â†’ MinHash â†’ LSH â†’ Jaccard Filter â†’ Hybrid Cosine (content + genres/country) â†’ Topâ€‘N Recos â†’ Distance Matrix for Clustering

> Dieses Notebook basiert auf deinen Codeâ€‘Snippets und ist so strukturiert, dass du es direkt in VS Code ausfÃ¼hren kannst.



## 1) Setup & Imports
> Wenn eine Bibliothek fehlt, installiere sie in deiner aktiven Umgebung (Terminal in VS Code):
```bash
pip install pandas numpy scikit-learn datasketch scipy mmh3
```


In [None]:

import pandas as pd 
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
from datasketch import MinHash, MinHashLSH
import re
import mmh3
from itertools import combinations
from collections import defaultdict
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import string, re
from scipy.sparse import lil_matrix
from scipy.spatial.distance import squareform
from scipy.cluster.hierarchy import linkage, fcluster

# Load data
# Make sure the CSV is in the same folder or adjust the path.
df = pd.read_csv("netflix_titles.csv", encoding="latin1", sep=",", quotechar='"', engine="python")
print(f"Raw rows loaded: {len(df)}")



## 2) Text Cleaning & Normalization
- Entfernt Klammern/AnhÃ¤nge aus Titeln, vereinheitlicht Text (lowercase, Satzzeichen raus).
- Dedupliziert anhand von normalisiertem Titel **und** bereinigter Beschreibung.


In [None]:

#%% Text cleaning
def normalize_title(title):
    if pd.isna(title):
        return ''
    return re.sub(r'\(.*?\)', '', title).lower().strip()

def clean_text(text):
    text = str(text).lower()
    text = re.sub(f"[{string.punctuation}]", " ", text)
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

df['title_normalized'] = df['title'].fillna('').apply(normalize_title)
df['title_clean'] = df['title'].fillna('').apply(clean_text)
df['description_clean'] = df['description'].fillna('').apply(clean_text)

# Drop duplicates
df = df.drop_duplicates(subset='title_normalized').reset_index(drop=True)
df = df.drop_duplicates(subset='description_clean').reset_index(drop=True)
print(f"Data loaded: {len(df)} unique titles after dedup.")



## 3) Genres & Countries â†’ Multiâ€‘Hot Features
- `listed_in` (Genres, kommagetrennt) â†’ Liste
- Kombiniert mit `country` â†’ MultiLabelBinarizer


In [None]:

# Process genres and countries
df['genre_list'] = df['listed_in'].apply(lambda x: [g.strip() for g in x.split(',')] if pd.notnull(x) else [])
df['combined_features'] = df['genre_list'] + df['country'].fillna('').apply(lambda x: [x])

# One-hot encode genres + countries
mlb = MultiLabelBinarizer()
genre_country_matrix = mlb.fit_transform(df['combined_features'])
print("Genre+Country feature matrix shape:", genre_country_matrix.shape)



## 4) TFâ€‘IDF auf Beschreibungen & Topâ€‘WÃ¶rter je Titel


In [None]:

#%% TF-IDF vectorization
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(df['description_clean'])
feature_names = vectorizer.get_feature_names_out()

rows, cols = tfidf_matrix.nonzero()
tfidf_words = defaultdict(list)
for r, c in zip(rows, cols):
    tfidf_words[r].append((feature_names[c], tfidf_matrix[r, c]))

top_n = 20
def top_words(doc_idx, n=top_n):
    words_scores = tfidf_words[doc_idx]
    words_scores.sort(key=lambda x: x[1], reverse=True)
    words = [w for w, _ in words_scores[:n]]
    return ' '.join(words)

df['description_tfidf'] = [top_words(i) for i in range(len(df))]
print("\nExample top words for first description:")
print(df.loc[0, 'description_tfidf'])



## 5) Shingling (qâ€‘grams)
> StandardmÃ¤ÃŸig `q=1` (Unigramme). Du kannst `q=2` (Bigrams) setzen, um semantische NÃ¤he zu verstÃ¤rken.


In [None]:

#%% Shingling
def shingle(q, text):
    words = text.split()
    return [words[i:i+q] for i in range(len(words)-q+1)]

q = 1
shingle_vector = [shingle(q, text) for text in df['description_tfidf']]
print("\nExample shingles for first description:")
print(shingle_vector[0][:10])



## 6) MinHash Signatures (custom, MurmurHash3)
> Erzeugt pro Dokument eine MinHashâ€‘Signatur der LÃ¤nge `k`. Der Anteil gleicher Positionen zwischen zwei Signaturen approximiert die **Jaccardâ€‘Ã„hnlichkeit**.


In [None]:

def listhash(l, seed):
    val = 0
    for e in l:
        val ^= mmh3.hash(' '.join(e), seed)
    return val

def minhash_k(shingles, k):
    return [min([listhash(shingle, seed) for shingle in shingles]) for seed in range(1, k+1)]

k = 50
minhash_signatures = np.array([minhash_k(shingles, k) for shingles in shingle_vector])
print("\nExample MinHash signature for first doc:")
print(minhash_signatures[0])



## 7) LSH (Bands Ã— Rows) â†’ Kandidatenpaare
> Teilt die Signaturen in `bands Ã— rows` (hier 10 Ã— 5) und sammelt Paare, die in mindestens einem Band identisch sind.


In [None]:

def lsh_candidates(signatures, bands, rows):
    assert bands * rows == signatures.shape[1], "bands * rows must equal signature length"
    candidates = set()
    n = signatures.shape[0]
    
    for b in range(bands):
        buckets = defaultdict(list)
        for i in range(n):
            band_sig = tuple(signatures[i, b*rows:(b+1)*rows])
            buckets[band_sig].append(i)
        for bucket_docs in buckets.values():
            if len(bucket_docs) > 1:
                for i_idx in range(len(bucket_docs)):
                    for j_idx in range(i_idx+1, len(bucket_docs)):
                        candidates.add(tuple(sorted((bucket_docs[i_idx], bucket_docs[j_idx]))))
    return candidates

bands = 10
rows = 5
candidates = lsh_candidates(minhash_signatures, bands, rows)
print(f"\nNumber of candidate pairs: {len(candidates)}")



## 8) MinHashâ€‘basierte Jaccardâ€‘SchÃ¤tzung & Filter
> SchÃ¤tzt die Jaccardâ€‘Ã„hnlichkeit als Anteil Ã¼bereinstimmender Signaturpositionen und filtert Paare mit `threshold`.


In [None]:

#%% Jaccard similarity for candidate pairs
def jaccard_list(doc1_idx, doc2_idx, signatures):
    sig1 = signatures[doc1_idx]
    sig2 = signatures[doc2_idx]
    matches = np.sum(sig1 == sig2)
    return matches / len(sig1)

threshold = 0.35
similarities = []
for i, j in candidates:
    sim = jaccard_list(i, j, minhash_signatures)
    if sim >= threshold:
        similarities.append((i, j, sim))

similarities.sort(key=lambda x: x[2], reverse=True)
print(f"\nTop 5 similar pairs (threshold={threshold}):")
for i, j, sim in similarities[:5]:
    print(f"- {df.loc[i, 'title']} â†” {df.loc[j, 'title']} | similarity: {sim:.2f}")



## 9) Empfehlungen: Hybrid aus MinHash & Cosine (TFâ€‘IDF + Genres/Country)
- Start mit MinHashâ€‘Treffern.
- ErgÃ¤nze durch gewichtete Cosineâ€‘Similarities (0.7 Content, 0.3 Meta).
- Stelle sicher, dass jede:r Titel **Topâ€‘N** Empfehlungen hat.


In [None]:

#%% Build recommendations and ensure all movies have top-N
recommendations = defaultdict(list)

# Fill from MinHash similarities first
for i, j, sim in similarities:
    if df.loc[i, 'title_normalized'] == df.loc[j, 'title_normalized']:
        continue
    recommendations[i].append((j, sim))
    recommendations[j].append((i, sim))

# Calculate cosine similarity for descriptions
desc_similarity = cosine_similarity(tfidf_matrix)

# Calculate cosine similarity for genre + country
genre_similarity = cosine_similarity(genre_country_matrix)

# Combine both: You can adjust weights (e.g., 0.7 for descriptions, 0.3 for genres)
cosine_sim = 0.7 * desc_similarity + 0.3 * genre_similarity

top_n = 5
for i in range(len(df)):
    if len(recommendations[i]) < top_n:
        sims = cosine_sim[i]
        best_idx = np.argsort(sims)[::-1]
        added = 0
        for j in best_idx:
            if i == j:
                continue
            if any(r[0] == j for r in recommendations[i]):
                continue
            recommendations[i].append((j, float(sims[j])))
            added += 1
            if added >= (top_n - len(recommendations[i])):
                break

# Truncate to top-N total
for k_idx, recs in recommendations.items():
    recommendations[k_idx] = sorted(recs, key=lambda x: x[1], reverse=True)[:top_n]

example_idx = np.random.randint(0, len(df))
print(f"\nFinal recommendations for '{df.loc[example_idx, 'title']}':")
for rec_idx, sim in recommendations[example_idx]:
    print(f"- {df.loc[rec_idx, 'title']} (similarity: {sim:.2f})")



## 10) Similarity â†’ Distance for Clustering
> Diese Distanzmatrix solltest du fÃ¼r hierarchisches Clustering/DBSCAN verwenden.


In [None]:

similarity_matrix = cosine_sim
distance_matrix = 1 - similarity_matrix  # this should be used for the clustering
distance_matrix



## 11) (Optional) Hierarchisches Clustering (Subsample)
> VollstÃ¤ndige Paarâ€‘Distanzen sind O(nÂ²). FÃ¼r groÃŸe Daten nimm ein Subsample oder berechne nur Topâ€‘k Nachbarn.


In [None]:

# Optional demo on a small subset to avoid O(n^2) blowup
subset = min(400, distance_matrix.shape[0])  # adjust as needed
if subset >= 3:
    Z = linkage(squareform(distance_matrix[:subset, :subset], checks=False), method='average')
    labels = fcluster(Z, t=0.7, criterion='distance')
    print("Cluster labels (first few):", labels[:20])
else:
    print("Not enough items for clustering demo.")
