# Task
Implement a topic clustering model using LangGraph with nodes for Data Cleaning (spaCy), Vector Extraction (DistilBERT), Dimensionality Reduction (UMAP), Clustering (OPTICS & K-Means), Cluster Naming (KeyBERT), and Storage (ChromaDB). The model should handle noise points with OPTICS and use K-Means for clustering when the number of clusters is specified. The cluster names should be a maximum of 5 words. Include comments in the code.

## Set up the langgraph environment

### Subtask:
Install necessary libraries and define the graph structure with nodes for each stage (Data Cleaning, Vector Extraction, Dimensionality Reduction, Clustering, Cluster Naming, Storage) and an orchestrator.


**Reasoning**:
The first step is to install the necessary libraries for the topic clustering model. This involves installing spaCy for data cleaning, sentence-transformers for embedding, umap-learn for dimensionality reduction, scikit-learn and hdbscan for clustering, keybert for cluster naming, and chromadb for storage, along with langgraph for building the graph.



In [20]:
%pip install spacy sentence-transformers umap-learn scikit-learn hdbscan keybert chromadb langgraph --quiet
%pip install langgraph --quiet
!python -m spacy download en_core_web_sm --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m33.8 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


**Reasoning**:
I need to research alternative sentence embedding models suitable for distinguishing humor from non-humor, select the most promising one, update the `extract_embeddings` function to use the new model, and replace the old 'embed' node in the workflow. I will start by researching models and then implement the change.



In [21]:
import spacy
import typing
from typing import List, Optional, Dict, Any

# Define the state of the graph
# This dictionary will hold the data and parameters throughout the pipeline
class GraphState(typing.TypedDict):
    """
    Represents the state of our graph.

    Attributes:
        input_data: Original input data (list of strings).
        cleaned_data: Data after cleaning (list of strings).
        embeddings: Embeddings of the cleaned data (numpy array).
        reduced_embeddings: Dimensionality-reduced embeddings (numpy array).
        cluster_labels: Labels assigned to each data point (list of ints).
        cluster_names: Names generated for each cluster (dictionary).
        num_clusters: Optional number of clusters for K-Means (int).
        error: Any error encountered during the process (string).
        next_node: Explicitly set next node for orchestrator routing (string).
        storage_status: Indicates if storage is complete (string).
    """
    input_data: List[str]
    cleaned_data: Optional[List[str]] = None
    embeddings: Optional[Any] = None
    reduced_embeddings: Optional[Any] = None
    cluster_labels: Optional[List[int]] = None
    cluster_names: Optional[Dict[int, str]] = None
    num_clusters: Optional[int] = None
    error: Optional[str] = None
    next_node: Optional[str] = None
    storage_status: Optional[str] = None


# Load spaCy model
try:
    nlp = spacy.load("en_core_web_sm")
except OSError:
    print("Downloading spaCy model 'en_core_web_sm'...")
    from spacy.cli import download
    download("en_core_web_sm")
    nlp = spacy.load("en_core_web_sm")

# Define the Data Cleaning Node function
def clean_data(state: GraphState) -> Dict[str, List[str]]:
    """
    Cleans the input text data using spaCy.

    Args:
        state: The current state of the graph with input_data.

    Returns:
        A dictionary updating the state with cleaned_data.
    """
    print("---DATA CLEANING NODE---")
    input_data = state.get("input_data") # Use .get() for safer access

    if input_data is None:
        print("Error: No input data available for cleaning.")
        return {"error": "No input data available for cleaning."}

    cleaned_texts = []

    for text in input_data:
        if isinstance(text, str): # Ensure the input is a string
             # Process text with spaCy
            doc = nlp(text)

            # Tokenization, lowercasing, punctuation removal, stop word removal, and lemmatization
            cleaned_text = " ".join([
                token.lemma_.lower() for token in doc
                if not token.is_punct and not token.is_stop and not token.is_space
            ])
            cleaned_texts.append(cleaned_text)
        else:
            print(f"Warning: Skipping non-string input: {text}")


    print(f"Cleaned {len(cleaned_texts)} texts.")
    print(f"First cleaned text sample: {cleaned_texts[:1]}") # Debugging print
    print(f"Returning state update: {{'cleaned_data': ...}}") # Debugging print

    return {"cleaned_data": cleaned_texts}

# Assuming 'workflow' is defined in a previous cell
# If 'workflow' is not defined, you would need to initialize it here
# from langgraph.graph import StateGraph
# workflow = StateGraph(GraphState)

# Now, replace the placeholder node in the workflow with the actual function
# Remove the old placeholder node before adding the new one if it exists
# This part assumes 'workflow' is globally accessible and initialized
# If not, you might need to adjust the notebook structure
try:
    nodes_to_remove = [name for name, node in workflow.nodes.items() if name == "clean"] # Check by name
    for name in nodes_to_remove:
        workflow.nodes.pop(name)

    workflow.add_node("clean", clean_data)
    print("Data Cleaning Node implemented and added to the workflow.")
except NameError:
    print("Error: 'workflow' is not defined. Please ensure the cell defining the StateGraph is executed first.")



Data Cleaning Node implemented and added to the workflow.


In [22]:
from sentence_transformers import SentenceTransformer
import numpy as np
import typing
from typing import List, Optional, Dict, Any

# Assuming GraphState is defined in a previous cell
# If not, you would need to define it here or ensure the cell defining it is executed first
class GraphState(typing.TypedDict):
    """
    Represents the state of our graph.

    Attributes:
        input_data: Original input data (list of strings).
        cleaned_data: Data after cleaning (list of strings).
        embeddings: Embeddings of the cleaned data (numpy array).
        reduced_embeddings: Dimensionality-reduced embeddings (numpy array).
        cluster_labels: Labels assigned to each data point (list of ints).
        cluster_names: Names generated for each cluster (dictionary).
        num_clusters: Optional number of clusters for K-Means (int).
        error: Any error encountered during the process (string).
        next_node: Explicitly set next node for orchestrator routing (string).
        storage_status: Indicates if storage is complete (string).
    """
    input_data: List[str]
    cleaned_data: Optional[List[str]] = None
    embeddings: Optional[Any] = None
    reduced_embeddings: Optional[Any] = None
    cluster_labels: Optional[List[int]] = None
    cluster_names: Optional[Dict[int, str]] = None
    num_clusters: Optional[int] = None
    error: Optional[str] = None
    next_node: Optional[str] = None
    storage_status: Optional[str] = None


# Load a pre-trained sentence transformer model
# Using 'tweetnlp/TweetNLP-Sentence-Embedding-base' seems promising for humor detection,
# but will fall back to 'all-MiniLM-L6-v2' if it fails.
try:
    new_model = SentenceTransformer('tweetnlp/TweetNLP-Sentence-Embedding-base')
    print("Loaded new embedding model: tweetnlp/TweetNLP-Sentence-Embedding-base")
except Exception as e:
    print(f"Error loading new embedding model: {e}. Falling back to 'all-MiniLM-L6-v2'")
    # Fallback to a reliable general-purpose model if the preferred one fails
    new_model = SentenceTransformer('all-MiniLM-L6-v2')
    print("Loaded fallback embedding model: all-MiniLM-L6-v2")


# Define the Vector Extraction Node function
def extract_embeddings(state: GraphState) -> Dict[str, np.ndarray]:
    """
    Extracts vector embeddings from cleaned text data using the selected pre-trained model.

    Args:
        state: The current state of the graph with cleaned_data.

    Returns:
        A dictionary updating the state with embeddings.
    """
    print("---VECTOR EXTRACTION NODE (Updated)---")
    cleaned_data = state.get("cleaned_data") # Use .get() for safer access

    if cleaned_data is None:
        print("Error: No cleaned data available for embedding.")
        return {"error": "No cleaned data available for embedding."}

    print(f"Extracting embeddings for {len(cleaned_data)} texts using the updated model...")
    # Generate embeddings using the new model
    embeddings = new_model.encode(cleaned_data)
    print("Embeddings extraction complete (Updated).")
    return {"embeddings": embeddings} # Ensure this returns a dictionary to update state

# Assuming 'workflow' is defined in a previous cell
# If 'workflow' is not defined, you would need to initialize it here
# from langgraph.graph import StateGraph
# workflow = StateGraph(GraphState)

# Replace the existing 'embed' node in the workflow with the updated function
# Remove the old node before adding the new one
try:
    nodes_to_remove = [name for name, node in workflow.nodes.items() if name == "embed"] # Check by name
    for name in nodes_to_remove:
        workflow.nodes.pop(name)

    workflow.add_node("embed", extract_embeddings)
    print("Vector Extraction Node updated and added to the workflow.")
except NameError:
    print("Error: 'workflow' is not defined. Please ensure the cell defining the StateGraph is executed first.")


# Re-compile the workflow with the updated node (This might not be necessary if adding to an existing workflow)
# app = workflow.compile()
# print("LangGraph workflow re-compiled with updated embedding node.")



Error loading new embedding model: tweetnlp/TweetNLP-Sentence-Embedding-base is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`. Falling back to 'all-MiniLM-L6-v2'




Loaded fallback embedding model: all-MiniLM-L6-v2
Vector Extraction Node updated and added to the workflow.


In [23]:
import umap
import numpy as np
import typing
from typing import List, Optional, Dict, Any

# Assuming GraphState is defined in a previous cell
# If not, you would need to define it here or ensure the cell defining it is executed first
class GraphState(typing.TypedDict):
    """
    Represents the state of our graph.

    Attributes:
        input_data: Original input data (list of strings).
        cleaned_data: Data after cleaning (list of strings).
        embeddings: Embeddings of the cleaned data (numpy array).
        reduced_embeddings: Dimensionality-reduced embeddings (numpy array).
        cluster_labels: Labels assigned to each data point (list of ints).
        cluster_names: Names generated for each cluster (dictionary).
        num_clusters: Optional number of clusters for K-Means (int).
        error: Any error encountered during the process (string).
        next_node: Explicitly set next node for orchestrator routing (string).
        storage_status: Indicates if storage is complete (string).
    """
    input_data: List[str]
    cleaned_data: Optional[List[str]] = None
    embeddings: Optional[Any] = None
    reduced_embeddings: Optional[Any] = None
    cluster_labels: Optional[List[int]] = None
    cluster_names: Optional[Dict[int, str]] = None
    num_clusters: Optional[int] = None
    error: Optional[str] = None
    next_node: Optional[str] = None
    storage_status: Optional[str] = None


# Define the Dimensionality Reduction Node function
def reduce_dimensionality(state: GraphState) -> Dict[str, np.ndarray]:
    """
    Reduces the dimensionality of vector embeddings using UMAP.

    Args:
        state: The current state of the graph with embeddings.

    Returns:
        A dictionary updating the state with reduced_embeddings.
    """
    print("---DIMENSIONALITY REDUCTION NODE---")
    embeddings = state.get("embeddings") # Use .get() for safer access
    input_data = state.get("input_data") # Use .get() for safer access

    if embeddings is None:
        print("Error: No embeddings available for dimensionality reduction.")
        return {"error": "No embeddings available for dimensionality reduction."}

    if input_data is None:
        print("Error: Input data is missing, cannot determine dimensionality.")
        return {"error": "Input data is missing, cannot determine dimensionality."}

    n_samples = len(input_data)
    # Determine target dimensionality based on the number of samples
    if n_samples <= 500:
        n_components = 20
    elif n_samples <= 5000:
        n_components = 30
    elif n_samples <= 20000:
        n_components = 50
    else:
        n_components = 100

    print(f"Reducing dimensionality to {n_components} using UMAP...")
    # Initialize and fit UMAP
    reducer = umap.UMAP(n_components=n_components, random_state=42)
    reduced_embeddings = reducer.fit_transform(embeddings)

    print("Dimensionality reduction complete.")
    return {"reduced_embeddings": reduced_embeddings}

# Assuming 'workflow' is defined in a previous cell
# If 'workflow' is not defined, you would need to initialize it here
# from langgraph.graph import StateGraph
# workflow = StateGraph(GraphState)

# Now, replace the placeholder node in the workflow with the actual function
# Remove the old placeholder node before adding the new one if it exists
try:
    nodes_to_remove = [name for name, node in workflow.nodes.items() if name == "reduce_dim"] # Check by name
    for name in nodes_to_remove:
        workflow.nodes.pop(name)

    workflow.add_node("reduce_dim", reduce_dimensionality)
    print("Dimensionality Reduction Node implemented and added to the workflow.")
except NameError:
    print("Error: 'workflow' is not defined. Please ensure the cell defining the StateGraph is executed first.")



Dimensionality Reduction Node implemented and added to the workflow.


**Reasoning**:
I will modify the `cluster_data` function to exclusively use KMeans with `n_clusters=2` on the `reduced_embeddings` when `num_clusters` is 2, as instructed. This will guarantee exactly two clusters with no noise points in this specific scenario. I will then replace the existing 'cluster' node in the workflow with this modified function and re-compile the workflow.



In [24]:
from sklearn.cluster import KMeans, OPTICS
import numpy as np
import typing
from typing import List, Optional, Dict, Any

# Assuming GraphState is defined in a previous cell
# If not, you would need to define it here or ensure the cell defining it is executed first
class GraphState(typing.TypedDict):
    """
    Represents the state of our graph.

    Attributes:
        input_data: Original input data (list of strings).
        cleaned_data: Data after cleaning (list of strings).
        embeddings: Embeddings of the cleaned data (numpy array).
        reduced_embeddings: Dimensionality-reduced embeddings (numpy array).
        cluster_labels: Labels assigned to each data point (list of ints).
        cluster_names: Names generated for each cluster (dictionary).
        num_clusters: Optional number of clusters for K-Means (int).
        error: Any error encountered during the process (string).
        next_node: Explicitly set next node for orchestrator routing (string).
        storage_status: Indicates if storage is complete (string).
    """
    input_data: List[str]
    cleaned_data: Optional[List[str]] = None
    embeddings: Optional[Any] = None
    reduced_embeddings: Optional[Any] = None
    cluster_labels: Optional[List[int]] = None
    cluster_names: Optional[Dict[int, str]] = None
    num_clusters: Optional[int] = None
    error: Optional[str] = None
    next_node: Optional[str] = None
    storage_status: Optional[str] = None


# Define the Clustering Node function
def cluster_data(state: GraphState) -> Dict[str, Any]:
    """
    Clusters the dimensionality-reduced data using K-Means with n_clusters=2
    when num_clusters is set to 2 in the state, ensuring no noise points.
    Retains OPTICS logic for other num_clusters values or when num_clusters is not provided.

    Args:
        state: The current state of the graph with reduced_embeddings and optional num_clusters.

    Returns:
        A dictionary updating the state with cluster_labels or an error message.
    """
    print("---CLUSTERING NODE---")
    reduced_embeddings = state.get("reduced_embeddings") # Use .get() for safer access
    num_clusters = state.get("num_clusters")

    if reduced_embeddings is None:
        print("Error: No reduced embeddings available for clustering.")
        return {"error": "No reduced embeddings available for clustering."}

    cluster_labels = None

    # If num_clusters is specifically 2, use K-Means on all data points
    if num_clusters == 2:
        print(f"Applying K-Means to achieve exactly {num_clusters} clusters on all data points...")
        try:
            kmeans_model = KMeans(n_clusters=num_clusters, random_state=42, n_init=10)
            cluster_labels = kmeans_model.fit_predict(reduced_embeddings)
            print("K-Means clustering complete (2 clusters, no noise).")
        except Exception as e:
            print(f"Error during K-Means clustering: {e}")
            return {"error": f"K-Means clustering failed: {e}"}

    # Otherwise, use OPTICS or K-Means on non-noise points if num_clusters is specified and not 2
    else:
        print("Performing clustering using OPTICS...")
        # Use OPTICS to find clusters and identify noise points
        optics_model = OPTICS(min_samples=10, xi=0.05, min_cluster_size=0.05)
        optics_model.fit(reduced_embeddings)

        optics_labels = optics_model.labels_
        noise_points = optics_labels == -1
        n_noise = list(optics_labels).count(-1)

        print(f"OPTICS found {len(set(optics_labels)) - (1 if -1 in optics_labels else 0)} clusters and {n_noise} noise points.")

        if num_clusters is not None and num_clusters > 0:
            print(f"Applying K-Means to achieve {num_clusters} clusters on non-noise points...")
            # Filter out noise points for K-Means
            non_noise_indices = np.where(~noise_points)[0]
            non_noise_embeddings = reduced_embeddings[non_noise_indices]

            if len(non_noise_embeddings) == 0:
                print("Warning: No non-noise points to apply K-Means.")
                # Assign -1 to all points if no non-noise points
                final_cluster_labels = np.full(len(reduced_embeddings), -1, dtype=int)
            elif num_clusters > len(non_noise_embeddings):
                 print(f"Warning: Requested number of clusters ({num_clusters}) is greater than the number of non-noise points ({len(non_noise_embeddings)}). Using OPTICS labels.")
                 final_cluster_labels = optics_labels
            else:
                # Apply K-Means
                kmeans_model = KMeans(n_clusters=num_clusters, random_state=42, n_init=10)
                kmeans_labels = kmeans_model.fit_predict(non_noise_embeddings)

                # Map K-Means labels back to original indices, keeping noise points as -1
                final_cluster_labels = np.full(len(reduced_embeddings), -1, dtype=int)
                for original_idx, kmeans_label in zip(non_noise_indices, kmeans_labels):
                    final_cluster_labels[original_idx] = kmeans_label

            print("K-Means clustering complete (on non-noise points).")
            cluster_labels = final_cluster_labels

        else:
            print("Using OPTICS clustering results.")
            cluster_labels = optics_labels


    if cluster_labels is not None:
        return {"cluster_labels": cluster_labels.tolist()} # Ensure labels are a list for JSON compatibility
    else:
        return {"error": "Clustering failed to produce labels."}


# Assuming 'workflow' is defined in a previous cell
# If 'workflow' is not defined, you would need to initialize it here
# from langgraph.graph import StateGraph
# workflow = StateGraph(GraphState)

# Now, replace the placeholder node in the workflow with the actual function
# Remove the old placeholder node before adding the new one if it exists
try:
    nodes_to_remove = [name for name, node in workflow.nodes.items() if name == "cluster"] # Check by name
    for name in nodes_to_remove:
        workflow.nodes.pop(name)

    workflow.add_node("cluster", cluster_data)

    print("Clustering Node modified and updated in the workflow to force 2 clusters with KMeans when num_clusters=2.")
except NameError:
    print("Error: 'workflow' is not defined. Please ensure the cell defining the StateGraph is executed first.")


# Re-compile the workflow with the updated node (This might not be necessary if adding to an existing workflow)
# app = workflow.compile()
# print("LangGraph workflow re-compiled with updated clustering node.")



Clustering Node modified and updated in the workflow to force 2 clusters with KMeans when num_clusters=2.


In [25]:
from keybert import KeyBERT
from collections import defaultdict
import google.generativeai as genai
from google.colab import userdata
import numpy as np

# Load a pre-trained KeyBERT model (still useful for keyword suggestions if needed)
kw_model = KeyBERT()

# Configure Gemini API
try:
    # Assuming GOOGLE_API_KEY is already set in the environment or Colab secrets
    GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')
    genai.configure(api_key=GOOGLE_API_KEY)
    gemini_model = genai.GenerativeModel('gemini-1.5-flash-latest') # Using a suitable model
    print("Gemini API configured successfully.")
except Exception as e:
    print(f"Error configuring Gemini API: {e}")
    gemini_model = None # Set to None if configuration fails


# Define the Cluster Naming Node function
def name_clusters(state: GraphState) -> Dict[str, Dict[int, str]]:
    """
    Names the clusters using Gemini API or KeyBERT, extracting keywords from documents within each cluster,
    aiming for semantic names and handling potential API failures.

    Args:
        state: The current state of the graph with input_data and cluster_labels.

    Returns:
        A dictionary updating the state with cluster_names.
    """
    print("---CLUSTER NAMING NODE (Refined)---")
    input_data = state.get("input_data") # Use .get() for safer access
    cluster_labels = state.get("cluster_labels") # Use .get() for safer access

    if input_data is None or cluster_labels is None:
        print("Error: Input data or cluster labels are missing for naming.")
        return {"error": "Input data or cluster labels are missing for naming."}

    # Group documents by cluster label
    clustered_docs = defaultdict(list)
    for doc, label in zip(input_data, cluster_labels):
        clustered_docs[label].append(doc)

    cluster_names = {}
    # Generate a name for each cluster
    for cluster_id, docs in clustered_docs.items():
        if cluster_id == -1:
            cluster_names[cluster_id] = "Noise"
            continue

        if not docs:
            cluster_names[cluster_id] = "Empty Cluster"
            continue

        cluster_name = None # Initialize cluster_name to None

        # Use Gemini API for naming if configured
        if gemini_model:
            print(f"Attempting to generate name for Cluster {cluster_id} using Gemini API...")
            # Take a sample of documents to avoid exceeding context window
            sample_docs = docs[:20] # Use a reasonable sample size
            # Refine the prompt to be more direct about the desired output format and constraints
            prompt = f"""Analyze the following texts from a cluster and provide a concise name (maximum 5 words) that summarizes the main topic. Ensure the name is semantic and easy to understand.

Texts:
{'- '.join(sample_docs)}

Concise Name (max 5 words):"""
            try:
                response = gemini_model.generate_content(prompt)
                if response and response.text:
                    cluster_name_raw = response.text.strip()
                    # Ensure the concise name is max 5 words
                    cluster_name = " ".join(cluster_name_raw.split()[:5])
                    print(f"Generated name for Cluster {cluster_id} with Gemini API: {cluster_name}")
                else:
                    print(f"Gemini API returned an empty response for Cluster {cluster_id}. Falling back to KeyBERT.")
            except Exception as e:
                print(f"Error generating name for Cluster {cluster_id} with Gemini API: {e}. Falling back to KeyBERT.")

        # Fallback to KeyBERT if Gemini API failed or not configured
        if cluster_name is None:
            print(f"Using KeyBERT for Cluster {cluster_id}...")
            cluster_text = " ".join(docs)
            keywords = kw_model.extract_keywords(
                cluster_text,
                keyphrase_ngram_range=(1, 3),
                stop_words='english',
                use_mmr=True,
                diversity=0.7,
                top_n=5
            )
            keyword_list = [keyword[0] for keyword in keywords]
            # Combine keywords into a name, ensuring it's max 5 words
            cluster_name = " ".join(keyword_list).split()[:5]
            cluster_name = " ".join(cluster_name)

            print(f"Generated name for Cluster {cluster_id} with KeyBERT: {cluster_name}")

        cluster_names[cluster_id] = cluster_name


    print("Cluster naming complete (Refined).")
    return {"cluster_names": cluster_names} # Ensure this returns a dictionary to update state

# Assuming 'workflow' is defined in a previous cell
# If 'workflow' is not defined, you would need to initialize it here
# from langgraph.graph import StateGraph
# workflow = StateGraph(GraphState)

# Now, replace the placeholder node in the workflow with the actual function
# Remove the old placeholder node before adding the new one if it exists
try:
    nodes_to_remove = [name for name, node in workflow.nodes.items() if name == "name_clusters"] # Check by name
    for name in nodes_to_remove:
        workflow.nodes.pop(name)

    workflow.add_node("name_clusters", name_clusters)
    print("Cluster Naming Node refined and updated in the workflow.")
except NameError:
    print("Error: 'workflow' is not defined. Please ensure the cell defining the StateGraph is executed first.")

# Re-compile the workflow with the updated node (This might not be necessary if adding to an existing workflow)
# app = workflow.compile()
# print("LangGraph workflow re-compiled with updated naming node.")



Gemini API configured successfully.
Cluster Naming Node refined and updated in the workflow.


In [26]:
import chromadb
import typing
from typing import List, Optional, Dict, Any

# Assuming GraphState is defined in a previous cell
# If not, you would need to define it here or ensure the cell defining it is executed first
class GraphState(typing.TypedDict):
    """
    Represents the state of our graph.

    Attributes:
        input_data: Original input data (list of strings).
        cleaned_data: Data after cleaning (list of strings).
        embeddings: Embeddings of the cleaned data (numpy array).
        reduced_embeddings: Dimensionality-reduced embeddings (numpy array).
        cluster_labels: Labels assigned to each data point (list of ints).
        cluster_names: Names generated for each cluster (dictionary).
        num_clusters: Optional number of clusters for K-Means (int).
        error: Any error encountered during the process (string).
        next_node: Explicitly set next node for orchestrator routing (string).
        storage_status: Indicates if storage is complete (string).
    """
    input_data: List[str]
    cleaned_data: Optional[List[str]] = None
    embeddings: Optional[Any] = None
    reduced_embeddings: Optional[Any] = None
    cluster_labels: Optional[List[int]] = None
    cluster_names: Optional[Dict[int, str]] = None
    num_clusters: Optional[int] = None
    error: Optional[str] = None
    next_node: Optional[str] = None
    storage_status: Optional[str] = None


# Initialize ChromaDB client (in-memory for this example)
client = chromadb.Client()
%env CHROMA_ANALYTICS=False

# Define the Storage Node function
def store_results(state: GraphState) -> Dict[str, Any]:
    """
    Stores the clustered data and cluster names in ChromaDB.

    Args:
        state: The current state of the graph with input_data, cluster_labels, and cluster_names.

    Returns:
        A dictionary indicating the storage is complete or an error message.
    """
    print("---STORAGE NODE---")
    input_data = state.get("input_data") # Use .get() for safer access
    cluster_labels = state.get("cluster_labels") # Use .get() for safer access
    cluster_names = state.get("cluster_names") # Use .get() for safer access

    if input_data is None or cluster_labels is None or cluster_names is None:
        print("Error: Data, labels, or names are missing for storage.")
        return {"error": "Data, labels, or names are missing for storage."}

    # Create or get a collection
    collection_name = "topic_clusters"
    try:
        # Attempt to delete collection if it exists to avoid issues with re-adding
        client.delete_collection(name=collection_name)
        print(f"Deleted existing collection: {collection_name}")
    except:
        pass # Ignore if collection doesn't exist

    try:
        collection = client.create_collection(name=collection_name)
        print(f"Created collection: {collection_name}")
    except Exception as e:
        print(f"Error creating collection: {e}")
        return {"error": f"Error creating collection: {e}"}


    # Prepare data for ChromaDB
    ids = [f"doc_{i}" for i in range(len(input_data))]
    # Store original text and cluster label as metadata
    metadatas = []
    for i in range(len(input_data)):
        metadata = {"cluster_label": str(cluster_labels[i])}
        # Add cluster name to metadata if available
        if cluster_labels[i] in cluster_names:
            metadata["cluster_name"] = cluster_names[cluster_labels[i]]
        metadatas.append(metadata)


    # Add data to the collection
    # Note: ChromaDB requires embeddings for add, but we only need to store text and metadata for this task
    # A workaround is to use the original embeddings or generate dummy ones if not available.
    # For simplicity, we will store the original text as documents and metadata.
    # If you need to query by similarity, you would store the embeddings here.
    print(f"Adding {len(input_data)} documents to ChromaDB collection '{collection_name}'...")
    try:
        collection.add(
            documents=input_data,
            metadatas=metadatas,
            ids=ids
        )
        print("Storage complete.")
        return {"storage_status": "complete"}
    except Exception as e:
        print(f"Error adding documents to collection: {e}")
        return {"error": f"Error adding documents to collection: {e}"}


# Assuming 'workflow' is defined in a previous cell
# If 'workflow' is not defined, you would need to initialize it here
# from langgraph.graph import StateGraph
# workflow = StateGraph(GraphState)

# Now, replace the placeholder node in the workflow with the actual function
# Remove the old placeholder node before adding the new one if it exists
try:
    nodes_to_remove = [name for name, node in workflow.nodes.items() if name == "store"] # Check by name
    for name in nodes_to_remove:
        workflow.nodes.pop(name)

    workflow.add_node("store", store_results)

    print("Storage Node implemented and added to the workflow.")
except NameError:
    print("Error: 'workflow' is not defined. Please ensure the cell defining the StateGraph is executed first.")

# Re-compile the workflow with the updated node (This might not be necessary if adding to an existing workflow)
# app = workflow.compile()
# print("LangGraph workflow re-compiled with updated storage node.")



env: CHROMA_ANALYTICS=False
Storage Node implemented and added to the workflow.


## Analyze and present results

### Subtask:
Examine the new clustering results (labels and noise) and the generated cluster names. Present the updated visualization and document lists per cluster to the user for review.


**Reasoning**:
I will generate a 2D UMAP visualization of the reduced embeddings, coloring points by cluster labels and including a legend with cluster names. Then I will iterate through each unique cluster, print its ID and name, and list the documents belonging to that cluster.



In [27]:
import matplotlib.pyplot as plt
import umap
import numpy as np
from collections import defaultdict

# Assuming 'final_state' contains the results after running the workflow
# Make sure to run the workflow execution cell (e.g., cell ID 3ad91cd3) first
if 'final_state' in locals():
    reduced_embeddings = final_state.get("reduced_embeddings")
    cluster_labels = final_state.get("cluster_labels")
    cluster_names = final_state.get("cluster_names")
    input_data = final_state.get("input_data")


    if reduced_embeddings is not None and cluster_labels is not None:
        print("Generating 2D visualization of clusters...")

        # Reduce dimensionality to 2 components specifically for visualization
        # Use a consistent random_state for reproducibility
        umap_visualizer = umap.UMAP(n_components=2, random_state=42)
        # Need to handle cases where reduced_embeddings might be empty or not suitable
        try:
            reduced_for_viz = umap_visualizer.fit_transform(reduced_embeddings)
        except Exception as e:
            print(f"Error during UMAP reduction for visualization: {e}")
            print("Could not generate visualization.")
            reduced_for_viz = None


        if reduced_for_viz is not None:
            # Create the scatter plot
            plt.figure(figsize=(10, 8))

            # Get unique cluster labels (excluding noise if present)
            unique_labels = sorted(list(set(cluster_labels)))

            # Assign a color to each cluster - ensure enough distinct colors for 2 clusters + noise
            # Using a colormap that provides distinct colors for a small number of categories
            colors = plt.cm.get_cmap('tab10', len(unique_labels))


            for i, label in enumerate(unique_labels):
                if label == -1:
                    # Plot noise points in black
                    color = 'black'
                    label_name = cluster_names.get(label, "Noise")
                    alpha = 0.5
                    marker = 'x'
                else:
                    # Plot regular clusters with colors
                    color = colors(i)
                    label_name = cluster_names.get(label, f"Cluster {label}")
                    alpha = 0.8
                    marker = 'o'

                # Select points belonging to the current cluster
                clustered_points = reduced_for_viz[np.array(cluster_labels) == label]

                # Plot the points
                plt.scatter(
                    clustered_points[:, 0],
                    clustered_points[:, 1],
                    s=10, # Size of points
                    color=color,
                    label=label_name,
                    alpha=alpha,
                    marker=marker
                )

            plt.title('Topic Clustering Visualization (UMAP 2D)')
            plt.xlabel('UMAP Component 1')
            plt.ylabel('UMAP Component 2')
            plt.legend(title='Clusters', bbox_to_anchor=(1.05, 1), loc='upper left')
            plt.grid(True)
            plt.tight_layout() # Adjust layout to prevent labels overlapping
            plt.show()

    else:
        print("Reduced embeddings or cluster labels not found in final_state. Please run the workflow first.")

    # Now, print the documents per cluster
    if input_data is not None and cluster_labels is not None:
        print("\n--- DOCUMENTS PER CLUSTER ---")

        # Group documents by cluster label
        clustered_docs = defaultdict(list)
        for doc, label in zip(input_data, cluster_labels):
            clustered_docs[label].append(doc)

        # Display documents for each cluster
        # Sort cluster IDs for consistent output, excluding -1 if it exists and is not needed
        sorted_cluster_ids = sorted([label for label in clustered_docs.keys() if label != -1])
        if -1 in clustered_docs: # Include noise if present
            sorted_cluster_ids.append(-1)


        for cluster_id in sorted_cluster_ids:
            docs = clustered_docs[cluster_id]
            cluster_name = cluster_names.get(cluster_id, f"Cluster {cluster_id}" if cluster_id != -1 else "Noise")

            print(f"\n--- Cluster {cluster_id} ({cluster_name}) ---")
            if not docs:
                print("  No documents in this cluster.")
            else:
                for i, doc in enumerate(docs):
                    print(f"  {i+1}. {doc}")

    else:
        print("Input data or cluster labels not found in final_state. Please run the workflow first.")

else:
    print("final_state variable not found. Please run the workflow execution cell first.")

final_state variable not found. Please run the workflow execution cell first.


In [28]:
# Sample data for testing
sample_data = [
"Despite his devotion to his hometown of Salem (and its Halloween celebration), Hubie Dubois is a figure of mockery for kids and adults alike. But this year, something is going bump in the night, and it's up to Hubie to save Halloween. Good-natured but eccentric community volunteer Hubie Dubois finds himself at the center of a real murder case on Halloween night. Despite his devotion to his hometown of Salem, Massachusetts (and its legendary Halloween celebration), Hubie is a figure of mockery for kids and adults alike. We join our story on the eve of Halloween in Salem, Massachusetts a city with a witch history. Our hero is the Fifty-Four year old Hubert Schubert Dubois, who despite being a kind, generous, and overall selfless individual is treated like the town idiot and constantly belittled and bullied by the residents whether it be in the form of malicious pranks or the near-constant barrage of items being thrown at him. Sweet Hubie is generally able to rise above their insulation, but Halloween can make things difficult as his fatal flaw is a proclivity for being frightened. The only Salemite, besides his Mother, that seems to appreciate our protagonist as a worthwhile individual is Violet Valentine. Awarded the superlatives of Most Friendly, Most Popular, and Best Looking (the High School Hat Trick) and despite having married and divorced local law enforcement officer Steve Downey, can't help but look at him with the same stars in her eyes she had when they were in first grade.",

"A teenage murder witness finds himself pursued by twin assassins in the Montana wilderness with a survival expert tasked with protecting him and a forest fire threatening to consume them all. Stationed in a lonely fire lookout tower in the heart of Montana's green wilderness, guilt-ridden Hannah Faber, a daredevil smoke-jumper having a death wish, is still struggling to cope with her emotional trauma after a disastrous failure of judgement. Then, as if that weren't enough, Hannah crosses paths with utterly unprepared Connor, the young son of the forensic accountant, Owen Casserly, and catches the attention of a highly trained pair of assassins bent on silencing the boy for good. Now, to prevent them from finishing the job, Hannah must put her sharp survival skills to good use and stop the killers, who would do everything in their power to cover their tracks including setting the forest ablaze. Can Hannah and Connor escape from those who wish them dead? Hannah Faber (Angelina Jolie), a smoke-jumper (a person who is parachuted into the middle of forest fires from an airplane to douse them out), is struggling after failing to prevent the deaths of three young campers and a fellow smoke-jumper in a forest fire. She is now posted in a fire lookout tower in Park County, Montana. Hannah is in depression and tries to keep her spirits up by indulging in dangerous stunts involving smoke jumping.Owen Casserly (Jake Weber), a forensic accountant, learns about the death of his district attorney boss and his family in an apparent gas explosion; believing that their deaths were actually a contracted killing and that he is the next target, Owen goes on the run with his son, Connor (Finn Little). He intends to seek refuge with his brother-in-law, Ethan Sawyer (Jon Bernthal), a Deputy Sheriff. Owen calls Ethan and asks him to assemble a TV crew, as he wants to expose his evidence to the media. Hannah is Ethan's ex girlfriend, and Ethan is worried about her emotional state of mind.They are ambushed by the assassins ""Jack"" (Aidan Gillen) and ""Patrick"" (Nicholas Hoult). When Jack and Patrick reach Owen's house, he is already gone with his son. They hack into his computer and figure out that he has withdrawn $10000 in cash. They find pics of his brother-in-law who is a deputy sheriff and figure that Owen who have headed to him to seek refuge. The assassins force Owen & Connor off the road and down a cliff. Trapped in the car, Owen gives Connor the evidence against the assassins' employer. Connor flees before the assassins kill Owen.As Ethan discovers Owen's car wreck, Hannah stumbles upon Connor while out on patrol. She takes him back to the tower to contact help. The assassins meet their boss, who instructs them to hunt down and kill Connor. The boss says that Jack should assume the worst case scenario, that Owen had copies of the evidence and which are now in possession of the boy. Jack tells Patrick from this point on they kill whoever sees their faces.",

"Two friends on a road trip compete for the affections of a handsome man when their flight is redirected due to a hurricane. After countless lonely nights over a bottle of wine and ""The Bachelor"", the Seattle longtime friends, Kate, a high school English teacher, and Meg, an ambitious cosmetics saleswoman, are beginning to realise that they are going through a rough patch. To take a break, the pair will soon find themselves on an impromptu flight to Fort Lauderdale, seated next to the handsome blonde Ryan who is on his way to a friend's wedding. All of a sudden, the two best friends will get sucked into a destructive spiral of relentless competition and cut-throat one-upmanship with Ryan as the prize, especially when a Category 4 hurricane reroutes their flight to a St. Louis layover. They say all is fair in love and war; however, is Ryan worthy of Meg and Kate's years of friendship? Kate (Alexandra Daddario) and Meg (Kate Upton) are childhood friends and roommates in Seattle going through stressful times. Kate is a high school English teacher, bored of her regular curriculum and under pressure to quit by Principal Moss (Rob Corddry) who believes she should be in a different profession. Meg is a cosmetics saleswoman trying and failing to sell beauty products... illegally imported from North Korea. After a night of drinking away their stress, the adventurous Meg suggests they go on vacation to get their groove back. The reserved Kate is reluctant, but ultimately acquiesces as Meg had already booked non-refundable tickets to Fort Lauderdale using Kate's frequent flyer miles.",

"It is based on U.S. Marshal Mason Pollard who is specialized in engineering the fake deaths of witnesses that leaves no trace of their existence. U.S. Marshal Mason Pollard specializes in ""erasing"" people  faking the deaths of highrisk witnesses. With the technological advances of the last  years, the game has upgraded, and it's just another day at the office when he's assigned to Rina Kimura, a crime boss' wife who's decided to turn state's evidence. As the two flee to Cape Town, South Africa, with a team of merciless assassins on their trail, Pollard discovers he's been set up. Doublecrossed and fueled by adrenaline, he needs to be at the top of his game, or he'll be the one who's erased. Mason Pollard played by Dominic Sherwood, is a deputy who saves the life of Sugar Jax, a witness who had blown his cover and was abducted by a few thugs who wanted to kill him for testifying against them. After saving Sugar's life, Mason shoots him with a fake bullet, thereby giving him another fake death in order to create a new cover for him so that he can easily leave the state. As soon as the film sets the premise, it quickly introduces the major conflict, which involves a woman named Rina Kimura, who is gathering evidence against a crime lord named Kosta.The FBI had planted Robyn Nhi Rina in the inner circle of Kosta Kimura to gather evidence against the criminal Syndicate for which he was working. The film quickly establishes that Kosta has married Rina, though he uses his wife to physically please his potential investors so that they will hand him the money without going too much into the details. Kosta has arranged a party to get some investors for a new project, and the FBI, hiding inside a van parked outside Kosta's villa, instructs Rina to use the distraction in their favor. The government authorities want some confidential information on Kosta; therefore, Rina walks into Kosta's office and opens his safe in order to transfer the files.",

"The friends facing disastrously funny situations together and having each other's backs through the trenches to make it out of a new mess this time. Fukrey's Choocha,Hunny and Lali this time with Panditji are running a departmental store given by government but hardly have customers.They survive on small expenses by giving tips to others while Bholi is contesting the upcoming elections and wants help of Fukrey's.During opening of a public toilet people start praising for Choocha leaving Bholi aside.Bholi faces brunt from criminal Dhingra for speaking against water mafia as he is financing her party.Hunny realizes that if Bholi comes in power she will ruin Delhi and decide to contest Choocha against her and need money.Bholi's bodyguards Eddie and Bobby plan to ditch her and take Fukrey's to South Africa where their uncle Sinda has a diamond mine. Sinda is finding it difficult to locate the diamonds and Choocha with Deejachu can help him.After spending couple of days there Hunny realizes that the mine is empty and it was all planned by Bholi to keep them away from elections.The Fukrey's plan to elope to India to contest the elections but now carry a boon with them.",

"After accidentally crash-landing in 2022, time-traveling fighter pilot Adam Reed teams up with his 12-year-old self for a mission to save the future. Adam Reed, age 12 and still grieving his father's sudden death the year before, walks into his garage one night to find a wounded pilot hiding there. This mysterious pilot turns out to be the older version of himself from the future, where time travel is in its infancy. He has risked everything to come back in time on a secret mission. Together they must embark on an adventure into the past to find their father, set things right, and save the world. As the three work together, both young and grown Adam come to terms with the loss of their father and have a chance to heal the wounds that have shaped them. Adding to the challenge of the mission, the two Adams discover that they really don't like each other much, and if they are to save the world, first they need to figure out how to get along. In a dystopian 2050, fighter pilot Adam Reed (Ryan Reynolds) steals his time jet and escapes through time on a rescue mission to 2018. However, he accidentally crash-lands in 2022 instead where Adam meets his 12-year-old self (Walker Scobell) who is struggling with the recent death of their father Louis (Mark Ruffalo) (Adam's father and a brilliant quantum physicist who wrote the algorithm necessary for controlled time travel) in a car accident. Ellie Reed (Jennifer Garner), Adam's mother is alive in 2022. Adam12 gets bullied at school and when he stands up to them, he is suspended for instigating a fight. Ellie was on a date the night when Adam crash landed into 2022. Adam reveals to Adam12, that he is the same guy from the future.Adam reluctantly enlists his younger self's help to repair his jet (The future tech is coded to the user's DNA. Since Adam is injured, the jet wont clear him to fly or repair it because all his vitals are haywire). Adam sets the jet on self repair mode and tucks in for the night in the garage. The next day Ellie leaves for work. Adam tells Adam12 that he treats his mother badly and regrets it for the rest of his life. They go to a drug store to get supplies. Adam tries to get Adam12 to stand up to the bullies, but Adam12 gets beaten down again. Finally Adam takes on the bully and tells him to Adam12 alone. Adam meets Ellie at a bar (as a stranger eves-dropping on her conversation with the bartender) and tells her that she is doing nothing wrong with Adam12. He advises her to be more vulnerable and let Adam12 know that she is also hurting from the death of her husband. Adam reveals that he is looking for his wife, Laura (Zoe SaldaÃ±a), who was supposedly killed in a plane crash while on a time travel mission to 2018. They changed the jump logs and Adam never knew what she was chasing in 2018.Adam is being chased by Maya Sorian (Catherine Keener), the leader of the dystopian world and her lieutenant Christos (Alex Mallari Jr.) who attempt to apprehend Adam in 2022 and take him back to 2050.",

"In suburban Chicago, Great Lakes High School seniors Julie, Kayla and Sam, who have been best friends since they met the first day of their school life, are preparing for prom night. The parent that took each to school that first day - Lisa, Mitchell and Hunter respectively - had the potential to match their daughters' friendship, but it has not materialized in quite the same way. Despite brawny, sports-minded Mitchell - who nonetheless is prone to emotional tears - texting her to do things all the time with his wife Marcie, Lisa, the single mom who has devoted her entire being to only offspring Julie, quietly ignores those attempts to be buds for her own personal reasons. Julie and Mitchell don't as quietly ignore overgrown adolescent Hunter's messages, he who they have not kept in touch with since he and Sam's mother, Brenda, split. Hunter has generally been a non-existent part not only of the social circle around their daughters and their daughters' friends, but of Sam's life. Without telling their parents, the three girls have entered into a pact to lose her respective virginity on prom night, each having a different reason for wanting to lose her virginity that night. While Julie wants it to be a special night with her boyfriend of six months Austin, Kayla feels it's a rite-of-passage she will eventually go through, her prom date, Connor, as good a person as any. Sam, still in the closet to everyone, wants to feel part of the crowd and will go along with testing the straight waters with her date Chad, while she truly has her heart for Angelica, who is openly out. As the three couples head off on prom night, Lisa, Mitchell and Hunter learn of their daughters' pact, each who has his or her own reason to try and stop them. Lisa doesn't want Julie to make the same mistake she made with her first, Julie's father - who dropped out of her life shortly after their sexual encounter - that encounter which resulted in Julie. Lisa also does not want to lose Julie, who is the entire focus of her life. Mitchell is not emotionally ready for Kayla to do anything besides the sports related goals to be the best they have worked on together their entire lives. And Hunter, while wanting Sam to have a fun, sexual life, wants her to have it with someone of the correct gender, he deep in his heart knowing of Sam's homosexual orientation. The teen couples and Lisa, Mitchell and Hunter, as two groups, get into one misadventure after another as the parents try to track their daughters and their dates before they feel the girls will be making the biggest mistake of their young lives.",

"A former Greek diving champion and an eccentric German student take an adventurous road-trip of rediscovery from Bari to Bavaria. Victor is a twenty-something ex-diving champion now working in a furniture factory and living with his sick grandmother in a seaside town in Greece. Distraught after her death, he decides to dust off her old car and travel to Germany to visit his estranged mother. On the ferry to Italy, he meets Matthias, a talkative, inquisitive young German who is on his way home. Matthias persuades Victor to take him along and as they drive north, Victor's uptight, repressive personality clashes with the more free-spirited Matthias. But they soon find common emotional ground as their summer road trip takes unexpected turns. A tender story of self-discovery, love and family, in its many forms. Forced to reevaluate his uneventful life after the demise of a dear one in Greece, ex-championship diver Victor summons up the strength to escape. As taciturn Victor embarks on a long trip to visit his estranged mother in Bavaria, Germany, a chance encounter with persistent, confident, happy-go-lucky stranger Matthias paves the way for new horizons. But in this challenging quest for self-discovery, two worlds will collide with life-altering results. Does reconciliation await at the end of the road? Ex-championship diver Victoras (Magouliotis) whiles away his days on the Greek coast, toiling away at a factory with only his dreams, medals and grandmother for company. When a phone call summons him to Germany, a simple road trip is the answer - that is until he crosses paths with the handsome Mathias (Weil) - a free-spirited hitchhiker who tempts Victoras to take the road not taken.",

"When teenage Priscilla Beaulieu meets Elvis Presley, the man who is already a meteoric rock-and-roll superstar becomes someone entirely unexpected in private moments: a thrilling crush, an ally in loneliness, a vulnerable best friend. When teenage Priscilla Beaulieu meets Elvis Presley at a party, the man who is already a meteoric rock-and-roll superstar becomes someone entirely unexpected in private moments: a thrilling crush, an ally in loneliness, a vulnerable best friend. Through Priscilla's eyes, Sofia Coppola tells the unseen side of a great American myth in Elvis and Priscilla's long courtship and turbulent marriage, from a German army base to his dream-world estate at Graceland, in this deeply felt and ravishingly detailed portrait of love, fantasy, and fame.",

"An aspiring cellist learns that the cost of his cello is a lot more insidious than he thought. An awardwinning filmmaker Darren Lynn Bousman's new international horror film Cello, written by Turki Al Alshikh has now wrapped production. Based on the book by Turki Al Alshikh, Cello, stars Academy Academy Award, Tony, Emmy, and SAG. Award winning actor, Jeremy Irons, Tobin Bell the Saw series, Syrian actor Samer Ismail The Day I Lost My Shadow, On Borrowed Time and Saudi actress Elham Ali Ashman, Zero Distance, in the story of an aspiring cellist who learns the cost of his brandnew cello is a lot more insidious than he first thought. The film was shot on location in Saudi Arabia and the Czech Republic. Produced by Envision Media Art's Lee Nelson, the film is executive produced by Sultan Al Muheisen and Niko Ruokosuo for Alamiya and David Tish The Ice Road, Mr Church for Envision Media Arts. The film was financed by Rozam Media., which also owns all rights to the film. The topsecret project has been speculated about for weeks now on social media. Seeking distribution, Cello plans to premiere on the festival circuit in . Maxime Alexandre The Haunting of Bly Manor, Shazam served as Director of Photography, and Cello is being edited by awardwinning editor Harvey Rosenstock Homeland, Scent of a Woman, Kiss the Girls with music composition by awardwinning composer Joseph Bishara Insidious, The Conjuring. Raul Talwar serves as associate producer.",

"In New York City, a young guy falls for the daughter of his father's nemesis. The Rathcarts and the Gibbons are modern day Montagues and Capulets, two rival families who each control their own media empire in New York City. Their teenage kids ignore the feud and fall in love, despite their parents efforts to keep them apart. With hitman, corporate corruption, love, jealousy, revenge and lust, all of the characters and emotions come to a head at the wedding in this fast moving action filled love story. In Die in a Gunfight, Mary (Alexandra Daddario) and Ben (Diego Boneta) are the star-crossed black sheep of two powerful families engaged in a centuries-long feud - and they're about to reignite an affair after many years apart. Their forbidden love will trigger the dominoes that will draw in Mukul (Wade Allain-Marcus), Ben's best friend, who owes him a life debt; Terrence (Justin Chatwin), Mary's would-be protector-turned-stalker; Wayne (Travis Fimmel), an Aussie hitman with an open mind and a code of ethics; and his free-spirited girlfriend, Barbie (Emmanuelle Chriqui). As fists and bullets fly, it becomes clear that violent delights will have violent ends. As told by the Narrator, in third person omniscient:In 1864 New York City, Tarleton Rathcart and Theodore Gibbon settle their rivalry through a Gentlemen's Duel. This results in Theodore's death, initiating a feud between the families.Benjamin Gibbon often gets into fights. He seeks meaning in his life, due to depression. Ben falls in love, ceasing his trouble-making ways, but love escapes him. This causes a return to his disruptive habits. Now 27, Ben has renounced his family's wealth, but has regular communication with his parents.Mary Rathcart, was expelled from every private school in town. However, her most severe indiscretion was having fallen in love with Ben. Upon their discovery, her parents forbid her seeing Ben due to the family feud. Defiantly, she continues to see Ben. When her parents find out, she is sent to boarding school abroad.",

"The daughter of a man on death row falls in love with a woman on the opposing side of her family's political cause. Lucy and others demonstrate outside a prison about to execute a mentally disabled man for killing a cop. She meets Mercy, who is there with her mom and dad, a cop, to support the execution. Once home again with her big sister Martha and kid brother Ben, Lucy googles Mercy. She's a lawyer. Lucy's dad is to be executed in a few months for allegedly killing her mom 8 years ago, when Lucy was 14 and Ben a baby. Martha has been like a mom for them since. The siblings have a pro bono lawyer trying to get their dad off death row. The cute Mercy keeps meeting up with Lucy.",

"A police brigade working in the dangerous northern neighborhoods of Marseille, where the level of crime is higher than anywhere else in France. The northern districts of Marseille hold a sad record: the area with the highest crime rate in France. Driven by its hierarchy, the BAC Nord, a field brigade, constantly seeks to improve its results. In a high-risk sector, the cops adapt their methods, sometimes crossing the yellow line. Until the day the justice system turns against them.",

"One of the most watched television shows of the 1990s, this show was a true-to-life comedy series that follows the events of a group of friends. The group consists of Jerry Seinfeld, a stand-up comedian who questions every bizarre tidbit about life; George Costanza (Jason Alexander), a hard-luck member of the New York Yankees organization; Elaine Benes (Julia Louis-Dreyfus), a flashy woman and book editor who is not afraid to speak her mind; and Cosmo Kramer (Michael Richards), an extremely eccentric, lanky goofball. Another very notable member of the show is Newman (Wayne Knight), a chubby mailman, friend of Kramer, and, almost always, nemesis of Jerry. Other sources of comedy appear in the form of the parents of both Jerry and George. Jerry Seinfeld is a very successful stand-up comedian, mainly because the people around him offer an endless supply of great material. His best friend is George Costanza (Jason Alexander), a bald, whiny loser who craves the kind of success Jerry has, but is never willing to do what it takes to get it. Jerry's neighbor Cosmo Kramer (Michael Richards) often barges into his apartment and imposes onto his life. In the second episode, Jerry's former girlfriend Elaine Benes (Julia Louis-Dreyfus) comes back into his life, and the four of them are able to form a friendship together. The episodes were rarely very plot-heavy, focusing more on mundane conversations and situations that could be found during everyday life in New York City.",

"After mysteriously inheriting an abandoned coastal property, Ben and his family accidentally unleash an ancient, longdormant creature that terrorized the entire regionincluding his own ancestorsfor generations. Ben and his wife Jules are in for a surprise when they inherit an abandoned coastal property that Ben's recently deceased mother never told them about. Untouched for  years the house looms like an eerie relic over land which includes a stunning private cove and beach. The beauty and tranquility of the place leave the family with the nagging question why was this property kept a secret for so long? While Jules rummages through the house looking for answers, Ben goes to repair the buried water tank, not knowing that in doing so he is unleashing a longdormant creature, fiercely protective of its environment. In the year , Ben Adams and his wife Jules are bequeathed an enigmatic coastal property shrouded in secrecy. Uninhabited for four decades, the property boasts of an exquisite private cove and beach that only adds to its enigmatic allure. As the couple begins to unravel the mysteries surrounding the property, they unwittingly awaken a dormant entity, fiercely protective of its environment. While Jules delves deeper into the secrets hidden within the abandoned house, Ben's actions accidentally unleash a dangerous force that threatens to consume them all.",

"Pretty girl with great assets skims money from the mob. The crime boss' son takes it personally and seeks to settle the score. Meanwhile a man on a quest to scatter his brother's ashes becomes entangled in this tawdry affair. After skimming money from the mob, a beautiful young woman finds herself on the run with a kind stranger on a pilgrimage across the country to scatter his brother's ashes. In the heat of the moment, we quickly learn that her split personality comes in handy as the ruthless, dynamic side of her is unstoppable.",

"Two young lovers change the lives of their parents forever when the parents learn from the joyful experience of their kids, and allow themselves to again find their love. Two young people, somewhat wary of love, spend a summer together in Europe making a film about people's attitude towards love. Tanner and Christian realize that they're actually filming their own love story, but they have no idea that their film will ultimately save Christian's life after tragedy strikes them both.",

"Miami detectives Mike Lowrey and Marcus Burnett must face off against a motherandson pair of drug lords who wreak vengeful havoc on their city. Marcus and Mike have to confront new issues career changes and midlife crises, as they join the newly created elite team AMMO of the Miami police department to take down the ruthless Armando Armas, the vicious leader of a Miami drug cartel. This third 'Bad Boys' film starts with Detectives Mike Lowery Will Smith and Marcus Burnett Martin Lawrence speeding through the streets of Miami with other cops following them. They arrive at a hospital and run inside as they head to the room where Marcus's daughter Megan Bianca Bethune has given birth to a baby boy. His wife Theresa Therese Randle and Megan's fiance Reggie Dennis Greene are there as well, informing Marcus that the boy is named after him. He proudly holds his grandson.In a Mexican prison, inmate Isabel Aretas Kate Del Castillo is muttering an incantation that draws a guard's attention. Isabel grabs a knife off the guard and stabs her, with the other inmates jumping in and stabbing her to death. Other guards take Isabel outside, only for her son Armando Jacob Scipio to help her kill off all the other guards. He frees her, and they return to their home to plot revenge on behalf of her late husband Benito, with Mike being their intended final target.Armando goes to the docks to pull out a crate filled with money with intent to work with other gangsters, but these gangsters try to screw him over and take the majority shares of the money. With their guns drawn on Armando, he retaliates by stabbing everyone rapidly. He then orders the remaining men to step up if they wish to work loyally under his family. A crook named Lorenzo Rodriguez, AKA ZwayLo Nicky Jam, joins him.Mike and Marcus's fellow officers, including Captain Howard Joe Pantoliano, gather at a bar to celebrate Marcus becoming a grandfather. They encounter a woman named Rita Paola Nunez, who has just been promoted to lieutenant and is also Mike's ex.",

"1930s Hollywood is re-evaluated through the eyes of scathing social critic and alcoholic screenwriter Herman J. Mankiewicz as he races to finish the screenplay of Citizen Kane (1941). In 1940, film studio RKO 1940 hires 24-year-old wunderkind Orson Welles under a contract that gives him full creative control of his movies. For his first film, he calls in washed-up alcoholic Herman J. Mankiewicz to write the screenplay. That film is ""Citizen Kane,"" and this is the story of how it was written. Holed up in the secluded North Verde Ranch in the middle of the Mojave Desert, bedridden American screenwriter Herman J. Mankiewicz has 60 short days to turn in the first draft of the Citizen Kane (1941) screenplay. Grappling with alcohol addiction, Herman J. Mankiewicz collaborates with Orson Welles, Hollywood's 24-year-old golden boy, and gets to work. As RKO had already given Orson Welles carte blanche, Herman J. Mankiewicz decides to draw inspiration from his days working for MGM, his friendship with newspaper magnate William Randolph Hearst, and Hearst's 20-year-old girlfriend, actress Marion Davies of the Ziegfeld Follies. Mank captures the life of Mankiewicz between essentially 1933 and 1940, when America was in the midst of the Depression and watching uneasily, but from far, the gathering clouds of World War II. There is another presence that Fincher does well to focus on in his telling of the making of Citizen Kane, the candidacy of Left-leaning Upton Sinclair for the governorship of California, which sees the tinsel town gang up against him. As Sinclair is portrayed as promoting ""anti-American"" values, with MGM lending its might to a campaign that would now be described as fake news, Mank is forced to confront his own compromises and little lies.In a marvelous scene, Dance's Hearst recounts Oldman's Mank the ""parable of the organ grinder's monkey"", just after the latter has humiliated himself and Hearst in a drunken rant about the sold idealism of the newspaper baron. While the monkey thinks it is him running the show, Hearst reminds Mank, he has to ""dance"", ""every time"", the music plays.That realization is a glimpse of the bitterness that would eventually lead Mank to finding himself jobless in Hollywood, particularly after his decision to take on Hearst with Welles backing. It would also lead him to Citizen Kane - while on the bed with a broken leg, awayfrom friends and family and fighting for a drink despite alcohol slowly claiming him - as wellas his only Oscar.As Mank explains it to a friend, ""We have got a huge responsibility, to people in the dark willingly checking their disbelief at the door.",

"On Halloween night in New Salem, Radio DJs Chilly Billy Corey Taylor and Paul Zach Galligan tell a twisted anthology of terrifying local myths that lead to a grim end for smalltown residents. Bad Candy follows local Halloween stories of both myth and lessons learned in the community of New Salem. With its annual Psychotronic FM Halloween show, reenactment radio DJs Chilly Billy and Paul weave the tales of the supernatural of years gone by. Bad Candy follows local Halloween stories of both myth and lessons learned in the community of New Salem. With its annual Psychotronic FM Halloween show, reenactment radio DJs Chilly Billy and Paul weave the tales of the supernatural of years gone by. In this small town it's a grimy ending for most. Radio show hosts Chilly Billy and Paul tell a series of scary stories to their listeners on Halloween night. Several stories feature a supernatural killer clown called Bad Candy, who was magically drawn to life by a young girl named Kyra.After smashing pumpkins around town, a young bully wanders into a home haunt made to look like a circus tent. Bad Candy captures the boy and transforms him into a small figure that the clown adds to a demented diorama.Because she unleashed Bad Candy the previous year, Kyra's abusive stepfather forbids her from trickortreating with her friends. Locked in her bedroom, Kyra uses her magic sketchpad to draw a fairy and a goblin that come to life. Kyra's stepfather angrily smashes the creatures. Distraught, Kyra draws her dead mother, who then appears as a vengeful ghost to suck out her evil exhusband's soul.",

"Follows 17-year-old Ryota Miyagi, who struggles to accomplish his late elder brother's dream of becoming a basketball star. Haunted by the tragic loss of his elder brother, Miyagi Ryota, a teenager from Okinawa, Japan, struggles to grapple with questions about self-worth and life's purpose while immersing himself in basketball, the sport he and his brother shared a passion for. Widely viewed as underdogs, Ryota and his Shohoku High School teammates bravely take on a much more talented squad in their quest for recognition and glory. What lies ahead after their first slam dunk in the up-hill fight?ÃƒÂ‚Ã‚Â—fuenping. Shohoku\'s ""speedster"" and point guard, Ryota Miyagi, always plays with brains and lightning speed, running circles around his opponents while feigning composure. Born and raised in Okinawa, Ryota had a brother who was three years older. Following in the footsteps of his older brother, who was a famous local player from a young age, Ryota also became addicted to basketball. In his second year of high school, Ryota plays with the Shohoku High School basketball team along with Sakuragi, Rukawa, Akagi, and Mitsui as they take the stage at the Inter-High School National Championship. And now, they are on the brink of challenging the reigning champions, Sannoh Kogyo High School.",

"A young Parisian woman meets a middle-aged American businessman who demands their clandestine relationship be based only on sex. While looking for an apartment, Jeanne, a beautiful young Parisienne, encounters Paul, a mysterious American expatriate mourning his wife's recent suicide. Instantly drawn to each other, they have a stormy, passionate affair, in which they do not reveal their names to each other. Their relationship deeply affects their lives, as Paul struggles with his wife's death and Jeanne prepares to marry her fiance, Tom, a film director making a cinema-verite documentary about her.",

"When police raid a house in El Paso, they find it full of dead Latinos, and only one survivor. Known as ""The Traveler,"" he is taken to the police station for questioning. There, he recounts tales of horrors from his life, chronicling portals leading to other worlds, mythical beings, demons and the undead; he speaks of legends from Latin America. Satanic Hispanics tells stories by top Latin filmmakers that showcase the skills of Hispanic talent, both on and off screen.",

"Anvitha Ravali Shetty is a 38 year old single independent woman working as a master chef in the United Kingdom living with her single mother. Her mother has very less time left to live as she suffers from an unknown illness so she keeps convincing her daughter to get married soon. Anvitha who is very adamant, doesn't believe in marriage and love due to her parents' failed marriage and hence chooses to be single for the rest of her life. Later, her mother requests her to take her back to India as she wants to spend her last days there. Eventually after a few months her mother dies and Anvitha starts feeling lonely. As she doesn't believe in marriage, she tries to become a mother through IUI, as she feels she too needs a person during her last days. She feels that, like her mother, she doesn't need any man to love her throughout her life and instead, her own biological child can be her supporter throughout her life just like she had her mother. With the help of her friend Kavya, they visit a fertilization center to get pregnant. Anvitha learns the norms and comes to know that random males are taken as donors, which concerns her. She tries to choose her sperm donor through interviewing other men. Anvitha as clear as she is, writes down all the qualities she needs in her donor. Finally, she meets Siddhu Polishetty who is a 33 year old guy doing stand up comedy in one restaurant. She is impressed with him at first sight and tries to talk to him about her intention, but Kavya stops her stating it might make him awkward, so she requests her to take it slow and know more about his habits and personality first. Anvitha gets convinced and gives her visiting card to him to meet her. After a few meetings both get along pretty quick and gradually Siddhu catches feelings for Anvitha, not knowing her actual intention. One day, Siddhu tries to propose her but it backfires when Anvitha reveals the real reason why she wanted him. The concept of getting pregnant without getting married stuns Siddhu and he gets disturbed. Siddhu and his friend, Rahul go to a coffee shop where he unexpectedly sees Anvitha taking to another guy. This enrages him and he intervenes between them and causes a scene. Siddhu, who is now unable to let Anvitha go and unable to accept the fact that Anvitha is going to be with another man if he doesn't help her, later decides to help Anvitha. He goes to her house and begs for forgiveness and asks her to consider him. Siddhu mistakenly believes that Anvita will conceive by the old fashioned method of intercourse, as he doesn't know the general procedure for sperm donation. Later, Siddhu is shocked to know when Anvitha tells him that it is not the usual process of intercourse and instead it is IUI which they have to follow. He is shocked but agrees to the procedure stating that he wants everything she has to be from him. He undergoes vigorous diet and exercise which is monitored by her and the doctor. Siddhu gets frustrated in the process and gets drunk. This enrages Anvitha and begs him to not to do the work if he is uninterested and confesses that she doesn't have feelings. She also tells him that she doesn't believe in love and marriage because of her parents's divorce which happened when she was 6. Siddhu realizes his mistake and resumes his dieting. One day the procedure date is fixed and according to the contract the donor should not involve into her personal life and shouldn't contact her after the procedure otherwise legal action shall be taken. This breaks Siddhu even more and he bids farewell to Anvitha asking her a favor. He further requests Anvita to tell her daughter or son that their father was a person who genuinely loved her instead of just some random sperm donor when they inevitably ask who their father was. Later Anvitha gets her results positive in her pregnancy test and goes to share the news with Siddhu but Rahul stops her by saying even she shouldn't interfere in his life as and when she wants as he is just healing from their separation since Siddhu too can't, according to the contract and asks her to leave him. Anvitha, heartbroken leaves India. While she returns to UK, she is disturbed due to Siddhu's absence and decides she needs a change of place. She moves to country side somewhere in UK and starts spending time there. Meanwhile, Siddhu successfully gives frequent stand ups and gains popularity. His father gets to know about his feelings and asks him to give assurance to Anvitha that he is beside her through thick and thin as she is scared of getting left out in a relationship. Siddhu travels every corner of UK to get to know about her. But, in the end finally meets her at the hospital at the time of her delivery and tries to talk to her, he is stopped by the staff stating only her partner can enter the labor room. Anvitha recognizes his voice and replies he is the father of the child and this moves Siddhu. He sees his name in the medical form as the father. Finally blessed with a baby the movies ends with Siddhu doing a stand up where he reveals he is married to Anvitha and leaves the stage saying he needs to change his daughter's diaper.",

"While on probation, a man begins to re-evaluate his relationship with his volatile best friend. Collin (Daveed Diggs) must make it through his final three days of probation for a chance at a new beginning. He and his troublemaking childhood best friend Miles (Rafael Casal) work as movers and are forced to watch their old neighborhood become a trendy spot in the rapidly-gentrifying Bay Area. When a life-altering event causes Collin to miss his mandatory curfew, the two men struggle to maintain their friendship as the changing social landscape exposes their differences. Explores the intersection of race and class set against the backdrop of Oakland. In West Oakland, Collin Hopkins, a Black man who works for the Commander Moving Company as a mover, is a convicted felon on the last three days of his one-year parole. Among the many restrictions contained within his parole are living in a halfway house which has its own additional rules, having curfew, not being allowed outside of Alameda County, and no possession of firearms; contravention of any of these items could extend the length of his parole, or worse, send him back to prison. Collin, whose felony was largely a matter of unexpected circumstance, wants to do the right thing and lead a straight life. And despite having made it through the first 362 days of his parole, it isn't a guarantee that he will make it to the end clear, let alone make to the end at all due to the environment in which he lives, which includes people, like him, of a lower socioeconomic standing having to adjust to the gentrification happening within the community. One of the larger threats is his association with Miles Jones, his married best friend since they were kids and his moving partner. Miles, a Caucasian, feels like he has something to prove being white and living in West Oakland, something that Collin inherently doesn't have to prove being Black. But what could be the biggest threat to Collin is being haunted in witnessing a white police officer shoot a fleeing Black man to death in the back late in the evening of the third to last day of his parole--being shot for no reason by the police, something that Black people like Collin face every day. Through it all, Collin tries to negotiate his relationship with Val, his girlfriend before his incarceration and the dispatcher at Commander, who is taking more outward steps to improve her life to match that gentrification which may not include associating personally with someone like Collin, especially in light of having seen the aftermath of what sent him to prison.",

"Arielle lives in Florida with her mother Janet, who has a deadbeat boyfriend named Bobby. Arielle enjoys social media and has few followers. One night, she goes to a party where she gets into a fight and beats up a girl while everyone at the party films the fight and posts it on social media. Arielle immediately gains 147 new followers. Dean Taylor is new to town and stays with his father. Arielle meets him while he is working on his car. They spend time together at a party, where Dean reveals he was in prison for armed robbery and assault, and that his parole requires him to be with his father since his mother is dead. His father is an abusive drunk. She tells him how badly she wants to get out of Florida and go to Hollywood to be famous. One day, Arielle finds all her saved money missing. Right away, she bursts into Janet's room where she and Bobby are in bed. Arielle accuses Bobby of stealing her money and attacks Bobby, but Bobby pushes her and she hits her head against the wall. Arielle leaves after threatening to kill him. She goes to Dean's home and finds his father beating him. Arielle tries to intervene but again gets hit on her head after getting thrown to the ground. Dean fights his father who ends up falling down the stairs and dies after hitting his head. Arielle and Dean begin to leave town right away but realizing that they have no money, they decide to rob a gas station with Dean's gun. Arielle live streams as Dean commits the robbery. Dean is unaware of being live streamed. Arielle sees that her social media account has three thousand followers. Dean becomes angry when he learns that Arielle has been live streaming their crimes, but Arielle says she used IP blocker and did not show their faces. She says it will lead to fame and money. Still short on money, Dean suggests they rob a dispensary next. This time, Dean films as Arielle does the robbing. With more crimes filmed, Arielle's account gains over three million followers. Eventually, the police identify them and Dean and Arielle see their faces on the news. Dean is angry about the social media, but Arielle is elated to be famous. On the road, they get pulled over, and Dean tells Arielle not to start anything. When the officer goes to check their IDs, Arielle exits the car and shoots the officer. Arielle and Dean go into hiding for a while and this leads to Arielle losing her subscribers and followers. Dean and Arielle get into an argument where Dean tries to make her realize about how serious of a crime she committed whereas Arielle says that she did it for him but Dean disagrees as he thinks she did it for fame. Even though they have enough money and do not need anymore, Arielle still goes off on her own to commit a robbery at a gas station and accidentally kills a customer who startled her. The clerk manages to grab a gun and shoots Arielle in the shoulder as she flees. Dean is extremely upset that everyone now knows where they are. The police have found them and thus begins a car chase and a shootout. They realize that they left their money behind. Once they get away from the police, Dean removes the bullet from her shoulder. Arielle takes pictures of her wounds for her followers, making Dean angry again. Dean blames Arielle for the whole situation since her coming over led to his father's death. Arielle slaps him repeatedly and then kisses him. A woman driving by sees Arielle and the broken down truck. She pulls over to help and Arielle pulls out her gun and says they need a place to stay for a night. Elle reveals she follows Arielle online and that she knew who they were before she pulled up. At Elle's place, Dean asks her why she follows Arielle. Elle explains that her life has not worked out and that people find them empowering. Dean tells her no one should want to be like them. The next day, Elle drives them through a police checkpoint. Once they are through, they give Elle a bottle of water and take her car, directing her to walk back to the gas station they passed and to report her car stolen to get the insurance reimbursement. Elle wants to go with them, but Dean refuses. Arielle takes a picture with her and posts it, telling her lots of people are going to want to talk to her after that. Arielle and Dean drive off. Later, they stop at the home of Kyle, Dean's contact. Kyle's crew plans to rob a bank but needs an extra gunmen. Arielle wants in, but Dean thinks it is a bad idea. He wants to leave and go to Mexico, but Arielle holds firm, wanting the money and the followers. Dean makes Arielle promise that if they do this job she will leave with him after. During the robbery, even though Kyle insisted on no social media, Arielle streams it to her five million followers. On the other hand, Kyle becomes upset after not being able to find the money they expected and in anger shoots a bank worker. As they are about to leave, they find cops already outside surrounding the bank. Kyle realizes it is because Arielle streamed the bank robbery and (accidentally) the name of the bank. Kyle and his crew get into a shootout with Arielle and Dean where Kyle and Dean end up shooting each other and both die. The police arrest Arielle and as they bring her out of the bank, she sees hundreds of fans and followers with signs cheering her, giving her fame that she always wanted.",

"An ex-military doctor finds herself in a deadly battle for survival when the Irish mafia seize control of the hospital at which she works. When her son is taken hostage, she is forced to rely upon her battle-hardened past and lethal skills after realizing there's no one left to save the day but her.",

"When alien invaders capture the Earth's superheroes, their kids must learn to work together to save their parents and the planet.When alien invaders kidnap Earth's superheroes, their kids are whisked away to a government safe house. But whipsmart tween Missy Moreno will stop at nothing to rescue her superhero dad, Marcus Moreno. Missy teams up with the rest of the superkids to escape their mysterious government babysitter, Ms. Granada. If they want to save their parents, they'll need to work together by using their individual powersfrom elasticity to time control to predicting the futureand form an outofthisworld team.Netflix"", 'Heroics are the Earth's team of super heroes.nMissy Moreno YaYa Gosselin is at home with her dad, Marcus Pedro Pascal a superhero and master swordsman. His power is a magnetic force emitted from his hands, which allows him to keep a constant grip on his blades, when they receive word to come to the rescue of heroes where Missy has to go with her father and be with other Heroics' children. Earth has been attacked by aliens and the super heroes are powerless against their weapons.nMarcus was summoned by Ms Granada The program manager of the Heroics program, even though he is no longer working full time for them he used to be the leader of the Heroics. Granada and the US President authorizes a full Heroics attack against the alien armada.nThe Govt rounds up all the Heroic kids and keeps them in a safe location at the Heroics HQ. Missy doesn't have any powers.Missy meets the other children Wheels Andy Walken Miracle Guy's son, who possesses superintelligence; Noodles Lyon Daniels Invisi Girl's son, who can stretch his body; Ojo Hala Finley Ms. Granada's stepdaughter, who is mute and communicates through art; ACapella Lotus Blossom Ms. Vox's daughter, who can move objects by singing; SloMo Dylan Henry Lau  Blinding Fast's son, who is always in slow motion; Face Maker Andrew Diaz Crushing Low's son, who can make any face; Rewind Isaiah RussellBailey and Fast Forward Akira Akbar Crimson Legend  Red Lightning Fury's twins, twins that can alter time; Wild Card Nathan Blair TechNo's son, who has immense power but no control over it; and Guppy Vivien Blair Sharkboy and Lavagirl's daughter, who has ""shark strength"" and can shape water into anything.The parents are Anita Moreno Adriana Barraza, Marcus' mother. Miracle Guy Boyd Holbrook, a superhero with super strength. TechNo Christian Slater, a superhero with technology powers. Lavagirl Taylor Dooley, a super heroine with lavabased powers. Blinding Fast Sung Kang, a superhero with superspeed. Ms. Vox Haley Reinhart, a superhero with a sonar scream. Sharkboy JJ Dashnaw. Crimson Legend J. Quinton Johnson as, a superhero who can make solar explosions. Red Lightning Fury Brittany PerryRussell, a superhero with lightning powers. Invisi Girl Jamie Perez, a superhero with invisibility powers. Crushing Low Brently Heilbron, a superhero with superstrength.The kids watch the battle between the aliens and Heroics on television, which ends with the Heroics' capture. Missy realizes that Ojo's drawings tell the future. When a drawing shows aliens breaking into the vault, the kids hatch a plan to escape.Face Maker tricks the guards into coming into the vault where Guppy subdues them, but not before one of the guards triggers an emergency lockdown. Rewind sends them back in time Stopping the alarm from being activated, Wheels stops the guard from pushing the button, and Noodles steals their security badges. Mrs. Granada Priyanka Chopra Jonas the leader of the Heroics Program spots Missy in the hallway and seals the doors, but ACapella makes a staircase to the roof, allowing them to escape. Noodles secures a vehicle, and the kids escape.They land at the home of Missy's grandmother, the Heroics' trainer. She helps the kids master their powers and work as a team. The aliens arrive and Grandma sends the kids through a tunnel that leads to an empty field before she is captured. The kids spot an empty alien craft and use it to reach the Mother ship. Locating a room with a purple pyramid Stuffed full of aliens who will be sent to Earth in an hour to start the takeover of the planet, they see the president Neil Anami Christopher McDonald and Ms. Granada speaking. They are alien spies, sent to prepare Earth for a ""takeover"". The kids are captured placed in a cell. Guppy makes a replica of the key from the children's tears She needs water to shape objects and Granada took away her water bottle and opens the door. A fight between the kids and the aliens ensues, and Wild Card is caught and taken for questioning while the others seek the pyramid.Wheels hacks into the motherboard, but Ojo reveals that she can speak and is Supreme Commander of the aliens. Missy communicates with Wild Card in the control room; Face Maker has switched places with him So Granada has Face Maker and not wild card. Granada goes after Wild Card, but not before the protective shield around the motherboard is deactivated by him.nWith the kids holding off the aliens, Wheels and Noodles remove the motherboard and swap it with a new one deactivating the alien's rocket and foiling the takeover. To the kids' surprise, their parents emerge from the rocket. Ojo reveals that she and Ms. Granada faked the ""takeover"" to train the kids to be the new Heroics. The kids reunite with their parents, and are soon ready to save the world.",

"The Ice Age Adventures of Buck Wild continues the escapades of the possum brothers Crash and Eddie who set out to find a place of their own. Together with the one-eyed weasel, Buck Wild, they face the dinosaurs who inhabit the Lost World. Desperate for some distance from their older sister Ellie, the thrill-seeking possum brothers Crash and Eddie set out to find a place of their own, but quickly find themselves trapped in a massive cave underground. They are rescued by the one-eyed, adventure-loving, dinosaur-hunting weasel, Buck Wild, and together they must face the unruly dinosaurs who inhabit the Lost World.', 'Dreaming of moving out and making their mark on the world, even though their adopted big sister Ellie believes they are not ready to be on their own, prehistoric possum twins Crash and Eddie summon up the courage to leave the nest after the events of Ice Age: Collision Course (2016). But it is a jungle out there. And before long, the tiny explorers wind up back in the dinosaur-ridden Lost World: the subterranean realm of expert survivalist Buck Wild, the extreme one-eyed weasel from Ice Age: Dawn of the Dinosaurs (2009), and a new megalomaniac villain bent on world domination. Do the reckless brothers have what it takes to survive reality? In order to prove to the rest of the group that they are independent and reliable on their own, Crash and Eddie set out to live their own lives. However, when they reunite with Buck, they are caught in a situation that may cost them their lives.",

"A bohemian artist travels from London to Italy with his estranged son to sell the house they inherited from their late wife/mother. Jack is getting divorced and his London art gallery building belongs to his parents-in-law so his soon-to-be ex-wife gives him one month to find the money to buy the place. He contacts his estranged artist/painter dad, Robert, and they drive down to Tuscany, Italy, to fix up and sell the country house inherited from their late wife/mom (car accident). Jack hasn't been there since he was 7 and it hasn't been used the following 20 years or so. It has a fantastic view but is in dire need of repair and paint if they want to sell it. Jack heads to the nearest town and meets cute, single divorcee Natalia, who owns and runs a restaurant there. The movie opens in Flite Gallery, an art gallery in London, managed by Jack (MicheÃƒÂ¡l Richardson, Liam Neeson's real life son) but owned by Ruth (Yolanda Kettle) Jack's estranged wife and her family. Ruth tells Jack that her family is selling the gallery and seeing as how he put great effort but never any money in it, he has no say or input in this decision. Jack says that he and his father can sell the Italian house and that with that, he'll buy the gallery. Jack goes to his father, Robert (Liam Neeson), a true bohemian artist, whose London flat is also his studio, first thing in the morning, to tell him that they have to sell the house in Italy. Robert comes out of his bedroom wrapped in a robe, with a scruffy beard and mustache and as he is talking to Jack, a woman comes out of the bedroom, fully clothed. Robert introduces the woman to Jack but calls her by a different name. She curses him and storms out. Jack, not yet having told his father that he and his wife, Ruth, are divorcing, insists they go to the Italian house and just sell it. Robert tells him that the house isn't what he might imagine it to be, not having been visited for many years. Robert tells Jack that he couldn't understand Jack's sudden interest, who hadn't been to that house since he was seven years old.",

"Elmer Elevator searches for a captive Dragon on Wild Island and finds much more than he could ever have anticipated. An unseen older woman tells the story of her father, Elmer Elevator Jacob Tremblay, when he was a kid. He and his mother Dela Golshifteh Farahani owned a candy shop in a small town, but were soon forced to close down and move away when the people of the town moved away. They move to a farawa city where they plan to open a new shop, but they eventually lose all the money they save up while getting by. Dela struggles to find work in the city and is barely able to afford her weekly rent payments to Mrs. McClaren Rita Moreno.Elmer soon befriends a cat Whoopi Goldberg and eventually gets the idea to panhandle the money needed for the store, only for his mother to tell him that it is a lost cause. Angered, Elmer runs to the docks to be alone. The Cat comes to him and begins speaking to him, much to his shock. She tells him that on an island, Wild Island, beyond the city lies a dragon that can probably help him. Elmer takes the task and is transported to the island thanks to a bubbly whale named Soda Judy Greer. Once they make it to Wild Island, Soda explains that a gorilla named Saiwa Ian McShane is using the dragon to keep the island from sinking, but it remains ineffective. The Island continues to sink and needs more and more frequent effort from Boris to pull it back up from the ocean.Elmer frees the dragon, a goof ball named Boris Gaten Matarazzo, and they go on an adventure in search of a tortoise named Aratuah to find out how Boris can keep the island from sinking for the next century. Boris explains that his kind has been saving the island as long as anyone can remember, and after he succeeds, he will be an ""After Dragon"". Dragons have been coming to wild island forever as a rite of passage. when Dragons are  yrs old  in human yrs, they come to the island and save it from sinking coz the island sinks every  yrs, and this turns them into a After Dragon, more muscular, fire breathing and basically awesome. Boris knows that the way he is saving the Island is not the right way, as he hasn't turned into the After Dragon yet. Boris has no idea how to save the Island the right way, and believes that Aratuah would know since he was around when the last dragon Horatio came to save it.During this conversation, it is discovered that Boris cannot fly due to breaking his wing when Elmer saved him They were chased by Saiwa's nd in command Kwan who threw a fire torch at them and ended up hitting Boris's wing, injuring him, and he reveals that he is afraid of both water and fire. The two make an agreement that Boris will help Elmer raise enough cash to buy a new store and will let the dragon go free once finished. Along the way, they encounter some of the islands inhabitants, like Cornelius the crocodile Alan Cumming, the tiger siblings Sasha Leighton Meester and George Spence Moore II The tigers want to eat Elmer, as he is soft and sweet. He only escapes by making them fight over a cinnamon flavored bubble gum and a mother rhino named Iris Dianne Wiest she fell into a trap with her baby rhino and wasn't able to climb out. Elmer fell into the same trap all while trying to evade Saiwa and his monkey army. Elmer explains to Iris that he is taking Boris to Aratuah to find out how to save the Island for real. When Saiwa arrives, Iris helps hide Boris and Elmer and tells Saiwa that they went towards the summit if the Island.nAs Boris and Elmer make their way to Aratuah, the island continues to sink gradually. Kwan almost captures Boris, but Saiwa makes him let go, as he wants Kwan to save the monkeys from the incoming sea. Saiwa tells Kwan that one day he will lead the monkeys, and he has to learn to not abandon them, ever. They still have time to capture the dragon and save the island.They soon make it to Aratuah's shell, but Elmer finds out that he died, and leave as the island continues to sink. While resting on a flower, they are found by Saiwa and his forces. When Boris tries to escapes, Saiwa captures him by his injured wing, fixing it inadvertently in the process. Boris flies away with Elmer. Saiwa reveals that he knew about Aratuah's death, which angers his macaque ndincommand Kwan Chris O'Dowd, who proceeds to use a giant mushroom as a raft to leave the island, convinced that it is hopeless to save. While flying with Boris, Elmer has an epiphany; it is the roots below the island that pull it down. He manages to convince Boris to fly with all his might and while Boris manages to free the island from two of the roots, he then realizes that this is still the wrong way to save the island. He knows he has to jump into the summit fire to save it. But Elmer is not convinced and keeps telling him not to. Boris pushes Elmer off, who falls off the Island and into the water.Elmer is saved by Saiwa, who reveals that he and the other animals are evacuating the island and berates Elmer for wanting to use Boris for his own merits. He then reveals that once he found out about Aratuah's death, he was frightened about making everyone more worried and was even more worried when Boris showed up to save the island but Saiwa was weary of his goofy personality, so he lied and said he knew how to control Boris. Wanting to fix things, Elmer goes back to the island where Boris tells him he has found a way to save it by jumping into the fire at the summit of the Island and eventually after a lot of encouragement from Elmer he jumps in, he bursts out and magically lifts the island from the sea, finally becoming an After Dragon.After telling the animals of the island how to always let future dragons know the right way to save it, Boris takes Elmer home, passing by a surprised Kwan residing over on tangerine trees. Elmer reunites with his mother and the film ends with him embracing his new life in the city with his daughter narrating the end of the story.'",

"As the threat of giant unidentified lifeforms known as ""S-Class Species"" worsens in Japan, a silver giant appears from beyond Earth\'s atmosphere. The continued appearance of giant unidentified lifeforms known as ""S-Class Species"" has become commonplace in Japan. Conventional weapons have no effect on them. Having exhausted all other options, the Japanese Government issued the S-Class Species Suppression Protocol and formed an enforcement unit, known as the SSSP. The members chosen for the unit are: Leader Fumio Tamura (played by Hidetoshi Nishijima), Executive Strategist Shinji Kaminaga (played by Takumi Saitoh), Unparticle Physicist Taki Akihisa (played by Daiki Arioka), and Universal Biologist Yume Funaberi (Akari Hayami) As the threat of S-Class Species worsens, a silver giant appears from beyond Earth\'s atmosphere. Analyst Hiroko Asami (played by Masami Nagasawa) is newly appointed to the SSSP to deal with this giant and is partnered with Shinji Kaminaga. In Hiroko\'s report, she writes ""Ultraman (tentative name), identity unknown"".",

"Following the events at home, the Abbott family now face the terrors of the outside world. Forced to venture into the unknown, they realize the creatures that hunt by sound are not the only threats lurking beyond the sand path. With the newly acquired knowledge of the seemingly invulnerable creatures' weakness, griefstricken Evelyn Abbott finds herself on her own, with two young teens, a defenseless newborn son, and with no place to hide. Now,  days after the allout alien attack in A Quiet Place , the Abbotts summon up every last ounce of courage to leave their nowburnedtotheground farm and embark on a perilladen quest to find civilization. With this in mind, determined to expand beyond the boundaries, the resilient survivors have no other choice but to venture into eerily quiet, uncharted hostile territory, hoping for a miracle. But, this time, the enemy is everywhere. Day Lee Abbott John Krasanski drives into a town and goes to a store to buy oranges. In the store we see rocket space toys. The shop keeper is watching the news, where an extraordinary bomb in China is being reported. Lee then walks through the street to a park where his wife, Evelyn Emily Blunt, is pushing youngest son Beau on a swing. The Abbott family are there to watch a baseball game in which eldest son Marcus is playing. Lee sits next to Regan his eldest child, and only daughter, who is deaf on the stalls and says hi to his friend Emmett Cillian Murphy, who is sat behind them with his youngest son. Emmett's eldest son bats the ball and makes a home run. As he runs to the final base the crowd and Emmett are shouting 'dive' as the fielders are close. After this Emmett asks Regan how to say 'Dive' and she signals a diving motion with her hands in American Sign Language. Beau and Evelyn wish Marcus good luck as he goes up to bat. Marcus misses the first two balls, then is distracted by a large meteor in the sky. The game stops and everyone starts heading back to their cars  homes. Regan goes with her dad and Evelyn takes the boys. The aliens are already here though and start attacking the street and killing lots of people.nDay The film cuts to moments after the events of A Quiet Place , the Abbott family leaves their home barefoot. By this point, day , Evelyn has given birth to a baby boy, Lee was killed protecting Marcus and Regan from an alien, the house is both on fire and flooding and Evelyn has just killed several aliens with the help of Regan's cochlea implant feedback noise amplified by a speaker which weakens the aliens. Beau died on day  due to taking a space rocket toy from the store at the beginning of a Quiet Place , the sound of the toy attracted an alien who quickly attacked.Just before leaving the farm, Evelyn tells Marcus and Regan to 'stay there' whist she quietly heads back into the flooded cellar and tries to stay quiet as she swims in search of an oxygen tank. ",

"An idiosyncratic general confronts opposition from enemies, allies, and bureaucrats while leading a massive rebuilding operation in Afghanistan. A general from the U.S. is sent to Afghanistan to ""clean"" the situation up after eight years of war in the country. He finds himself among tired soldiers and disillusioned politicians eager to leave. In this situation, he feels his mission is to ""win"" the war, something deemed impossible by everyone around him. A tough general, hardened in his ways, knowledgeable about counterinsurgency operations, finds himself up against his own government and troops, confused about his own purpose, and without a clear plan of how to deal with the politics of ""nation-building"". A general with prior experience is sent to Afghanistan with a mission shrouded in vagueness, nonexistent political support and a lack of confidence from the U.S. government. However, with the team of supporters and aides that surround him, a strategic ""battle"" plan is set into motion that is questionable by those who already believe there is no hope, and flawless by those that are closest to him.",

"A young girl finds solace in her artist father and the ghost of her dead mother. Eightyearold Jenny Violet McGraw is constantly caught in the middle of the feuding between her lawyer mother Maggie Mamie Gummer and artist father Jeff Rupert Friend. She leads a lonely but imaginative life, surrounded by puppets called ""Grisly Kin,"" which are based on the works of her father. When Maggie is tragically killed in a hitandrun, Jeff and Jenny try to piece together a new life. But when Maggie's father Brian Cox sues for custody, and babysitter Samantha Madeline Brewer tries to be the new woman of the house, life in their Brooklyn town home takes a dark turn. The puppets and frightening characters come to life and Jenny is the only person who can see them. When the motives of the ghoulish creatures become clear, the lives of everyone are put very much in jeopardy. An idiotic illustrator named Jeff Vahn who lacks the proper attention and parenting skills is forced into divorce by the mother of his beautiful but temperamental daughter Jenny. His ex wife Maggie clearly has a hatred and resentment toward her husband for his narcissistic stupidity after Jenny falls from the attic board and ends up in the hospital. She makes it her mission to divorce him and not allow Jenny to be around him until he gets his life in order all despite the fact Maggie is disconnected from her daughter. Jeff reminiscing about his earlier times of marriage and fatherhood is in need of reevaluation of his life.Distraught at the possibility of loosing the divorce case and failing at his career he considers signing on to his wife agreement though on the phone Maggie is distracted by their argument and is fatally run over. Much time later the funeral is disastrous as Jeff's FatherinLaw Paul Rivers makes a scene at the funeral and Jenny has not comprehended her new circumstances. As Jeff gives a thoughtless eulogy, an unwanted portrait he placed lights on fire melting only the painted face of Jeff this portrait was well hated by Maggie.Through the days spirits lurk the house giving Jeff nightmares, Jenny remains distant from Her dad as Paul prepares to sue Jeff for custody. Jenny reverts to state of depression and acts in a strange manner. In hopes of getting employment Jeff reaches out to an old friend for a smaller job opportunity. In a trance Jeff draws a peculiar figure that unknown him is watching over his daughter.",

"When a young man who thought his mother was dead discovers that she may still be alive, he goes on a quest to find her. His journey takes him to a remote cabin in the woods where his mother lives in exile with a mysterious young woman. Canada, 1972. Dominic, 22 years-old, has a fetish - for himself. Nothing turns him on more than his reflection, with much of his time spent taking Polaroid selfies. When his loving grandmother dies, he discovers a deep family secret: his lesbian mother didn't die in childbirth and he has a twin brother, Daniel, raised in a remote monastery by a depraved priest, held captive against his will. The power of destiny bring back together the two beautiful, identical brothers, who, after being reunited with their mother Beatrice, are soon embroiled in a strange web of sex, revenge and redemption.",

"Baahubali-The Beginning (2015) is essentially about a tribal warrior boy, Shivudu who learns his past and awaits his destiny. The story is set in and around the fictional kingdom of Mahishmati. On the backdrop of a mighty water fall, an old yet regal lady (Sivagami) tries desperately to save a baby from a few attacking soldiers, but dies in the process. A tribal chief and his wife adopt that baby boy as their son. Egged on by curiosity and courage, the boy (Shivudu) makes a daring journey against the wishes of his mother, leaving the valley, towards the waterfall and further north into the nearby mountains. He is simply smitten by a rebel (Avantika) whose cause he willingly he takes up. This is the cause that brings him to Mahishmati Kingdom, and makes him confront his legacy. He saves the trapped and enslaved queen, later revealed to be his true mother (Devasena). He learns about his father, the benevolent and righteous (Amarendra Baahubali), and his ambitious uncle (Bhallaladeva) as told by a loyal warrior slave (Kattappa). And what is the stunning answer Kattappa gives for Shivudu's question as to who killed his father, Amarendra Baahubali?",

"Kenshin's past catches up to him causing the destruction of Akabeko Restaurant, which was Kenshin's favorite place to eat. There, he finds a note with the word ""Junchu"" on it. Kenshin Himura (Takeru Satoh) is a legendary swordsman. After the Meiji Restoration, he has stopped killing with the sword. He tries to live a peaceful life with Kaoru Kamiya who runs a swordsmanship school in the village. Things change. Akabeko Restaurant, which is Kenshin Himura's favorite place to eat, is destroyed. Kenshin Himura finds a note written ""Junchu"" there. Kenshin Himura is a legendary swordsman who has stopped killing people with his sword. Instead, he uses a dull edged sword. He tries to live a peaceful life with Kaoru who runs a swordsmanship school in their village. Kenshin's past catches up to him causing the destruction of Akabeko Restaurant, which was Kenshin's favorite place to eat. There, he finds a note with the word ""Junchu"" on it."
]

## Summary:

### Data Analysis Key Findings

*   The initial attempt to use the 'tweetnlp/TweetNLP-Sentence-Embedding-base' model failed, and the process successfully fell back to using the 'all-MiniLM-L6-v2' model for generating embeddings.
*   By setting the number of clusters to 2, K-Means clustering was successfully applied, resulting in exactly two clusters and no noise points.
*   The refined cluster naming prompt with the Gemini API successfully categorized the clusters as "Humor" and "Non-Humor".
*   The UMAP visualization clearly shows a separation between the "Humor" and "Non-Humor" clusters in the 2D space.
*   The document listings confirmed that texts were grouped into the intended "Humor" and "Non-Humor" categories based on their content.

### Insights or Next Steps

*   The combination of 'all-MiniLM-L6-v2' embeddings, K-Means clustering with two clusters, and refined Gemini API naming was effective in separating the texts into "Humor" and "Non-Humor" categories.
*   Further evaluation with a larger and more diverse dataset could help confirm the robustness of this approach for distinguishing humor from non-humor texts.


In [34]:
from langgraph.graph import StateGraph, END
import typing
from typing import List, Optional, Dict, Any
import numpy as np
import spacy
from sentence_transformers import SentenceTransformer
import umap
from sklearn.cluster import KMeans, OPTICS
from keybert import KeyBERT
from collections import defaultdict
import google.generativeai as genai
from google.colab import userdata
import chromadb
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming GraphState is defined in a previous cell
class GraphState(typing.TypedDict):
    """
    Represents the state of our graph.

    Attributes:
        input_data: Original input data (list of strings).
        cleaned_data: Data after cleaning (list of strings).
        embeddings: Embeddings of the cleaned data (numpy array).
        reduced_embeddings: Dimensionality-reduced embeddings (numpy array).
        cluster_labels: Labels assigned to each data point (list of ints).
        cluster_names: Names generated for each cluster (dictionary).
        num_clusters: Optional number of clusters for K-Means (int).
        error: Any error encountered during the process (string).
        next_node: Explicitly set next node for orchestrator routing (string).
        storage_status: Indicates if storage is complete (string).
        visualization_status: Indicates if visualization is complete (string).
    """
    input_data: List[str]
    cleaned_data: Optional[List[str]] = None
    embeddings: Optional[Any] = None
    reduced_embeddings: Optional[Any] = None
    cluster_labels: Optional[List[int]] = None
    cluster_names: Optional[Dict[int, str]] = None
    num_clusters: Optional[int] = None
    error: Optional[str] = None
    next_node: Optional[str] = None
    storage_status: Optional[str] = None
    visualization_status: Optional[str] = None


# Load spaCy model
try:
    nlp = spacy.load("en_core_web_sm")
except OSError:
    print("Downloading spaCy model 'en_core_web_sm'...")
    from spacy.cli import download
    download("en_core_web_sm")
    nlp = spacy.load("en_core_web_sm")

# Define the Data Cleaning Node function
def clean_data(state: GraphState) -> Dict[str, List[str]]:
    """
    Cleans the input text data using spaCy.

    Args:
        state: The current state of the graph with input_data.

    Returns:
        A dictionary updating the state with cleaned_data.
    """
    print("---DATA CLEANING NODE---")
    input_data = state.get("input_data") # Use .get() for safer access

    if input_data is None:
        print("Error: No input data available for cleaning.")
        return {"error": "No input data available for cleaning."}

    cleaned_texts = []

    for text in input_data:
        if isinstance(text, str): # Ensure the input is a string
             # Process text with spaCy
            doc = nlp(text)

            # Tokenization, lowercasing, punctuation removal, stop word removal, and lemmatization
            cleaned_text = " ".join([
                token.lemma_.lower() for token in doc
                if not token.is_punct and not token.is_stop and not token.is_space
            ])
            cleaned_texts.append(cleaned_text)
        else:
            print(f"Warning: Skipping non-string input: {text}")


    print(f"Cleaned {len(cleaned_texts)} texts.")
    print(f"First cleaned text sample: {cleaned_texts[:1]}") # Debugging print
    print(f"Returning state update: {{'cleaned_data': ...}}") # Debugging print

    return {"cleaned_data": cleaned_texts}

# Load a pre-trained sentence transformer model
# Using 'all-MiniLM-L6-v2' as a reliable general-purpose model
try:
    new_model = SentenceTransformer('all-MiniLM-L6-v2')
    print("Loaded embedding model: all-MiniLM-L6-v2")
except Exception as e:
    print(f"Error loading embedding model 'all-MiniLM-L6-v2': {e}")
    # Handle this error appropriately, maybe return an error state
    raise e # Re-raise the exception if the fallback also fails


# Define the Vector Extraction Node function
def extract_embeddings(state: GraphState) -> Dict[str, np.ndarray]:
    """
    Extracts vector embeddings from cleaned text data using the selected pre-trained model.

    Args:
        state: The current state of the graph with cleaned_data.

    Returns:
        A dictionary updating the state with embeddings.
    """
    print("---VECTOR EXTRACTION NODE (Updated)---")
    cleaned_data = state.get("cleaned_data") # Use .get() for safer access

    if cleaned_data is None:
        print("Error: No cleaned data available for embedding.")
        return {"error": "No cleaned data available for embedding."}

    print(f"Extracting embeddings for {len(cleaned_data)} texts using the updated model...")
    # Generate embeddings using the new model
    embeddings = new_model.encode(cleaned_data)
    print("Embeddings extraction complete (Updated).")
    return {"embeddings": embeddings} # Ensure this returns a dictionary to update state

# Define the Dimensionality Reduction Node function
def reduce_dimensionality(state: GraphState) -> Dict[str, np.ndarray]:
    """
    Reduces the dimensionality of vector embeddings using UMAP.

    Args:
        state: The current state of the graph with embeddings.

    Returns:
        A dictionary updating the state with reduced_embeddings.
    """
    print("---DIMENSIONALITY REDUCTION NODE---")
    embeddings = state.get("embeddings") # Use .get() for safer access
    input_data = state.get("input_data") # Use .get() for safer access

    if embeddings is None:
        print("Error: No embeddings available for dimensionality reduction.")
        return {"error": "No embeddings available for dimensionality reduction."}

    if input_data is None:
        print("Error: Input data is missing, cannot determine dimensionality.")
        return {"error": "Input data is missing, cannot determine dimensionality."}

    n_samples = len(input_data)
    # Determine target dimensionality based on the number of samples
    if n_samples <= 500:
        n_components = 20
    elif n_samples <= 5000:
        n_components = 30
    elif n_samples <= 20000:
        n_components = 50
    else:
        n_components = 100

    print(f"Reducing dimensionality to {n_components} using UMAP...")
    # Initialize and fit UMAP
    reducer = umap.UMAP(n_components=n_components, random_state=42)
    reduced_embeddings = reducer.fit_transform(embeddings)

    print("Dimensionality reduction complete.")
    return {"reduced_embeddings": reduced_embeddings}


# Define the Clustering Node function
def cluster_data(state: GraphState) -> Dict[str, Any]:
    """
    Clusters the dimensionality-reduced data using K-Means with n_clusters=2
    when num_clusters is set to 2 in the state, ensuring no noise points.
    Retains OPTICS logic for other num_clusters values or when num_clusters is not provided.

    Args:
        state: The current state of the graph with reduced_embeddings and optional num_clusters.

    Returns:
        A dictionary updating the state with cluster_labels or an error message.
    """
    print("---CLUSTERING NODE---")
    reduced_embeddings = state.get("reduced_embeddings") # Use .get() for safer access
    num_clusters = state.get("num_clusters")

    if reduced_embeddings is None:
        print("Error: No reduced embeddings available for clustering.")
        return {"error": "No reduced embeddings available for clustering."}

    cluster_labels = None

    # If num_clusters is specifically 2, use K-Means on all data points
    if num_clusters == 2:
        print(f"Applying K-Means to achieve exactly {num_clusters} clusters on all data points...")
        try:
            kmeans_model = KMeans(n_clusters=num_clusters, random_state=42, n_init=10)
            cluster_labels = kmeans_model.fit_predict(reduced_embeddings)
            print("K-Means clustering complete (2 clusters, no noise).")
        except Exception as e:
            print(f"Error during K-Means clustering: {e}")
            return {"error": f"K-Means clustering failed: {e}"}

    # Otherwise, use OPTICS or K-Means on non-noise points if num_clusters is specified and not 2
    else:
        print("Performing clustering using OPTICS...")
        # Use OPTICS to find clusters and identify noise points
        optics_model = OPTICS(min_samples=10, xi=0.05, min_cluster_size=0.05)
        optics_model.fit(reduced_embeddings)

        optics_labels = optics_model.labels_
        noise_points = optics_labels == -1
        n_noise = list(optics_labels).count(-1)

        print(f"OPTICS found {len(set(optics_labels)) - (1 if -1 in optics_labels else 0)} clusters and {n_noise} noise points.")

        if num_clusters is not None and num_clusters > 0:
            print(f"Applying K-Means to achieve {num_clusters} clusters on non-noise points...")
            # Filter out noise points for K-Means
            non_noise_indices = np.where(~noise_points)[0]
            non_noise_embeddings = reduced_embeddings[non_noise_indices]

            if len(non_noise_embeddings) == 0:
                print("Warning: No non-noise points to apply K-Means.")
                # Assign -1 to all points if no non-noise points
                final_cluster_labels = np.full(len(reduced_embeddings), -1, dtype=int)
            elif num_clusters > len(non_noise_embeddings):
                 print(f"Warning: Requested number of clusters ({num_clusters}) is greater than the number of non-noise points ({len(non_noise_embeddings)}). Using OPTICS labels.")
                 final_cluster_labels = optics_labels
            else:
                # Apply K-Means
                kmeans_model = KMeans(n_clusters=num_clusters, random_state=42, n_init=10)
                kmeans_labels = kmeans_model.fit_predict(non_noise_embeddings)

                # Map K-Means labels back to original indices, keeping noise points as -1
                final_cluster_labels = np.full(len(reduced_embeddings), -1, dtype=int)
                for original_idx, kmeans_label in zip(non_noise_indices, kmeans_labels):
                    final_cluster_labels[original_idx] = kmeans_label

            print("K-Means clustering complete (on non-noise points).")
            cluster_labels = final_cluster_labels

        else:
            print("Using OPTICS clustering results.")
            cluster_labels = optics_labels


    if cluster_labels is not None:
        return {"cluster_labels": cluster_labels.tolist()} # Ensure labels are a list for JSON compatibility
    else:
        return {"error": "Clustering failed to produce labels."}

# Load a pre-trained KeyBERT model (still useful for keyword suggestions if needed)
kw_model = KeyBERT()

# Configure Gemini API
try:
    # Assuming GOOGLE_API_KEY is already set in the environment or Colab secrets
    GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')
    genai.configure(api_key=GOOGLE_API_KEY)
    gemini_model = genai.GenerativeModel('gemini-1.5-flash-latest') # Using a suitable model
    print("Gemini API configured successfully.")
except Exception as e:
    print(f"Error configuring Gemini API: {e}")
    gemini_model = None # Set to None if configuration fails


# Define the Cluster Naming Node function
def name_clusters(state: GraphState) -> Dict[str, Dict[int, str]]:
    """
    Names the clusters using Gemini API or KeyBERT, extracting keywords from documents within each cluster,
    aiming for semantic names and handling potential API failures.

    Args:
        state: The current state of the graph with input_data and cluster_labels.

    Returns:
        A dictionary updating the state with cluster_names.
    """
    print("---CLUSTER NAMING NODE (Refined)---")
    input_data = state.get("input_data") # Use .get() for safer access
    cluster_labels = state.get("cluster_labels") # Use .get() for safer access

    if input_data is None or cluster_labels is None:
        print("Error: Input data or cluster labels are missing for naming.")
        return {"error": "Input data or cluster labels are missing for naming."}

    # Group documents by cluster label
    clustered_docs = defaultdict(list)
    for doc, label in zip(input_data, cluster_labels):
        clustered_docs[label].append(doc)

    cluster_names = {}
    # Generate a name for each cluster
    for cluster_id, docs in clustered_docs.items():
        if cluster_id == -1:
            cluster_names[cluster_id] = "Noise"
            continue

        if not docs:
            cluster_names[cluster_id] = "Empty Cluster"
            continue

        cluster_name = None # Initialize cluster_name to None

        # Use Gemini API for naming if configured
        if gemini_model:
            print(f"Attempting to generate name for Cluster {cluster_id} using Gemini API...")
            # Take a sample of documents to avoid exceeding context window
            sample_docs = docs[:20] # Use a reasonable sample size
            # Refine the prompt to be more direct about the desired output format and constraints
            prompt = f"""Analyze the following texts from a cluster and provide a concise name (maximum 5 words) that summarizes the main topic. Ensure the name is semantic and easy to understand.

Texts:
{'- '.join(sample_docs)}

Concise Name (max 5 words):"""
            try:
                response = gemini_model.generate_content(prompt)
                if response and response.text:
                    cluster_name_raw = response.text.strip()
                    # Ensure the concise name is max 5 words
                    cluster_name = " ".join(cluster_name_raw.split()[:5])
                    print(f"Generated name for Cluster {cluster_id} with Gemini API: {cluster_name}")
                else:
                    print(f"Gemini API returned an empty response for Cluster {cluster_id}. Falling back to KeyBERT.")
            except Exception as e:
                print(f"Error generating name for Cluster {cluster_id} with Gemini API: {e}. Falling back to KeyBERT.")

        # Fallback to KeyBERT if Gemini API failed or not configured
        if cluster_name is None:
            print(f"Using KeyBERT for Cluster {cluster_id}...")
            cluster_text = " ".join(docs)
            keywords = kw_model.extract_keywords(
                cluster_text,
                keyphrase_ngram_range=(1, 3),
                stop_words='english',
                use_mmr=True,
                diversity=0.7,
                top_n=5
            )
            keyword_list = [keyword[0] for keyword in keywords]
            # Combine keywords into a name, ensuring it's max 5 words
            cluster_name = " ".join(keyword_list).split()[:5]
            cluster_name = " ".join(cluster_name)

            print(f"Generated name for Cluster {cluster_id} with KeyBERT: {cluster_name}")

        cluster_names[cluster_id] = cluster_name


    print("Cluster naming complete (Refined).")
    return {"cluster_names": cluster_names} # Ensure this returns a dictionary to update state

# Initialize ChromaDB client (in-memory for this example)
client = chromadb.Client()
%env CHROMA_ANALYTICS=False

# Define the Storage Node function
def store_results(state: GraphState) -> Dict[str, Any]:
    """
    Stores the clustered data and cluster names in ChromaDB.

    Args:
        state: The current state of the graph with input_data, cluster_labels, and cluster_names.

    Returns:
        A dictionary indicating the storage is complete or an error message.
    """
    print("---STORAGE NODE---")
    input_data = state.get("input_data") # Use .get() for safer access
    cluster_labels = state.get("cluster_labels") # Use .get() for safer access
    cluster_names = state.get("cluster_names") # Use .get() for safer access

    if input_data is None or cluster_labels is None or cluster_names is None:
        print("Error: Data, labels, or names are missing for storage.")
        return {"error": "Data, labels, or names are missing for storage."}

    # Create or get a collection
    collection_name = "topic_clusters"
    try:
        # Attempt to delete collection if it exists to avoid issues with re-adding
        client.delete_collection(name=collection_name)
        print(f"Deleted existing collection: {collection_name}")
    except:
        pass # Ignore if collection doesn't exist

    try:
        collection = client.create_collection(name=collection_name)
        print(f"Created collection: {collection_name}")
    except Exception as e:
        print(f"Error creating collection: {e}")
        return {"error": f"Error creating collection: {e}"}


    # Prepare data for ChromaDB
    ids = [f"doc_{i}" for i in range(len(input_data))]
    # Store original text and cluster label as metadata
    metadatas = []
    for i in range(len(input_data)):
        metadata = {"cluster_label": str(cluster_labels[i])}
        # Add cluster name to metadata if available
        if cluster_labels[i] in cluster_names:
            metadata["cluster_name"] = cluster_names[cluster_labels[i]]
        metadatas.append(metadata)


    # Add data to the collection
    # Note: ChromaDB requires embeddings for add, but we only need to store text and metadata for this task
    # A workaround is to use the original embeddings or generate dummy ones if not available.
    # For simplicity, we will store the original text as documents and metadata.
    # If you need to query by similarity, you would store the embeddings here.
    print(f"Adding {len(input_data)} documents to ChromaDB collection '{collection_name}'...")
    try:
        collection.add(
            documents=input_data,
            metadatas=metadatas,
            ids=ids
        )
        print("Storage complete.")
        return {"storage_status": "complete"}
    except Exception as e:
        print(f"Error adding documents to collection: {e}")
        return {"error": f"Error adding documents to collection: {e}"}

# Define the Visualization Node function
def visualize_clusters(state: GraphState) -> Dict[str, Any]:
    """
    Visualizes the clustered, dimensionality-reduced data using UMAP and cluster labels/names.

    Args:
        state: The current state of the graph with reduced_embeddings, cluster_labels, and cluster_names.

    Returns:
        A dictionary indicating the visualization is complete or an error message.
    """
    print("---VISUALIZATION NODE---")
    reduced_embeddings = state.get("reduced_embeddings") # Use .get() for safer access
    cluster_labels = state.get("cluster_labels") # Use .get() for safer access
    cluster_names = state.get("cluster_names") # Use .get() for safer access
    input_data = state.get("input_data")

    if reduced_embeddings is None or cluster_labels is None or cluster_names is None:
        print("Error: Reduced embeddings, cluster labels, or cluster names are missing for visualization.")
        return {"error": "Reduced embeddings, cluster labels, or cluster names are missing for visualization."}

    # Ensure reduced_embeddings are in a plottable format (e.g., 2D)
    if reduced_embeddings.shape[1] > 2:
         print("Warning: Reduced embeddings are not 2D. Performing UMAP again for visualization.")
         try:
            # Reduce to 2 components specifically for visualization
            reducer_2d = umap.UMAP(n_components=2, random_state=42)
            reduced_embeddings_2d = reducer_2d.fit_transform(reduced_embeddings)
         except Exception as e:
             print(f"Error reducing dimensionality to 2D for visualization: {e}")
             return {"error": f"Error reducing dimensionality to 2D for visualization: {e}"}
    else:
        reduced_embeddings_2d = reduced_embeddings

    plt.figure(figsize=(10, 8))
    scatter = sns.scatterplot(
        x=reduced_embeddings_2d[:, 0],
        y=reduced_embeddings_2d[:, 1],
        hue=cluster_labels,
        palette='viridis',
        legend='full',
        alpha=0.7
    )

    # Add cluster names as labels to the plot (optional, can be crowded)
    # You might want to add labels only for cluster centroids or a sample of points
    # For simplicity, let's use a legend with names
    handles, labels = scatter.get_legend_handles_labels()
    # Map numeric labels to cluster names for the legend
    named_labels = [cluster_names.get(int(label), f"Cluster {label}") for label in labels]
    plt.legend(handles, named_labels, title="Clusters")


    plt.title('Cluster Visualization (UMAP)')
    plt.xlabel('UMAP Component 1')
    plt.ylabel('UMAP Component 2')
    plt.grid(True)
    plt.show()

    print("Visualization complete.")
    return {"visualization_status": "complete"}


# Define the Orchestrator Node function
def orchestrator(state: GraphState) -> str:
    """
    Directs the workflow based on the current state and presence of errors.

    Args:
        state: The current state of the graph.

    Returns:
        The name of the next node or END.
    """
    print("---ORCHESTRATOR NODE---")
    error = state.get("error")
    storage_status = state.get("storage_status")
    visualization_status = state.get("visualization_status")


    # If there's an error, stop the process
    if error:
        print(f"Error detected: {error}. Stopping workflow.")
        return END # Indicate end of graph due to error

    # Determine next step based on completed steps in sequence
    # Check for the latest completed step first
    if visualization_status is None and state.get("cluster_names") is not None:
         print("Visualization not complete. Proceeding to visualization.")
         return "visualize_clusters"
    elif state.get("storage_status") is None and state.get("cluster_names") is not None:
         print("Storage not complete. Proceeding to storage.")
         return "store"
    elif state.get("cluster_names") is None and state.get("cluster_labels") is not None:
        print("Cluster names not found. Proceeding to cluster naming.")
        return "name_clusters"
    elif state.get("cluster_labels") is None and state.get("reduced_embeddings") is not None:
        print("Cluster labels not found. Proceeding to clustering.")
        return "cluster"
    elif state.get("reduced_embeddings") is None and state.get("embeddings") is not None:
        print("Reduced embeddings not found. Proceeding to dimensionality reduction.")
        return "reduce_dim"
    elif state.get("embeddings") is None and state.get("cleaned_data") is not None:
        print("Embeddings not found. Proceeding to vector extraction.")
        return "embed"
    elif state.get("cleaned_data") is None:
        print("Cleaned data not found. Proceeding to data cleaning.")
        return "clean"
    else:
        print("All processing steps complete. Ending workflow.")
        return END


# Define the LangGraph workflow
workflow = StateGraph(GraphState)

# Add nodes for each stage
workflow.add_node("clean", clean_data)
workflow.add_node("embed", extract_embeddings)
workflow.add_node("reduce_dim", reduce_dimensionality)
workflow.add_node("cluster", cluster_data)
workflow.add_node("name_clusters", name_clusters)
workflow.add_node("store", store_results)
workflow.add_node("visualize_clusters", visualize_clusters) # Add visualization node
workflow.add_node("orchestrator", orchestrator) # Add the orchestrator node


# Set the entry point
workflow.set_entry_point("orchestrator")

# Define the edges (transitions) between nodes
# Each node transitions back to the orchestrator to decide the next step
workflow.add_edge("clean", "orchestrator")
workflow.add_edge("embed", "orchestrator")
workflow.add_edge("reduce_dim", "orchestrator")
workflow.add_edge("cluster", "orchestrator")
workflow.add_edge("name_clusters", "orchestrator")
workflow.add_edge("visualize_clusters", "orchestrator") # Add edge from visualize to orchestrator
workflow.add_edge("store", "orchestrator") # After storage, go back to orchestrator to potentially end

# Add conditional edges from the orchestrator
# The orchestrator's return value (the string name of the next node or END)
# will determine which node to execute next.
workflow.add_conditional_edges(
    "orchestrator",
    orchestrator, # The orchestrator function directly returns the next node name or END
    {
        "clean": "clean",
        "embed": "embed",
        "reduce_dim": "reduce_dim",
        "cluster": "cluster",
        "name_clusters": "name_clusters",
        "visualize_clusters": "visualize_clusters", # Add visualization transition
        "store": "store",
        END: END # If orchestrator returns END, the workflow stops
    }
)

# Compile the workflow
app = workflow.compile()

# Run the workflow with the sample data, setting num_clusters to 2 as requested
inputs = {"input_data": sample_data, "num_clusters": 2}
final_state = app.invoke(inputs)

print("\n---Workflow Execution Complete---")
# You can now access the results in the final_state variable
# For example:
# print(final_state['cluster_names'])
# print(final_state['cluster_labels'])

Loaded embedding model: all-MiniLM-L6-v2
Gemini API configured successfully.
env: CHROMA_ANALYTICS=False
---ORCHESTRATOR NODE---
Cleaned data not found. Proceeding to data cleaning.


InvalidUpdateError: Expected dict, got clean
For troubleshooting, visit: https://python.langchain.com/docs/troubleshooting/errors/INVALID_GRAPH_NODE_RETURN_VALUE

In [None]:
import typing
from typing import List, Optional, Dict, Any
import numpy as np
import spacy
from sentence_transformers import SentenceTransformer
import umap
from sklearn.cluster import KMeans, OPTICS
from keybert import KeyBERT
from collections import defaultdict
import google.generativeai as genai
from google.colab import userdata
import chromadb
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming GraphState is defined in a previous cell
class GraphState(typing.TypedDict):
    """
    Represents the state of our graph.

    Attributes:
        input_data: Original input data (list of strings).
        cleaned_data: Data after cleaning (list of strings).
        embeddings: Embeddings of the cleaned data (numpy array).
        reduced_embeddings: Dimensionality-reduced embeddings (numpy array).
        cluster_labels: Labels assigned to each data point (list of ints).
        cluster_names: Names generated for each cluster (dictionary).
        num_clusters: Optional number of clusters for K-Means (int).
        error: Any error encountered during the process (string).
        next_node: Explicitly set next node for orchestrator routing (string).
        storage_status: Indicates if storage is complete (string).
        visualization_status: Indicates if visualization is complete (string).
    """
    input_data: List[str]
    cleaned_data: Optional[List[str]] = None
    embeddings: Optional[Any] = None
    reduced_embeddings: Optional[Any] = None
    cluster_labels: Optional[List[int]] = None
    cluster_names: Optional[Dict[int, str]] = None
    num_clusters: Optional[int] = None
    error: Optional[str] = None
    next_node: Optional[str] = None
    storage_status: Optional[str] = None
    visualization_status: Optional[str] = None


# Load spaCy model
try:
    nlp = spacy.load("en_core_web_sm")
except OSError:
    print("Downloading spaCy model 'en_core_web_sm'...")
    from spacy.cli import download
    download("en_core_web_sm")
    nlp = spacy.load("en_core_web_sm")

In [36]:
# Define the Data Cleaning Node function
def clean_data(state: GraphState) -> Dict[str, List[str]]:
    """
    Cleans the input text data using spaCy.

    Args:
        state: The current state of the graph with input_data.

    Returns:
        A dictionary updating the state with cleaned_data.
    """
    print("---DATA CLEANING NODE---")
    input_data = state.get("input_data") # Use .get() for safer access

    if input_data is None:
        print("Error: No input data available for cleaning.")
        return {"error": "No input data available for cleaning."}

    cleaned_texts = []

    for text in input_data:
        if isinstance(text, str): # Ensure the input is a string
             # Process text with spaCy
            doc = nlp(text)

            # Tokenization, lowercasing, punctuation removal, stop word removal, and lemmatization
            cleaned_text = " ".join([
                token.lemma_.lower() for token in doc
                if not token.is_punct and not token.is_stop and not token.is_space
            ])
            cleaned_texts.append(cleaned_text)
        else:
            print(f"Warning: Skipping non-string input: {text}")


    print(f"Cleaned {len(cleaned_texts)} texts.")
    print(f"First cleaned text sample: {cleaned_texts[:1]}") # Debugging print
    print(f"Returning state update: {{'cleaned_data': ...}}") # Debugging print

    return {"cleaned_data": cleaned_texts}

In [37]:
# Load a pre-trained sentence transformer model
# Using 'all-MiniLM-L6-v2' as a reliable general-purpose model
try:
    new_model = SentenceTransformer('all-MiniLM-L6-v2')
    print("Loaded embedding model: all-MiniLM-L6-v2")
except Exception as e:
    print(f"Error loading embedding model 'all-MiniLM-L6-v2': {e}")
    # Handle this error appropriately, maybe return an error state
    raise e # Re-raise the exception if the fallback also fails


# Define the Vector Extraction Node function
def extract_embeddings(state: GraphState) -> Dict[str, np.ndarray]:
    """
    Extracts vector embeddings from cleaned text data using the selected pre-trained model.

    Args:
        state: The current state of the graph with cleaned_data.

    Returns:
        A dictionary updating the state with embeddings.
    """
    print("---VECTOR EXTRACTION NODE (Updated)---")
    cleaned_data = state.get("cleaned_data") # Use .get() for safer access

    if cleaned_data is None:
        print("Error: No cleaned data available for embedding.")
        return {"error": "No cleaned data available for embedding."}

    print(f"Extracting embeddings for {len(cleaned_data)} texts using the updated model...")
    # Generate embeddings using the new model
    embeddings = new_model.encode(cleaned_data)
    print("Embeddings extraction complete (Updated).")
    return {"embeddings": embeddings} # Ensure this returns a dictionary to update state

Loaded embedding model: all-MiniLM-L6-v2


In [38]:
# Define the Dimensionality Reduction Node function
def reduce_dimensionality(state: GraphState) -> Dict[str, np.ndarray]:
    """
    Reduces the dimensionality of vector embeddings using UMAP.

    Args:
        state: The current state of the graph with embeddings.

    Returns:
        A dictionary updating the state with reduced_embeddings.
    """
    print("---DIMENSIONALITY REDUCTION NODE---")
    embeddings = state.get("embeddings") # Use .get() for safer access
    input_data = state.get("input_data") # Use .get() for safer access

    if embeddings is None:
        print("Error: No embeddings available for dimensionality reduction.")
        return {"error": "No embeddings available for dimensionality reduction."}

    if input_data is None:
        print("Error: Input data is missing, cannot determine dimensionality.")
        return {"error": "Input data is missing, cannot determine dimensionality."}

    n_samples = len(input_data)
    # Determine target dimensionality based on the number of samples
    if n_samples <= 500:
        n_components = 20
    elif n_samples <= 5000:
        n_components = 30
    elif n_samples <= 20000:
        n_components = 50
    else:
        n_components = 100

    print(f"Reducing dimensionality to {n_components} using UMAP...")
    # Initialize and fit UMAP
    reducer = umap.UMAP(n_components=n_components, random_state=42)
    reduced_embeddings = reducer.fit_transform(embeddings)

    print("Dimensionality reduction complete.")
    return {"reduced_embeddings": reduced_embeddings}

In [39]:
# Define the Clustering Node function
def cluster_data(state: GraphState) -> Dict[str, Any]:
    """
    Clusters the dimensionality-reduced data using K-Means with n_clusters=2
    when num_clusters is set to 2 in the state, ensuring no noise points.
    Retains OPTICS logic for other num_clusters values or when num_clusters is not provided.

    Args:
        state: The current state of the graph with reduced_embeddings and optional num_clusters.

    Returns:
        A dictionary updating the state with cluster_labels or an error message.
    """
    print("---CLUSTERING NODE---")
    reduced_embeddings = state.get("reduced_embeddings") # Use .get() for safer access
    num_clusters = state.get("num_clusters")

    if reduced_embeddings is None:
        print("Error: No reduced embeddings available for clustering.")
        return {"error": "No reduced embeddings available for clustering."}

    cluster_labels = None

    # If num_clusters is specifically 2, use K-Means on all data points
    if num_clusters == 2:
        print(f"Applying K-Means to achieve exactly {num_clusters} clusters on all data points...")
        try:
            kmeans_model = KMeans(n_clusters=num_clusters, random_state=42, n_init=10)
            cluster_labels = kmeans_model.fit_predict(reduced_embeddings)
            print("K-Means clustering complete (2 clusters, no noise).")
        except Exception as e:
            print(f"Error during K-Means clustering: {e}")
            return {"error": f"K-Means clustering failed: {e}"}

    # Otherwise, use OPTICS or K-Means on non-noise points if num_clusters is specified and not 2
    else:
        print("Performing clustering using OPTICS...")
        # Use OPTICS to find clusters and identify noise points
        optics_model = OPTICS(min_samples=10, xi=0.05, min_cluster_size=0.05)
        optics_model.fit(reduced_embeddings)

        optics_labels = optics_model.labels_
        noise_points = optics_labels == -1
        n_noise = list(optics_labels).count(-1)

        print(f"OPTICS found {len(set(optics_labels)) - (1 if -1 in optics_labels else 0)} clusters and {n_noise} noise points.")

        if num_clusters is not None and num_clusters > 0:
            print(f"Applying K-Means to achieve {num_clusters} clusters on non-noise points...")
            # Filter out noise points for K-Means
            non_noise_indices = np.where(~noise_points)[0]
            non_noise_embeddings = reduced_embeddings[non_noise_indices]

            if len(non_noise_embeddings) == 0:
                print("Warning: No non-noise points to apply K-Means.")
                # Assign -1 to all points if no non-noise points
                final_cluster_labels = np.full(len(reduced_embeddings), -1, dtype=int)
            elif num_clusters > len(non_noise_embeddings):
                 print(f"Warning: Requested number of clusters ({num_clusters}) is greater than the number of non-noise points ({len(non_noise_embeddings)}). Using OPTICS labels.")
                 final_cluster_labels = optics_labels
            else:
                # Apply K-Means
                kmeans_model = KMeans(n_clusters=num_clusters, random_state=42, n_init=10)
                kmeans_labels = kmeans_model.fit_predict(non_noise_embeddings)

                # Map K-Means labels back to original indices, keeping noise points as -1
                final_cluster_labels = np.full(len(reduced_embeddings), -1, dtype=int)
                for original_idx, kmeans_label in zip(non_noise_indices, kmeans_labels):
                    final_cluster_labels[original_idx] = kmeans_label

            print("K-Means clustering complete (on non-noise points).")
            cluster_labels = final_cluster_labels

        else:
            print("Using OPTICS clustering results.")
            cluster_labels = optics_labels


    if cluster_labels is not None:
        return {"cluster_labels": cluster_labels.tolist()} # Ensure labels are a list for JSON compatibility
    else:
        return {"error": "Clustering failed to produce labels."}

In [40]:
# Load a pre-trained KeyBERT model (still useful for keyword suggestions if needed)
kw_model = KeyBERT()

# Configure Gemini API
try:
    # Assuming GOOGLE_API_KEY is already set in the environment or Colab secrets
    GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')
    genai.configure(api_key=GOOGLE_API_KEY)
    gemini_model = genai.GenerativeModel('gemini-1.5-flash-latest') # Using a suitable model
    print("Gemini API configured successfully.")
except Exception as e:
    print(f"Error configuring Gemini API: {e}")
    gemini_model = None # Set to None if configuration fails


# Define the Cluster Naming Node function
def name_clusters(state: GraphState) -> Dict[str, Dict[int, str]]:
    """
    Names the clusters using Gemini API or KeyBERT, extracting keywords from documents within each cluster,
    aiming for semantic names and handling potential API failures.

    Args:
        state: The current state of the graph with input_data and cluster_labels.

    Returns:
        A dictionary updating the state with cluster_names.
    """
    print("---CLUSTER NAMING NODE (Refined)---")
    input_data = state.get("input_data") # Use .get() for safer access
    cluster_labels = state.get("cluster_labels") # Use .get() for safer access

    if input_data is None or cluster_labels is None:
        print("Error: Input data or cluster labels are missing for naming.")
        return {"error": "Input data or cluster labels are missing for naming."}

    # Group documents by cluster label
    clustered_docs = defaultdict(list)
    for doc, label in zip(input_data, cluster_labels):
        clustered_docs[label].append(doc)

    cluster_names = {}
    # Generate a name for each cluster
    for cluster_id, docs in clustered_docs.items():
        if cluster_id == -1:
            cluster_names[cluster_id] = "Noise"
            continue

        if not docs:
            cluster_names[cluster_id] = "Empty Cluster"
            continue

        cluster_name = None # Initialize cluster_name to None

        # Use Gemini API for naming if configured
        if gemini_model:
            print(f"Attempting to generate name for Cluster {cluster_id} using Gemini API...")
            # Take a sample of documents to avoid exceeding context window
            sample_docs = docs[:20] # Use a reasonable sample size
            # Refine the prompt to be more direct about the desired output format and constraints
            prompt = f"""Analyze the following texts from a cluster and provide a concise name (maximum 5 words) that summarizes the main topic. Ensure the name is semantic and easy to understand.

Texts:
{'- '.join(sample_docs)}

Concise Name (max 5 words):"""
            try:
                response = gemini_model.generate_content(prompt)
                if response and response.text:
                    cluster_name_raw = response.text.strip()
                    # Ensure the concise name is max 5 words
                    cluster_name = " ".join(cluster_name_raw.split()[:5])
                    print(f"Generated name for Cluster {cluster_id} with Gemini API: {cluster_name}")
                else:
                    print(f"Gemini API returned an empty response for Cluster {cluster_id}. Falling back to KeyBERT.")
            except Exception as e:
                print(f"Error generating name for Cluster {cluster_id} with Gemini API: {e}. Falling back to KeyBERT.")

        # Fallback to KeyBERT if Gemini API failed or not configured
        if cluster_name is None:
            print(f"Using KeyBERT for Cluster {cluster_id}...")
            cluster_text = " ".join(docs)
            keywords = kw_model.extract_keywords(
                cluster_text,
                keyphrase_ngram_range=(1, 3),
                stop_words='english',
                use_mmr=True,
                diversity=0.7,
                top_n=5
            )
            keyword_list = [keyword[0] for keyword in keywords]
            # Combine keywords into a name, ensuring it's max 5 words
            cluster_name = " ".join(keyword_list).split()[:5]
            cluster_name = " ".join(cluster_name)

            print(f"Generated name for Cluster {cluster_id} with KeyBERT: {cluster_name}")

        cluster_names[cluster_id] = cluster_name


    print("Cluster naming complete (Refined).")
    return {"cluster_names": cluster_names} # Ensure this returns a dictionary to update state

Gemini API configured successfully.


In [41]:
# Define the Visualization Node function
def visualize_clusters(state: GraphState) -> Dict[str, Any]:
    """
    Visualizes the clustered, dimensionality-reduced data using UMAP and cluster labels/names.

    Args:
        state: The current state of the graph with reduced_embeddings, cluster_labels, and cluster_names.

    Returns:
        A dictionary indicating the visualization is complete or an error message.
    """
    print("---VISUALIZATION NODE---")
    reduced_embeddings = state.get("reduced_embeddings") # Use .get() for safer access
    cluster_labels = state.get("cluster_labels") # Use .get() for safer access
    cluster_names = state.get("cluster_names") # Use .get() for safer access
    input_data = state.get("input_data")

    if reduced_embeddings is None or cluster_labels is None or cluster_names is None:
        print("Error: Reduced embeddings, cluster labels, or cluster names are missing for visualization.")
        return {"error": "Reduced embeddings, cluster labels, or cluster names are missing for visualization."}

    # Ensure reduced_embeddings are in a plottable format (e.g., 2D)
    if reduced_embeddings.shape[1] > 2:
         print("Warning: Reduced embeddings are not 2D. Performing UMAP again for visualization.")
         try:
            # Reduce to 2 components specifically for visualization
            reducer_2d = umap.UMAP(n_components=2, random_state=42)
            reduced_embeddings_2d = reducer_2d.fit_transform(reduced_embeddings)
         except Exception as e:
             print(f"Error reducing dimensionality to 2D for visualization: {e}")
             return {"error": f"Error reducing dimensionality to 2D for visualization: {e}"}
    else:
        reduced_embeddings_2d = reduced_embeddings

    plt.figure(figsize=(10, 8))
    scatter = sns.scatterplot(
        x=reduced_embeddings_2d[:, 0],
        y=reduced_embeddings_2d[:, 1],
        hue=cluster_labels,
        palette='viridis',
        legend='full',
        alpha=0.7
    )

    # Add cluster names as labels to the plot (optional, can be crowded)
    # You might want to add labels only for cluster centroids or a sample of points
    # For simplicity, let's use a legend with names
    handles, labels = scatter.get_legend_handles_labels()
    # Map numeric labels to cluster names for the legend
    named_labels = [cluster_names.get(int(label), f"Cluster {label}") for label in labels]
    plt.legend(handles, named_labels, title="Clusters")


    plt.title('Cluster Visualization (UMAP)')
    plt.xlabel('UMAP Component 1')
    plt.ylabel('UMAP Component 2')
    plt.grid(True)
    plt.show()

    print("Visualization complete.")
    return {"visualization_status": "complete"}

In [42]:
# Initialize ChromaDB client (in-memory for this example)
client = chromadb.Client()
%env CHROMA_ANALYTICS=False

# Define the Storage Node function
def store_results(state: GraphState) -> Dict[str, Any]:
    """
    Stores the clustered data and cluster names in ChromaDB.

    Args:
        state: The current state of the graph with input_data, cluster_labels, and cluster_names.

    Returns:
        A dictionary indicating the storage is complete or an error message.
    """
    print("---STORAGE NODE---")
    input_data = state.get("input_data") # Use .get() for safer access
    cluster_labels = state.get("cluster_labels") # Use .get() for safer access
    cluster_names = state.get("cluster_names") # Use .get() for safer access

    if input_data is None or cluster_labels is None or cluster_names is None:
        print("Error: Data, labels, or names are missing for storage.")
        return {"error": "Data, labels, or names are missing for storage."}

    # Create or get a collection
    collection_name = "topic_clusters"
    try:
        # Attempt to delete collection if it exists to avoid issues with re-adding
        client.delete_collection(name=collection_name)
        print(f"Deleted existing collection: {collection_name}")
    except:
        pass # Ignore if collection doesn't exist

    try:
        collection = client.create_collection(name=collection_name)
        print(f"Created collection: {collection_name}")
    except Exception as e:
        print(f"Error creating collection: {e}")
        return {"error": f"Error creating collection: {e}"}


    # Prepare data for ChromaDB
    ids = [f"doc_{i}" for i in range(len(input_data))]
    # Store original text and cluster label as metadata
    metadatas = []
    for i in range(len(input_data)):
        metadata = {"cluster_label": str(cluster_labels[i])}
        # Add cluster name to metadata if available
        if cluster_labels[i] in cluster_names:
            metadata["cluster_name"] = cluster_names[cluster_labels[i]]
        metadatas.append(metadata)


    # Add data to the collection
    # Note: ChromaDB requires embeddings for add, but we only need to store text and metadata for this task
    # A workaround is to use the original embeddings or generate dummy ones if not available.
    # For simplicity, we will store the original text as documents and metadata.
    # If you need to query by similarity, you would store the embeddings here.
    print(f"Adding {len(input_data)} documents to ChromaDB collection '{collection_name}'...")
    try:
        collection.add(
            documents=input_data,
            metadatas=metadatas,
            ids=ids
        )
        print("Storage complete.")
        return {"storage_status": "complete"}
    except Exception as e:
        print(f"Error adding documents to collection: {e}")
        return {"error": f"Error adding documents to collection: {e}"}

env: CHROMA_ANALYTICS=False


In [43]:
from langgraph.graph import StateGraph, END

# Define the Orchestrator Node function
def orchestrator(state: GraphState) -> str:
    """
    Directs the workflow based on the current state and presence of errors.

    Args:
        state: The current state of the graph.

    Returns:
        The name of the next node or END.
    """
    print("---ORCHESTRATOR NODE---")
    error = state.get("error")
    storage_status = state.get("storage_status")
    visualization_status = state.get("visualization_status")


    # If there's an error, stop the process
    if error:
        print(f"Error detected: {error}. Stopping workflow.")
        return END # Indicate end of graph due to error

    # Determine next step based on completed steps in sequence
    # Check for the latest completed step first
    if visualization_status is None and state.get("cluster_names") is not None:
         print("Visualization not complete. Proceeding to visualization.")
         return "visualize_clusters"
    elif state.get("storage_status") is None and state.get("cluster_names") is not None:
         print("Storage not complete. Proceeding to storage.")
         return "store"
    elif state.get("cluster_names") is None and state.get("cluster_labels") is not None:
        print("Cluster names not found. Proceeding to cluster naming.")
        return "name_clusters"
    elif state.get("cluster_labels") is None and state.get("reduced_embeddings") is not None:
        print("Cluster labels not found. Proceeding to clustering.")
        return "cluster"
    elif state.get("reduced_embeddings") is None and state.get("embeddings") is not None:
        print("Reduced embeddings not found. Proceeding to dimensionality reduction.")
        return "reduce_dim"
    elif state.get("embeddings") is None and state.get("cleaned_data") is not None:
        print("Embeddings not found. Proceeding to vector extraction.")
        return "embed"
    elif state.get("cleaned_data") is None:
        print("Cleaned data not found. Proceeding to data cleaning.")
        return "clean"
    else:
        print("All processing steps complete. Ending workflow.")
        return END


# Define the LangGraph workflow
workflow = StateGraph(GraphState)

# Add nodes for each stage
workflow.add_node("clean", clean_data)
workflow.add_node("embed", extract_embeddings)
workflow.add_node("reduce_dim", reduce_dimensionality)
workflow.add_node("cluster", cluster_data)
workflow.add_node("name_clusters", name_clusters)
workflow.add_node("store", store_results)
workflow.add_node("visualize_clusters", visualize_clusters) # Add visualization node
workflow.add_node("orchestrator", orchestrator) # Add the orchestrator node


# Set the entry point
workflow.set_entry_point("orchestrator")

# Define the edges (transitions) between nodes
# Each node transitions back to the orchestrator to decide the next step
workflow.add_edge("clean", "orchestrator")
workflow.add_edge("embed", "orchestrator")
workflow.add_edge("reduce_dim", "orchestrator")
workflow.add_edge("cluster", "orchestrator")
workflow.add_edge("name_clusters", "orchestrator")
workflow.add_edge("visualize_clusters", "orchestrator") # Add edge from visualize to orchestrator
workflow.add_edge("store", "orchestrator") # After storage, go back to orchestrator to potentially end

# Add conditional edges from the orchestrator
# The orchestrator's return value (the string name of the next node or END)
# will determine which node to execute next.
workflow.add_conditional_edges(
    "orchestrator",
    orchestrator, # The orchestrator function directly returns the next node name or END
    {
        "clean": "clean",
        "embed": "embed",
        "reduce_dim": "reduce_dim",
        "cluster": "cluster",
        "name_clusters": "name_clusters",
        "visualize_clusters": "visualize_clusters", # Add visualization transition
        "store": "store",
        END: END # If orchestrator returns END, the workflow stops
    }
)

# Compile the workflow
app = workflow.compile()

# Run the workflow with the sample data, setting num_clusters to 2 as requested
inputs = {"input_data": sample_data, "num_clusters": 2}
final_state = app.invoke(inputs)

print("\n---Workflow Execution Complete---")
# You can now access the results in the final_state variable
# For example:
# print(final_state['cluster_names'])
# print(final_state['cluster_labels'])

---ORCHESTRATOR NODE---
Cleaned data not found. Proceeding to data cleaning.


InvalidUpdateError: Expected dict, got clean
For troubleshooting, visit: https://python.langchain.com/docs/troubleshooting/errors/INVALID_GRAPH_NODE_RETURN_VALUE