<a href="https://colab.research.google.com/github/cangurosorte/Dependencias/blob/main/Detecci%C3%B3n_de_Comunidades_(con_filtro).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [13]:
# -*- coding: "utf-8" -*-
"""
# Cuaderno para Detección y Visualización de Comunidades en Redes (v3)
@luisvaldes

Este cuaderno de Google Colab permite analizar un archivo CSV que representa
conexiones de red para identificar comunidades de máquinas virtuales.

**Funcionalidades:**
1.  **Carga de Datos:** Lee un archivo CSV con un formato específico.
2.  **Filtrado Opcional:** Permite excluir nodos "Unscanned Device" antes del análisis.
3.  **Construcción de Grafo:** Crea un modelo de grafo dirigido usando NetworkX.
4.  **Detección de Comunidades:** Aplica el algoritmo de Louvain para encontrar grupos.
5.  **Generación de Reporte:** Exporta un archivo de texto con los resultados del análisis.
6.  **Visualización Interactiva:** Utiliza la librería `jaal` para explorar el grafo de forma interactiva.

**Instrucciones de Uso:**
1.  Sube tu archivo CSV al entorno de Colab.
2.  Actualiza la variable `csv_file_path` con el nombre de tu archivo.
3.  Define el nombre del archivo de salida en `output_file_file_path`.
4.  Ajusta los parámetros de filtrado y visualización según tus necesidades.
5.  Ejecuta todas las celdas en orden.
"""

# ==============================================================================
# 1. INSTALACIÓN DE DEPENDENCIAS
# ==============================================================================
# Se instalan las librerías necesarias para el análisis y la visualización.
# - pandas: para manipulación de datos.
# - networkx: para la creación y análisis de grafos.
# - python-louvain: para el algoritmo de detección de comunidades.
# - jaal: para la visualización interactiva del grafo.
# - matplotlib: para gráficos básicos (si fueran necesarios).
!pip install pandas networkx python-louvain jaal matplotlib

# ==============================================================================
# 2. IMPORTACIÓN DE LIBRERÍAS
# ==============================================================================
import pandas as pd
import networkx as nx
import community.community_louvain # Corrected import
from jaal import Jaal
from collections import defaultdict
import os
import matplotlib.cm as cm # Import colormap
import matplotlib.pyplot as plt # Import pyplot for colormaps
import numpy as np # Import numpy for color mapping

print("Librerías importadas correctamente.")

# ==============================================================================
# 3. DEFINICIÓN DE PARÁMETROS
# ==============================================================================
# --- Parámetros de Archivos ---
# TODO: Modifica esta variable con el nombre de tu archivo CSV.
# Asegúrate de que el archivo esté en la misma carpeta que este cuaderno en Colab.
csv_file_path = '/content/dependencias.csv'  # EJEMPLO: 'datos_de_red.csv'

# Nombre del archivo de texto donde se guardará el reporte del análisis.
output_file_path = 'reporte_comunidades.txt'

# --- Parámetros de Filtrado ---
# Si es True, se excluirán del análisis todas las conexiones que involucren
# a un nodo llamado "Unscanned Device".
filter_unescanned = True

# --- Parámetros de Visualización ---
# Controla si se aplica el motor de física de vis.js para la disposición de nodos.
apply_physics = True

# Controla si se muestran las flechas indicando la dirección de la conexión.
show_arrows = True


# ==============================================================================
# 4. CREACIÓN DE DATOS DE EJEMPLO (SI NO EXISTE EL ARCHIVO)
# ==============================================================================
# Esta celda crea un archivo CSV de ejemplo si no se encuentra el especificado.
# Esto permite que el cuaderno se ejecute de principio a fin sin errores,
# incluso if el usuario aún no ha subido su propio archivo.

if not os.path.exists(csv_file_path):
    print(f"Advertencia: No se encontró el archivo '{csv_file_path}'.")
    print("Creando un archivo de datos de ejemplo para la demostración.")

    sample_data = {
        'Day': [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 5, 5, 6, 6, 7],
        'LocalVMName': ['WebServer1', 'WebServer1', 'DBServer1', 'AppServer1', 'WebServer2', 'AppServer2', 'DBServer2', 'AppServer2', 'WebServer1', 'MonitorServer', 'WebServer1', 'AuthServer', 'WebServer2', 'BackupServer', 'DBServer1', 'Unescanned Device'],
        'LocalAssetID': ['A01', 'A01', 'A02', 'A03', 'A04', 'A05', 'A06', 'A05', 'A01', 'A07', 'A01', 'A08', 'A02', 'A03', 'A01'],
        'LocalGroups': ['Web', 'Web', 'Database', 'Apps', 'Web', 'Apps', 'Database', 'Apps', 'Web', 'Monitoring', 'Web', 'Security', 'Web', 'Backup', 'Database', 'Unknown'],
        'LocalIP': ['10.0.1.10', '10.0.1.10', '10.0.2.20', '10.0.3.30', '10.0.1.11', '10.0.3.31', '10.0.2.21', '10.0.3.31', '10.0.1.10', '10.0.4.40', '10.0.1.10', '10.0.5.50', '10.0.1.11', '10.0.6.60', '10.0.2.20', '192.168.1.100'],
        'LocalPort': [80, 443, 1433, 8080, 80, 8080, 1433, 8081, 5000, 9100, 22, 88, 443, 21, 1433, 12345],
        'Protocol': ['TCP', 'TCP', 'TCP', 'TCP', 'TCP', 'TCP', 'TCP', 'UDP', 'TCP', 'TCP', 'TCP', 'UDP', 'TCP', 'TCP', 'TCP', 'TCP'],
        'LocalProcessName': ['nginx', 'nginx', 'sqlservr', 'java', 'apache', 'java', 'sqlservr', 'app.exe', 'python_api', 'prometheus', 'sshd', 'kerberos', 'apache', 'ftp', 'sqlservr.exe', 'unknown'],
        'RemoteVMName': ['AppServer1', 'DBServer1', 'BackupServer', 'DBServer1', 'AppServer2', 'DBServer2', 'BackupServer', 'WebServer1', 'AppServer2', 'WebServer2', 'MonitorServer', 'WebServer1', 'AuthServer', 'DBServer1', 'AppServer1', 'WebServer1'],
        'RemoteAssetID': ['A03', 'A02', 'A09', 'A02', 'A05', 'A06', 'A09', 'A01', 'A05', 'A04', 'A07', 'A01', 'A08', 'A02', 'A03', 'A01'],
        'RemoteGroups': ['Apps', 'Database', 'Backup', 'Database', 'Apps', 'Database', 'Backup', 'Web', 'Apps', 'Web', 'Monitoring', 'Web', 'Security', 'Database', 'Apps', 'Web'],
        'RemoteIP': ['10.0.3.30', '10.0.2.20', '10.0.6.60', '10.0.2.20', '10.0.3.31', '10.0.2.21', '10.0.6.60', '10.0.1.10', '10.0.3.31', '10.0.1.11', '10.0.4.40', '10.0.1.10', '10.0.5.50', '10.0.1.11', '10.0.4.40', '10.0.2.20', '10.0.3.30', '10.0.1.10'],
        'RemotePort': [8080, 1433, 21, 1433, 8080, 1433, 21, 54321, 8081, 9100, 22, 88, 443, 21, 8080, 80],
        'ConnectionCount': [150, 20, 5, 200, 180, 210, 8, 30, 90, 40, 3, 120, 130, 2, 45, 10]
    }
    sample_df = pd.DataFrame(sample_data)
    sample_df.to_csv(csv_file_path, index=False)
    print(f"Advertencia: Archivo '{csv_file_path}' no encontrado. Creando datos de ejemplo.")


# ==============================================================================
# 5. CARGA DE DATOS Y CONSTRUCCIÓN DEL GRAFO
# ==============================================================================
try:
    # Cargar los datos desde el archivo CSV
    df = pd.read_csv(csv_file_path, dtype={'LocalProcessName': str}) # Specify dtype for column 7
    print(f"Archivo '{csv_file_path}' cargado. {len(df)} registros encontrados.")

    # --- FILTRADO DE NODOS (NUEVO) ---
    # Si el parámetro `filter_unescanned` es True, eliminamos las filas
    # donde el origen o el destino sea "Unscanned Device".
    if filter_unescanned:
        initial_rows = len(df)
        unescanned_device_name = "Unscanned Device"
        df = df[
            (df['LocalVMName'].astype(str).str.strip() != unescanned_device_name) &
            (df['RemoteVMName'].astype(str).str.strip() != unescanned_device_name)
        ]
        filtered_rows = initial_rows - len(df)
        if filtered_rows > 0:
            print(f"Filtrado activado: Se excluyeron {filtered_rows} conexiones que involucran a '{unescanned_device_name}'.")
        else:
            print("Filtrado activado: No se encontraron conexiones con 'Unscanned Device' para excluir.")

    # Crear un grafo dirigido para representar las conexiones
    G = nx.DiGraph()

    # Iterar sobre cada fila del DataFrame para construir el grafo
    for _, row in df.iterrows():
        # --- Nodo de Origen ---
        source_node = str(row['LocalVMName']) # Ensure node names are strings
        G.add_node(
            source_node,
            ip=row['LocalIP'],
            port=row['LocalPort'],
            process=row['LocalProcessName'],
            asset_id=row['LocalAssetID'],
            groups=row['LocalGroups'],
            type='Local'
        )

        # --- Nodo de Destino ---
        target_node = str(row['RemoteVMName']) # Ensure node names are strings
        G.add_node(
            target_node,
            ip=row['RemoteIP'],
            port=row['RemotePort'],
            asset_id=row['RemoteAssetID'],
            groups=row['RemoteGroups'],
            type='Remote'
        )

        # --- Conexión (Arista) ---
        # If an edge already exists, sum the ConnectionCount
        if G.has_edge(source_node, target_node):
            G[source_node][target_node]['weight'] += row['ConnectionCount']
        else:
            G.add_edge(
                source_node,
                target_node,
                protocol=row['Protocol'],
                weight=row['ConnectionCount'] # The edge weight will be the connection count
            )

    print(f"Grafo construido exitosamente.")
    print(f" - Total nodes: {G.number_of_nodes()}")
    print(f" - Total connections: {G.number_of_edges()}")

except FileNotFoundError:
    print(f"Error: The file '{csv_file_path}' was not found.")
    print("Please ensure the CSV file is uploaded to Colab and the name is correct.")
except Exception as e:
    print(f"An error occurred while processing the file: {e}")


# ==============================================================================
# 6. DETECCIÓN DE COMUNIDADES
# ==============================================================================
# The Louvain algorithm, one of the most effective for community detection,
# is designed to operate on undirected graphs.
# Therefore, we first convert our directed graph to an undirected one.
# In doing that, we preserve the edge weights ('weight'), which represent
# the connection intensity (ConnectionCount).

if 'G' in locals() and G.number_of_nodes() > 0:
    G_undirected = G.to_undirected()

    # The `best_partition` function from the `community` library (python-louvain) is used.
    # This function implements the Louvain algorithm to find the partition
    # of nodes that maximizes the network's "modularity".
    # The `weight='weight'` parameter is crucial, as it tells the algorithm to
    # consider the strength of connections (ConnectionCount) when calculating modularity,
    # resulting in more significant communities.
    partition = community.community_louvain.best_partition(G_undirected, weight='weight')

    # Invert the dictionary to group nodes by community
    # The result will be: {community_id: [node1, node2, ...]}
    communities = defaultdict(list)
    for node, community_id in partition.items():
        communities[community_id].append(node)

    # Assign the community as an attribute to each node in the original graph for visualization
    nx.set_node_attributes(G, partition, 'community')

    print(f"Community detection completed.")
    print(f"Found {len(communities)} communities.")
else:
    print("The graph is empty. Community detection cannot be performed.")
    communities = {}


# ==============================================================================
# 7. ANÁLISIS Y GENERACIÓN DE REPORTE
# ==============================================================================
if 'G' in locals() and communities:
    with open(output_file_path, 'w', encoding='utf-8') as f:
        f.write("="*50 + "\n")
        f.write("NETWORK COMMUNITY ANALYSIS REPORT\n")
        f.write("="*50 + "\n\n")

        # --- General Summary ---
        num_communities = len(communities)
        f.write(f"Total communities found: {num_communities}\n\n")

        # --- Detail by Community ---
        f.write("-" * 50 + "\n")
        f.write("NODE DETAILS BY COMMUNITY\n")
        f.write("-" * 50 + "\n\n")

        sorted_communities = sorted(communities.items(), key=lambda item: len(item[1]), reverse=True)

        for community_id, nodes in sorted_communities:
            f.write(f"Community ID: {community_id}\n")
            f.write(f"  - Number of Nodes: {len(nodes)}\n")
            f.write(f"  - Community Members:\n")
            for node in sorted([str(n) for n in nodes]): # Ensure nodes are strings before sorting
                f.write(f"    - {node}\n")
            f.write("\n")

        # --- Border Node Analysis ---
        border_nodes = defaultdict(set)
        for u, v in G.edges():
            comm_u = partition.get(u)
            comm_v = partition.get(v)
            if comm_u != comm_v:
                border_nodes[u].add(comm_v)
                border_nodes[v].add(comm_u)

        f.write("-" * 50 + "\n")
        f.write("NODES CONNECTING MULTIPLE COMMUNITIES (BRIDGE NODES)\n")
        f.write("-" * 50 + "\n\n")

        if not border_nodes:
            f.write("No nodes connecting different communities were found.\n")
        else:
            f.write("The following nodes have direct connections to communities other than their own:\n\n")
            sorted_border_nodes = sorted(border_nodes.items(), key=lambda item: len(item[1]), reverse=True)
            for node, external_communities in sorted_border_nodes:
                own_community = partition.get(node)
                f.write(f"Node: '{node}' (belongs to Community {own_community})\n")
                f.write(f"  - Connects to communities: {sorted(list(external_communities))}\n")
                f.write("\n")

    print(f"Report successfully saved to '{output_file_path}'.")
    print("You can download the file from the left panel in Colab.")
else:
    print("The report was not generated because no communities were detected.")


# ==============================================================================
# 8. VISUALIZACIÓN INTERACTIVA CON JAAL
# ==============================================================================
if 'G' in locals() and G.number_of_nodes() > 0:
    print("\nGenerando visualización interactiva...")
    print("La visualización puede tardar unos segundos en aparecer.")

    edge_df = nx.to_pandas_edgelist(G, source='from', target='to')

    node_df = pd.DataFrame.from_dict(dict(G.nodes(data=True)), orient='index')
    # node_df = node_df.reset_index().rename(columns={'index': 'id'}) # Removed reset_index() and rename
    node_df['id'] = node_df.index.astype(str) # Use index (node names) as 'id' and ensure it's string

    # Add a 'color' column to node_df based on community
    if 'community' in node_df.columns and num_communities > 0:
        # Use a modern way to get a colormap
        colors = plt.get_cmap('viridis', num_communities)
        norm = plt.Normalize(vmin=0, vmax=num_communities - 1) # Normalize community IDs to the range [0, 1]
        node_df['color'] = node_df['community'].apply(lambda x: 'rgb({},{},{})'.format(*[int(c*255) for c in colors(norm(x))[:3]]))
    else:
        print("Community information not available for coloring.")
        # Assign a default color if community info is not available
        node_df['color'] = 'rgb(128,128,128)' # Gray color


    vis_options = {
        'physics': {'enabled': apply_physics},
        'edges': {
            'arrows': {'to': {'enabled': show_arrows, 'scaleFactor': 0.7}},
            'label': {'enabled': True}  # Enable edge labels (shows weight by default)
            },
        'interaction':{'hover': True, 'tooltipDelay': 200},
         'nodes': {
            'color': {'inherit': 'false'} # Ensure node color is not inherited
        }
    }

    Jaal(edge_df=edge_df, node_df=node_df).plot(vis_opts=vis_options) # Pass vis_opts to the plot method
    print("\nVisualización lista. Interactúa con el grafo a continuación.")
else:
    print("No se puede generar la visualización porque el grafo está vacío.")

Librerías importadas correctamente.
Archivo '/content/dependencias.csv' cargado. 441024 registros encontrados.
Filtrado activado: Se excluyeron 410306 conexiones que involucran a 'Unscanned Device'.
Grafo construido exitosamente.
 - Total nodes: 224
 - Total connections: 1090
Community detection completed.
Found 10 communities.
Report successfully saved to 'reporte_comunidades.txt'.
You can download the file from the left panel in Colab.

Generando visualización interactiva...
La visualización puede tardar unos segundos en aparecer.
Parsing the data...Done


<IPython.core.display.Javascript object>


Visualización lista. Interactúa con el grafo a continuación.


In [16]:
display(node_df.head())

Unnamed: 0,ip,port,process,asset_id,groups,type,community,id,color
srwlssatp04.suramericana.com.co,10.201.30.22,0,,projects/190611524863/locations/us-central1/as...,todas-las-vm,Remote,0,srwlssatp04.suramericana.com.co,"rgb(68,1,84)"
srwlssatp05.suramericana.com.co,10.201.30.24,0,,projects/190611524863/locations/us-central1/as...,todas-las-vm,Remote,0,srwlssatp05.suramericana.com.co,"rgb(68,1,84)"
epsohsappp01.suramericana.com.co,10.203.16.140,0,,projects/190611524863/locations/us-central1/as...,todas-las-vm,Remote,1,epsohsappp01.suramericana.com.co,"rgb(71,39,119)"
epsohsappp02.suramericana.com.co,10.203.16.141,0,,projects/190611524863/locations/us-central1/as...,todas-las-vm,Remote,1,epsohsappp02.suramericana.com.co,"rgb(71,39,119)"
sradmappp01.suramericana.com.co,10.203.21.71,7001,,projects/190611524863/locations/us-central1/as...,todas-las-vm,Local,1,sradmappp01.suramericana.com.co,"rgb(71,39,119)"
