Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Topology validate fails with error: dictionary changed size during iteration #131

Closed
kthare10 opened this issue Aug 18, 2022 · 5 comments
Closed
Assignees
Labels
1.3 For release 1.3

Comments

@kthare10
Copy link
Collaborator

Topology validate fails with error: dictionary changed size during iteration

Snapshot of the stack trace:

2022-08-18 13:19:29,962 - orchestrator - {slices_controller.py:66} - [waitress-0] - ERROR - dictionary changed size during iteration
Traceback (most recent call last):
  File "/usr/src/app/fabric_cf/orchestrator/core/orchestrator_handler.py", line 248, in create_slice
    topology.validate()
  File "/usr/local/lib/python3.9/site-packages/fim/user/topology.py", line 661, in validate
    s.validate_constraints(node_interfaces)
  File "/usr/local/lib/python3.9/site-packages/fim/user/network_service.py", line 231, in validate_constraints
    self.__validate_nstype_constraints(nstype, interfaces)
  File "/usr/local/lib/python3.9/site-packages/fim/user/network_service.py", line 218, in __validate_nstype_constraints
    sites.add(owner.site)
  File "/usr/local/lib/python3.9/site-packages/fim/user/node.py", line 111, in site
    return self.get_property('site') if self.__dict__.get('topo', None) is not None else None
  File "/usr/local/lib/python3.9/site-packages/fim/user/node.py", line 227, in get_property
    _, node_properties = self.topo.graph_model.get_node_properties(node_id=self.node_id)
  File "/usr/local/lib/python3.9/site-packages/fim/graph/networkx_property_graph.py", line 92, in get_node_properties
    node_props = self.storage.get_graph(self.graph_id).nodes[self._find_node(node_id=node_id)].copy()
  File "/usr/local/lib/python3.9/site-packages/fim/graph/networkx_mixin.py", line 58, in _find_node
    query_match = list(nxq.search_nodes(self.storage.get_graph(use_graph_id),
RuntimeError: dictionary changed size during iteration
@kthare10
Copy link
Collaborator Author

ospf_slice.graphml.txt

Attaching the graphml of the topology which caused the error.

@ibaldin ibaldin self-assigned this Sep 1, 2022
@ibaldin
Copy link
Contributor

ibaldin commented Sep 1, 2022

Need to see if this is still a problem.

@ibaldin ibaldin added the 1.3 For release 1.3 label Sep 8, 2022
@ibaldin
Copy link
Contributor

ibaldin commented Sep 10, 2022

@kthare10 is this error popping up in Orchestrator at graph ingestion? Is it possible somehow the model is being modified by one thread while being validated by another? This code is not thread-safe and at least at first glance this is looking to me like a race condition. Simply validating this graph after loading it doesn't create any problems.

@kthare10
Copy link
Collaborator Author

Yes, the error pops up as soon as Orchestrator is validating the incoming create slice request. No other thread is processing the slice at that point. This function is called directly from the Flask end point handler. I'll try to reproduce it and keep you posted.

    def create_slice(self, *, token: str, slice_name: str, slice_graph: str, ssh_key: str,
                     lease_end_time: str) -> List[dict]:
        if self.globals.is_maintenance_mode_on():
            raise OrchestratorException(Constants.MAINTENANCE_MODE_ERROR)

        slice_id = None
        controller = None
        new_slice_object = None
        asm_graph = None
        try:
            end_time = self.__validate_lease_end_time(lease_end_time=lease_end_time)

            controller = self.controller_state.get_management_actor()
            self.logger.debug(f"create_slice invoked for Controller: {controller}")

            # Validate the slice graph
            topology = ExperimentTopology(graph_string=slice_graph)
            topology.validate()
....

@kthare10
Copy link
Collaborator Author

kthare10 commented Oct 6, 2022

It looks the issue is because NetworkXGraphImporter class has an inbuilt NetworkXGraphStorage class which uses a singleton object __NetworkXGraphStorage which is not thread safe.

Also, in the current CF implementation, when the topology object goes out of scope it is not explicitly deleting the graph from the NetworkXGraphStorage. So this Singleton instance on production must be growing pretty big.

It has resulted in some of the operations taking as long as 250 seconds.

class NetworkXGraphImporter(ABCGraphImporter):
    """
    Importer for NetworkX graphs. Stores graphs in a single NetworkX
    object. Care is taken to disambiguate nodes when loading graphs.
    """
    READ_FORMATS = ["json_nodelink", "graphml"]

    def __init__(self, *, logger=None):
        """
        Initialize the importer setting up storage and logger
        :param logger:
        """
        self.storage = NetworkXGraphStorage(logger=logger)

class NetworkXGraphStorage:
    def __init__(self, logger=None):
        if not NetworkXGraphStorage.storage_instance:
            NetworkXGraphStorage.storage_instance = NetworkXGraphStorage.__NetworkXGraphStorage(logger=logger)

    class __NetworkXGraphStorage:
        """
        Singleton in-memory storage for graphs
        """

        def __init__(self, logger=None):
            self.graphs = nx.Graph()
            self.start_id = 1
            self.log = logger

@ibaldin ibaldin mentioned this issue Oct 6, 2022
@ibaldin ibaldin closed this as completed Oct 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.3 For release 1.3
Projects
None yet
Development

No branches or pull requests

2 participants