# mapping out inheritance structures in yt

(`pyenv activate yt_dev`)



a yt frontend: a `Dataset` class that implements some "abstract" classes:

```
class GadgetDataset(SPHDataset):
    _index_class: Type[Index] = GadgetBinaryIndex
    _file_class: Type[ParticleFile] = GadgetBinaryFile
    _field_info_class: Type[FieldInfoContainer] = GadgetFieldInfo
```

The `_index_class` has an another attached "abstract" class, `_index_class.io` that each frontend implements, in this case, `IOHandlerGadgetHDF5`. Each of these classes have methods that may or may not need to be re-written for each frontend... at this stage, lots of code duplication, how to identify code structure that could be better abstracted?

Help simplify writing new frontends, daskification...


Concentrating on the most dask-relevant part of this, the `io` classes, want to know how classes implement specific methods. e.g., `io._read_particle_selection` and `io._read_particle_coords`... 


# 1. identifying child classes

In [None]:
# looking for children:
from yt.utilities.io_handler import BaseIOHandler, BaseParticleIOHandler
BaseIOHandler.__subclasses__()  # all the classes that override BaseIOHandler

In [None]:
BaseParticleIOHandler.__subclasses__()

In [None]:
from yt.frontends.sph.io import IOHandlerSPH
IOHandlerSPH.__subclasses__()

So, need to assemble recursively! 

In [None]:
all_subclasses = []
def subclasses_assemble(parent):
    for child in parent.__subclasses__(): 
        all_subclasses.append(child)  # add the name        
        subclasses_assemble(child)  # continue downward... 
    
subclasses_assemble(BaseParticleIOHandler)
all_subclasses

# 2. inspecting source code

In [None]:
import inspect
from yt.frontends.gadget.io import IOHandlerGadgetBinary

inspect.getsourcefile(IOHandlerGadgetBinary._read_particle_coords)

In [None]:
inspect.getsourcefile(IOHandlerGadgetBinary._read_particle_selection)

In [None]:
inspect.getsourcelines(IOHandlerGadgetBinary._read_particle_selection)

In [None]:
import yt
import inspect
import ipywidgets
import traitlets

from IPython.display import Markdown, display
import textwrap
import collections

In [None]:
base_class = BaseParticleIOHandler

all_subclasses = []
def subclasses_assemble(parent):
    for child in parent.__subclasses__(): 
        all_subclasses.append(child)  # add the name        
        subclasses_assemble(child)  # continue downward... 
    
subclasses_assemble(base_class)

# sort classes alphabetically  
s_c_strings = [i.__name__ for i in all_subclasses]
sorted_subclsses = [x for _, x in sorted(zip(s_c_strings, all_subclasses))]
sorted_subclsses

class_dropdown = ipywidgets.Dropdown(options=[(_.__name__, _) for _ in sorted_subclsses])
func_dropdown = ipywidgets.Dropdown(options=[_ for _ in dir(base_class) if not _.startswith("__")])
defined_at = ipywidgets.HTML()
source = ipywidgets.Output(layout=ipywidgets.Layout(width="100%", height="50em"))


def update_class(event):
    current_func = func_dropdown.value
    func_dropdown.options = [_ for _ in dir(class_dropdown.value) if not _.startswith("__")]
    if current_func in func_dropdown.options:
        func_dropdown.value = current_func

update_class(None)
class_dropdown.observe(update_class, ["value"])
    
def update_source(event):
    cls = class_dropdown.value
    f = getattr(cls, func_dropdown.value)
    
    source.clear_output()
    if not isinstance(f, collections.abc.Callable): return
    defined_at.value = f"<tt>{inspect.getsourcefile(f)}:{inspect.getsourcelines(f)[1]}</tt>"
    with source:
        display(
            Markdown(
                data="```python\n"
                + textwrap.dedent(inspect.getsource(f))
                + "\n```"
            )
        )

func_dropdown.observe(update_source, ["value"])
class_dropdown.observe(update_class, ["value"])
update_source(None)
display(ipywidgets.VBox([class_dropdown, func_dropdown, defined_at, source]))

# checkout out _read_particle_coords, Swift vs Gadget vs OWLSSubfind (uses Gadget)

## 3 semi-automation? 

understand which frontends use which over-rides of a given function... 

2 parts: building some graphs and testing code similarity



### graph visualization with graphviz

In [None]:
# Add nodes 1 and 2
from graphviz import Digraph  # "directed" graph
dot = Digraph()
dot.node('1',label='node 1')
dot.node('2')
dot.node('3')

# Add edge between 1 and 2
dot.edge('2','1')
dot.edge('3','1')
dot

### inheritance digraph with yt

Recursively construct inheritance structure, **highlighting when a class defines or overrides a specified function**

In [None]:
import collections
import inspect
from graphviz import Digraph
from typing import Optional, Any

class ChildNode:
    # a class that is a child of some parent
    def __init__(self, 
                 child: Any, 
                 child_id: int, 
                 parent: Optional[Any]=None, 
                 parent_id: Optional[int]=None,
                 color: Optional[str]="#000000"):
        self.child = child
        self.child_name = child.__name__
        self._child_id = child_id
        self.parent = parent
        
        self._parent_id = parent_id
        self.parent_name = None
        if parent:
            self.parent_name = parent.__name__
            
        self.color = color
    
    @property
    def child_id(self) -> str:
        return str(self._child_id)
    
    @property
    def parent_id(self) -> str:        
        if self._parent_id:            
            return str(self._parent_id)
        return
    
        
class ClassGraphTree:
    
    def __init__(self, 
                 baseclass: Any, 
                 funcname: Optional[str]=None, 
                 default_color: Optional[str]= "#000000",
                 func_override_color: Optional[str]= "#ff0000",
                 **kwargs):
        """
        baseclass: 
            the starting base class to begin mapping from
        funcname: 
            the name of a function to watch for overrides
        default_color: t
            he default outline color of nodes, in any graphviz string
        func_override_color: 
            the outline color of nodes that override funcname, in any graphviz string
        **kwargs:
            any additional keyword arguments are passed to graphviz.Digraph(**kwargs)
        """
        self.baseclass = baseclass
        self.basename: str = baseclass.__name__
        self.funcname = funcname
        self.dot = Digraph(**kwargs)
        self._nodenum: int = 0
        self._node_list = []        
        self._current_node = 1                
        self._default_color = default_color
        self._override_color = func_override_color
        self.build()
        
    def _get_source_info(self, obj) -> Optional[str]:
        f = getattr(obj, self.funcname)
        if isinstance(f, collections.abc.Callable):
            return f"{inspect.getsourcefile(f)}:{inspect.getsourcelines(f)[1]}"
        return None
    
    def _node_overrides_func(self, child, parent) -> bool:
        childsrc = self._get_source_info(child)
        parentsrc = self._get_source_info(parent)
        if childsrc != parentsrc:            
            return True # it overrides! 
        return False        

    def _get_new_node_color(self, child, parent) -> str:
        if self.funcname and self._node_overrides_func(child, parent):
            return self._override_color
        return self._default_color

    def _get_baseclass_color(self) -> str:
        color = self._default_color        
        if self.funcname:
            f = getattr(self.baseclass, self.funcname)
            class_where_its_defined = f.__qualname__.split('.')[0]
            if self.basename == class_where_its_defined: 
                # then its defined here, use the override color
                color = self._override_color
        return color
        
    
    def check_subclasses(self, parent, parent_id: int, node_i: int) -> int:
        for child in parent.__subclasses__():            
            color = self._get_new_node_color(child, parent) # color changes if overridden
            new_node = ChildNode(child, node_i, parent=parent, parent_id=parent_id, color=color)            
            self._node_list.append(new_node)
            
            node_i += 1
            node_i = self.check_subclasses(child, node_i - 1, node_i)
        return node_i
            
            
    def build(self):
        # builds a list of nodes with references for inheritance 
        
        # initialize with the top node       
        color = self._get_baseclass_color()        
        self._node_list.append(ChildNode(self.baseclass, self._current_node, parent=None, color=color))
        self._current_node += 1
        
        # recursively follow the subclasses
        _ = self.check_subclasses(self.baseclass, self._current_node - 1, self._current_node)        
            
        # now build the graph
        for node in self._node_list:            
            self.dot.node(node.child_id, label=node.child_name, color=node.color)
            if node.parent:                
                self.dot.edge(node.child_id, node.parent_id)                   
            


In [None]:
c = ClassGraphTree(BaseIOHandler, "_read_particle_selection")
c.dot

ALL THE FRONTENDS (and intermediate classes)! 
* black arrows: point to the class' parent
* red outlines: the selected function is different from the parent

lots to see... limit by choosing a different top class. just the particle frontends:

In [None]:
c = ClassGraphTree(BaseParticleIOHandler, "_read_particle_selection")
c.dot

highlight a different function

In [None]:
c = ClassGraphTree(BaseParticleIOHandler, "_read_particle_coords")
c.dot

## 4. code similarity ?

Each of the red nodes over-rides the selected function. How similar is each over-ride? 

https://github.com/fyrestone/pycode_similar is nice and easy: "This is a simple plagiarism detection tool for python code, the basic idea is to normalize python AST representation and use difflib to get the modification from referenced code to candidate code."

AST = abstract syntax trees (https://docs.python.org/3/library/ast.html, https://ruslanspivak.com/lsbasi-part7/)

In [None]:
import pycode_similar
from yt.frontends.swift.io import IOHandlerSwift
from yt.frontends.gadget.io import IOHandlerGadgetHDF5

src_swift = textwrap.dedent(inspect.getsource(IOHandlerSwift._read_particle_coords))
src_gadget = textwrap.dedent(inspect.getsource(IOHandlerGadgetHDF5._read_particle_coords))

In [None]:
# pycode_similar.detect([the reference, cadidate 1, candiate 2, ...])
result = pycode_similar.detect([src_swift, src_swift, src_gadget])
result

In [None]:
# comparison of the same function
def pull_result(result, indx):
    return (result[indx][1][0].plagiarism_percent,
            result[indx][1][0].plagiarism_count, 
            result[indx][1][0].total_count)

pull_result(result, 0)


In [None]:
# gadget vs swift
pull_result(result, 1)

In [None]:
# swift vs gadget
pull_result(pycode_similar.detect([src_gadget, src_swift]), 0)

not symmetric! intrinsic AST complexity or `pycode_similar` oddity? 

averaging from here on:

In [None]:
r1 = pycode_similar.detect([src_swift, src_gadget])
r2 = pycode_similar.detect([src_gadget, src_swift])

(r1[0][1][0].plagiarism_percent + r2[0][1][0].plagiarism_percent)/2

## 5. putting it all together

things are getting complicated... new repo: https://github.com/chrishavlin/inheritance_explorer (not yt specific, but not tested on anything other than yt...)


In [None]:
import yt
from inheritance_explorer import ClassGraphTree

In [None]:
base_class = yt.utilities.io_handler.BaseParticleIOHandler
# fname = "_read_particle_selection"
fname = "_read_particle_coords"

In [None]:
cgt = ClassGraphTree(base_class, funcname=fname)

In [None]:
cgt.digraph()

but now... pull the source code everytime the function gets over-ridden and then do a similarity test for every permuatation to get a "similarity matrix"

In [None]:
M = cgt.similarity_results['matrix']
M.shape

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 12))

plt.imshow(M)
_ = plt.gca().set_xticks(range(M.shape[0]))
_ = plt.gca().set_yticks(range(M.shape[0]))
plt.colorbar()

each row and column is a child class, the color is how similar the function source is to other instances of overriding the function:

* diagonal is always 1 (a self comparison)
* symmetric because I'm averaging each comparison direction

Look for similarity above some cutoff:

In [None]:
cgt.similarity_cutoff

In [None]:
plt.figure(figsize=(8, 8))
plt.imshow(M * (M>=cgt.similarity_cutoff), cmap="gray")
_ = plt.gca().set_xticks(range(M.shape[0]))
_ = plt.gca().set_yticks(range(M.shape[0]))

since each row, column refers back to a node, we can add it to the graph!

In [None]:
cgt.digraph(include_similarity=True) 

* black arrows: inheritance (same as before)
* red outlines: the selected function is overridden
* isolated red outline: did not find any other classes function source above cutoff
* colored arrows: point between classes containing function source above similarity cutoff. The actual color isn't meaningful... yet?

Look at that gadget, Halo loop! and the Swift, OWLs, Gadaget Binary loop!

The simimilarity loops distinct from inheritance structure suggests there are simplifications to be made!

# improvements?

* Static graph: color nodes or arrow by similarity value
* code similarity: other methods? build in the similarity matrix plot cause it's neat?

Interactivity?
* a node editor? the node list is separate from the graphviz digraph construction, easy to export as json (or whatever)
* display the similarity arrows when hovering over a node?
* option to display source code?
* open the source code in external editor?

In [None]:
import subprocess
file_list = list(cgt._override_src_files.values()) # list of files and line numbers of the functions
vscode_args = ['code',"-g"] + file_list[:3]
_ = subprocess.Popen(vscode_args)

* other packages for code comparison? I didn't find any that let me isolate a single function of my choice like this... 
* any other parts of yt (or other packages) that might be interesting to map out? 