# Custom Cypher Queries with Falkor

This notebook demonstrates how to write custom Cypher queries to explore your codebase knowledge graph.

Topics covered:
1. Basic graph traversal
2. Finding dependencies
3. Analyzing complexity hotspots
4. Discovering patterns
5. Performance optimization tips

## Setup

First, connect to Neo4j and ensure you have data from a previous ingestion.

In [None]:
from falkor.graph import Neo4jClient
from falkor.config import load_config
import pandas as pd

# Load config and connect
config = load_config()
db = Neo4jClient(
    uri=config.neo4j.uri,
    username=config.neo4j.user,
    password=config.neo4j.password
)

# Verify connection
stats = db.get_stats()
print(f"✓ Connected. Graph contains {stats['total_nodes']} nodes")

## 1. Basic Graph Traversal

Let's start with simple queries to understand the graph structure.

In [None]:
# Find all files in the codebase
query = """
MATCH (f:File)
RETURN f.filePath AS path, 
       f.language AS language, 
       f.loc AS lines_of_code
ORDER BY f.loc DESC
LIMIT 10
"""

result = db.execute_query(query)
df = pd.DataFrame(result)
print("\nTop 10 Largest Files:")
print(df.to_string(index=False))

In [None]:
# Find all classes with their method counts
query = """
MATCH (f:File)-[:CONTAINS]->(c:Class)
OPTIONAL MATCH (f)-[:CONTAINS]->(m:Function)
WHERE m.qualifiedName STARTS WITH c.qualifiedName + '.'
WITH c, f, count(m) AS method_count
RETURN c.name AS class_name,
       f.filePath AS file,
       method_count,
       c.complexity AS complexity
ORDER BY method_count DESC
LIMIT 10
"""

result = db.execute_query(query)
df = pd.DataFrame(result)
print("\nClasses with Most Methods:")
print(df.to_string(index=False))

## 2. Finding Dependencies

Explore import relationships and dependencies between files.

In [None]:
# Find files with most imports (highly coupled)
query = """
MATCH (f:File)-[:IMPORTS]->(m:Module)
WITH f, count(DISTINCT m) AS import_count
RETURN f.filePath AS file,
       import_count
ORDER BY import_count DESC
LIMIT 10
"""

result = db.execute_query(query)
df = pd.DataFrame(result)
print("\nFiles with Most Imports (High Coupling):")
print(df.to_string(index=False))

In [None]:
# Find most imported modules (highly depended upon)
query = """
MATCH (f:File)-[:IMPORTS]->(m:Module)
WHERE m.is_external = false
WITH m, collect(DISTINCT f.filePath) AS importers
RETURN m.qualifiedName AS module,
       size(importers) AS imported_by_count,
       importers[..3] AS sample_importers
ORDER BY imported_by_count DESC
LIMIT 10
"""

result = db.execute_query(query)
df = pd.DataFrame(result)
print("\nMost Depended-Upon Internal Modules:")
print(df.to_string(index=False))

In [None]:
# Find dependency chains (A imports B imports C)
query = """
MATCH path = (f1:File)-[:IMPORTS*2..3]->(f2:File)
WHERE f1 <> f2
RETURN [node IN nodes(path) | node.filePath] AS dependency_chain,
       length(path) AS chain_length
ORDER BY chain_length DESC
LIMIT 5
"""

result = db.execute_query(query)
print("\nLong Dependency Chains:")
for i, record in enumerate(result[:5], 1):
    chain = record['dependency_chain']
    print(f"\n{i}. Chain length: {record['chain_length']}")
    for j, file in enumerate(chain, 1):
        print(f"   {j}. {file}")

## 3. Analyzing Complexity Hotspots

Find the most complex parts of your codebase.

In [None]:
# Find most complex functions
query = """
MATCH (f:File)-[:CONTAINS]->(func:Function)
RETURN func.name AS function,
       f.filePath AS file,
       func.complexity AS complexity,
       func.lineStart AS line,
       size(func.parameters) AS param_count
ORDER BY func.complexity DESC
LIMIT 10
"""

result = db.execute_query(query)
df = pd.DataFrame(result)
print("\nMost Complex Functions:")
print(df.to_string(index=False))

In [None]:
# Find complexity distribution by file
query = """
MATCH (file:File)-[:CONTAINS]->(func:Function)
WITH file, 
     sum(func.complexity) AS total_complexity,
     count(func) AS function_count,
     avg(func.complexity) AS avg_complexity
RETURN file.filePath AS file,
       total_complexity,
       function_count,
       round(avg_complexity, 2) AS avg_complexity
ORDER BY total_complexity DESC
LIMIT 10
"""

result = db.execute_query(query)
df = pd.DataFrame(result)
print("\nFiles with Highest Total Complexity:")
print(df.to_string(index=False))

## 4. Discovering Patterns

Find interesting patterns in your codebase.

In [None]:
# Find functions that call many other functions (fan-out)
query = """
MATCH (f:Function)-[:CALLS]->(called)
WITH f, count(DISTINCT called) AS call_count
WHERE call_count > 5
RETURN f.name AS function,
       f.filePath AS file,
       call_count,
       f.complexity AS complexity
ORDER BY call_count DESC
LIMIT 10
"""

result = db.execute_query(query)
df = pd.DataFrame(result)
print("\nFunctions with High Fan-Out:")
print(df.to_string(index=False))

In [None]:
# Find utility functions (called by many others)
query = """
MATCH (caller:Function)-[:CALLS]->(f:Function)
WITH f, count(DISTINCT caller) AS caller_count
WHERE caller_count > 3
RETURN f.name AS function,
       f.filePath AS file,
       caller_count,
       f.complexity AS complexity
ORDER BY caller_count DESC
LIMIT 10
"""

result = db.execute_query(query)
df = pd.DataFrame(result)
print("\nUtility Functions (High Fan-In):")
print(df.to_string(index=False))

In [None]:
# Find classes with inheritance chains
query = """
MATCH path = (child:Class)-[:INHERITS*1..5]->(parent:Class)
RETURN child.name AS child_class,
       [node IN nodes(path) | node.name] AS inheritance_chain,
       length(path) AS depth
ORDER BY depth DESC
LIMIT 10
"""

result = db.execute_query(query)
print("\nDeep Inheritance Hierarchies:")
for i, record in enumerate(result[:5], 1):
    chain = record['inheritance_chain']
    print(f"\n{i}. Depth: {record['depth']}")
    print(f"   {' -> '.join(chain)}")

In [None]:
# Find methods that override parent methods
query = """
MATCH (child_method:Function)-[:OVERRIDES]->(parent_method:Function)
MATCH (child_class:Class)-[:INHERITS]->(parent_class:Class)
WHERE child_method.qualifiedName STARTS WITH child_class.qualifiedName
  AND parent_method.qualifiedName STARTS WITH parent_class.qualifiedName
RETURN child_method.name AS method_name,
       child_class.name AS child_class,
       parent_class.name AS parent_class,
       child_method.complexity AS child_complexity,
       parent_method.complexity AS parent_complexity
ORDER BY child_method.name
LIMIT 10
"""

result = db.execute_query(query)
df = pd.DataFrame(result)
print("\nMethod Overrides:")
print(df.to_string(index=False))

## 5. Advanced Analysis Patterns

More complex queries for deeper insights.

In [None]:
# Find cohesion within files (functions calling each other in same file)
query = """
MATCH (file:File)-[:CONTAINS]->(f1:Function),
      (file)-[:CONTAINS]->(f2:Function)
WHERE f1 <> f2
OPTIONAL MATCH (f1)-[:CALLS]->(f2)
WITH file, 
     count(DISTINCT f1) AS total_functions,
     count((f1)-[:CALLS]->(f2)) AS internal_calls
WHERE total_functions > 2
RETURN file.filePath AS file,
       total_functions,
       internal_calls,
       round(toFloat(internal_calls) / (total_functions * (total_functions - 1)), 3) AS cohesion
ORDER BY cohesion DESC
LIMIT 10
"""

result = db.execute_query(query)
df = pd.DataFrame(result)
print("\nFiles with High Internal Cohesion:")
print(df.to_string(index=False))

In [None]:
# Find bottleneck functions (high betweenness centrality)
query = """
MATCH (f:Function)
OPTIONAL MATCH (caller)-[:CALLS]->(f)
OPTIONAL MATCH (f)-[:CALLS]->(callee)
WITH f,
     count(DISTINCT caller) AS in_degree,
     count(DISTINCT callee) AS out_degree
WHERE in_degree > 2 AND out_degree > 2
RETURN f.name AS function,
       f.filePath AS file,
       in_degree,
       out_degree,
       in_degree * out_degree AS centrality_score
ORDER BY centrality_score DESC
LIMIT 10
"""

result = db.execute_query(query)
df = pd.DataFrame(result)
print("\nBottleneck Functions (High Centrality):")
print(df.to_string(index=False))

In [None]:
# Find decorator usage patterns
query = """
MATCH (f:Function)
WHERE size(f.decorators) > 0
UNWIND f.decorators AS decorator
RETURN decorator,
       count(*) AS usage_count,
       collect(DISTINCT f.name)[..5] AS example_functions
ORDER BY usage_count DESC
LIMIT 10
"""

result = db.execute_query(query)
df = pd.DataFrame(result)
print("\nMost Used Decorators:")
print(df.to_string(index=False))

## 6. Performance Optimization Tips

Guidelines for writing efficient Cypher queries.

In [None]:
# Use EXPLAIN or PROFILE to understand query execution
query = """
EXPLAIN
MATCH (f:Function)
WHERE f.complexity > 10
RETURN f.name, f.complexity
ORDER BY f.complexity DESC
LIMIT 10
"""

result = db.execute_query(query)
print("Query execution plan generated (see Neo4j Browser for details)")
print("\nTips:")
print("- Use indexes on frequently queried properties")
print("- Filter early with WHERE clauses")
print("- Limit relationship traversal depth")
print("- Use PROFILE to see actual query performance")

## 7. Custom Analysis Examples

Combine patterns to answer specific questions about your codebase.

In [None]:
# Question: Which files have the highest "change risk"?
# (High complexity + high coupling + many dependencies)

query = """
MATCH (file:File)
OPTIONAL MATCH (file)-[:CONTAINS]->(func:Function)
OPTIONAL MATCH (file)-[:IMPORTS]->(module)
OPTIONAL MATCH (other_file:File)-[:IMPORTS]->(file)
WITH file,
     sum(func.complexity) AS total_complexity,
     count(DISTINCT module) AS import_count,
     count(DISTINCT other_file) AS imported_by_count
WITH file,
     total_complexity,
     import_count,
     imported_by_count,
     (total_complexity * 0.4 + import_count * 10 + imported_by_count * 15) AS risk_score
RETURN file.filePath AS file,
       total_complexity,
       import_count,
       imported_by_count,
       round(risk_score) AS risk_score
ORDER BY risk_score DESC
LIMIT 10
"""

result = db.execute_query(query)
df = pd.DataFrame(result)
print("\nFiles with Highest Change Risk:")
print(df.to_string(index=False))
print("\nRisk factors: complexity, coupling (imports), dependents")

## Cleanup

In [None]:
db.close()
print("✓ Connection closed")

## Summary

In this notebook, you learned:

1. **Basic Traversal**: How to navigate the knowledge graph
2. **Dependencies**: Finding import relationships and coupling
3. **Complexity**: Identifying complexity hotspots
4. **Patterns**: Discovering common code patterns
5. **Advanced Analysis**: Cohesion, centrality, custom metrics
6. **Performance**: Query optimization techniques

## Resources

- [Neo4j Cypher Documentation](https://neo4j.com/docs/cypher-manual/current/)
- [Graph Data Science Library](https://neo4j.com/docs/graph-data-science/current/)
- Falkor Schema: See `falkor/graph/schema.py`

## Next Steps

- Try `03_visualization.ipynb` for graph visualization
- Explore `04_batch_analysis.ipynb` for multi-project analysis
- Write your own queries to answer specific questions about your codebase!