# Exploring the Decision Tree

This module requires the completion of [the initial data loading module](1%20-%20Loading%20Decision%20Tree%20Data.ipynb).  Please complete module 1 before continuing.

Our decision tree contains a lot of valuable information we can gather.  As a review of our initial module, our decision tree contains questions and answers.  A question is classified by an answer, and each answer can result in subsequent questions.  This set of data should allow us to guide a user through this decision tree to narrow down the available decisions he or she needs to make to get a drink.

This module will attempt to answer the following questions:
* [What are all of the questions in the database?](#What-are-all-of-the-questions-in-the-database?)
* [What are all of the answers in the database?](#What-are-all-of-the-answers-in-the-database?)
* [At any question, what are my available answers?](#At-any-question,-what-are-my-available-answers?)
* [At any answer, what are my next questions?](#At-any-answer,-what-are-my-next-questions?)
* [Making the Above Queries Generic](#Making-the-Above-Queries-Generic)
* [At any question, what are all possible subsequent questions?](#At-any-question,-what-are-all-possible-subsequent-questions?)
* [Given the current set of answers, what are my possible available questions?](#Given-the-current-set-of-answers,-what-are-my-possible-available-questions?)

The above list of things we want to ask our dataset are invaluable in constructing a simple UI to assist the user through the decision tree.  First, we need to set up our environmental variables for this notebook.

In [1]:
import os
import pandas as pd
from IPython.display import display, Markdown, Latex
from py2neo import Graph
from string import Template

# Connect to our Graph database, ensure connectivity, and store connection in variable.
graph = Graph(uri="bolt://localhost", user="neo4j", password="123changeme");

# Set up a local path reference
rel_path = os.getcwd()

## What are all of the questions in the database?

In order to get a grasp of the questions asked in the decision tree.  Let's get a wholistic list of the questions in the database.  Now, you may be thinking: _Why are we doing this?_.  The answer is two fold: 1 - We can use the information later in the module :), and 2 - we can also do some nifty things with the query.

The below query gives us a list of questions _ordered by path length_.  We will be able to return the questions in order of where they show up in the depth of the decision tree.

In [2]:
query = """
MATCH p=(q:Question { id: 1 })-[*0..]->(n:Question)
WITH DISTINCT q, n, length(p) as depth
RETURN (depth / 2) as Depth, coalesce(n.value, q.value) as Question, coalesce(n.id, q.id) as ID
ORDER BY depth
"""
table = graph.run(query).to_data_frame(columns=['ID','Question','Depth']);
display(table);

Unnamed: 0,ID,Question,Depth
0,1,Is Holiday,0
1,2,Is Frappucino,1
2,3,Frappucino Flavor,2
3,9,Is Tea,2
4,4,Size,3
5,10,Tea Type,3
6,5,Is Espresso Based,3
7,4,Size,4
8,7,Added Flavor,4
9,11,Coffee Type,4


Notice that certain questions appear multiple times, such as 'Size'.  This is because the 'Size' question exists at different depths within the tree depending on what path you take.  If you decide to answer questions differently, you may end up at 'Size' at different times.

## What are all of the answers in the database?

Similar to the previous query - let's check out the available answers.  The only change we made was to start with the initial question, then branch from the resulting answers.

In [3]:
query = """
MATCH p=(q:Question { id: 1 })-[:IS_CLASSIFIED_BY]->(a:Answer)-[*0..]->(n:Answer)
WITH DISTINCT a, n, length(p) as depth
RETURN (depth / 2) as Depth, coalesce(n.value, a.value) as Question, coalesce(n.id, a.id) as ID
ORDER BY depth
"""
table = graph.run(query).to_data_frame(columns=['ID','Question','Depth']);
display(table);

Unnamed: 0,ID,Question,Depth
0,2,No,0
1,1,Yes,0
2,3,Yes,1
3,4,No,1
4,3,Yes,1
5,4,No,1
6,5,Strawberries & Crème,2
7,6,Pumpkin Spice,2
8,7,Coffee,2
9,8,Mocha,2


Notice we have the same phenomenon happening with duplicate answers.  This is because of the same reason as above.  As paths start, end, and merge with one another, it is possible to arrive at the same question and answer block at different depths in the tree.

## At any question, what are my available answers?

Let's attempt to tackle the first item on our list.  The first two are extremly simple to do.  We should have a query that takes in an ID for a node with the `:Question` label and have it return a list of nodes with the label `:Answer`.

In [6]:
def get_answers_by_question(id):
    query_string = """
        MATCH (q:Question { id: $id })-[:IS_CLASSIFIED_BY]->(a:Answer)
        RETURN a.id as Answer_ID, q.value as Question, a.value as Answer
        ORDER BY a.value;
    """
    template = Template(query_string);
    query = template.substitute(id=id);
    return graph.run(query);
    
MY_QUESTION_ID = 3; # you can change this to any of the question IDs above

result = get_answers_by_question(MY_QUESTION_ID);

table = result.to_data_frame(columns=['Answer_ID', 'Question', 'Answer']);

display(table);

Unnamed: 0,Answer_ID,Question,Answer
0,7,Frappucino Flavor,Coffee
1,9,Frappucino Flavor,Java Chip
2,8,Frappucino Flavor,Mocha
3,6,Frappucino Flavor,Pumpkin Spice
4,5,Frappucino Flavor,Strawberries & Crème
5,10,Frappucino Flavor,Vanilla Bean


## At any answer, what are my next questions?

This query is extremely similar to the last one.  We just want to know the questions spawning from any given answer.

In [8]:
def get_questions_by_answer(id):
    query_string = """
        MATCH (a:Answer { id: $id })-[:RESULTS_IN]->(q:Question)
        RETURN q.id as ID, q.value as Value
        ORDER BY q.value;
    """
    template = Template(query_string);
    query = template.substitute(id=id);
    return graph.run(query);
    
MY_ANSWER_ID = 18; # you can change this to any of the answer IDs above

result = get_questions_by_answer(MY_ANSWER_ID);

table = result.to_data_frame(columns=['ID', 'Value']);

display(table);

Unnamed: 0,ID,Value
0,4,Size


## Making the Above Queries Generic

The queries above can easily be genericized.  What if we don't know a given ID is a question or an answer?  We can easily modify the query to work with either node.  You will be able to give any ID existing for a Question or Answer and the query will provide the resulting nodes in the decision tree.  As a bonus, I also added in wether or not the result node is terminal (i.e. doesn't lead to any other nodes in the tree).

In [12]:
def get_next_steps(id, type):
    query_string = """
        MATCH (a { id: $id })-[:RESULTS_IN|:IS_CLASSIFIED_BY]->(b)
        WHERE '$type' IN labels(a)
        OPTIONAL MATCH (b)-[:RESULTS_IN|:IS_CLASSIFIED_BY]->(c)
        WITH DISTINCT a, b, c IS NULL as Terminal
        RETURN b.id as Value_ID, a.value as Key, b.value as Value, labels(b)[0] as Value_Type, Terminal
        ORDER BY b.value;
    """
    template = Template(query_string);
    query = template.substitute(id=id, type=type);
    return graph.run(query);
    
MY_ID = 1; # you can change this to any of the IDs above
TYPE = 'Question';

result = get_next_steps(MY_ID, TYPE);

table = result.to_data_frame(columns=['Value_ID', 'Key', 'Value', 'Value_Type', 'Terminal']);

display(table);

Unnamed: 0,Value_ID,Key,Value,Value_Type,Terminal
0,2,Is Holiday,No,Answer,False
1,1,Is Holiday,Yes,Answer,False


## At any question, what are all possible subsequent questions?

This is an interesting question to ask our data.  We have initially been looking at our questions or answers one node at a time, or from the initial question node.  What if the user wants to answer a question at a random place in the tree?  We should know two things for a given question: All of the questions left below in the decision tree, and all of the possible paths back up to our initial question.

The query below will tell us all questions the user has available from an initial starting point ID.  For the example of `id: 10`, simply _selecting_ the single question "Tea Type" allows us to reduce the total number of questions from 11 to 4.

Imagine a form that knew how to trim the number of fields based on your realtime input.

In [15]:
def get_next_steps(id):
    query_string = """
        MATCH (q:Question { id: $id })
        OPTIONAL MATCH (q)-[*1..]->(n1:Question)
        OPTIONAL MATCH (q)<-[*1..]-(n2:Question)
        RETURN DISTINCT q, n1, n2
    """
    template = Template(query_string);
    query = template.substitute(id=id);
    return graph.run(query);
    
MY_QUESTION_ID = 10; # you can change this to any of the question IDs above

result = get_next_steps(MY_QUESTION_ID);

# extracting the list of distinct nodes as a set
result_list = list(result.to_subgraph().nodes);

# mapping the list of nodes to a simple array of [id, value]
result_list = list(map(lambda n: [n['id'], n['value']], result_list))

pd.DataFrame(result_list, columns=["ID", "Value"])

Unnamed: 0,ID,Value
0,1,Is Holiday
1,2,Is Frappucino
2,4,Size
3,9,Is Tea
4,10,Tea Type


## Given the current set of answers, what are my possible available questions?

This is probably the holy grail of this excersize.  The user has provided answers to any question he or she decides to start with.  We need to be able to display the possible questions available based on the set of answers the user provides.  The below query will do just that.

**Note**: the example set of two answer IDs (`[4, 29, 21]`) is a _valid_ set.  This means we are giving two answers that fit within the same sub-decision tree.  We will show how to validate answer selections in the next section.

In [18]:
def get_next_steps(id_list):
    query_string = """
        MATCH p=(:Question { id: 1 })-[*0..]->(z:Answer)
        WHERE z.id IN $id_list
        AND ALL(id in $id_list WHERE ANY(node IN nodes(p) WHERE 'Answer' IN labels(node) AND node.id = id))
        WITH DISTINCT nodes(p) AS determined, p
        WITH COLLECT(p) as collectedPaths, MAX(length(p)) AS maxLength, determined
        WITH FILTER(path IN collectedPaths WHERE length(path) = maxLength) AS longestPaths, determined
        WITH EXTRACT(path IN longestPaths | LAST(nodes(path)))[0] as last, FILTER(node IN determined WHERE 'Question' IN labels(node)) AS determinedNodes
        MATCH (last)-[*1..]->(q:Question)
        WITH collect(DISTINCT q) + determinedNodes as merged
        UNWIND merged as returnNodes
        RETURN returnNodes
    """
    template = Template(query_string);
    query = template.substitute(id_list=id_list);
    return graph.run(query);
    
MY_ANSWERS = [4, 29, 21]; # you can change this to any of the question IDs above

# just showing you the selected answers
table = graph.run(Template("""
MATCH (q)-[:IS_CLASSIFIED_BY]->(a:Answer)
WHERE a.id IN $ids
RETURN a.id as ID, q.value as Question, a.value as Answer
""").substitute(ids=MY_ANSWERS)).to_data_frame(columns=['ID','Question','Answer']);
display(Markdown('##### Selected Answers:'));
display(table);

result = get_next_steps(MY_ANSWERS);

# extracting the list of distinct nodes as a set
result_list = list(result.to_subgraph().nodes);

# mapping the list of nodes to a simple array of [id, value]
result_list = list(map(lambda n: [n['id'], n['value']], result_list))

display(Markdown('##### Available Questions:'));
pd.DataFrame(result_list, columns=["ID", "Value"])

##### Selected Answers:

Unnamed: 0,ID,Question,Answer
0,4,Is Frappucino,No
1,29,Is Tea,No
2,21,Is Espresso Based,No


##### Available Questions:

Unnamed: 0,ID,Value
0,11,Coffee Type
1,4,Size
2,1,Is Holiday
3,2,Is Frappucino
4,5,Is Espresso Based
5,8,Is Iced
6,9,Is Tea


#### What's Next?

This module covered the exploration of the decision tree data.  In order to determine what drink order we have, we need to group answer sets together.  The next module loads mock `Set` nodes and runs a few queries against the new data.

[Go to the next module >>](3%20-%20Grouping%20Answers%20Into%20Sets.ipynb)