<a href="https://colab.research.google.com/github/gisalgs/notebooks/blob/main/kd-tree-querying-colab.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Range Query Using k-D Trees


So far we have discussed how to query a k-D tree for a point. We can use k-D trees to conduct more complicated queries. For example, we can use a k-D tree to efficiently find all the points in a certain region such as a circle or rectangle and we call this a **range query**. We can also find a specified number of the nearest points of a given point. We will focus on a particular query called **orthogonal range query**, where the range is specified as a rectangle. There are other kinds of queries, including circular range query and nearest neighbor query. More details can be found in the textbook *GIS Algorithms* and the code are available in the [indexing repository](https://github.com/gisalgs/indexing) at the github site.

## Orthogonal range query: a brute force approach

Given a **rectangle**, can we quickly find all the points that fall within the rectangle? A brute-force approach is to test every point. We can also call it exhaustive search since it exhausts all the possible cases. This is straightforward to implement, but it can be extremely slow, especially when we have a lot of points to test and we have to test many times -- think about the number of places in the world. 

To show how a brute force approach works, we will complete the code below. Here, we encode the rectangle as a list of lists of four values: `[[xmin, xmax], [ymin, ymax]]`. 

In [None]:
# assume: geom is already installed (shell) or cloned (notebook)
#         and the path is set (for shell only)
!git clone https://github.com/gisalgs/geom.git 

## <font color="red">Question 1: testing brute force search</font>

Please complete the two TODO parts below and run the code cell.

In [None]:
from geom.point import *

# TODO:
#
# Write one line of code that contains a lambda function that 
#     takes three arguments, x, y, and rect (rectangle) and 
#     returns True if the point at x, y is in the rect, and False otherwise
# and assigns the lambda function to a name in_rect

# lambda function here:




def search_all(all_points, rect):
    '''
    Finds the points that fall in a rectangle
    
    Input
        all_points    a list of Point objects
        rect          a rectangle encoded as  [[xmin, xmax], [ymin, ymax]]

    Output
        found         a list containing points in rect
    '''
    # TODO:
    #
    #    Complete the code for a brute-force search
    #    Make sure the code aligns inside the function




# testing 
raw_coord = [ (2,2), (0,5), (8,0), (9,8), (7,14), (13,12), (14,13) ]
my_points = [Point(d[0], d[1]) for d in raw_coord]
rect = [ [10, 14.5], [10, 13.5] ]

found = search_all(my_points, rect)
found

Now let's try this on a larger data set of 1 million random points.

### Orthogonal range query: using k-D trees

Given what we have discussed about a tree structure, we should expect a k-D tree can help with range query. We will use the following figure to explain the algorithm.

<img src="figures/kdtree-range-query-2.png" width=450/>

Here we discuss a search algorithm that is detailed in a function called `range_query_orthogonal` as listed below. This function was saved in a file called **kdtree2a.py** and is available in the github site at [here](https://github.com/gisalgs/indexing/blob/master/kdtree2a.py). In this function, we use a list called `found` to hold points found by the search process. The list `found` must be declared outside and passed to the function.

![](figures/kd-tree-orth-query-code.png)

The function `range_query_orthogonal` can be broken down into 5 logical blocks and we explain each of them below.

(1) We make sure that the node is not empty. We will get the situation of `t` is `None` when we travel down from a leaf node. When this happens, we know there is nothing there to check out and we will just return, meaning exist the function.

(2) If we ever reach here, it means that `t` is a valid node. Here we get which dimension we should use given the depth. The value of depth is passed from the function call and we will increase the depth value if we go down the tree (see below).

(3) If the point at node `t` is to the left or below the rectangle (depending on which dimension we are using, X or Y here), we will only need to check the right branch of the node. In order to do that, we use the same function one more time, but this time we set the node to be checked to be `t.right` and increase the depth by 1. Note that we are not going to call the function using the left branch, meaning we will exclude everything in that branch. After this is done, we will exist the function, meaning we will not run anything underneath this part.

(4) This is the code that handles the situation when the point on the node is to the right or above the rectangle.

(5) If we reach here, it means that the point on the node is sandwiched between the two bounds (either X or Y, depending on the depth of the node). There are a few things to take care of. First, we check if the point is in the rectangle, if it is then we append the point on the node (`t.point`) to the list called `found`, which will collect all points found. Second, we will have to use both the right and left branches of the node. 

Once we have a good understanding of the code, it is time to first test the algorithm using the k-D tree we drew manually from the previous modules. Then we will set up a few points and use the following code to test the orthogonal range query on a k-D tree.

Again, the code of this algorithm is in the file called [kdtree2a.py](https://github.com/gisalgs/indexing/blob/master/kdtree2a.py) on github. We should download it and put it in the indexing folder, along with other files like `kdtree1.py` and `bst.py` as we have used before.

We will **import** everything from the `indexing` folder. This can be done locally or we can clone the entire repo called `indexing` from github.

## <font color="red">Question 2: testing k-D tree</front>


In [None]:
!git clone https://github.com/gisalgs/indexing.git 

In [None]:
# TODO
#
# Complete the code here to 
#    (1) import necessary modules from indexing
#    (2) build a balanced k-D tree using kdtree2
#    (3) conduct orthogonal range query 
#    (4) print the tree
#
# still use my_points
#



## Validating the code

We can use the functions in matplotlib to test if the query function actually returns the right result. We will use the following couple of functions to draw points and lines.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

Now we illustrate the results.

In [None]:
# TODO
# 
# Write code here to draw (1) all the points in red crosses, (2) found points in grey pluses, 
# and (3) the rectangle box, in grey, used to search



## Testing on a larger data set

Now we get back to the larger data set. This time we use a k-D tree for this.

In [None]:
import random
import time

In [None]:
t1 = time.time()
points_alot = [Point(p[0], p[1], 0) for p in [(random.uniform(0, 100), random.uniform(0, 100)) for _ in range(1000000)]]
t2 = time.time()
print(f'Time for data generation: {t2-t1}')

t1 = time.time()
found = search_all(points_alot, rect)
t2 = time.time()
t_brute_force = t2-t1

print(f'Brute-force found {len(found)} in {t_brute_force:.4f} seconds')

In [None]:
t1 = time.time()
t1mil = kdtree2(points_alot)
t2 = time.time()

print(f'Time creating the tree {t2-t1}')

## <font color="red">Question 3: Calculating the depth of a binary tree</font>

Let see how big the tree for one million points is. If we try to guess before we do anything, what would be your guess? 100 depths? The truth will surprise most people.

Let's write a short function to calculate the depth of a binary tree and run it on tree `t1mil`.

In [None]:
# TODO
#
# Write a function called depth that takes the root of a tree as the input and 
# returns the maximum depth (i.e. height) of the tree
#
# Then apply the function on t1mil



In [None]:
t1 = time.time()
found = []
range_query_orthogonal(t1mil, rect, found)

t2 = time.time()
t_kd_tree = t2-t1
print(f'k-D tree found {len(found)} in {t_kd_tree:.4f} seconds')

## <font color="red">Question 4</font>

Complete the code below for a bar chart.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

# TODO
# 
# Draw a barchart to show the difference between brute force search and k-D tree search



Here is a takehome message: it may not take much (in terms of the size of a tree) to index a huge amount of points and a lot can be gained. We should also note the cost of building the tree. We will discuss more of this at the end of this section when we focus on computational issues of indexing.

## <font color="red">Question 5</font>

If we do an orthogonal range query on the tree illustrated below using the rectangle shown in dashed red lines, the result will be an empty list. What are the nodes in the tree that must be tested by the algorithm? When we say a note is *tested* by the algorithm, we mean the line with a comment of `test t in rect` in [kdtree2a.py](https://github.com/gisalgs/indexing/blob/master/kdtree2a.py) is called. Apparently not every point will come to this line because of the various `return` statements before the line. You can answer this question by slightly modifying the query function to print out the points that come to there and run it. If you choose this approach, the rectangle can be encoded as follows:

```python
rect = [ [2, 7.5], [8.5, 11.5] ]
```

You can also answer this question using a visual examination of the following figure. 

<img width="450" src="figures/kdtree-range-query-1.png"/>

Add more code/markdown cells when needed.

## Circular range query using k-D trees

We are going to test circular range query using k-D trees by comparing it with a brute-force search. The circular range query is implemented in the `kdtree2b` module and can be imported as follows:

In [None]:
from indexing.kdtree2b import *

## <font color="red">Question 6</font>

*Complete the three TODO's below.*

We will use the same list of one million points that are randomly placed in a box of 0 to 100 on both X and Y coordinates. We set the "range" to be a circle centered at (1, 3) with a radius of 1.2345. 

We will also write a function to test whether a point is in a circle. But this function is a little more complicated and it is better to write it as a regular function (v. a lambda function).

In [None]:
# TODO
#
# Complete the function

def in_circle(p, center, radius):
    '''
    Returns True if point p is in a circle defined by center and radius, False otherwise
    '''


In [None]:
center = Point(1, 3)
radius = 1.2345

def search_all_circle(all_points, center, radius):
    '''
    Returns the points that fall in a circle

    Input
        all_points    a list of Point objects
        center        center of circle, a Point object
        radius        float

    Output
        found         a list of Point objects
    '''
    # TODO:
    #
    #    Complete the code for a brute-force search
    #    Make sure the code aligns inside the function


    
# testing using one million points

t1 = time.time()
found = search_all_circle(points_alot, center, radius)
t2 = time.time()

t2-t1, len(found)


In [None]:
found2 = []

t3 = time.time()
range_query_circular(t1mil, center, radius, found2)
t4 = time.time()

print(f'Circular found {len(found2)} in {t4-t3:.4f} seconds')

In [None]:
for p in found2:
    if p not in found:
        print(p)

How do we know the points in `found` and `found2` are unique? We will have to test that too. For most lists, we can convert them to a Python set and check if the length of the set is as same as the original list. But a list of Point objects cannot use that approach because Point is not hashable -- Python doesn't know how to convert a Point object into a fixed length integer. For an object to be hashable, we need to override the `__hash__` method, which isn't the case for the `Point` class. This is not a big deal for our situation since we can check the uniqueness in other ways (we do have a way to compare the equality of two Point objects, which should be enough here).

In [None]:
# TODO
#
# Define a function called unique and use it on found and found2



## <font color="red">Question 7</font>

Read page 84 of the textbook (*GIS Algorithms*). In the next cell, use your own words to describe how circular range query using a k-D tree works. 

Also, Listing 5.3 on page 84 of the textbook was written in Python 2. If we directly copy and paste the code in Python 3, which line MUST be changed in order for it to run in Python 3? Write the line number in the cell below as well.

Answer in this cell:


Key: Line 27 should be changed to 

`print(found)`