<a href="https://colab.research.google.com/github/gisalgs/notebooks/blob/main/computational_issues_2-colab.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Computational Issues of Spatial Indexing


>"How long does getting thin take?" asked Pooh anxiously.  
>"About a week, I should think."  
>"But I can't stay here for a *week*!"  
>"You can *stay* all right, silly old Bear. It's getting you out which is so difficult."
>
><cite>A. A. Milne, Winnie-the-Pooh</cite>

>"And is all this common consciousness satisfied to use me as a black box? Since the black box works, is it unimportant to know what is inside? --- That doesn't suit me. I don't enjoy being a black box. I want to know what's inside."
>
><cite>Issac Asimov, Foundation and Earth</cite>


The computational time for trees can be broken down to at least two parts. The first is the time used to construct the tree, and then it is the time the tree is used to query. The overall time complexity of building a balanced k-D tree is $O(n \log_2 n)$. Searching a k-D only takes $O(\log_2 n)$ time in average when the tree is balanced. For unbalanced trees, however, we can imagine a worst case where points are always aligned on one branch of the node and in this case the search time is $O(n)$, as same as the linear search (though the actual time might be longer because traversing a tree takes more time than traversing a list or an array).

For point quadtrees, the cost of building a point quadtree is $O(n \log_4 n)$ when points are randomly sorted before they are inserted to the tree as we discussed above. A simple search on a balanced point quadtree has a time complexity of $O(\log_4 n)$ while the worst case would be $O(n)$ when the tree has only one node at each depth.

The above discussion, however, is theoretical. In practice, the actual computational time may follow the overall trend as predicted, but there are also many other factors that have significant impacts on the performance. For example, the physical time used can vary a lot depending on whether the program is compiled into binary code (as C/C++ programs) or interpreted (as Python and Java). Generally speaking, interpreted programming languages such as Python are less efficient in terms of the actual running time because the code must be interpreted line by line. It should be noted that Python or Java is not the interpreted language in its original meaning where the interpreter literally goes through line by line for every time it runs the program. Instead, they often use an immediate representation of the code that is compiled in binary that runs faster. Still, interpreted languages are still generally slower than compiled languages such as C/C++. The difference may not be noticeable for small data (and probably we don't really care), but the difference will be big when we deal with large data sets. 

Aside from the programming language, how the algorithms are actually implemented will be a factor too. For example, the use of recursive functions, as convenient as it is, slows down the algorithm because of the repeated recursive function calls. 

The following are some commands that can be used in the notebook to get info about the system.

In [None]:
# Linux (colab), macOS
!uname -a
print()

# The following lists CPU details

# linux (colab)
# !lscpu

# macOC
# in report, hw = hardware, machdep = machine dependent, 
# !sysctl -a | grep cpu

# Windows
# !wmic cpu get caption, deviceid, name, numberofcores, maxclockspeed

Also we can check the version of Python:

In [None]:
import sys
sys.version

## 1. Performance of query using k-D trees and point quadtrees

Here, we put our algorithms of k-D trees and point quadtrees into a test. We will simply compare the performance of using these trees, and we also compare them with the linear (brute-force) search approach. We test the performance by systematically controlling the size of the data and see how they catch up. The following are the packages that we will use.

In [None]:
# Uncomment the following if needed in Jupyter notebook to clone the github repos
# !git clone https://github.com/gisalgs/geom.git
# !git clone https://github.com/gisalgs/indexing.git

In [None]:
from geom.point import *
from indexing.kdtree1 import *
from indexing.kdtree2a import *
from indexing.kdtree3 import *
from indexing.pointquadtree1 import *
from indexing.pointquadtree3 import *

from random import random, sample, uniform
import time
import copy

To better organize our experiments, let's use the following dictionary for the parameters used. This is a bigger topic of not hard-code our programs. The idea is to move the actual numbers (parameters) out of the program and put them in a separate data structure. In many applications, this means we put them in a file that is called the configuration file. This is an important part of computer software.

In general, there are a number of ways to organize the parameters in computer programs (outside hard-coding). The traditional way is to put the things in a INI file that has the structure to be read by the program. The following is an example:

```ini
# config.ini
[test]
numpoints = 10000
numfound = 100

[report]
verbose = True
```

And we can use Python's `configparser` module to get the data into our program.

Similarly, the information in the INI file can be organized in a JSON format:

```json
// config.json
{
    "test": {
        "numpoints": 10000,
        "numfound": 100
    },
    "report": {
        "verbose": True
    }
}
```

and of course we can use the `json` module to handle it. It should be noted that the symbol `//` used above is for a comment in JSON. We cannot use it directly in Python, but it is valid if we put it in a JSON file. The `json` module should have no problem recognizing it.

A more recent and most popular way is to organize the parameters in a YAML file. YAML is a human-readable language and it stands for YAML Ain't Markup Language. So its developers don't want it to be treated as a markup language that can be tedious and lengthy, while YAML stricks for simplicity and minimalism. When we are happy, we can just call this Yay MAIL. The sample data can be then encoded in a YAML file as

```yaml
# config.yaml
test:
    numpoint: 10000
    numfound: 100
report:
    verbose: True
```

There is a `yaml` module in Python that can be used to handle this.

Some interesting discussions can be found on [stackoverflow](https://stackoverflow.com/questions/5055042/whats-the-best-practice-using-a-settingsconfig-file-in-python).

Now, in our case, we don't need to make things as complicated as handling configuration files. We can organize things in a dictionary and then retrieve the values from there. 


In [None]:
params = {
    "test": {
        "numpoints": 10000,
        "numfound": 100
    },
    "report": {
        "verbose": True
    }
}

Now we write a function to do the testing. This function has one input that is the dictionary as described above. The parameters in the dictionary will be retrieved and used to assign different variables. Among the parameters, a boolean variable is used to specify if we need verbose (wordy) output.

## <font color="red">Question 1</font>

Complete the `# TODO` part below.

In [None]:
def test(params):
    '''
    A function that evaluates the performance of four different types of search:
        1. a k-D tree based on the order of points that is given
        2. a balanced k-D tree
        3. a point quadtree based on the order of points in a list
        4. a brute-force approach (linear search)

    INPUT
        params  - a dictionary that contains the following keys:

            'numpoints'  - the number of points to be searched from
                           the actual points will be randomly generated 
                           where the coordinates of each point range from 0 to 1
            'numfound'   - the number of points to be searched
                           the actual points to be searched will be randomly sampled from npts
            'verbose'    - a boolean value (True - print out more info, False - no print out)
                           when True, print out the format as follows:
                             10000 |  0.135  0.050  0.065 |  0.001  0.001  0.001 |  0.073
                           where the numbers are npts, time to build k-D tree, the balanced k-D tree, and the point quadtree, and 
                           time to search using the k-D tree, balanced k-D tree, point quadtree, and linear search, respectively.

    OUTPUT
        a tuple containing the 8 numbers mentioned above in the "verbose" section.

    Example

        >>> params = {
                'numpoints': 10000, 
                'numfound': 100, 
                'verbose': True
            }
        >>> t1 = test(params)
          10000 |  0.135  0.050  0.065 |  0.001  0.001  0.001 |  0.073
        >>> print(t1)
        (10000, 0.13518762588500977, 0.04988360404968262, 0.0650629997253418, 0.0009088516235351562, 0.0006577968597412109, 0.0006163120269775391, 0.0731801986694336)
    '''
    # TODO: complete the code below



Here is a quick demo of this function in searching for 100 random points from 10,000 points. The test() function has an input called verbose which can be used to make the function run silently without printing anything. But printing out the current result can be a good feature if we want to know how the program progresses during time (for a long wait).

In [None]:
t1 = test(params)

print(t1)

The above quick test clearly shows the efficiency of using the indexing method for query. It also shows that building the tree may need some significant amount of time. 

Now we give it a more systematical test. More specifically, we use different numbers of points, ranging from 100,000 to **1,000,000**, with a step of 100,000. All the experiments were done on Now we give it a more systematical test. More specifically, we use different numbers of points, ranging from 100,000 to **1,000,000**, with a step of 100,000. The experiments will be done on different systems (local or cloud) and the numbers will be different.

In [None]:
%%time
time1 = time.time()
alltime = []

for i in range(100000, 1000001, 100000):
    params = {
        "test": {
            "numpoints": i,
            "numfound": 100
        },
        "report": {
            "verbose": True
        }
    }
    res = test(params)
    alltime.append(res)

time2 = time.time()

The following code reports some numbers, including the total time in minutes and the time used on tree, where almost all of that time are used to construct the tree (so using the tree doesn't take much time).

In [None]:
t1 = (time2-time1)/60
t2 = sum([sum(alltime[i][1:]) for i in range(len(alltime))])/60
t3 = sum([sum(alltime[i][1:7]) for i in range(len(alltime))])/60
t4 = sum([sum(alltime[i][1:4]) for i in range(len(alltime))])/60
print(f'total computing time: {t1:.1f} minutes')
print(f'Total processing time: {t2:.1f} minutes')
print(f'Total time on trees: {t3:.1f} minutes')
print(f'Tree construction time: {t4:.1f} minutes')

As a way of comparison, we can find [past results](https://github.com/gisalgs/notebooks/blob/main/computational-issues-past-results.md) at the github repo. It is interesting to see how performance varies among computers and even Python versions.


We now plot the results for a better visualization of the difference in the performances. Here is a shot at the construction times used for different kinds of trees:

In [None]:
sys.version_info

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
x = [ alltime[i][0]/1000 for i in range(len(alltime))]
plt.plot(x, [ alltime[i][1] for i in range(len(alltime))], label='k-D tree')
plt.plot(x, [ alltime[i][2] for i in range(len(alltime))], label='k-D tree (balanced)')
plt.plot(x, [ alltime[i][3] for i in range(len(alltime))], label = 'point quadtree')
plt.legend(loc='upper left')
plt.xlabel('Number of points (x1000)')
plt.ylabel('Seconds')
plt.title('Time for tree construction')
plt.show()

The benefit of using the tree for query is obvious:

In [None]:
plt.plot(x, [ alltime[i][7] for i in range(len(alltime))], label='linear')
plt.plot(x, [ alltime[i][4] for i in range(len(alltime))], label='k-D tree')
plt.plot(x, [ alltime[i][6] for i in range(len(alltime))], label='point quadtree')
plt.plot(x, [ alltime[i][5] for i in range(len(alltime))], label='k-D tree (balanced)')
plt.legend(loc='upper left')
plt.xlabel('Number of points (x1000)')
plt.ylabel('Seconds')
plt.title('Time to query 100 points')
plt.show()

The trend of using the tree across the three trees is not clear based on the test we just did, but we can still see from below that the balanced k-D tree is clearly positioned at the bottom of the three curves, showing the efficiency of the balanced tree.

In [None]:
plt.plot(x, [ alltime[i][4] for i in range(len(alltime))], label='k-D tree')
plt.plot(x, [ alltime[i][5] for i in range(len(alltime))], label='k-D tree (balanced)')
plt.plot(x, [ alltime[i][6] for i in range(len(alltime))], label='point quadtree')
plt.legend(loc='upper left')
plt.xlabel('Number of points (x1000)')
plt.ylabel('Seconds')
plt.title('Time to query 100 points')
plt.legend(loc='right', bbox_to_anchor=(1.45, 0.5))
plt.show()

## 2. Performance of orthogonal range search

The parameters we use to test orthogonal range search will be different from our previous experiments.

In [None]:
params2 = {
    "geom": {
        "bounds": [[10,1000], [10,1000]],
        "width": 20,
        "height": 20
    },
    "test": {
        "numpoints": 100000,
        "numfound": 100,
        "numrepeat": 10
    },
    "report": {
        "verbose": True
    }
}

We first define a few functions to make it convenient for testing different cases.

A note on changes: the textbook (*GIS Algorithms*) has the following line to create random points. 

`randpoints0 = [Point(randrange(xmin, xmax), randrange(ymin, ymax)) for i in range(npts)]`

However, `randrange` will only return integers which will likely produce duplicated points, especially when the range is relatively small. Here we use `random.uniform` to generate random points. 

## <font color="red">Question 2</font>

Complete the `# TODO` part below.

In [None]:
# A rectangle is defined as [ [xmin, xmax], [ymin, ymax]]

def in_rect(p, rect):
    x, y = p.x, p.y
    if not (rect[0][0]>x or rect[0][1] < x or rect[1][0]>y or rect[1][1] < y):
        return True
    return False

def rectangular_linear(points, rect):
    l = []
    for p in points:
        if in_rect(p, rect):
            l.append(p)
    return l

def test_rect_find(params):
    """
    Tests the performance of orthogonal search using a balanced k-D tree and brute-force.

    INPUT

        params   - a dictionary that contains the width, height of the rectangle for search, 
                   the rectangle defining the area where the random points will be placed,
                   the total number of points in the area,
                   the number of times the search will be repeated.
    
        Everytime, the search rectangle (width and height specified by the user in params)
        will be randomly decided. The experiment will be repeated a number of times,
        also specified by the user in params. The points will be the same for the repeats.

    OUTPUT
    
        Returns the AVERAGE times of using the k-D tree and linear search, respectively.
    """
    # TODO
    #
    # Complete the function that tests the performance of orthogonal range query



Here is an example of using it:

In [None]:
res = test_rect_find(params2)

We hypothesize that using a k-D tree will help rectangular query, but the increase of the rectangle size will increase the time used to query. We test two things here:

1. when will the additional computation caused by the increase of the rectangle exceed the efficiency of using a k-D tree?
2. what is the impact of increasing the problem size (total number of points)?

We test the average of time used for each query for each configuration. The following code will take some significant time to run. It will be important to let the computer run, with power plugged in, and do not disturb it with other heavy lifting tasks such as watching movies or even gaming. We will be better off by making lunch or doing some workouts while letting the computer to finish.

In [None]:
%%time

results = []
params2 = {
    "geom": {
        "bounds": [[10,1000], [10,1000]],
        "width": 20,
        "height": 20
    },
    "test": {
        "numpoints": 100000,
        "numfound": 100,
        "numrepeat": 10
    },
    "report": {
        "verbose": False
    }
}

for i in range(100000, 1000001, 100000):
    for w in [25, 50, 100, 200, 400, 600, 800]:
        params2['geom']['width'] = w
        params2['geom']['height'] = w
        params2['test']['numpoints'] = i
        
        x = test_rect_find(params2)
        res = i, w, x[0], x[1]
        results.append(res)

for r in results:
    print(r)

Please note there is a reason the above code is used to printout the tedious results. One of the questions for this module asks for a program that can be used to compute the total time of the above experiment based on the above output. This will be the total time used on the computer to produce this tutorial and it will be an interesting point to see how each of our own computer fares with this NUC 10.


## <font color="red">Question 3</font>

The above cell reports the total time used by the cell magic command `%%time`. However, what is the time used for search? Will this time be different from the total time used by the entire cell? In this question, we answer two questions:

(1) what is the time used for search (including the time by k-D tree and linear search)? We have the search time recorded in `results`, but it is important to remember this is the average time so it will be necessary to multiple that with the time of repeats.

(2) what causes the difference between the time for search and the total time reported by the magic command?



In [None]:
# TODO
#
#    Double click on this cell and write your code to answer question (1). Then write your 
#    response to question (2) either as comments below or in a markdown cell.



## <font color="red">Question 4</font>

Now, at this point, we have a lot of data in `results` to visualize. We want to draw a series of 10 plots in a row where each corresponds to one of the 10 sizes (from 100 K to 1 million). On each plot, the horizontal axis is the width of rectangle (from 25 to 800) and the vertical is the time. 

To do so, we need to **reorganize** the data in `results` using a dictionary where the keys are the data sizes, and the value associated with each key is a list of three lists: a list of all the widths in hundreds associated with the size,  all the times used to search the k-D tree, and times for linear search. The following is an example of the first two items in the dictionary where only the first two values in each list are shown:

```python
{
100000: [ [0.25, 0.5, ...], [0.00013082027435302735, 0.0003963470458984375], [0.015203642845153808, 0.015846920013427735] ],
200000: [ [0.25, 0.5, ...], [0.00014896392822265624, 0.0004911422729492188], [0.01960759162902832, 0.0219163179397583] ]
}
```

We initialize the dictionary using a dictionary comprehension with empty lists and then append the corresponding values from `results` using a loop. The code should be completed below:

In [None]:
# initiate the dictionary
new_data = {w: [[], [], []] for w in range(100000, 1000001, 100000)}

# TODO: Complete the code to populate data into the dictionary



The newly organized data will be proven convenient when we draw the plots:

In [None]:
# TODO
#
# Code to visualize the results



## 3. Performance of nearest neighbor search

Now we test the performance of nearest neighbor search using three methods: k-D tree, point quadtree, and linear search (`nn_linear`). We did not discuss nearest neighbor search the class in this semester, but the algorithms are similar to those of orthogonal and circular searches. Please refer to Sections 5.1.3 and 6.2 of *GIS Algorithms* for more detailed discussions about nearest neighbor search.

Here are some necessary functions for the testing.

## <font color="red">Question 4</font>

Complete the `# TODO` part below.

In [None]:
params3 = {
    "geom": {
        "bounds": [[10,1000], [10,1000]]
    },
    "test": {
        "numpoints": 100000,
        "numfound": 25,
        "numrepeat": 10
    },
    "report": {
        "verbose": True
    }
}

def nn_linear(p, points, n_neighbor=10):
    '''Linear search, or exhaustive search, or brute-force search'''
    dist = [p.distance(z) for z in points]
    Z1 = [(points[i], dist[i]) for i in range(len(dist))]
    Z1.sort(key=lambda Z1: Z1[1])
    Z1 = Z1[:n_neighbor]
    return Z1

def test_nn_find(params):
    '''
    Tests the performance of nearest neighbor search using a balanced k-D tree, 
    a point quadtree, and the brute force approach. The brute-force approach (linear search)
    is done using the above function called nn_linear.

    INPUT
        rect        - a list of lists defining the rectangle as [ [xmin, xmax], [ymin, ymax] ]
        npts        - the number of points to search from
        n_neighbor  - the number of neighbors to find
        n_query     - number of times to repeat the search

    OUTPUT
        check the code and text (answer this in Question 3 below)
    '''

    # TODO
    #
    # Complete the function here



In [None]:
res = test_nn_find(params3)

Now we test a few configurations. We search for up to 800 nearest points (note this is not the same as 800 in the previous experiment where 800 is the width of the rectangle). In our tests below, 800 points is really a small portion of all points. More on this later.

In [None]:
%%time

results_nn = []
params3 = {
    "geom": {
        "bounds": [[10,1000], [10,1000]]
    },
    "test": {
        "numpoints": 100000,
        "numfound": 25,
        "numrepeat": 10
    },
    "report": {
        "verbose": True
    }
}

for n in range(200000, 1000001, 200000):
    for i in [25, 50, 100, 200, 400, 800]:
        params3['test']['numpoints'] = n
        params3['test']['numfound'] = i
        
        x = test_nn_find(params3)
        res = n, i, x[0], x[1], x[2]
        results_nn.append(res)

## <font color="red">Question 5</font>

Plot a figure that can show the time complexity trend of nearest neighbor search using a tree (a k-D tree, a point quadtree, or both) from the above experiment. Use the data in `results_nn` to do this. We can use something very similar to the previous section, but note now we have three times, k-D tree, quadtree, and linear search, respectively.

In [1]:
# TODO
#
#    Write your code here to answer the above question.



Generally, finding 800 nearest neighbors of a point on a tree of 900,000 points is a piece of cake! However, before we can be more conclusive, there are more tests to do: what is the downside of using a k-D tree? We know that constructing such a tree takes time, and from the previous experiments, we also know that at some point the use of a k-D tree for searching may be excessive because we will have to traverse the tree back and forth too many times that will be more than just using a linear search. Does this happen to the nearest neighbor search using k-D tree too? Here are some quick tests and these should give us some good ideas about the last point!

In [None]:
%%time

params3 = {
    "geom": {
        "bounds": [[10,1000], [10,1000]]
    },
    "test": {
        "numpoints": 100000,
        "numfound": 25,
        "numrepeat": 10
    },
    "report": {
        "verbose": True
    }
}

params3['test']['numfound'] = 10
res = test_nn_find(params3)

params3['test']['numfound'] = 25
res = test_nn_find(params3)

params3['test']['numfound'] = 10000
res = test_nn_find(params3)

We can do some similar tests, but on a much smaller data set (and therefore much smaller trees):

In [None]:
%%time

params3 = {
    "geom": {
        "bounds": [[10,1000], [10,1000]]
    },
    "test": {
        "numpoints": 250,
        "numfound": 25,
        "numrepeat": 10
    },
    "report": {
        "verbose": True
    }
}

params3['test']['numfound'] = 10
res = test_nn_find(params3)

params3['test']['numfound'] = 50
res = test_nn_find(params3)

params3['test']['numfound'] = 100
res = test_nn_find(params3)

params3['test']['numfound'] = 20
res = test_nn_find(params3)