
#**Spatial Pattern Analysis**

## Computer lab notebook for [DDLS 2023 course](https://ddls.aicell.io/course/ddls-2023), module 3.

In this notebook, we will cover Ripley’s K curve testing, DBSCAN and Voronoi Tessellation.

Here are the references:
 - [Ripley K demo notebook](https://github.com/GeostatsGuy/PythonNumericalDemos/blob/master/Ripley_K_demo.ipynb)
 - [DBSCAN demo in scikit-learn](https://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html)
 - [SR-Tesseler: a method to segment and quantify localization-based super-resolution microscopy data](https://www.nature.com/articles/nmeth.3579)


## Prerequistes
Before we start the computer lab, please use Google Search or consult ChatGPT to find answers to the following questions:

 - What is Ripley’s K curve and how is it useful in spatial point pattern analysis?
 - How does DBSCAN algorithm work and what distinguishes it from other clustering algorithms like K-means?
 - What is Voronoi tessellation? And how it can be used to analyse localization-based super-resolution microscopy data?
 - How do clustering algorithms like DBSCAN or spatial point patterns like Ripley’s K curve contribute to understanding cellular processes or diseases?

 You can also find the anwsers for many of these questions in the lecture slides and the references above.


## Getting started

 * Click "Connect" in the top right corner (we won't need GPU this time, so the default runtime type will work)
 * Press `Ctrl + S` or use the `File` menu to save the current notebook to your google drive

Now run the following cell to setup some libraries.

In [None]:
!pip install -U shareloc-utils

### Download Super-resolution microscopy data from shareloc.xyz

ShareLoc.XYZ is a website for sharing super-resolution microscopy data.

Here we download an image of microtuble here: https://shareloc.xyz/#/r/7234160

You can visualize the image [here](https://shareloc.xyz/shareloc-potree-viewer.html?pointShape=circle&pointSizeType=adaptive&unit=nm&name=cell-8/data.smlm&load=https://imjoy-s3.pasteur.fr/public/pointclouds/10.5281/zenodo.7234161/cell-8/data.potree.zip
).

In [None]:
!wget https://zenodo.org/api/files/0d7b0d10-0503-416f-9a64-6fc553c041a4/cell-8/data.smlm -O cell-8.smlm

## Load the SMLM dataset

In [None]:
from PIL import Image

from scipy.spatial import distance_matrix
from shareloc_utils.smlm_file import read_smlm_file, plot_histogram

import numpy as np
import matplotlib.pyplot as plt
from shareloc_utils.smlm_file import read_smlm_file, plot_histogram

def load_data(file_path, points=None):
    # parse the .smlm file
    manifest = read_smlm_file(file_path)
    # one file can contain multiple localization tables
    tables = manifest["files"]
    data = tables[0]["data"]
    # Let's normalize the coordinates to [-1, 1]
    x_real = data['x'] / data['x'].max() * 2 -1.0
    y_real = data['y'] / data['y'].max() * 2 -1.0
    if points:
        x_real, y_real = x_real[0:points], y_real[0:points]

    # Compute the squared distances from the origin for each point
    squared_distances = x_real ** 2 + y_real ** 2

    # Create a boolean mask for points that lie within the circle (squared distance <= 1)
    inside_circle_filter = squared_distances <= 1

    # Use the boolean mask to filter both x_real and y_real
    x_real = x_real[inside_circle_filter]
    y_real = y_real[inside_circle_filter]
    return x_real, y_real


# load real data with specified point number
x_real, y_real = load_data("cell-8.smlm", points=10000)

# generate a histogram image for the data and display it
histogram = plot_histogram({"x": x_real, "y": y_real}, pixel_size=0.01, value_range=(0, 10))
plt.figure()
plt.imshow(histogram)


# print data size and range information
print(len(x_real), len(y_real), x_real.min(), x_real.max(), y_real.min(), y_real.max())


# Spatial Pattern Analysis: Ripley's K Curve
This part of the notebook introduces you to Ripley's K function, a statistical test used in spatial pattern analysis. We'll start by generating random x, y coordinates to simulate Complete Spatial Randomness (CSR), then we'll apply Ripley's K curve to the real Single Molecule Localization Microscopy (SMLM) dataset you've just downloaded.


For convinience, we will use [ripleyk.calculate_ripley](https://github.com/SamPIngram/RipleyK/tree/master) to compute the K values.


### Generate Random x, y Coordinates


In [None]:
# Number of points
N = len(x_real)

# Generate random x and y coordinates
import random
import numpy as np
xs = []
ys = []
radius = 1
random.seed(0)
for i in range(0,N):
    positioned = False
    while positioned is False:
        x = random.uniform(-radius, radius)
        y = random.uniform(-radius, radius)
        if (x**2)+(y**2) < radius**2:
            xs.append(x)
            ys.append(y)
            positioned = True
x_random = np.array(xs)
y_random = np.array(ys)
print(len(x_random), x_random.min(), y_random.max())


# generate a histogram image for the random coordinates
histogram = plot_histogram({"x": x_random, "y": y_random}, pixel_size=0.01, value_range=(0, 1))
plt.imshow(histogram)


## Function to Compute Ripley's K Curve

Let's create the Ripley's K function first.

Note that this is just a very simple Ripley's K function implementation, feel free to change it to other variants.

In [None]:
def ripleys_k_function(x, y, max_radius, area):
    N = len(x)
    distances = distance_matrix(np.column_stack((x, y)), np.column_stack((x, y)))
    radii = np.linspace(0, max_radius, 20)
    K_values = []

    for r in radii:
        neighbors = np.sum(distances < r)
        K = neighbors / (N * N) * area
        K_values.append(K)

    return radii, K_values

### Compute Ripley's K Curve for random points

In [None]:
radii, K_values_random = ripleys_k_function(x_random, y_random, 0.2, 1)

plt.plot(radii, K_values_random)


## Exercise 1: Compute Ripley's K Curve for real data

Now complete the following execises:

 - plot the Ripley's K Curve for the example SMLM image (`cell-8.smlm`)
 - compare the Ripley's K Curve with random points by plotting them in the same plot

Here are some code which you may find useful:
```python
# load real data with specified point number
x_real, y_real = load_data("cell-8.smlm", points=10000)
```

To display the image
```python
# generate a histogram image for the data and display it
histogram = plot_histogram({"x": x_real, "y": y_real}, pixel_size=0.01, value_range=(0, 10))
plt.figure()
plt.imshow(histogram)
```

In [None]:
# your code here

## Exercise 2: Changing point number

Try to change the points number when loading the data, try `points=100`, `points=1000` and `points=10000`, display the histogram image first to get an intuition and plot the Ripley's K Curve, try to understand why.

In [None]:
# your code here

## DBSCAN Clustering

In this part of the tutorial, we'll explore how to use the DBSCAN clustering algorithm to analyze spatial data. Specifically, we'll apply DBSCAN on the real Single Molecule Localization Microscopy (SMLM) data.

### Prepare Data for Clustering
We'll combine the x and y coordinates into a single data matrix, which will be passed into the DBSCAN algorithm.

In [None]:
from sklearn.cluster import DBSCAN

x_real, y_real = load_data("cell-8.smlm", points=10000)
X = np.column_stack((x_real, y_real))

print(X.shape, X.min(), X.max())


### Perform DBSCAN Clustering: First attempt

Let's try DBSCAN and see if it can detect clusters.

In [None]:
# Define DBSCAN parameters
eps = 0.1  # The radius of the neighborhood
min_samples = 10  # The minimum number of samples in a neighborhood for a point to be considered as a core point

# Apply DBSCAN
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
labels = dbscan.fit_predict(X)

# Number of clusters
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)

print(f'DBSCAN Clustering: {n_clusters} clusters found')

**As you may noticed, the detected cluster number is wrong, it's because we didn't set the right parameters. Namely the `eps` and `min_samples`**


### Display the result

Despite the incorrect detection, let's see what it found first.

In [None]:
plt.scatter(x_real, y_real, c=labels, cmap='rainbow')
plt.title(f'DBSCAN Clustering: {n_clusters} clusters found')
plt.xlabel('X Coordinate')
plt.ylabel('Y Coordinate')
plt.colorbar()
plt.show()


### Exercise 3: Fix the DBSCAN parameters

In this exercise, use ChatGPT to help you find the correct parameters.

Construct a prompt with enough context information, considering include the following:
   - Set a role for ChatGPT (e.g. `Act as a data scientist`)
   - Describe the setting (e.g. you are using Colab notebooks) and what are you trying to achieve
   - Describe how you loaded the data, provide the script for loading your data:
   ```
   x_real, y_real = load_data("cell-8.smlm", points=10000)
   X = np.column_stack((x_real, y_real))
   print(X.shape, X.min(), X.max())
   ```
   describe what was printed so it knows the data shape and value range.
   - Describe what you have tried (copy the code and the output), what went wrong and what do you expect
   - Express your intent, e.g. tell ChatGPT that you want some code which will automatically determine the DBSCAN parameters
   - If the code provided ChatGPT produces error, copy and paste the error and ask it to fix it.

Write down the prompt you used in a text block below.

TODO: Your ChatGPT prompt:

In [None]:
# your code here

## Voronoi Tessellation
In this section, we will explore the Voronoi tessellation method to perform clustering analysis. Voronoi tessellation can offer another way to understand spatial point patterns by partitioning a plane into regions based on distance to points in a specific subset.

### Perform Voronoi Tessellation

In [None]:
from scipy.spatial import Voronoi, voronoi_plot_2d

# Generate Voronoi tessellation
vor = Voronoi(X)

# Plot Voronoi tessellation
fig, ax = plt.subplots()
voronoi_plot_2d(vor, ax=ax)
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='rainbow')


### Clustering with Voronoi Tessellation

Now we can use Voronoi tessellation to detect clusters. For example, the area of each Voronoi cell can serve as a criterion to determine whether a point is part of a cluster or not.


Let's see how does the area distribution looks like.


In [None]:
from shapely.geometry import Polygon

areas = []
for region in vor.regions:
    if not -1 in region and len(region) > 0:
        polygon = Polygon([vor.vertices[i] for i in region])
        areas.append(polygon.area)

plt.hist(areas, bins=50)
plt.title('Histogram of Voronoi Cell Areas')
plt.xlabel('Area')
plt.ylabel('Frequency')
plt.show()


One idea is to simply filter out regions by an area threshold (infered from the area distribution), and then we can use the filtered regions to detect clusters.

In [None]:
threshold = np.percentile(areas, 16)  # 10th percentile as the threshold

# Identify clusters
cluster_points = []
for i, region in enumerate(vor.regions):
    if not -1 in region and len(region) > 0:
        polygon = Polygon([vor.vertices[j] for j in region])
        if polygon.area < threshold:
            cluster_points.append(vor.point_region[i])

# Plot clusters
plt.scatter(x_real, y_real, c='grey')
plt.scatter(x_real[cluster_points], y_real[cluster_points], c='red')
plt.title('Clusters Detected via Voronoi Tessellation')
plt.xlabel('X Coordinate')
plt.ylabel('Y Coordinate')
plt.show()


## Exercise 4: Improve Voronoi Tessellation Clustering Detection

The above cluster detection attempt using area as the criterion is far from ideal, can you think of other criterion or other methods to detecting the clusters? Feel free to use ChatGPT to brainstorm ideas and guide it to implement the solution. How does it compare to DBSCAN? When designing the prompt, try to use the tips in Exercise 3.

In [None]:
# your code here

## Submitting your work

After you have completed the exercises in the notebook:
 - During the lab session, tell the lab teacher so he/she can go through what you have done together and maybe ask you a few questions.
 - Export the notebook by using `File -> Download -> Download .ipynb`, then submit the notebook file to the [submission form](https://forms.gle/MpYYxyZqPeRF7f9r6).

**Submission Deadline: Before Wednesday at 18:00**

**NOTE: If you cannot join the lab session, please submit the notebook before the deadline, and find the lab teacher in a next lab session to go through what you have done together.**