##  Problem Set 1: Clustering streaming data  (take home midterm problem) - 40 points

### Introduction

In your NYC taxi trip data assignment, you have built a parquet dataset that contains all the nyc taxi trips. The data stream is in reality available through kafka / nats  consumers but you can [stream the dataset you have uploaded to Hugging Face](https://huggingface.co/docs/datasets/en/stream) to process it through a _streaming_ clustering algorithm that can cluster the pickup and drop off trips into different geographical areas.

### PS1-1 Features (5 points)

You will select the features that allow you to solve the problem - consult [this kaggle notebook](https://www.kaggle.com/code/danielviray/clustering-by-district-using-k-means-algorithm/report) to see the required visualization of the produced clusters.

![](clustering-outcome.png)

Please note that this is an example of a clustering and since you may be dealing with potentially more areas in your dataset the number of cluster may be larger.  

### PS1-2 K-Medoids clustering algorithm (15 points)

You will use the [K-medoids](https://scikit-learn-extra.readthedocs.io/en/stable/generated/sklearn_extra.cluster.KMedoids.html) algorithm (see video [here](https://www.youtube.com/watch?v=OFELCn-6r2o)) to do the actual clustering. (5 points)

The Gower distance needs to be used as the distance metric - see [this](https://medium.com/analytics-vidhya/gowers-distance-899f9c4bd553) or [this](https://jamesmccaffrey.wordpress.com/2020/04/21/example-of-calculating-the-gower-distance/) reference. (5 points)

Explain why we need to use the Gower distance in conjunction with K-medoids and we couldn/t use K-means for this problem. (5 points)

What you will deliver: Batch version of the K-medoids algorithm that is demonstrated to work on a subset of the dataset as allowed by your computer memory.


### PS1-3 Execution on Ray (15 points)

Streaming data mining algorithms like K-medoids  is executed in the real world on top of distributed computing platforms - one of the most popular ones in the domain of machine learning is [Ray](https://docs.ray.io/en/latest/ray-core/examples/gentle_walkthrough.html). See [this video](https://www.youtube.com/watch?v=w7uPnEqYz7A) for a brief overview. Ray will allow you to parallelize the execution of the K-medoids algorithm i.e. you will see all cores of your laptop be close to have increased utilization when Ray executes your K-medoids method (or all the cores across all the servers if you had a multiple machine Ray cluster). To guide you as to what kind of parallelism you can do for this problem [see this](https://github.com/EthanWng97/ray-mapreduce-kmeans) reference.

Note: Ray itself [can be orchestrated by Dagster](https://www.samsara.com/blog/building-a-modern-machine-learning-platform-with-ray/), the subject of another assignment, but to keep things simple such orchestration is not a requirement in this assignment - the emphasis is on implementation and execution of the algorithm in Ray.

What you will deliver: The version of the K-medoids algorithm you implemented earlier except that now it is executed more efficiently via parallel workers/actors on a Ray cluster consisting of a single node (your computer).


### PS1-4 Deployment (5 points)

For this assignment you need to create a new directory in your github repository called `midterm-take-home` and place all the necessary files in it. A docker-based environment will need to be created in that subdir with essential for this assignment files in the python environment such as `sklearn`, `ray`, and other necessary libraries.







In [6]:
import pandas as pd
import numpy as np
from sklearn_extra.cluster import KMedoids
import gower

# Assuming you have a DataFrame `df` with relevant features
# For the sake of demonstration, let's create a simple DataFrame with mixed data types
data = {
    'pickup_latitude': np.random.uniform(low=40.5, high=40.9, size=100),
    'pickup_longitude': np.random.uniform(low=-74.0, high=-73.9, size=100),
    'day_of_week': np.random.choice(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'], size=100),
    # Assume 'duration_minutes' is a continuous variable and 'day_of_week' is categorical
    'duration_minutes': np.random.uniform(low=5, high=60, size=100),
}

df = pd.DataFrame(data)

# Calculate the Gower distance matrix
distance_matrix = gower.gower_matrix(df)

# Use K-Medoids with the precomputed distance matrix
kmedoids = KMedoids(n_clusters=5, metric="precomputed", random_state=42)
kmedoids.fit(distance_matrix)

# Assign clusters back to the DataFrame
df['cluster'] = kmedoids.labels_

print(df.head())



SyntaxError: invalid syntax (<ipython-input-6-36d888776c5d>, line 1)

In [7]:
import ray
import numpy as np
import pandas as pd
from sklearn_extra.cluster import KMedoids
import gower

ray.init()

@ray.remote
def compute_gower_distance_subset(data_subset):
    """
    Simulated task: Compute the Gower distance matrix for a subset of data.
    In a real scenario, this function would process actual data.
    """
    # This is a placeholder for the computation.
    # Replace it with actual computation, e.g., gower.gower_matrix(data_subset)
    distance_matrix = np.random.rand(len(data_subset), len(data_subset))
    return distance_matrix

# Assuming you have a DataFrame `df` with your data
# For demonstration, creating a sample DataFrame
df = pd.DataFrame({
    'pickup_latitude': np.random.uniform(low=40.5, high=40.9, size=100),
    'pickup_longitude': np.random.uniform(low=-74.0, high=-73.9, size=100),
    'day_of_week': np.random.choice(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'], size=100),
    'duration_minutes': np.random.uniform(low=5, high=60, size=100),
})

# Split the DataFrame into subsets for demonstration
# In a real application, determine the best way to split your data based on its size and the available memory
data_subsets = np.array_split(df, 10)  # Splitting into 10 subsets as an example

# Submit tasks to Ray for processing
future_results = [compute_gower_distance_subset.remote(data_subset) for data_subset in data_subsets]

# Retrieve results
distance_matrices = ray.get(future_results)

# Here you would combine the results (distance matrices) and proceed with K-Medoids clustering
# For demonstration, just print the shape of the first distance matrix
print(distance_matrices[0].shape)

# Don't forget to shutdown Ray
ray.shutdown()


ModuleNotFoundError: No module named 'ray'

In [None]:
# Use the official Python base image
FROM python:3.8-slim

# Set the working directory in the container to /app
WORKDIR /app

# Copy the current directory contents into the container at /app
COPY . /app

# Install the required packages
RUN pip install --no-cache-dir -r requirements.txt

# Command to run the application
CMD ["python", "./clustering.py"]
