# Introduction

**Real World Code Example: Analyzing Polyp Progression in FAP Patients**

This notebook investigates the effectiveness of sulindac treatment in individuals with familial adenomatous polyposis (FAP) using a refined dataset based on a landmark study published in the New England Journal of Medicine in 1993.  We'll use K-Means clustering to explore potential subgroups of patients based on their polyp progression over time.

**Key Variables:**

* `age`: Patient's age
* `baseline`: Baseline polyp count
* `number3m`: Polyp count at 3 months post-treatment
* `number12m`: Polyp count at 12 months post-treatment

# Install and Import

In [None]:
import pandas as pd
import io
from pyodide.http import open_url
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Load and Prepare Data

In [None]:
# Load data from GitHub
url = "https://raw.githubusercontent.com/arcus/education_modules/python_clustering/python_clustering/data/polyps.csv"
url_contents = open_url(url)
text = url_contents.read()
file = io.StringIO(text)
df = pd.read_csv(file)

# Print data information
print(df.info())

# Select features for clustering
features = ['age', 'baseline', 'number3m', 'number12m']
X = df[features]

# Fill missing values with the mean
X.fillna(X.mean(), inplace=True)

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Cluster Data

In [None]:
# Define number of clusters
num_clusters = 3

# Apply K-Means clustering
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
kmeans.fit(X_scaled)

# Assign cluster labels
df['cluster'] = kmeans.labels_

# Visualize Clusters

In [None]:
# Visualize clusters
plt.figure(figsize=(10, 8))
colors = ['red', 'blue', 'green']

for i in range(num_clusters):
    cluster_data = df[df['cluster'] == i]
    plt.scatter(cluster_data['number3m'], cluster_data['number12m'],
                color=colors[i], label=f'Cluster {i}')

plt.xlabel('Number of Polyps at 3 Months')
plt.ylabel('Number of Polyps at 12 Months')
plt.title('K-Means Clustering: Polyp Progression (3 vs. 12 Months)')
plt.legend()
plt.show()

# Interpretation and Further Analysis

The K-Means clustering results suggest potential subgroups of patients based on polyp progression patterns:

* **Cluster 1 (Low Progression):** Potentially stable or slow polyp growth.
* **Cluster 2 (Moderate Progression):** Some increase in polyps over time.
* **Cluster 3 (High Progression):** Substantial increase in polyps over time.

**Further Analysis (Not Shown):**
* Investigate differences in treatment (sulindac vs. placebo) between clusters.
* Explore other patient characteristics (e.g., age, sex) within each cluster.
* Consider alternative clustering methods or a different number of clusters.


Remember, clustering is exploratory. Additional analysis is needed to confirm these patterns and understand the underlying factors influencing polyp progression.