<a href="https://colab.research.google.com/github/dornercr/INFO371/blob/main/week3_Customer_Churn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
🔸 Section: Load the Iris Flower Dataset

url = "https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv"
df = pd.read_csv(url)

We’re loading the Iris dataset directly from GitHub using a URL. Pandas reads it and stores it in a DataFrame named df.

df.shape

This line returns the shape of the dataset — in this case, 150 rows and 5 columns. That means we have 150 samples, and each sample has four measurements plus one label column for the species.
🔸 Section: Explore Target Classes

species = df.species.unique()
species

This code returns the distinct values in the species column. We should see three species: Setosa, Versicolor, and Virginica. These are our three target classes.
🔸 Section: Convert Labels to Numeric

label_maps = {'setosa': 0, 'versicolor': 1, 'virginica': 2}
y_train = df.species.map(label_maps)

Machine learning models require numeric labels, so we map each species to an integer: Setosa is 0, Versicolor is 1, and Virginica is 2. The .map() function applies this transformation to every row in the species column, producing our label vector y_train.
🔸 Section: Extract Features

X_train = df.iloc[:, 0:-1]

This selects all rows and the first four columns — the feature columns — excluding the last column, which contains the species labels. The result is X_train, a matrix of shape 150 by 4.

X_train.head()

Displays the first five rows of the feature matrix, giving us a quick look at the raw numeric data used to train our model.
🔸 Section: Train K-Nearest Neighbors Classifier

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

We import KNN from scikit-learn, set the number of neighbors to 5, and fit the model using the training data. KNN is a lazy learner — it doesn’t actually do much during training except store the training data. The real work happens during prediction.

knn.score(X_train, y_train)

This returns the model's accuracy on the training set. With this small and clean dataset, you’ll see an accuracy around 96 or 97 percent.
🔸 Section: Predict on Training Data

knn.predict(X_train)

This returns the predicted class for each of the 150 samples using the trained KNN model.
🔸 Section: Evaluate Accuracy

from sklearn.metrics import accuracy_score
accuracy_score(y_train, knn.predict(X_train))

We import accuracy_score to compute how many predictions matched the actual labels. Again, this is on the training set, so it’s expected to be high. But it's not a good measure of generalization.
🔸 Section: Evaluate with Precision, Recall, F1 Score

from sklearn.metrics import precision_recall_fscore_support
y_pred = knn.predict(X_train)
precision_recall_fscore_support(y_train, y_pred)

This function gives us:

    Precision: the percentage of predicted labels that were correct

    Recall: the percentage of actual labels we correctly identified

    F1 Score: the harmonic mean of precision and recall

    Support: the number of true instances for each class

These are provided for each of the three species.
🔸 Section: Motivation for Rescaling

    Text: “Many techniques are sensitive to the scale of your data…”

The next block introduces a different dataset — heights and weights of people — to show why feature scaling matters.

df = pd.DataFrame({
    "Person": ['A', 'B', 'C'],
    "height (cm)": [160, 170.2, 177.8],
    "weight (pounds)": [150, 160, 171],
    "height (inches)": [63, 67, 70]
})

We create a toy dataset with three people. The same concept — height — appears in different units (centimeters and inches), along with weight.
🔸 Section: Distance Without Scaling

from scipy.spatial import distance
print("A to B:", distance.euclidean(df.iloc[0, 2:], df.iloc[1, 2:]))
print("A to C:", distance.euclidean(df.iloc[0, 2:], df.iloc[2, 2:]))
print("B to C:", distance.euclidean(df.iloc[1, 2:], df.iloc[2, 2:]))

We calculate Euclidean distances between pairs of people using raw values from the DataFrame. The distances vary widely, and results differ depending on which units we choose. That’s a red flag.
🔸 Section: Manual Z-Score Scaling

df_data = df.set_index('Person')
df_scaled = (df_data - df_data.mean(axis=0)) / df_data.std(axis=0)

We set “Person” as the index and scale the numeric columns using z-score normalization — subtracting the mean and dividing by the standard deviation. Now, all features have mean 0 and standard deviation 1.
🔸 Section: Verify Scaled Features

df_scaled.mean(axis = 0)
df_scaled.std(axis = 0)

These show the new means and standard deviations, confirming that scaling worked as expected.
🔸 Section: Distance After Scaling

distance.euclidean(df_scaled.iloc[0], df_scaled.iloc[1])

We recalculate the distances between individuals after scaling. This time, the differences are more balanced and independent of units.
🔸 Section: Scale with StandardScaler

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(df_data)
scaled = scaler.transform(df_data)

Here, we use scikit-learn's built-in scaler. We first .fit() the scaler to compute the mean and std, then .transform() to apply the scaling.
🔸 Section: Validate Scaled Output

scaled.mean(axis = 0)
scaled.std(axis = 0)

We verify that the transformed data has a mean of 0 and standard deviation of 1 — consistent with our manual approach.
🔸 Section: Population vs Sample Variance

df_data.var(axis = 0)
scaler.var_

df_data.var() uses n−1n−1, which is sample variance.
scaler.var_ uses nn, which is population variance.
The values differ slightly — good to understand, but not usually critical for ML models.
🔸 Section: Curse of Dimensionality

    Text: “k-nearest neighbors runs into trouble in higher dimensions…”

As dimensions increase, data becomes sparse, and distances become less meaningful. This is the Curse of Dimensionality.
🔸 Section: Simulate Distance in High Dimensions

def random_distances(dim, num_pairs):
    return [distance.euclidean(np.random.rand(dim), np.random.rand(dim)) for _ in range(num_pairs)]

We define a function to generate num_pairs random points in dim dimensions and compute their Euclidean distances.
🔸 Section: Run the Simulation

dimensions = range(1, 101, 5)
avg_distances = []
min_distances = []

for dim in dimensions:
    distances = random_distances(dim, 10000)
    avg_distances.append(np.mean(distances))
    min_distances.append(min(distances))

For each dimension from 1 to 100, we generate 10,000 distances and compute the average and minimum for that dimension.
🔸 Section: Visualize Distance Ratio

plt.plot(list(dimensions), np.array(min_distances) / np.array(avg_distances))

This plot shows how the ratio of minimum to average distance approaches 1 as dimension increases.
Meaning: in high dimensions, everything is far away — and the nearest neighbor is almost as far as a random point.
🔸 Section: Visualizing Sparsity

    Page 12–13: Random points in 1D, 2D, and 3D

We see how 50 points cover the space well in 1D, somewhat in 2D, and poorly in 3D. As dimensionality increases, we need exponentially more data to achieve the same coverage.
🔸 Section: KNN and Dimensionality Reduction

    Page 14–15

KNN performance drops in high dimensions. We often use dimensionality reduction techniques like PCA to project data into a lower-dimensional subspace before applying KNN.
🔸 Section: KNN in scikit-learn

    KNeighborsClassifier: standard k-NN classifier

    RadiusNeighborsClassifier: finds all neighbors within a fixed radius

    weights='uniform' uses majority vote

    weights='distance' gives more weight to closer neighbors

Choose k carefully: too small leads to noise sensitivity, too large oversmooths class boundaries.