In [1]:
import pandas as pd
df = pd.read_csv("shirt_sizes.csv")
print(df)

   height(cm)  weight(kg) t-shirt size
0         158          58            M
1         163          61            M
2         165          61            L
3         168          66            L


# Intro to ML (Machine Learning)
* Supervised learning: labeled data (ex: there is an attribute (AKA feature) that you are interested in predicting for unseen instances)
    * The attribute is called the "class" or the "class label"
    * The attribute is categorical...classification
    * The attribute is numeric...regression
    * Example algorithm: kNN (k nearest neighbors)
* Unsupervised learning: unlabeled data 
    * Example algorithm: k-means clustering

## Supervised Learning
* Need a way to divide a dataset into a training set and a testing set
    * The training set is used to build/train an algorithm/model
    * The testing set is used to evaluate the algorithm/model
    * The training set and the testing set *are different*
* Example
    * We have this super tiny t-shirt sizes dataset
        * 4 instances
        * 3 attributes (1 is the class which is t-shirt size)
        * Goal is to use height and weight attributes to predict t-shirt size
        * We will do this for a test set with a single unseen instance
            * height=161 weight=63 t-shirt size=?
            * Let's say the "ground truth value" is M (medium)

## kNN Algorithm
* Identify the k nearest neighbors in the training set to a test set instance
    * The most frequently occurring class label amongst the k nearest neighbors will be the class label prediction for the unseen instance
* We need a way to measure "nearness" AKA "closeness"
    * 2D: Pythagorean theorem
    * ND: Euclidean distance formula = $dist(a,b) = \sqrt{\sum_{i=1}^{n} {(a_i-b_i)^2}}$
* We need to normalize (AKA scale) our attributes so we don't have an unanticipated weighting of one attribute more than another (ex: height has a larger scale than weight, so it will dominate the formula)
    * We will use the min-max scaling approach
    * For each attribute, the min becomes 0 and the max becomes 1 (ex: bounded to [0,1] so the units have no weighting effect)
    * For each attribute, for each value subtract the min then divide by the original range (max - min)

In [2]:
# kNN algorithm with the sci-kit learn library
# notation
# X: is a feature matrix (rows of feature vectors (instances)) with the class labels stripped off
# y: is a class label vector
# X and y are parallel
# use _train and _test to denote train and test sets respectively
X_train = df.drop("t-shirt size", axis=1) # 1 is for columns
print(X_train)
y_train = df["t-shirt size"]
print(y_train)
X_test = [[161, 63]]
print(X_test)

   height(cm)  weight(kg)
0         158          58
1         163          61
2         165          61
3         168          66
0    M
1    M
2    L
3    L
Name: t-shirt size, dtype: object
[[161, 63]]


Algorithm steps:
1. Normalize the X data
2. Compute the distances to each unseen instance in the test set
3. Apply majority voting to k(=3 in our case) closest distances' labels


In [3]:
# normalize the data
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(X_train)
X_train_normalized = scaler.transform(X_train) # often combined fit_transform()
print(X_train_normalized)
X_test_normalized = scaler.transform(X_test)
print(X_test_normalized)

[[0.    0.   ]
 [0.5   0.375]
 [0.7   0.375]
 [1.    1.   ]]
[[0.3   0.625]]


In [4]:
# set up knn classifier
from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier(n_neighbors=3, metric="euclidean")
knn_clf.fit(X_train_normalized, y_train)

KNeighborsClassifier(metric='euclidean', n_neighbors=3)

In [5]:
# to make predictions with the classifier for unseen instance
y_predicted = knn_clf.predict(X_test_normalized)
print("y predicted:", y_predicted)
print("nearest neighbors:", knn_clf.kneighbors(X_test_normalized))

y predicted: ['M']
nearest neighbors: (array([[0.32015621, 0.47169906, 0.69327123]]), array([[1, 2, 0]], dtype=int64))


## Closing Thoughts on kNN
* What if our attributes are not numeric (meaning they are categorical?)
    * Simple approach: convert category labels to integers
        * from sklearn.preprocessing import LabelEncoder
    * Another approach: write your own distance function (0 if the labels are the same, 1 otherwise)
* kNN is not the only ML algorithm
    * Naive Bayes
    * Decision trees (random forests)
    * SVMs (support vector machines)
    * Neural networks
    * etc.

## Nov 30, 2021


In [6]:
long_df = pd.read_csv("shirt_sizes_large.csv")

X = long_df.drop("t-shirt size", axis=1) # 1 is for columns
y = long_df["t-shirt size"]

scaler = MinMaxScaler()
X = scaler.fit_transform(X)
print(X)
print(y)


[[0.         0.        ]
 [0.         0.1       ]
 [0.         0.5       ]
 [0.16666667 0.1       ]
 [0.16666667 0.2       ]
 [0.41666667 0.2       ]
 [0.41666667 0.3       ]
 [0.16666667 0.6       ]
 [0.41666667 0.6       ]
 [0.58333333 0.3       ]
 [0.58333333 0.4       ]
 [0.58333333 0.7       ]
 [0.83333333 0.4       ]
 [0.83333333 0.5       ]
 [0.83333333 0.8       ]
 [1.         0.5       ]
 [1.         0.6       ]
 [1.         1.        ]]
0     M
1     M
2     M
3     M
4     M
5     M
6     M
7     L
8     L
9     L
10    L
11    L
12    L
13    L
14    L
15    L
16    L
17    L
Name: t-shirt size, dtype: object


# Classifier Evaluation
* In our previous demo, we had 1 instance in our "test set"
    * If our classifier predicted this instance's class correctly, accuracy 100%
    * If our classifier predicted this instance's class incorrectly, accuracy 0%
* Note
    * We should use a "large" test set to get a better picture of how our classifier is performing
    * Accuracy doesn't tell the whole story...
        * Ex: 100 samples... 99 M and 1 L
        * And our classifier simply only predicts M
        * We have 99% accuracy yay!
            * Accuracy only makes sense when your class labels are near evenly distributed

* Given a data set, we need a way to "divide" our dataset into a training set and a test set
    * A few ways to use this
        1. Hold out method
        2. Random subsampling
        3. Cross validation
        4. Boot-strap method

### Hold out method
1. "Hold out" a certain number or percentage of instances in a dataset for testing
    * Train on the remaining instances
    * Typically choose a standard split or percentage
        * ex: 2:1 split means 1/3 of data is held out for testing, so remaining 2/3 for training
        * ex: 25% hold out -- 25% of data is held out for testing and 75% is for training
            * Default for sklearn's `train_test_split()`


In [7]:
from sklearn.model_selection import train_test_split

# random state is used for reproducablility
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) #  random_state=0
# print(X_train, X_test)
# print(y_train, y_test)

knn_clf = KNeighborsClassifier(n_neighbors=3, metric="euclidean")
knn_clf.fit(X_train, y_train)
y_predicted = knn_clf.predict(X_test)
print(y_predicted)
print(list(y_test))


['M' 'M' 'L' 'L' 'L']
['M', 'M', 'L', 'L', 'L']


In [8]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

tree_clf = DecisionTreeClassifier()
tree_clf.fit(X_train, y_train)
y_predicted = tree_clf.predict(X_test)
print(y_predicted)
print(list(y_test))

accuracy = knn_clf.score(X_test, y_test)
print(accuracy)
accuracy = accuracy_score(y_test, y_predicted)
print(accuracy)

['M' 'M' 'L' 'L' 'L']
['M', 'M', 'L', 'L', 'L']
1.0
1.0


## Randomly Subsampling
* Performing the hold out method k times (different from kNN)
* Accuracy is the mean accuracy over the k runs

### Cross Validation
* With a random subsampling, we are not guarenteed that each instance ends up in a test set at least once
* With cross validaiton, we are more intentional about our "partitions"
* Algorithm: Divide the data set into k folds (also different k)
    * For each fold:
        * Hold out the fold and test on it
        * Train on the remaining folds (folds - fold)
* With this algorithm, each instance is tested exactly 1 time
* Accuracy is the total predicted correctly divided by the total predicted

In [9]:
from sklearn.model_selection import cross_val_score, cross_val_predict

# run 5-fold cross validation for both the knn and tree
for clf in [knn_clf, tree_clf]:
    print(type(clf))
    # lazier approach
    accuracies = cross_val_score(clf, X, y, cv=5)
    print(accuracies, accuracies.mean())
    # better approach
    y_predicted = cross_val_predict(clf, X, y, cv=5)
    print(y_predicted)
    accuracy = accuracy_score(y, y_predicted)
    print(accuracy)

<class 'sklearn.neighbors._classification.KNeighborsClassifier'>
[0.75       0.5        1.         1.         0.66666667] 0.7833333333333333
['M' 'M' 'M' 'M' 'M' 'M' 'L' 'M' 'L' 'M' 'M' 'L' 'L' 'L' 'L' 'L' 'L' 'L']
0.7777777777777778
<class 'sklearn.tree._classes.DecisionTreeClassifier'>
[0.5        0.5        1.         1.         0.66666667] 0.7333333333333333
['M' 'M' 'L' 'M' 'M' 'M' 'L' 'M' 'M' 'M' 'L' 'L' 'L' 'L' 'L' 'L' 'L' 'L']
0.7222222222222222


GS for next class...

NOTE: from cross_val_score() documentation: "For int/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used. These splitters are instantiated with shuffle=False so the splits will be the same across calls."

TODO: cross validation variants plus confusion matrices

Variants of cross-validation
* Stratified k fold cross validation: roughly the same distribution of class labels in each fold
* LOOCV: leave one out cross validation: k = N: each fold contains exactly one instance
    * Good for when you need as much training data as possible
    * Inefficient

## Classifier Evaluation Metrics
For binary classification...
* P: # of positive instances in the test set
* N: # of negative instances in the test set
* TP: # of positives that are correctly classified as positives (true positives)
* TN: # of negatives that are correctly classified as negatives
* FN: # of positives that are incorrectly classified as negatives
* FP: # of negatives that are incorrectly classified as positives
* Accuracy = $\frac{TP + TN}{P + N}$
* Error rate = $\frac{FN + FP}{P + N}$
* Accuracy and error rate are not always the best metrics
    * Especially when you have a class imbalance problem
* Other metrics (for further study)
    * Precision, recall, F measure, AUC (area under the receiver operator curve), etc...

Last ... confusion matrix demo