# Unsupervised Machine Learning with Clustering
Michaela Webster - mawebster9

_This notebook is an introductory guide to machine learning and walks you through the Machine Learning process with a sample dataset. You can use your own dataset with some minor adjustments to the code. This notebook provides guidance on how to implement both text and boolean based solutions to machine learning._

_This guide only touches on a few machine learning techniques and should not be used as your one-stop-shop for all machine learning problems._

In [38]:
# our data structure
import pandas as pd

# bag of words vectorizer - take inverse frequency of words to assign weights
from sklearn.feature_extraction.text import TfidfVectorizer

# split data into training/test data, validate our models, and specify number of folds for training/test data
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold 

# our 3 classification models
from sklearn.cluster import KMeans
from sklearn.svm import SVC, LinearSVC, SVR

# our accuracy metrics
from sklearn.metrics import accuracy_score, completeness_score, adjusted_mutual_info_score, adjusted_rand_score, v_measure_score
from sklearn.metrics import fowlkes_mallows_score, homogeneity_score, mutual_info_score, normalized_mutual_info_score

## 1. Import Data and Set X & y


For this example, we will focus solely on the typeNumber field. The typeNumber field is a numerical representation of our animal type. Our goal is to see if the typeNumber field can be used to determine the type of animal with high accuracy. For this, our X is all of the boolean fields and its associated label, or y, is the typeNumber field.

In [17]:
#connect to CSV file that contains our data - our data was adapted from https://github.com/selva86/datasets/Zoo.csv
path_to_file = "https://raw.githubusercontent.com/mawebster9/MachineLearningMadeEasy/master/Zoo.csv"

#open, read, and store our data into a pandas dataframe
df = pd.read_csv(path_to_file, encoding='latin-1')

In [18]:
#assign our attributes to X and y
#X_1 is a placeholder for now before we tranform our data into the format our ML algorithms can understand
X_1 = df.select_dtypes('bool')
y = df['typeNumber']

In [19]:
#print the first record in X to verify the previous step
X_1.iloc[0]

type         True
feathers    False
eggs        False
milk         True
airborne    False
aquatic     False
predator     True
toothed      True
backbone     True
breathes     True
venomous    False
fins        False
tail        False
domestic    False
catsize      True
Name: 0, dtype: bool

In [20]:
#print the first record in y to verify the previous step
y.iloc[0]

2

## 3. Set-up X: Fix Boolean Values

We have our X values set up in a way that a machine learning algorithm can understand it, but now we need to fix our y values. In order to do this we need to change our true/false values into a numeric format. To do this, we need to change all true values to 1.0 and all false values to 0.0.

In [23]:
#replace all instances: True = 1.0, False = 0.0 and store into X (transform X_1 and store results in X)
X_1.columns.tolist()
for i in (X_1.columns.tolist()):
    X[i] = X_1[i].replace(True,1)

In [24]:
#print out counts for all y records - ensure that our replace statement worked
X.iloc[0]

type        1.0
feathers    0.0
eggs        0.0
milk        1.0
airborne    0.0
aquatic     0.0
predator    1.0
toothed     1.0
backbone    1.0
breathes    1.0
venomous    0.0
fins        0.0
tail        0.0
domestic    0.0
catsize     1.0
Name: 0, dtype: float64

### Final Data Check

Now that our X and y are in the right format, we need to ensure one last time that the dimensions of each dataframe are correct. For our X, we see that there are 500 rows and 2,117 columns (different words in BOW). For our y, we see that there are 500 rows and no columns.

Our data has passed the check and is ready to be used.

In [25]:
X.shape

(101, 15)

In [26]:
y.shape

(101,)

## 4. Machine Learning Step-by-Step

#### A. Run train_test_split() on X & y

This step is not needed for this notebook but it shows you how the train_test_split function works. Our X and y are randomly split up into training and testing groups. In this case, our test group will be comprised of 33% of the X data(test_size), and will be the same each time we run this line (random_state).

In [27]:
#break 33% of X and y into X_test and y_test, break other remaining 67% into X_train and y_train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [28]:
#print size of training data (335/500 = .67)
X_train.shape

(67, 15)

In [29]:
#print size of test data (165/500 = .33)
X_test.shape

(34, 15)

#### B. Run fit() on X_train & y_train

The next step in our machine learning model is taking our training data and feeding it into an algorithm to build a model. This is essentially the step that teaches an algorithm that for each record X = y. To do this, there are a number of classification models. For this example we will focus on the KMeans classifier.

In [30]:
#specify which classifier to use and set parameters
clf = KMeans(random_state=0).fit(X)

In [31]:
#send X and y into our classifier to build a model
k_means = clf.fit(X_train, y_train)

In [32]:
#print out all information about our model
print(k_means)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=8, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=0, tol=0.0001, verbose=0)


#### C. Run predict() on X_test

The next step in our model is to take the model we just made using the training data and feeding the test data into it. This will output an array of values that the algorithm has determined to be the typeNumber status.

In [33]:
#send our test data into the model we just created
y_pred = k_means.predict(X_test)

In [34]:
#print our results for the predictions
print("Here is the model's predictions: ")
print(y_pred)

Here is the model's predictions: 
[2 2 3 3 3 7 2 3 3 3 5 7 7 1 4 3 2 1 5 2 0 0 0 7 2 4 7 0 2 2 1 0 3 2]


In [35]:
print(k_means.labels_)
print(y_test)

[1 2 3 4 2 1 7 5 5 1 5 2 3 2 0 0 3 4 7 3 2 5 5 1 2 2 2 7 6 1 3 3 3 4 3 5 3
 0 1 6 3 6 1 3 2 6 1 3 4 1 3 2 0 1 5 6 6 7 3 5 5 1 5 6 4 5 7]
84    2
55    2
66    2
67    2
45    2
39    6
22    2
44    2
10    2
0     2
18    3
30    6
97    6
33    4
77    5
4     2
93    2
78    4
12    3
31    2
76    7
89    1
26    1
42    6
70    2
15    5
40    6
72    5
9     2
96    2
11    4
91    7
64    2
28    2
Name: typeNumber, dtype: int64


#### D. Verify Accuracy of Model

Now that we have split our data into training and testing groups, created a model using a machine learning algorithm, and used the model to predict outcomes for our test data, it is time to verify how well our model did compared to the actual outcomes. To do this, there are a number of accuracy metrics. For this example we will focus on the accuracy score.

In [36]:
#compare y_test values with the predicted y values
score = accuracy_score(y_test, y_pred).mean()
print("Accuracy score for KMeans classifier:  ", score*100,"%")

Accuracy score for KMeans classifier:   26.47058823529412 %


***Here we can see that our KMeans model was correct 26.47% of the time when predicting the typeNumber status.***

## Test for Best Classifier to Use

Now that we understand how machine learning is done, we can determine which model is the best choice for our data. In this example we will use  different classifiers and evaluate each against 9 accuracy metrics.

Note: the cross_val_score() function handles test_train_split(X,y,test_size=33,random_state-42), fit(X_train,y_train), predict(X_test), and also any of the accuracy score metrics. This function is essentially and all-in-one function.

__Ignore the warning - there will be a default change when the next update is pushed__

In [39]:
classifiers = [SVC(gamma='auto'), LinearSVC(), KMeans(random_state=0)]
clf_names = ['SVC', 'LinearSVC', 'SVR']
metric_names = ['accuracy', 'completeness_score', 'adjusted_mutual_info_score', 'adjusted_rand_score', 'fowlkes_mallows_score', 'homogeneity_score', 'mutual_info_score', 'normalized_mutual_info_score', 'v_measure_score']

scv = StratifiedKFold(n_splits=3)

scores_df = pd.DataFrame(index=metric_names,columns=clf_names)
clf_scores = []
for clf, name in zip(classifiers, clf_names):
    print('-----------------------------------------------------------------------------------------------------------')
    print('Classifier: ',clf)
    print('')
    print("Scoring Metrics: ")
    for metric in metric_names:
        score = cross_val_score(clf,X,y,scoring=metric, cv=scv).mean()
        clf_scores.append(score)
        print('\t*',metric,'score: ', score)
    scores_df[name] = clf_scores
    clf_scores = []

-----------------------------------------------------------------------------------------------------------
Classifier:  SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

Scoring Metrics: 
	* accuracy score:  0.8712948242360007
	* completeness_score score:  0.9418178347175505
	* adjusted_mutual_info_score score:  0.76144398216278
	* adjusted_rand_score score:  0.8886509558840245
	* fowlkes_mallows_score score:  0.9164954557636752
	* homogeneity_score score:  0.8153033147609291
	* mutual_info_score score:  1.3432862063118838
	* normalized_mutual_info_score score:  0.8762737018254722
	* v_measure_score score:  0.8739928525235382
-----------------------------------------------------------------------------------------------------------
Classifier:  LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=T



	* mutual_info_score score:  1.508640959392465
	* normalized_mutual_info_score score:  0.931478552913411
	* v_measure_score score:  0.9313119044957495
-----------------------------------------------------------------------------------------------------------
Classifier:  KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=8, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=0, tol=0.0001, verbose=0)

Scoring Metrics: 
	* accuracy score:  0.21875993640699523
	* completeness_score score:  0.7766847898129409




	* adjusted_mutual_info_score score:  0.6775502585101932
	* adjusted_rand_score score:  0.6278825545830816
	* fowlkes_mallows_score score:  0.7106366579954777
	* homogeneity_score score:  0.9089922681799937
	* mutual_info_score score:  1.498575252792693
	* normalized_mutual_info_score score:  0.8402029950403515
	* v_measure_score score:  0.8375761413397314




## Final Results

For boolean classification with multivalue labels, the top performing machine learning algorithm is LinearSVC. This classifier works by analyzing a dataset with one or more independent variables that determine an set of possible outcomes - the outcome is measured with a list.