In order to successfully complete this assignment you need to participate both individually and in groups during class on **Thursday October 29**.

# In-Class Assignment: Instructor template


<img alt="Flow diagram of thypical classification workflow" src="https://www.researchgate.net/profile/Inigo_Goiri/publication/221561415/figure/fig1/AS:393974662615051@1470942287791/Supervised-Machine-Learning-Schema.png">

Figure From: https://www.researchgate.net/publication/221561415_Towards_energy-aware_scheduling_in_data_centers_using_machine_learning

### Agenda for today's class (80 minutes)

1. [(20 minutes) Pre-class Assignment Review](#class_Assignment_Review)
2. [(20 minutes) Machine Learning Rules of thumb](#Machine_Learning_Rules_of_thumb)
3. [(40 minutes) Example Application: The Skin Cancer data set](#The_cancer_data_set)

----
<a name="Pre-class_Assignment_Review"></a>
# 1. Pre-class Assignment Review


* [1028--ML-pre-class-assignment](1028--ML-pre-class-assignment.ipynb)


---
<a name="Machine_Learning_Rules_of_thumb"></a>


# 2. Machine Learning Rules of thumb

- [Ugly duckling Theorem](https://en.wikipedia.org/wiki/Ugly_duckling_theorem)
- [Curse of Dimentionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality)


&#9989; **<font color=red>DO THIS:</font>** The above two properties can dominate how machine learning can be used. Briefly review both and discuss with your group how you think they relate to Machine Learning. Be prepaired to discuss with the rest of the class. 

I think the ugly duckling theory is actually really interesting! It is an easy to understand way to see the flaw in potential groupings and to remember that there can be nothing unbiased. No matter what you do there is a bias in science, which chemists have a hard time admitting lol, and this helps to acknowldge that.

---
<a name="The_cancer_data_set"></a>

# 4. Example Application: The Skin Cancer data set

In this example we will do the same calculation steps as in the pre-class assignment but using different dataset provided by scikit learn called the "cancer" dataset.  


The following commands loads a dataset of measurements computed from a digitized image of a fine needle aspirate (FNA) of a breast mass.  

In [None]:
%matplotlib inline 
import matplotlib.pylab as plt
import numpy as np
import sympy as sym
sym.init_printing()

In [None]:
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
data = cancer.data
target = cancer.target
data.shape

# Variables used for plotting
labels=cancer.target
cdict={0:'red',1:'green'}
labl={0:'Malignant',1:'Benign'}
marker={0:'*',1:'o'}
alpha={0:.5, 1:.5}

In [None]:
#print(cancer.DESCR)

## Step A: Feature Extraction

The Following is a plot of just the first two features:

In [None]:
plt.scatter(data[:,0],data[:,1], c=labels, s=30, cmap=plt.cm.rainbow);

In [None]:
dir(cancer)

## Step B: Splitting the dataset for model into training and testing sets
&#9989; **<font color=red>DO THIS:</font>** Split the iris data into a training and testing set like we did in the previous example:

In [None]:
from sklearn.model_selection import train_test_split

feature_vectors = cancer.data
class_labels = cancer.target

train_vectors, test_vectors, train_labels, test_labels = train_test_split(feature_vectors, class_labels, test_size=0.25)

print(len(train_vectors))
print(len(test_vectors))

## Step C: Select and train a Classifier using the training dataset

&#9989; **<font color=red>DO THIS:</font>** Use the train_vectors set and train_labels to train a Support Vector Machine. *Hint:* You should be able to use the same code and parameters we used in the previous example: 


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

print("Fitting the classifier to the training set")
param_grid = {'C': [1e3, 5e3, 1e4, 5e4, 1e5],
              'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1], }
clf = GridSearchCV(SVC(kernel='rbf', class_weight='balanced'), param_grid)
clf = clf.fit(train_vectors, train_labels)
print("Best estimator found by grid search:")
print(clf.best_estimator_)

## Step D. Show the results of the classification on the testing dataset

&#9989; **<font color=red>DO THIS:</font>** Test the predictive capabilities of your SVM using the test_vectors and compare the predicted labels to the actual labels. 

In [None]:
pred_labels = clf.predict(test_vectors)

In [None]:
feature_vectors = cancer.data
class_labels = cancer.target
percents=[]
for j in range(100):
    train_vectors, test_vectors, train_labels, test_labels = train_test_split(feature_vectors, class_labels, test_size=0.25)
    param_grid = {'C': [1e3, 5e3, 1e4, 5e4, 1e5],'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1], }
    clf = GridSearchCV(SVC(kernel='rbf', class_weight='balanced'), param_grid)
    clf = clf.fit(train_vectors, train_labels)
    pred_labels = clf.predict(test_vectors)
    tot=0
    for i in range(len(pred_labels)):
        if pred_labels[i] == test_labels[i]:
            tot+=1
        else:
            continue
    percents.append(tot/len(pred_labels)*100)
    print("The percentage of correct labels for training set number ",j," is:",tot/len(pred_labels)*100)
print(sum(percents)/len(percents))

-----
### Congratulations, we're done!

### Course Resources:


- [Website](https://msu-cmse-courses.github.io/cmse802-f20-student/)
- [ZOOM](https://msu.zoom.us/j/97272546850)
- [Syllabus](https://docs.google.com/document/d/e/2PACX-1vT9Wn11y0ECI_NAUl_2NA8V5jcD8dXKJkqUSWXjlawgqr2gU5hII3IsE0S8-CPd3W4xsWIlPAg2YW7D/pub)
- [Schedule](https://docs.google.com/spreadsheets/d/e/2PACX-1vQRAm1mqJPQs1YSLPT9_41ABtywSV2f3EWPon9szguL6wvWqWsqaIzqkuHkSk7sea8ZIcIgZmkKJvwu/pubhtml?gid=2142090757&single=true)




Written by Dirk Colbry, Michigan State University
<a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/">Creative Commons Attribution-NonCommercial 4.0 International License</a>.

----