**Import Data Set:**

The basic idea behind classification based supervised machine learning is to develop a model that can train on input data and then make classifications on new data. Finding large amounts of data that is representative of your population is a difficult task. Luckily, sklearn has a bunch of datasets that are practical for our academic needs so we can use their data. However, in general, it is time-consuming and difficult to collect data and format it for a machine learning model to use.

**Run the code below** to import our data set.

In [1]:
#Import scikit-learn dataset library
from sklearn import datasets

#Load dataset
cancer = datasets.load_breast_cancer()

**Features and Labels:**

In supervised machine learning, we give our model a series of data points. In order for the model to learn from this data, each data point must have two components: features and labels.

**Features** are values that the machine learning model can use to classify a data point. For example, if your data points are animals and you are trying to classify dogs, cats, etc. then some features could be height, weight, etc.

**Labels** are the different types of classifications that the machine learning model can make. For example, if you are trying to classify animals, your labels could be "Dog", "Cat", etc.

**Run the code below** to look at the various features and labels that we will be using for our breast cancer classifier.

In [2]:
# print the names of the 13 features
print("Features: ", cancer.feature_names)

# print the label type of cancer('malignant' 'benign')
print("Labels: ", cancer.target_names)

Features:  ['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']
Labels:  ['malignant' 'benign']


**Splitting Data:**

There are three main phases of a machine learning model's life-cycle: Training, Validation, and Deployment.

**Training:** Before a machine learning model can perform its regression/classification task, it must "learn" by observing a set of training data. It is important to make sure you have a large amount of training data and that your training data is representative of your population.

**Validation:** Once a machine learning model has been trained, it must be validated to ensure that it will work well with data it has not seen before. If we validated with data the machine learning model with data it trained with, then it would not be impressive if it did well because it already saw the answer. Thus it is important to have a set of data points, separate from the training data, that will be used for the validation process.

**Deployment:** Once a machine learning model has been validated, it can put used in the desired application. 

In the context of this lab, we have already imported our data. However, we need to split this data into training and validation data. Luckily, sklearn has a function that will split the data for us.

**Run the code below** to separate our data into training data and validation data.

In [9]:
# Import train_test_split function
from sklearn.model_selection import train_test_split

# Split dataset into training set and test set
features_train, features_test, labels_train, labels_test = train_test_split(cancer.data, cancer.target, test_size=0.3,random_state=109) # 70% training and 30% test

**Generating A Model:**

There are many types of supervised machine learning models. One of the most popular is Support Vector Machine (SVM). The theory behind SVM goes beyond the scope of this lab, however if you are interested in learning more, here is a link to a high-level explanation: https://towardsdatascience.com/support-vector-machine-simply-explained-fee28eba5496.

If you are interested in learning more about machine learning models/theory, look to take courses such as Math 447, CSCI 467, CSCI 567, EE 541, EE 559, EE 641, etc.


In [10]:
#Import svm model
from sklearn import svm

#Create a svm Classifier
clf = svm.SVC(kernel='linear') # Linear Kernel - see the SVM API for alternative kernels

#Train the model using the training sets
clf.fit(features_train, labels_train)

#Predict the response for test dataset
labels_pred = clf.predict(features_test)

**Validation:**

Now that we have trained our model, it is time for us to determine how well it will work. There are many statistics that can be used to quantify the "goodness" of the trained algorithm, and in this case we will use accuracy, precision, and recall.

**Accuracy:** This measures how often the trained machine learning classifier  was able to successfully label the validation data.

> Accuracy = $\frac{Num Correct}{Num Validation Data}$

**Precision:** This measures the portion of positive values that was actually correct.

> Precision = $\frac{True Positive}{True Positive + False Positive}$

**Recall:** This measures the portion of actual positives that was labeled correctly.

> Recall = $\frac{True Positive}{True Positive + False Negative}$

**Run the code below** to calculate these statistics.


In [12]:
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics

# Model Accuracy: how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(labels_test, labels_pred))

# Model Precision: what percentage of positive tuples are labeled as such?
print("Precision:",metrics.precision_score(labels_test, labels_pred))

# Model Recall: what percentage of positive tuples are labelled as such?
print("Recall:",metrics.recall_score(labels_test, labels_pred))

Accuracy: 0.9649122807017544
Precision: 0.9811320754716981
Recall: 0.9629629629629629
