# Introduction 

*\[Machine Learning is a\] field of study that gives computers the ability to learn without being explicitly programmed.* --- Arthur Samuel, 1959

Machine learning has existed for decades; however, until recently, computing power and data storage were too limited to allow machines to solve many problems in the field effectively.
The amount of data that is being generated is growing at massive amounts. Humans are not predisposed to be as reliable and efficient as computers at handling large amounts of data.

Machine learning is allowing the development of a model where the data and expected output enable the machine to develop a model without a person giving all the details.
<img src="images/traditional_vs_ml.png" width=400>


Let us compare the traditional approach to solving a problem such as *classifying email as spam or non-spam.* 

## Programmatic approach

Here is an example of the path we might take to develop this classification:

1. First, we would identify some common phrases in a spam message ("4U", "credit card", "free", "amazing", "one simple trick", ...). We might also identify some other common patterns in sender's name, email's name, domains, ...
2. We would write set of rules (pattern recognition).
3. We would then test our program and likely write more rules
4. This would continue until our program worked well enough.

## ML Approach

1. We take in a large number of emails that have been sorted as *spam* or *not spam*.
2. We define some measurable charateristics of the messages, such as domains, message length, ..
3. We process the emails through a process of the model determining parameters to give the answer.
4. We allow the model to reach an acceptable lvel of accuracy. 




# Practical

The questions that arise when embarking on developing a machine learning-based problem come down to obtaining data, put that data in a form that amendable to your system, and developing a modeling. In this workshop, we will go through this for a particular problem. 

The problem we will targeting in this example will be the detection of breast cancer. Breast cancer is one of the most common cancers among women worldwide. Early diagnosis of breast cancer can greatly decrease the likelihood of death. However, accurate diagnosis can be a challenge and requires expert analysis, which greatly impacts areas where these experts are in short supply. 

It would be very beneficial to use machine learning techniques to develop an accurate model that can bring expert levels of accuracy to anywhere the models can be used. 

In this workshop, we will use the Wisconsin breast cancer dataset, a labeled dataset containing 30 parameters obtained by analysis of fine needle aspirate (FNA) of breast masses.
This dataset has been previously studied by several papers including [Breast Cancer Detection with Reduced Feature Set](https://www.hindawi.com/journals/cmmm/2015/265138/)


Here is a diagram of our plan:
<img src="images/cancer_diagram.png">

## Steps in this project

1. Getting the needed libraries.
2. Getting the data
3. Getting the data into a usable form.
4. Exploring the data.
5. Preparing the data for training.
6. Preparing the training and testing sets
7. Training the model.
8. Testing the model.

## Libraries

Here are the basic libaries that we will use in this notebook for our machine learning workshop. 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
import sklearn
%matplotlib inline
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)

Let us check some information about one of the packages

In [None]:
print("TensorFlow version: {}".format(tf.__version__))
print("Eager execution is: {}".format(tf.executing_eagerly()))
print("Keras version: {}".format(tf.keras.__version__))

# Breast Cancer data

In [None]:
from sklearn.datasets import load_breast_cancer # Loading the breast cancer from a standard datasource within SciKit
cancer = load_breast_cancer()

Now, let's look at the data we have:

In [None]:
cancer

While that is somewhat human-readable, so let's start to transform it. 

Let's start by getting the name of the fields.

In [None]:
cancer.keys()

We can start to see get some information by looking at one of the components, which contains a description for users of the data 

In [None]:
print(cancer['DESCR'])

We will look at the labels (ML-terminology) for the dataset.

In [None]:
print(cancer['target'])

This is the labels in a code-format. We can see what the meanings of the codes by looking at the 'target_names' field

In [None]:
print(cancer['target_names'])

We note that the array is 0-indexed, so that 0 is 'malignant' and 1 is 'benign'. 

The features of the set can be found by looking at the 'features_names'.

In [None]:
print(cancer['feature_names'])

# DataFrame

We will use a Pandas DataFrame object to make maniplulation of the data easier. 

In [None]:
df_cancer = pd.DataFrame(np.c_[cancer['data'],cancer['target']],columns=np.append(cancer['feature_names'],['target']))

An issue to be aware of is that TensorFlow does not allow spaces in Feature names, so we will fix that now

In [None]:
for key in df_cancer.keys():
    newkey = key.replace(" ", "_")
    df_cancer.rename(index=str,columns={key:newkey},inplace=True)
print(df_cancer.keys())

In [None]:
df_cancer.head()

In [None]:
df_cancer.tail()

# Feature Scaling

We can note that there is a large difference in the magnitutes of the features. In order to avoid any issues that this may cause, we can scale the data to a uniform range. 

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_scaled = df_cancer.copy()
df_scaled[['mean_radius', 'mean_texture', 'mean_perimeter', 'mean_area',
       'mean_smoothness', 'mean_compactness', 'mean_concavity',
       'mean_concave_points', 'mean_symmetry', 'mean_fractal_dimension',
       'radius_error', 'texture_error', 'perimeter_error', 'area_error',
       'smoothness_error', 'compactness_error', 'concavity_error',
       'concave_points_error', 'symmetry_error', 'fractal_dimension_error',
       'worst_radius', 'worst_texture', 'worst_perimeter', 'worst_area',
       'worst_smoothness', 'worst_compactness', 'worst_concavity',
       'worst_concave_points', 'worst_symmetry', 'worst_fractal_dimension']]=  scaler.fit_transform(df_scaled[['mean_radius', 'mean_texture', 'mean_perimeter', 'mean_area',
       'mean_smoothness', 'mean_compactness', 'mean_concavity',
       'mean_concave_points', 'mean_symmetry', 'mean_fractal_dimension',
       'radius_error', 'texture_error', 'perimeter_error', 'area_error',
       'smoothness_error', 'compactness_error', 'concavity_error',
       'concave_points_error', 'symmetry_error', 'fractal_dimension_error',
       'worst_radius', 'worst_texture', 'worst_perimeter', 'worst_area',
       'worst_smoothness', 'worst_compactness', 'worst_concavity',
       'worst_concave_points', 'worst_symmetry', 'worst_fractal_dimension']])
print(df_scaled.head())
print(df_scaled.tail())

# Intelligent selection of Data 

Despite the difficulity that arises in the obtaining of data, we often have redudant data and too much data for an efficient model. In our case, we can think about how we know we have multiple descriptions of the same property (*mean* and *worst*). In addition, how can we make selections if we wish to reduce our variables without using our own biases? 




In [None]:
print(df_scaled.corr()['target'])

# Visualization of the Data

Let's look at the relaitonships between some of the variables, to get some idea of which data are important and which data are redundant. 

We'll use a heatmap, which can be used to visualize correlations. Each square of this heatmap shows the correlation between the variables. Correlation close to -1 shows a strong negative correlation (one variable increases as one decreases) and correlation close to +1 shows a strong postive correlation (both increase together). 

In [None]:
sns.heatmap(df_scaled.corr(),annot=True)

However, the default size can be difficult to view. Let's try to make that a little better. 

In [None]:
plt.figure(figsize=(20,10))
ax = sns.heatmap(df_scaled.corr(),annot=True) # This is because of an issue in matplotlib. 
bottom, top = ax.get_ylim() 
ax.set_ylim(bottom+0.5, top-0.5)

# Chosing features to be studied

It is tempting to use all the features in a haphazard way. In our case, we see several parameters, such as the size of the cells to have multiple types of measurements. It is often advanatageous to try to minimize the number of features because of this and computational expense. 

Thinking about this in a mathematical sense, the features do not necessarily form a orthogonal basis set. This can lead to degenerate answers which may complicate the optimization process and either lead to a local extrema or failure of convergence. **This is in general terms of optimization, not strictly ML terms.** 

As related and practical matter, the large the feature set, the more expensive the calcuation is. By reducing the number of features, we try to increase the "siginal-to-noise" while decrease the computational expense. 

In our case, we will use the mean parameters for a starting point because it reduces the number of features to 5. Inuitively, mean values tend to be a good choice for measuring trends.  

In [None]:
sns.pairplot(df_scaled, vars=['mean_radius','mean_texture','mean_perimeter','mean_area','mean_smoothness'])

Labeling of the data will begin to give some idea of trends within the data sets. 

In [None]:
g = sns.pairplot(df_scaled,hue='target', vars=['mean_radius','mean_texture','mean_perimeter','mean_area','mean_smoothness'])
# Below is to allow the legend to use words instead of numbers. 
handles = g._legend_data.values()
labels = ['Malignant','Benign'] 
g._legend.remove()
g.fig.legend(handles=handles,labels=labels, loc='center right',ncol=1)
g.fig.subplots_adjust(top=0.92,bottom=0.08,right=0.9)

In [None]:
sns.countplot(df_scaled['target'])

In [None]:
df_scaled.corr()['target'].sort_values()b

We can see the redudance in multiple manners and I choose to use the mean values for my model. 

In [None]:
features=['mean_radius','mean_texture','mean_perimeter','mean_area','mean_smoothness']
labels=['target']

In [None]:
randomized_data = df_scaled.reindex(np.random.permutation(df_scaled.index))

In [None]:
randomized_data.head()

In [None]:
total_records = len(randomized_data)
training_set_size_portion = 0.8
training_set_size = int(total_records*training_set_size_portion)
test_set_size = total_records - training_set_size
print(total_records,training_set_size,test_set_size)

In [None]:
# Building the testing features and labels
testing_features = randomized_data.tail(test_set_size)[features].copy()
testing_labels = randomized_data.tail(test_set_size)[labels].copy()

In [None]:
testing_features.head()

In [None]:
testing_labels.head()

In [None]:
training_features = randomized_data.head(training_set_size)[features].copy()
training_labels = randomized_data.head(training_set_size)[labels].copy()
print(training_features.head())
print(training_labels.head())

In [None]:
feature_columns = [tf.feature_column.numeric_column(key) for key in features]

In [None]:
print(feature_columns)

# Choice of Model

There are many models that can be used to attempt to solve the problem of classifying wheter the cancer is benign or malignant. In this example, we will use a neural network; which is a mathematical model that is inspired by how brains use. 

The strength of neural networks has been shown in the ability of these algorithms to excel in certain problems, especially classification. In the case of this problem, there is a deep pattern that is inside the set of data and the cancer outcome (otherwise, how would the physician's determination be better than a random determination). It seems like a fruitiful approach to develop neural network to classify each patient's data in terms of malignant or beign.  

#  Neural Networks

Neural networks are a type of machine learning algorithm that are inspired by neurons in the human brain. Similar to neurons in the brains, neural networks are formed by interconnecting neurons that interact with each other. Each neuron takes input, does some simple alogrithm to it, and then passes an output to the next neuron. 

Let us look at a *perceptron*; that is, a *single layer neural network*. 

<img src='images/perceptron.png'>

The *perceptron* is a mathematical function that takes a set of inputs, performs some operation, and outputs the result. In this case,
$$ y = \sum_{i} w_{i}x{i} + w_0,$$
where $w_i$ is the weight of the perceptron and $w_0$ is the bias. Note that this is the form of a line (plane,hyperplane,...) The weights are used to determine the importance of the of that component and the bias shifts the activation function curve up and down. 

The results of the perceptron acting on the inputs, will be input into the activation function, which will determine how to classify the set. 




## Architecture of neural networks

A neural network consists of 
* An input layer 
* Any number of hidden layers (these are called hidden because the external observe does not see the output)
* An output layer
* A set of weights and bias between each layer $\{w_i\}, \{b_i\}$
* An activation function for each layer, $\sigma$


<img src='images/neural_network_1.png'>

## Training Process

Each iteration of the training process consists of the following steps:
1. Calculating the predicted output $\hat{y}$, known as _*Feedforward*_
2. Updating the weights and biases, known as _*Backpropagation*_

Schematicially, this can be illustated as 
<img src='images/nn_iteration.png'>

### Feedforward

The forward motion is quite simply the calculation of the function in series, that is the the sum of the products of the weights and activations that lead to the neuron. Swe are moving forward in the network. 

The loss function comes into play at this point, since we must determine the "goodness" of our performance.
There are many possibilities to use for the *loss* function, such as the familar *sum-of-squares error*
$$ \mathrm{loss} = \sum_{i=1}^n (y-\hat{y})^2$$

### Backprogagation 

As we measure the error of our prediction, we can now find a way to use the error to improve the network, if desired. This is termed *backpropagation*. We work away back to update the weights and biases for the neurons. 

Minimization of the error function is how this optimization. There are multiple methods to optimize these multiple dimension functions, a popular one method may be to use the derviative of the loss function to determine the path of greatest decrease as in *gradient descent*.

## Hyperparameters

*Hyperparameters* are the *variables which determine the network structure* and *how the network is trained*. Examples that effect the *learning rate* are *epoch*, *batches*, and *iterations*. These are important parameters that are not learned by the network so they must be specified by the model designer. 

An *epoch* is when an entire training dataset is passed forward and backward through the network *once*. It is at the end of an epoch that parameters (weights and biases) have updated. In short (batch_size * number_iterations >= number_data)

An *iteration* is the number of *batches* needed to complete one epoch.

In some cases, the dataset will need to be divided into *batches* in order to fit everything in memory in order complete the calculations. For example, you may have 1000 training examples, 


In [None]:
classifier = tf.estimator.DNNClassifier(feature_columns=feature_columns,hidden_units=[10,10,10], n_classes=2,model_dir='tmp/model')

# Train the Network

We define the training the input function now. 

The function that does this is 

`train_input_fn = tf.estimator.inputs.pandas_input_fn(x=training_features, y=training_labels['target'], num_epochs=15,shuffle=True)`

In this case, we will pass through the data set 15 times, updating the weight and biases based on the loss.
<https://www.tensorflow.org/api_docs/python/tf/estimator/inputs/pandas_input_fn> for complete documentation of the function.



In [None]:
train_input_fn =  tf.compat.v1.estimator.inputs.pandas_input_fn(x=training_features,y=training_labels['target'],num_epochs=15,shuffle=True)

In [None]:
print(type(training_features['mean_radius']), type(training_labels['target']))

**Note** If you are reruning the calculation, it may be necessary to clean out the tmp directory.

In [None]:
classifier.train(input_fn=train_input_fn,steps=2000)

# Testing the Model

In [None]:
test_input_fn = tf.compat.v1.estimator.inputs.pandas_input_fn(x=testing_features,y=testing_labels['target'],num_epochs=15,shuffle=False)

In [None]:
classifier.evaluate(input_fn=test_input_fn)

In [None]:
accuracy_score=classifier.evaluate(input_fn=test_input_fn)['accuracy']
print("Accuracy = {}".format(accuracy_score))

# Improving the accuracy

Since the accuracy is not very high, let us start to try to improve the accuracy.
There are a number of ways to increase accuracy:

- Increase hidden layers
- Change activiation function
- Change activation function in output layer
- Increase number of neurons
- Weight initialization
- More data
- Normalization/scaling data
- Change learning algorith parameters


Let's try to improve the accuracy by increasing the number of hidden layers. 




In [None]:
classifier_v2 = tf.estimator.DNNClassifier(feature_columns=feature_columns,hidden_units=[10,20,20,10], n_classes=2,model_dir='tmp/model_2')

In [None]:
classifier_v2.train(input_fn=train_input_fn,steps=2000)

In [None]:
classifier_v2.evaluate(input_fn=test_input_fn)

**We have improve the accuracy to 0.86 from 0.36** Let's try to change the activation function. By default, we are using the relu (rectified linear unit). The other options can be found in found in [tf.nn documentation](https://www.tensorflow.org/api_docs/python/tf/nn)

In [None]:
%rm -rf './tmp/model_3'
classifier_v3 = tf.estimator.DNNClassifier(feature_columns=feature_columns,hidden_units=[10,20,20,10], activation_fn=tf.nn.selu, n_classes=2,model_dir='tmp/model_3')

In [None]:
classifier_v3.train(input_fn=train_input_fn,steps=2000)

In [None]:
classifier_v3.evaluate(input_fn=test_input_fn)

**We have decrease the accuracy with that.** Let's try something else. 

In [None]:
%rm -rf './tmp/model_4'
classifier_v4 = tf.estimator.DNNClassifier(feature_columns=feature_columns,hidden_units=[10,20,20,10], n_classes=2,model_dir='tmp/model_4')
classifier_v4.train(input_fn=train_input_fn,steps=2000)
classifier_v4.evaluate(input_fn=test_input_fn)

# Feedback 

<https://gatech.co1.qualtrics.com/jfe/form/SV_55uzMYLufTuiLch>