# Using Predictive Analysis To Predict Diagnosis of a Breast Tumor

## Identify the problem
Breast cancer is the most common malignancy among women, accounting for nearly 1 in 3 cancers diagnosed among women in the United States, and it is the second leading cause of cancer death among women. Breast Cancer occurs as a results of abnormal growth of cells in the breast tissue, commonly referred to as a Tumor. A tumor does not mean cancer - tumors can be benign (not cancerous), pre-malignant (pre-cancerous), or malignant (cancerous). Tests such as MRI, mammogram, ultrasound and biopsy are commonly used to diagnose breast cancer performed.

### Expected outcome
Given breast cancer results from breast fine needle aspiration (FNA) test (is a quick and simple procedure to perform, which removes some fluid or cells from a breast lesion or cyst (a lump, sore or swelling) with a fine needle similar to a blood sample needle). Since this build a model that can classify a breast cancer tumor using two training classification:

   - 1= Malignant (Cancerous) - Present
   - 0= Benign (Not Cancerous) -Absent

### Objective
Since the labels in the data are discrete, the predication falls into two categories, (i.e. Malignant or benign). In machine learning this is a classification problem.

Thus, the goal is to classify whether the breast cancer is benign or malignant and predict the recurrence and non-recurrence of malignant cases after a certain period. To achieve this we have used machine learning classification methods to fit a function that can predict the discrete class of new input.

### Identify data sources
The Breast Cancer datasets is available machine learning repository maintained by the University of California, Irvine. The dataset contains 569 samples of malignant and benign tumor cells.

   - The columns contain 30 real-value features that have been computed from digitized images of the cell nuclei, which can be used to build a model to predict whether a tumor is benign or malignant.

#### Getting Started: Load libraries

In [3]:
import pyforest as py
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler
from IPython.display import Image, display

#### Load Dataset and Describing the Data
First, load the supplied load_breast_cancer function

In [4]:
data = load_breast_cancer()
data.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

In [5]:
print(data['DESCR'])

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 0 is Mean Radi

 ##### Converting the given data to a DataFrame using option in Pandas DataFrame function.

In [6]:
df_cancer = pd.DataFrame(np.c_[data['data'], data['target']], columns=np.append(data['feature_names'], ['target']))

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

#### Inspecting the data
The first step is to visually inspect the new data set. There are multiple ways to achieve this:

   - The easiest being to request the first few records using the DataFrame data.head()* method. By default, “data.head()” returns the first 5 rows from the DataFrame object df (excluding the header row).
   - Alternatively, one can also use “df.tail()” to return the five rows of the data frame.
   - For both head and tail methods, there is an option to specify the number of records by including the required number in between the parentheses when calling either method.Inspecting the data

In [None]:
df_cancer.head(n=5)

You can check the number of cases, as well as the number of fields, using the shape method, as shown below.

In [None]:
df_cancer.shape

In the result displayed, you can see the data has 569 records, each with 31 columns.

The “info()” method provides a concise summary of the data; from the output, it provides the type of data in each column, the number of non-null values in each column, and how much memory the data frame is using.

In [None]:
df_cancer.info()

The “describe()” method provides a descriptive statistics including those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

In [None]:
df_cancer.describe()

##  Data Visualizations
Visualization is the process of projecting the data, or parts of it, into Cartesian space or into abstract images. In the data mining process, data exploration is leveraged in many different steps including preprocessing, modeling, and interpretation of results.

   - Correlation matrix
   - Scatter plots

In [None]:
plt.figure(figsize=(30, 20))
plt.title('Breast Cancer Feature Correlation', fontsize=50, ha='center')
sns.heatmap(df_cancer.corr(), annot=True, square='square', fmt='.2g',linewidths=2)

In [None]:
sns.pairplot(df_cancer, hue='target', vars=[cancer.feature_names[0], cancer.feature_names[1], cancer.feature_names[2], cancer.feature_names[3], cancer.feature_names[4]])

## Pre-Processing the data
### Introduction
Data preprocessing is a crucial step for any data analysis problem. It is often a very good idea to prepare your data in such way to best expose the structure of the problem to the machine learning algorithms that you intend to use.This involves a number of activities such as:

   - Assigning numerical values to categorical data;
   - Handling missing values; and
   - Normalizing the features (so that features on small scales do not dominate when fitting a model to the data).

#### Assesing Model Accuracy: Split data into training and test sets
The simplest method to evaluate the performance of a machine learning algorithm is to use different training and testing datasets. Here I will Split the available data into a training set and a testing set. (70% training, 30% test)

In [None]:
X = cancer.data
Y = cancer.target
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, random_state=0)
x_train.shape, y_train.shape, x_test.shape, y_test.shape

#### Feature Standardization

In [None]:
scaler = MinMaxScaler(feature_range=(0, 1))
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

## Predictive model using Support Vector Machine (SVM)
Support vector machines (SVMs) learning algorithm will be used to build the predictive model. SVMs are one of the most popular classification algorithms, and have an elegant way of transforming nonlinear data so that one can use a linear algorithm to fit a linear model to the data (Cortes and Vapnik 1995)

Kernelized support vector machines are powerful models and perform well on a variety of datasets.

   - SVMs allow for complex decision boundaries, even if the data has only a few features.
   - They work well on low-dimensional and high-dimensional data (i.e., few and many features), but don’t scale very well with the number of samples.

        Running an SVM on data with up to 10,000 samples might work well, but working with datasets of size 100,000 or more can become challenging in terms of runtime and memory usage.
        

   - SVMs requires careful preprocessing of the data and tuning of the parameters. This is why, these days, most people instead use tree-based models such as random forests or gradient boosting (which require little or no preprocessing) in many applications.

   - SVM models are hard to inspect; it can be difficult to understand why a particular prediction was made, and it might be tricky to explain the model to a nonexpert.

### Important Parameters
The important parameters in kernel SVMs are the

   - Regularization parameter C,
   - The choice of the kernel,(linear, radial basis function(RBF) or polynomial)
   - Kernel-specific parameters.

gamma and C both control the complexity of the model, with large values in either resulting in a more complex model. Therefore, good settings for the two parameters are usually strongly correlated, and C and gamma should be adjusted together.

In [None]:
svc = SVC()
svc.fit(x_train, y_train)

classifier_score = svc.score(x_test, y_test)
print('\nThe classifier accuracy score is', classifier_score)

In [None]:
y_pred = svc.predict(x_test)
cm = confusion_matrix(y_test, y_pred)

fig, ax = plt.subplots(figsize=(5, 5))
ax.matshow(cm, cmap=plt.cm.Reds, alpha=0.3)
for i in range(cm.shape[0]):
     for j in range(cm.shape[1]):
         ax.text(x=j, y=i,
                s=cm[i, j], 
                va='center', ha='center')
plt.xlabel('Predicted Values', )
plt.ylabel('Actual Values')
plt.show()
print(classification_report(y_test, y_pred ))

#### Observation

There are two possible predicted classes: "1" and "0". Malignant = 1 (indicates prescence of cancer cells) and Benign = 0 (indicates abscence).

   - The classifier made a total of 143 predictions (i.e 143 patients were being tested for the presence breast cancer).
   - Out of those 174 cases, the classifier predicted "yes" 97 times, and "no" 46 times.
   - In reality, 90 patients in the sample have the disease, and 53 patients do not.

## Optimizing the SVM Classifier

Machine learning models are parameterized so that their behavior can be tuned for a given problem. Models can have many parameters and finding the best combination of parameters can be treated as a search problem. In this notebook, I aim to tune parameters of the SVM Classification model using scikit-learn.

In [None]:
param_grid = {'C': [0.1, 1, 10, 100], 'gamma': [1, 0.1, 0.001], 'kernel': ['rbf']}

grid = GridSearchCV(SVC(), param_grid, refit=True, verbose = 4)
grid.fit(x_train, y_train)

In [None]:
print("The best parameters are %s with a score of %0.2f" % (grid.best_params_, grid.best_score_))

In [None]:
grid_score = grid.score(x_test, y_test)
print('\nThe grid accuracy score is', grid_score)

In [None]:
grid_pred = grid.predict(x_test)
cm = confusion_matrix(y_test, grid_pred)

fig, ax = plt.subplots(figsize=(5, 5))
ax.matshow(cm, cmap=plt.cm.Reds, alpha=0.3)
for i in range(cm.shape[0]):
     for j in range(cm.shape[1]):
         ax.text(x=j, y=i,
                s=cm[i, j], 
                va='center', ha='center')
plt.xlabel('Predicted Values', )
plt.ylabel('Actual Values')
plt.show()
print(classification_report(y_test, grid_pred ))

#### Observation

There are two possible predicted classes: "1" and "0". Malignant = 1 (indicates prescence of cancer cells) and Benign = 0 (indicates abscence).

   - The classifier made a total of 143 predictions (i.e 143 patients were being tested for the presence breast cancer).
   - Out of those 174 cases, the classifier predicted "yes" 92 times, and "no" 51 times.
   - In reality, 90 patients in the sample have the disease, and 53 patients do not.

### Final Accuracy reached - 97%