# Breast Cancer Detection

## Preprocessing

* Number of instances: 569

* Number of attributes: 32

* Attribute information:
   
   1) ID number

   2) Diagnosis (M = malignant, B = benign)
3-32)


Ten real-valued features are computed for each cell nucleus:

  a) radius (mean of distances from center to points on the perimeter)

  b) texture (standard deviation of gray-scale values)

  c) perimeter

  d) area

  e) smoothness (local variation in radius lengths)

  f) compactness (perimeter^2 / area - 1.0)

  g) concavity (severity of concave portions of the contour)

  h) concave points (number of concave portions of the contour)

  i) symmetry

  j) fractal dimension ("coastline approximation" - 1)


* Missing attribute values: None

In [None]:
# Import our dependencies
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
import tensorflow as tf
import pandas as pd 

In [None]:
#  Import and read the breast-cancer.data.csv.
df = pd.read_csv("../Resources/data.csv")
df.head(30)

In [None]:
df.columns

In [None]:
# Re-naming columns
# df.rename(columns={'no-recurrence-events': 'recurrence', 
#                    '30-39': 'age',
#                    'premeno': 'menopause',
#                    '30-34': 'tumor_size',
#                    '0-2': 'inv-nodes',
#                    'no': 'node-caps',
#                    '3': 'deg-malig',
#                    'left': 'breast',
#                    'left_low': 'breast-quad',
#                    'no.1': 'irradiat'
#                   }, inplace=True)
# df.head()

The key challenge against its detection is how to classify tumors into malignant (cancerous) or benign(non-cancerous). We ask you to complete the analysis of classifying these tumors using machine learning (with SVMs) and the Breast Cancer Wisconsin (Diagnostic) Dataset.

In [None]:
# 'diagnosis' value count
df['diagnosis'].value_counts()

In [None]:
# Drop the non-beneficial ID columns, 'menopause'.
# df = df.drop(['menopause'], axis = 1)
# df.head(30)

In [None]:
# Find null values
for column in df.columns:
    print(f"Column {column} has {df[column].isnull().sum()} null values")

In [None]:
# Find duplicate entries
print(f"Duplicate entries: {df.duplicated().sum()}")

In [None]:
# df.dtypes[df.dtypes == "object"].index.tolist()

In [None]:
# Look at APPLICATION_TYPE value counts for binning
# df[application_cat ].nuniquie()

In [None]:
# Determine the number of unique values in each column.
df.nunique()

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
# Look at 'diagnosis' value counts for binning
val_counts = df['diagnosis'].value_counts()
val_counts

In [None]:
# Choose a cutoff value and create a list of diagnosis to be replaced
# use the variable name `diagnosis_to_replace`

# Transform diagnosis
def diagnosis_to_replace(diagnosis):
    if diagnosis == "M":
        return 1
    else:
        return 0
    

df["diagnosis"] = df["diagnosis"].apply(diagnosis_to_replace)
df.head(20)

In [None]:
# Choose a cutoff value and create a list of diagnosis to be replaced
# use the variable name `diagnosis_to_replace`
diagnosis_to_replace = list(val_counts [val_counts == 1].index)

# Replace in dataframe
for app in diagnosis_to_replace:
    df['diagnosis'] = df['diagnosis'].replace(app,"Other")

In [None]:
# Check to make sure binning was successful
df["diagnosis"].value_counts()

In [None]:
df.columns

In [None]:
# Look at radius_worst value counts for binning
radius_worst_value_counts = df['radius_worst'].value_counts()
radius_worst_value_counts 

In [None]:
# Look at texture_worst value counts for binning
texture_worst_value_counts = df['texture_worst'].value_counts()
texture_worst_value_counts 

In [None]:
# Look at area_worst value counts for binning
area_worst_value_counts = df['area_worst'].value_counts()
area_worst_value_counts 

In [None]:
# Look at perimeter_worst value counts for binning
perimeter_worst_value_counts = df['perimeter_worst'].value_counts()
perimeter_worst_value_counts 

In [None]:
# look at perimeter_worst value counts < 150
perimeter_worst_value_counts[perimeter_worst_value_counts < 100]

In [None]:
# radius_worst_value_counts[radius_worst_value_counts  > 12.3]

In [None]:
# Determine which values to replace if counts are less than 1000
# radius_worst_to_replace = list(radius_worst_value_counts [radius_worst_value_counts > 13].index)

# Replace in dataframe
# for app in radius_worst_to_replace:
#     df['radius_worst'] = df['radius_worst'].replace(app,"Other")

In [None]:
# Check to make sure binning was successful
# df['radius_worst'].value_counts()

In [None]:
# Convert categorical data to numeric with `pd.get_dummies`
dummies_df = pd.get_dummies(df)
dummies_df.head(30)

In [None]:
# Split our preprocessed data into our features and target arrays
X = dummies_df.drop(["diagnosis"], axis='columns').values
y = dummies_df["diagnosis"].values
                    

# Split the preprocessed data into a training and testing dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=78)

In [None]:
# Create a StandardScaler instances
scaler = StandardScaler()

# Fit the StandardScaler
X_scaler = scaler.fit(X_train)

# Scale the data
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

## Compile, Train and Evaluate the Model

In [None]:
# Define the model - deep neural net, i.e., the number of input features and hidden nodes for each layer.
number_input_features = len(X_train[0]) 
hidden_nodes_layer1 = 8
hidden_nodes_layer2 = 5

nn = tf.keras.models.Sequential()

# First hidden layer
nn.add(tf.keras.layers.Dense(units=hidden_nodes_layer1, input_dim=number_input_features, activation="relu"))

# Second hidden layer
nn.add(tf.keras.layers.Dense(units=hidden_nodes_layer2, activation="relu"))

# Output layer
nn.add(tf.keras.layers.Dense(units=1, activation ="sigmoid"))

# Check the structure of the model
nn.summary()

In [None]:
# Compile the model
nn.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

In [None]:
# Train the model
fit_model = nn.fit(X_train_scaled, y_train, epochs=100)

In [None]:
# Evaluate the model using the test data
model_loss, model_accuracy = nn.evaluate(X_test_scaled,y_test,verbose=2)
print(f"Loss: {model_loss}, Accuracy: {model_accuracy}")