#Data Manipulation and Building an AI model to predict Breast Cancer data as malignant or benign based on metrics obtained by scans.
Breast cancer AI is highly significant as it can revolutionize breast cancer detection and help to mitigate the disease as fast as possible.

Our dataset reports several different features of the biopsies. Here's what a few of them mean:

1. $Perimeter$: Total distance between points defining the cell's nuclear perimeter.
2. $Radius$: Average distance from the center of the cell's nucleus to its perimeter.
3. $Texture$: The texture of the cell nucleus is measured by finding the variance of the gray scale intensities in the component pixels.
4. $Area$: Nuclear area is measured by counting the number of pixels on the interior of the nucleus and adding one-half of the pixels in the perimeter.

The following image should give a visual to what these cell nucleus features look like:

![perimeter](https://drive.google.com/uc?export=view&id=1-U43OAojYbMY9gIlpvLHPNr3V2saqqHJ)

5. $Smoothness$: Measures the smoothness of a nuclear contour by measuring the difference between the length of a radial line and the mean length of the lines surrounding it. The image below demonstrates this:

![smoothness](https://drive.google.com/uc?export=view&id=1oVGDbMi1R23i_dpMsimb3VV3_wrsQG-Q)

6. $Concavity$: Measures the severity of concavities or indentations in a cell nucleus. Chords are drawn between non-adjacent snake points and measure the extent to which the actual boundary lies inside each chord. The line in bold in the image below is an example of a chord.

![concavity](https://drive.google.com/uc?export=view&id=1svQHoeu1wKMAnum33lgvSNppx2GVsuKX)

7. $Symmetry$: The major axis (longest chord) through the center is found. Then, the difference between the distance on both sides of the lines that are perpendicular to the major axis is calculated. The image below shows an example of this:

![symmetry](https://drive.google.com/uc?export=view&id=1BAOqXpqCllq8iInFKlsZehM3qPr99WS9)


The paper that first detailed these measurements for this dataset can be found here for more information: https://pdfs.semanticscholar.org/1c4a/4db612212a9d3806a848854d20da9ddd0504.pdf

In [151]:
#import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
import tensorflow as tf

#Exploring our Dataset

In [152]:
#use pandas to read the breast cancer dataset
dataset = pd.read_csv('breast-cancer.csv')
#print the first five rows of data to see the various parameters
dataset.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


We see that our dataset has 2 problems:

1) The dataset has some unnecessary values, namedly the "id" values and "worst" values for each categorization, which the model will not utilize for breast cancer classification.

2) The diagnosis is currently stored as "M" and "B" for malignant and benign, where it should store 1's or 0's for the model to predict, as an AI model will more optimally output values.

So, we will do some **data manipulation**.

In [None]:
#modifying the dataset to only keep the necessary values
dataset = dataset[['diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean']]
#modifying the diagnosis column to use values
dataset['diagnosis(1=m, 0=b)'] = dataset['diagnosis'].astype('category').map({'M': 1, 'B': 0})
dataset = dataset.drop(columns = 'diagnosis')

In [154]:
dataset.head()

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,"diagnosis(1=m, 0=b)"
0,17.99,10.38,122.8,1001.0,0.1184,0.3001,0.1471,0.2419,0.07871,1
1,20.57,17.77,132.9,1326.0,0.08474,0.0869,0.07017,0.1812,0.05667,1
2,19.69,21.25,130.0,1203.0,0.1096,0.1974,0.1279,0.2069,0.05999,1
3,11.42,20.38,77.58,386.1,0.1425,0.2414,0.1052,0.2597,0.09744,1
4,20.29,14.34,135.1,1297.0,0.1003,0.198,0.1043,0.1809,0.05883,1


We set the X value (the independent variable) to be all of the categories besides the diagnosis itself, for the machine learning model to train and test itself upon, in order to determine the dependent variable.

We set the Y value (the dependent variable) to be the diagnosis, as the machine learning model aims to predict the diagnosis given all of the x values.

In [163]:
x = dataset.drop(columns=['diagnosis(1=m, 0=b)'])
y = dataset[['diagnosis(1=m, 0=b)']]

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean
0,17.99,10.38,122.8,1001.0,0.1184,0.3001,0.1471,0.2419,0.07871
1,20.57,17.77,132.9,1326.0,0.08474,0.0869,0.07017,0.1812,0.05667
2,19.69,21.25,130.0,1203.0,0.1096,0.1974,0.1279,0.2069,0.05999
3,11.42,20.38,77.58,386.1,0.1425,0.2414,0.1052,0.2597,0.09744
4,20.29,14.34,135.1,1297.0,0.1003,0.198,0.1043,0.1809,0.05883


#Split the data into a training set and a testing set.

In [156]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

#Build and train the neural network.

In [201]:
model = tf.keras.models.Sequential()

In [202]:
#add layers to the neural network, using the sigmoid function to calculate weights between layers
model.add(tf.keras.layers.Dense(256, input_shape=(x_train.shape[1],), activation='relu'))
model.add(tf.keras.layers.Dropout(0.3))  # Adjust dropout rate to avoid overfitting
model.add(tf.keras.layers.Dense(128, activation='relu'))
model.add(tf.keras.layers.Dropout(0.3))  # Adjust dropout rate
model.add(tf.keras.layers.Dense(64, activation='relu'))
model.add(tf.keras.layers.Dropout(0.3))  # Adjust dropout rate
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

In [203]:
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])

In [None]:
model.fit(x_train, y_train, epochs=1000)

#Evaluate the model

In [None]:
model.evaluate(x_test, y_test) #highest accuracy is 0.9474 - relatively accurate model - wavers between 0.80%-0.95%