# Breast Cancer Detection

### Contributors: Hyeeun Hughes, Arnold Schultz, Mauvonte Roberts, Ryan Grimsley

Breast Cancer Data (122 kB): https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic)

## Part 1: Prepare the Data

: Number of instances: 569

: Number of attributes: 32

: Attribute information:
   
       1) ID number

       2) Diagnosis (M = malignant, B = benign)
       
       3-32) Ten real-valued features are computed for each cell nucleus:

          a) radius (mean of distances from center to points on the perimeter)

          b) texture (standard deviation of gray-scale values)

          c) perimeter

          d) area

          e) smoothness (local variation in radius lengths)

          f) compactness (perimeter^2 / area - 1.0)

          g) concavity (severity of concave portions of the contour)

          h) concave points (number of concave portions of the contour)

          i) symmetry

          j) fractal dimension ("coastline approximation" - 1)


: Missing attribute values: None

In [None]:
# Import our dependencies
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

In [None]:
#  Import and read the breast-cancer.data.csv.
df = pd.read_csv("../Resources/data.csv")
df.head(30)

In [None]:
df.columns

### Definition of mean, se, and worst

* Mean: The average

* Standard Error: The standard error of the mean

* Worst: The mean of the three largest values(features were computed for each image, resulting in 30 features)

In [None]:
# Re-naming columns
# df.rename(columns={'radius_worst': 'radius_largest', 
#                    'texture_worst': 'texture_largest',
#                    'perimeter_worst': 'perimeter_largest',
#                    'area_worst': 'area_largest',
#                    'smoothness_worst': 'smoothness_largest',
#                    'compactness_worst': 'compactness_largest',
#                    'concavity_worst': 'concavity_largest',
#                    'concave_points_worst': 'concave_largest',
#                    'symmetry_worst': 'symmetry_largest',
#                    'fractal_dimension_worst': 'cfractal_dimension_largest',
#                      }, inplace=True)
# df.head()

The key challenge against its detection is how to classify tumors into malignant (cancerous) or benign(non-cancerous). We'll be completing the analysis of classifying these tumors using machine learning (with SVMs) and the Breast Cancer Wisconsin (Diagnostic) Dataset.

In [None]:
# 'diagnosis' value count
df['diagnosis'].value_counts()

In [None]:
# Find null values
for column in df.columns:
    print(f"Column {column} has {df[column].isnull().sum()} null values")

In [None]:
# Find duplicate entries
print(f"Duplicate entries: {df.duplicated().sum()}")

In [None]:
# Determine the number of unique values in each column.
df.nunique()

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
# Look at 'diagnosis' value counts for binning
val_counts = df['diagnosis'].value_counts()
val_counts

In [None]:
# Choose a cutoff value and create a list of diagnosis to be replaced
# use the variable name `diagnosis_to_replace`

# Transform diagnosis
def diagnosis_to_replace(diagnosis):
    if diagnosis == "M":
        return 1
    else:
        return 0
    

df["diagnosis"] = df["diagnosis"].apply(diagnosis_to_replace)
df.head(20)

In [None]:
# Choose a cutoff value and create a list of diagnosis to be replaced
# use the variable name `diagnosis_to_replace`
diagnosis_to_replace = list(val_counts [val_counts == 1].index)

# Replace in dataframe
for app in diagnosis_to_replace:
    df['diagnosis'] = df['diagnosis'].replace(app,"Other")

In [None]:
# Check to make sure binning was successful
df["diagnosis"].value_counts()

In [None]:
# Look at radius_worst value counts for binning
# radius_worst_value_counts = df['radius_worst'].value_counts()
# radius_worst_value_counts 

In [None]:
# Look at texture_worst value counts for binning
# texture_worst_value_counts = df['texture_worst'].value_counts()
# texture_worst_value_counts 

In [None]:
# Look at area_worst value counts for binning
# area_worst_value_counts = df['area_worst'].value_counts()
# area_worst_value_counts 

In [None]:
# Look at perimeter_worst value counts for binning
# perimeter_worst_value_counts = df['perimeter_worst'].value_counts()
# perimeter_worst_value_counts 

In [None]:
# Remove the "diagnosis" column from the dataset
# Split the data into X_train, X_test, y_train, y_test
X = df.drop(columns=['diagnosis'])
y = df['diagnosis']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
X_train.head()

In [None]:
# Create a StandardScaler instances
scaler = StandardScaler()

# Fit the StandardScaler
X_scaler = scaler.fit(X_train)

# Scale the data
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

In [None]:
 # Instantiate KNN model and make predictions
knn = KNeighborsClassifier(n_neighbors=9)
knn.fit(X_train_scaled, y_train)
y_pred = knn.predict(X_test_scaled)

In [None]:
 # Assess the accuracy score
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

## Part 2: Apply Dimensionality Reduction

In [None]:
df.sample(20)

In [None]:
# Crestr a new dataframe for t-sne
df2 = df.drop(['diagnosis'], axis = 1)
labels = df['diagnosis']

In [None]:
# Initialze t-sne model
tsne = TSNE(learning_rate=35)

In [None]:
# Reduce dimesins
tsne_features = tsne.fit_transform(df2)

In [None]:
# The dataset has 2 columns
tsne_features.shape

In [None]:
# Prepare ro plot the dataset
# The first column of transformed features
df2['x']=tsne_features[:,0]

In [None]:
df2['y']=tsne_features[:,1]

In [None]:
# Visualize the clusters
plt.scatter(df2['x'], df2['y'])
plt.show()

In [None]:
labels.value_counts()

In [None]:
# Visualize the clusters with color
plt.scatter(df2['x'], df2['y'], c=labels)
plt.show()

In [None]:
# Standarized data with StandarsScaler
df_scaled = StandardScaler().fit_transform(df)
print(df_scaled[0:15])

In [None]:
# Applying PCA to reduce dimensions from 32 to 2
# Initialize PCA model
pca = PCA(n_components=2)

In [None]:
# Get tow principal components for the myopia data
df_pca = pca.fit_transform(df_scaled)

In [None]:
# Transform PCA data to a DataFrme
df_pca = pd.DataFrame(
    data=df_pca, columns=["principal component 1", "principal component 2"])
df_pca.head()

In [None]:
# Fetch the explained variance
pca.explained_variance_ratio_

## Part 3: Logistic Regression

In [None]:
print(df.info())

In [None]:
# Assign the data to X and y
# Note: Sklearn requires a two-dimensional array of values
# so we use reshape() to create this

# X = df[['id', 'radius_mean', 'texture_mean', 'perimeter_mean',
#        'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
#        'concave_points_mean', 'symmetry_mean', 'fractal_dimension_mean',
#        'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
#        'compactness_se', 'concavity_se', 'concave_points_se', 'symmetry_se',
#        'fractal_dimension_se', 'radius_worst', 'texture_worst',
#        'perimeter_worst', 'area_worst', 'smoothness_worst',
#        'compactness_worst', 'concavity_worst', 'concave_points_worst',
#        'symmetry_worst', 'fractal_dimension_worst']]
# y = df['diagnosis']

# print("Shape: ", X.shape, y.shape)

In [None]:
# Split the data into X_train, X_test, y_train, y_test
X = df.drop(columns=['diagnosis'])
y = df['diagnosis']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
X_train.head()

In [None]:
# Train a Logistic Regression model print the model score
classifier = LogisticRegression(solver='lbfgs', random_state=1)
classifier.fit(X_train, y_train)
classifier.score(X_test, y_test)

In [None]:
# Train a Random Forest Classifier model and print the model score
classifier = RandomForestClassifier(random_state=1)
classifier.fit(X_train, y_train)
classifier.score(X_test, y_test)

In [None]:
predictions = classifier.predict(X_test)
pd.DataFrame({"Prediction": predictions, "Actual": y_test})