# Breast Cancer EDA📊 + Predictive Modelling🎮

![Breast Cancer](https://images.newscientist.com/wp-content/uploads/2019/06/06165424/c0462719-cervical_cancer_cell_sem-spl.jpg?width=1200)

Dataset Description

The Breast Cancer datasets is available UCI machine learning repository maintained by the University of California, Irvine. The dataset contains 569 samples of malignant and benign tumor cells.

The first two columns in the dataset store the unique ID numbers of the samples and the corresponding diagnosis (M=malignant, B=benign), respectively. The columns 3-32 contain 30 real-value features that have been computed from digitized images of the cell nuclei, which can be used to build a model to predict whether a tumor is benign or malignant.

* 1= Malignant (Cancerous) - Present (M)
* 0= Benign (Not Cancerous) - Absent (B)


Column names and meanings:

* id: ID number
* diagnosis: The diagnosis of breast tissues (M = malignant, B = benign)
* radius_mean: mean of distances from center to points on the perimeter
* texture_mean: standard deviation of gray-scale values
* perimeter_mean: mean size of the core tumor
* area_mean: area of the tumor
* smoothness_mean: mean of local variation in radius lengths
* compactness_mean: mean of perimeter^2 / area - 1.0
* concavity_mean: mean of severity of concave portions of the contour
* concave_points_mean: mean for number of concave portions of the contour
* symmetry_mean
* fractal_dimension_mean: mean for "coastline approximation" - 1
* radius_se: standard error for the mean of distances from center to points on the perimeter
* texture_se: standard error for standard deviation of gray-scale values
* perimeter_se
* area_se
* smoothness_se: standard error for local variation in radius lengths
* compactness_se: standard error for perimeter^2 / area - 1.0
* concavity_se: standard error for severity of concave portions of the contour
* concave_points_se: standard error for number of concave portions of the contour
* symmetry_se
* fractal_dimension_se: standard error for "coastline approximation" - 1
* radius_worst: "worst" or largest mean value for mean of distances from center to points on the perimeter
* texture_worst: "worst" or largest mean value for standard deviation of gray-scale values
* perimeter_worst
* area_worst
* smoothness_worst: "worst" or largest mean value for local variation in radius lengths
* compactness_worst: "worst" or largest mean value for perimeter^2 / area - 1.0
* concavity_worst: "worst" or largest mean value for severity of concave portions of the contour
* concave_points_worst: "worst" or largest mean value for number of concave portions of the contour
* symmetry_worst
* fractal_dimension_worst: "worst" or largest mean value for "coastline approximation" - 1

# Import Libraries 📚

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import missingno
from pandas_profiling import ProfileReport
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import f1_score,confusion_matrix
from sklearn import metrics

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
from IPython.display import HTML
f = open("../input/ocean2/ocean.css").read()
HTML(f"<style>{f}</style>")

# Loading the Dataset.

### Clearing the useless columns.

In [None]:
df = pd.read_csv('../input/breast-cancer-wisconsin-data/data.csv')
df.drop(['Unnamed: 32','id'] ,axis=1, inplace=True)
df

## Checking the missing values in the dataset.

In [None]:
missingno.matrix(df)
plt.show()

### This shows that the dataset having 0 missing values.

## Checking the duplicate values in the dataset.

In [None]:
df.duplicated().sum()

### This shows that the dataset having 0 duplicate values.

## Describing the dataset.

In [None]:
df.describe().T

# EDA (Exploratory Data Analysis)📊

## Let's check that, how the dataset is divided into two type of diagnosis.

In [None]:
plt.figure(figsize=(12, 8))
sns.countplot(x=df['diagnosis'], palette='RdBu')

benign, malignant = df['diagnosis'].value_counts()
print('Number of cells labeled Benign : ', benign)
print('Number of cells labeled Malignant : ', malignant)
print('')
print('% of cells labeled Benign', round(benign / len(df) * 100, 2), '%')
print('% of cells labeled Malignant', round(malignant / len(df) * 100, 2), '%')
plt.show()

### This plot shows that the Malignant (Cancerous) type result of diagnosis are 212. And This plot shows that the Benign (Not Cancerous) type result of diagnosis are 357.

## Let's check the correlation between the dataset.

In [None]:
fig, ax = plt.subplots(figsize=(20, 20))
sns.heatmap(df.corr(), annot=True, linewidths=.5, fmt= '.1f',ax=ax)
plt.show()

### This correlation shows pretty much good relationship between dataset.

## Clustermapping the datasets correlation for better understanding the relationship of data.

In [None]:
sns.clustermap(df.corr())
plt.show()

## Checking the relationship between the specific data for understanding the relationship.

In [None]:
sns.jointplot(x=df.loc[:,'concavity_worst'], y=df.loc[:,'concave points_worst'], kind="reg", color="#ce1414")
plt.show()

# Predictive Modelling 🎮

### Let's get to know about the columns of the dataset. So, that we can know better about the features of our model.

In [None]:
df.columns

## We have categorical data, but our model needs something numerical. So, that our model works perfectly fine and predicts with best  accuracy.

In [None]:
df['diagnosis']=df['diagnosis'].map({'M':1,'B':0})

### By mapping the Malignant diagnosis as 1 and Benign diagnosis as 0. We are making our dataset to be perfect fit for our model.

## Separating the features and the target value for our model.

In [None]:
X = df.drop(["diagnosis"], axis = 1)
y = df.diagnosis.values

## Preprocessing the feature values for better accuracy of the model.

In [None]:
X = (X - np.min(X))/(np.max(X) - np.min(X)).values
X

### Here, we separating the data in two parts - Training & Testing. Then we calling the Logistic Regression and fitting the data in our model. We predict the predictions and then checking the accuracy of our model.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42)
logistic = LogisticRegression()
logistic.fit(X_train,y_train)
y_pred = logistic.predict(X_test)
ac = accuracy_score(y_test,y_pred)
print('Accuracy is: ',ac)
conm = confusion_matrix(y_test,y_pred)
sns.heatmap(conm,annot=True,fmt="d")
plt.show()

### We got the accuracy of 0.98% on our this Logistic Regression model.

## Checking the Classification Report for better understanding the accuracy and score of our model.

In [None]:
print(metrics.classification_report(y_test, y_pred))

## Checking the roc_aur_score and the f1 score of our model.

In [None]:
print("roc_auc_score: ", roc_auc_score(y_test, y_pred))
print("f1 score: ", f1_score(y_test, y_pred))