# Importing Necessary Libraries

First, we will import the necessary libraries for analyzing and visualizing the data. These libraries include:
- NumPy, for numerical computation and analysis
- Pandas for data viewing and analysis
- Matplotlib for data visualization
- Seaborn for data visualization

We will also need the Scikit-Learn library for data preprocessing and model training. We will import the necessary APIs of the library when we get to those stages.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

# Viewing the Data

Now,we will use the `read_csv` method of the Pandas library to load the csv file as a Pandas DataFrame.

In [2]:
df = pd.read_csv("data (1).csv")

# Viewing the first 5 rows of the dataset
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


# Loading the Dataset

Now we shall load the dataset as a Pandas DataFrame from the Sci-kit Learn library

In [3]:
# Importing necessary APIs from sklearn library
from sklearn.datasets import load_breast_cancer

# Loading the dataset as a Pandas DataFrame
X, y = load_breast_cancer(return_X_y=True, as_frame=True)

# Viewing the first 5 rows of the features dataset
X.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


# Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is the process of examining and understanding the dataset before building any machine learning model. The goal of EDA is to get a sense of what the data looks like, identify any patterns or trends, and spot any potential issues like missing values or outliers. During EDA, we usually:

1. **Look at the Data**: Check the first few rows to understand its structure (using `.head()` or `.info()` in Python).
2. **Summary Statistics**: Get quick statistics on each column like the mean, median, and standard deviation (using `.describe()`).
3. **Visualize**: Use charts like histograms, box plots, and scatter plots to see the distribution of data and relationships between variables.
4. **Check for Missing Values**: Identify any missing or NaN values that need to be handled.
5. **Find Outliers**: Detect extreme values that could affect the model's performance.

This step helps ensure the data is clean and gives you insights into how to prepare it for machine learning.


## Displaying the First Five Rows of Data

We now check the first five rows of the features DataFrame to observe and understand the structure of the dataset.

In [4]:
# Viewing the first 5 rows of the features dataset
X.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


## Displaying the First Five Rows of Target
We also check the first five rows of the target variable to understand its structure.

In [5]:
# Viewing the first 5 rows of the labels dataset
y.head()

0    0
1    0
2    0
3    0
4    0
Name: target, dtype: int64

## Displaying Last Five Rows of Data and Target

In [6]:
# Joining the features and labels into a single dataframe
df = X.join(y)

# Viewing the first 5 rows of the features dataset
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


In [7]:
# Viewing the last 5 rows of the features dataset
df.tail()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
564,21.56,22.39,142.0,1479.0,0.111,0.1159,0.2439,0.1389,0.1726,0.05623,...,26.4,166.1,2027.0,0.141,0.2113,0.4107,0.2216,0.206,0.07115,0
565,20.13,28.25,131.2,1261.0,0.0978,0.1034,0.144,0.09791,0.1752,0.05533,...,38.25,155.0,1731.0,0.1166,0.1922,0.3215,0.1628,0.2572,0.06637,0
566,16.6,28.08,108.3,858.1,0.08455,0.1023,0.09251,0.05302,0.159,0.05648,...,34.12,126.7,1124.0,0.1139,0.3094,0.3403,0.1418,0.2218,0.0782,0
567,20.6,29.33,140.1,1265.0,0.1178,0.277,0.3514,0.152,0.2397,0.07016,...,39.42,184.6,1821.0,0.165,0.8681,0.9387,0.265,0.4087,0.124,0
568,7.76,24.54,47.92,181.0,0.05263,0.04362,0.0,0.0,0.1587,0.05884,...,30.37,59.16,268.6,0.08996,0.06444,0.0,0.0,0.2871,0.07039,1


## Analyzing the Dataset (`df`)

First we will use `.shape` to check the shape of the dataframe `df` i.e. the number of rows and columns in `df`.

In [8]:
# Checking the shape of the dataset
df.shape

(569, 31)

(569, 31) indicates that the DataFrame `df` has 569 rows and 31 columns.

Now we shall use the `.info()` method to get an overview of column types and non-null values in the dataset.

In [9]:
# info() method to overview the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         5

### Checking for Missing Values

To check for missing (null) values, we will use `.isnull().sum()` on the dataframe `df`.

In [10]:
# Checking for missing values
df.isnull().sum()

mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
target                     0
dtype: int64

Clearly, there are no missing values in the dataset, as seen from the output of the cell above. This means we do not need to impute any of the values in the dataset.

### Summary Statistics of the dataset (`df`)

Now, we shall display the summary statistics of the dataset using the Pandas `describe()` method. Summary statistics give the statistics such as count, mean, standard deviation, minimum and maximum value, and the quartiles of each of the columns in the DataFrame. This will give us an idea of the basic statistics of the dataset, which can be useful for further analysis and model building.

In [11]:
# Displaying summary statistics of the dataset
df.describe()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,0.062798,...,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946,0.627417
std,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,0.00706,...,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061,0.483918
min,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996,...,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504,0.0
25%,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,0.0577,...,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146,0.0
50%,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,0.06154,...,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004,1.0
75%,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,0.06612,...,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208,1.0
max,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744,...,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075,1.0


### Analyzing Target Variable Distribution

The "target" column contains the labels of the dataset. We shall now use the `value_counts()` method to analyze the distribution of the target variable.

In [12]:
df['target'].value_counts()

target
1    357
0    212
Name: count, dtype: int64

# Data Preprocessing

Since we do not have any null values in the dataset, we do not need to impute any of the values. We will proceed with directly separating the features (`X`) and the target (`y`) variable.

In [13]:
X = df.drop(columns='target', axis=1)
y = df['target']

# Checking if the features and target have same number of rows
print(X.shape)
print(y.shape)

(569, 30)
(569,)


Clearly, the features and target variables both have the same number of rows, so we know that we haven't made any mistake in the preprocessing step.

# Splitting the Dataset
Now that we have separated the features and target variable, we need to split the dataset into train and test datasets. This is done to avoid data snooping bias. The model that will be built will be trained on the train dataset and its performance will be evaluated on the test dataset.
Let us split the dataset into train and test datasets containing 80% and 20% of the data respectively.

In [14]:
# Importing necessary APIs from sklearn library
from sklearn.model_selection import train_test_split

# Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)

# Model Training
Training a logistic regression model on the training dataset. Since this is a binary classification problem, a logistic regression model is a good choice for this dataset.

In [15]:
from sklearn.linear_model import LogisticRegression

# Creating an instance of LogisticRegression
model = LogisticRegression(max_iter=10000)

# Training the model on the train dataset
model.fit(X_train, y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,10000


# Model Evaluation

Now we shall evaluate the model based on accuracy score. We shall evaluate the accuracy score on both training and test data.

## On Training Data

In [16]:
from sklearn.metrics import accuracy_score

# Predicting the label of the train data using the model
X_train_prediction = model.predict(X_train)

# Evaluating the accuracy of the predicted labels
training_data_accuracy = accuracy_score(y_train, X_train_prediction)
print('Accuracy on training data = ', training_data_accuracy)

Accuracy on training data =  0.9692307692307692


## On Testing Data

In [17]:
# Predicting the label of the test data using the model
X_test_prediction = model.predict(X_test)

# Evaluating the accuracy of the predicted labels
test_data_accuracy = accuracy_score(y_test, X_test_prediction)
print('Accuracy on test data = ', test_data_accuracy)

Accuracy on test data =  0.9298245614035088


As we can clearly see, the logistic regression model gives an accuracy of 96.92% (approximately) on the training data, and 92.98% (approximately) on the test data. This is a reasonable good accuracy score on the test data, and we shall now use this model to build a predictive system. We shall predict the label of a given input data which is not part of the dataset.

# Building a Predictive System

We shall now view some unseen input datapoint, which is not part of the original dataset, and make predictions of the labels of this input data using the logistic regression model we have just built.

In [18]:
input_data = (13.54,14.36,87.46,566.3,0.09779,0.08129,0.06664,0.04781,0.1885,0.05766,0.2699,0.7886,2.058,23.56,0.008462,0.0146,0.02387,0.01315,0.0198,0.0023,15.11,19.26,99.7,711.2,0.144,0.1773,0.239,0.1288,0.2977,0.07259)

We shall convert the `input_data` tuple into a 2D NumPy array with 1 row, so that we can conveniently input this array into our model for prediction purposes.

In [19]:
# Convert the input_data (tuple) to a NumPy array
input_data_as_numpy_array = np.asarray(input_data)

# Reshape the NumPy array into a 2D array with 1 row and as many columns as needed to fit all elements
# The '-1' means that the number of columns will be inferred based on the total number of elements
input_data_reshaped = input_data_as_numpy_array.reshape(1,-1)

In [20]:
# Predicting the label of the input data given
prediction = model.predict(input_data_reshaped)
if (prediction[0] == 0):
    print('The Breast Cancer is Malignant')
else:
    print('The Breast Cancer is Benign')

The Breast Cancer is Benign


