## Breast Cancer Analysis Wisconsin (Original) - Logistic Classification

### Data Set Information:

You can find this dataset in UCI Library:
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29

##### Background

Samples arrive periodically as Dr. Wolberg reports his clinical cases. The database therefore reflects this chronological grouping of the data. This grouping information appears immediately below, having been removed from the data itself:

- Group 1: 367 instances (January 1989)
- Group 2: 70 instances (October 1989)
- Group 3: 31 instances (February 1990)
- Group 4: 17 instances (April 1990)
- Group 5: 48 instances (August 1990)
- Group 6: 49 instances (Updated January 1991)
- Group 7: 31 instances (June 1991)
- Group 8: 86 instances (November 1991)\
Total: 699 points (as of the donated datbase on 15 July 1992) 

### Attribute Information:

1. Sample code number: id number
2. Clump Thickness: 1 - 10
3. Uniformity of Cell Size: 1 - 10
4. Uniformity of Cell Shape: 1 - 10
5. Marginal Adhesion: 1 - 10
6. Single Epithelial Cell Size: 1 - 10
7. Bare Nuclei: 1 - 10
8. Bland Chromatin: 1 - 10
9. Normal Nucleoli: 1 - 10
10. Mitoses: 1 - 10
11. Class: (2 for benign, 4 for malignant) -> dependent variable

##### Import the relevant libraries:

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import Normalizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

##### Set the column names of various features

In [2]:
colnames = ['Sample code number', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape', 'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin', 'Normal Nucleoli', 'Mitoses', 'Class']

df = pd.read_csv('breast_cancer_wisconsin.csv', names=colnames, header=None)

In [3]:
m,n = df.shape
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 11 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   Sample code number           699 non-null    int64 
 1   Clump Thickness              699 non-null    int64 
 2   Uniformity of Cell Size      699 non-null    int64 
 3   Uniformity of Cell Shape     699 non-null    int64 
 4   Marginal Adhesion            699 non-null    int64 
 5   Single Epithelial Cell Size  699 non-null    int64 
 6   Bare Nuclei                  699 non-null    object
 7   Bland Chromatin              699 non-null    int64 
 8   Normal Nucleoli              699 non-null    int64 
 9   Mitoses                      699 non-null    int64 
 10  Class                        699 non-null    int64 
dtypes: int64(10), object(1)
memory usage: 60.2+ KB


We can see that all the features seem to be ok, except the feature Bare Nuclei.\
Ideally it should only contain values between 1-10; However, it contains some junk values here.\
We will try to identify those junk values first then decide what to do with it.

In [4]:
dfd = df.loc[:,'Bare Nuclei']
dflist = df['Bare Nuclei'].value_counts().tolist()

In [5]:
dfd.value_counts()

1     402
10    132
5      30
2      30
3      28
8      21
4      19
?      16
9       9
7       8
6       4
Name: Bare Nuclei, dtype: int64

In [6]:
mis_val = dflist[7]
print("There are a total of {} values containing '?' data".format(mis_val))
print("Which is {:.2f} % of the data".format((mis_val/m)*100))

There are a total of 16 values containing '?' data
Which is 2.29 % of the data


Here we have a problem with the '?' data item which occurs 16 times in the dataset.

We can see that the dataset contains a total of 16 missing values marked as '?' which corresponds to only 2.29% of the data\
An obvious choice here might be deleting the data as it might be of different datatypes.\
The data has some problem. It would be more advisable to delete the records rather than fixing it.
Also, we are not sure what could be the missing value as well.

We will not replace the missing value with the mean (or average) value

In [7]:
df['Bare Nuclei'] = df['Bare Nuclei'].replace('?', np.NaN)
df['Bare Nuclei'].value_counts()
df['Bare Nuclei'].unique()
clean_data = df.dropna()
clean_data.shape

(683, 11)

In [8]:
clean_data['Bare Nuclei'] = clean_data['Bare Nuclei'].astype(np.int64)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [9]:
clean_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 683 entries, 0 to 698
Data columns (total 11 columns):
 #   Column                       Non-Null Count  Dtype
---  ------                       --------------  -----
 0   Sample code number           683 non-null    int64
 1   Clump Thickness              683 non-null    int64
 2   Uniformity of Cell Size      683 non-null    int64
 3   Uniformity of Cell Shape     683 non-null    int64
 4   Marginal Adhesion            683 non-null    int64
 5   Single Epithelial Cell Size  683 non-null    int64
 6   Bare Nuclei                  683 non-null    int64
 7   Bland Chromatin              683 non-null    int64
 8   Normal Nucleoli              683 non-null    int64
 9   Mitoses                      683 non-null    int64
 10  Class                        683 non-null    int64
dtypes: int64(11)
memory usage: 64.0 KB


Finally, we have cleaned all the rows, and converted all the data into the correct datatype (int64).\
Now we can proceed with modeling the data based on Logistic Regression Model.

In [10]:
clean_data.head()

Unnamed: 0,Sample code number,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


##### Separate the independent and the dependent variables

We need to separate all our dependent variables or features into one array and our observations into another.

In [11]:
X = clean_data.iloc[:,1:-1].values
y = clean_data.iloc[:,-1].values
# X = df[pd.to_numeric(X, errors='coerse')]
# y = df[pd.to_numeric(y, errors='coerse')]

##### Separate the dataset into train and test set (80-20 split)

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

##### Fitting into the Logistic Regresssion Model

Here we will try to fit our training data into the Logistic Regression Model and try to compute the accuracy of the model on the test data.

In [13]:
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=0, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

##### Predicting the results using our Test set 

In [14]:
y_pred = classifier.predict(X_test)

##### Confusion Matrix

We will now check the results of our predictions with the actual observations from the test set.\
We want to essentially check for the number of false positives and false negatives in the confusion matrix

In [15]:
conf_mtx = confusion_matrix(y_test, y_pred)
print(conf_mtx)

[[84  3]
 [ 3 47]]


Our results looks good as we have a total of:
- 84 True Positive
- 47 True Negative\
and only:
- 3 False Positive
- 3 False Negative

which is a really good result.\
But we will definitely compare this accuracy with 10-Fold cross validation technique

##### Computing accuracy with k-Fold Cross Validation

In [16]:
accuracies = cross_val_score(estimator = classifier, X=X_train, y=y_train, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f}".format(accuracies.std()))

Accuracy: 96.70 %
Standard Deviation: 0.02


Here we can see that our observations performed well when it came to 10-Fold Cross Validation.\
We have obtained a high accuracy with very little deviations.\
The model can predict data with an accuracy from 94% - 98% which is very good.

Our objective is now complete.