#### Data Source
### Blood Transfusion Service Center DataSet
- **Citation Request**

    This breast cancer databases was obtained from the **University of Wisconsin Hospitals**, **Madison** from **Dr. William H. Wolberg**. If you publish results when using this database, then please include this information in your acknowledgements.

- **Title**

    Wisconsin Breast Cancer Database (January 8, 1991)

- **Sources**
    - **Creator**
            Dr. WIlliam H. Wolberg (physician)
            University of Wisconsin Hospitals
            Madison, Wisconsin
            USA
    - **Donor**
            Olvi Mangasarian (mangasarian@cs.wisc.edu)
            Received by David W. Aha (aha@cs.jhu.edu)
    - **Date**
            15 July 1992
        
### UCI - Machine Learning Repository
- Center for Machine Learning and Intelligent Systems

The [**UCI Machine Learning Repository**](http://archive.ics.uci.edu/ml/about.html) is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms.

## Prepare for data analysis

## Load packages

In [1]:
%matplotlib inline    
# Line magic command will make plot outputs appear and be stored within the notebook.
import matplotlib.pyplot as plt   # matplotlib's plotting framework

import numpy as np    # fundamental package for scientific computing
import pandas as pd   # Python Data Analysis Library
from pandas import Series # one-dimensional labeled array capable of holding any data type 
import seaborn as sns # library for making statistical graphics in Python
import os             # operating system dependent functionality, file descriptor..
import matplotlib.gridspec as gridspec
import itertools

import warnings
warnings.filterwarnings('ignore')                                # For warning control

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.naive_bayes import GaussianNB 
from sklearn.neighbors import KNeighborsClassifier

from mlxtend.classifier import StackingClassifier
from mlxtend.plotting import plot_learning_curves
from mlxtend.plotting import plot_decision_regions

## Load Data

In [2]:
PATH="./../Data/"
os.listdir(PATH)

['transfusion.data', 'transfusion.names']

In [3]:
# let's load the train and test data
# As per the manual verification of the csv files, got a basic understanding about the data.
#     Identified the date column as one of the feature
Transfusion_train_data    = pd.read_csv( PATH + 'transfusion.data' )
#Transfusion_train_names  = pd.read_csv( PATH + 'transfusion.names' ) 
# transfusion.names is  not a csv, its just a text file

## Data exploration

### Check the data dimension

In [4]:
Transfusion_train_data.shape

(748, 5)

##### Feature Info from transfusion.names
- R (Recency - months since last donation),
- F (Frequency - total number of donation),
- M (Monetary - total blood donated in c.c.),
- T (Time - months since first donation), and
- a binary variable representing whether he/she donated blood in March 2007 
    - (1 stand for donating blood; 0 stands for not donating blood).


In [5]:
Transfusion_train_data.head()

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),whether he/she donated blood in March 2007
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0


In [6]:
Transfusion_train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 748 entries, 0 to 747
Data columns (total 5 columns):
Recency (months)                              748 non-null int64
Frequency (times)                             748 non-null int64
Monetary (c.c. blood)                         748 non-null int64
Time (months)                                 748 non-null int64
whether he/she donated blood in March 2007    748 non-null int64
dtypes: int64(5)
memory usage: 29.3 KB


In [7]:
Transfusion_train_data.describe()

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),whether he/she donated blood in March 2007
count,748.0,748.0,748.0,748.0,748.0
mean,9.506684,5.514706,1378.676471,34.282086,0.237968
std,8.095396,5.839307,1459.826781,24.376714,0.426124
min,0.0,1.0,250.0,2.0,0.0
25%,2.75,2.0,500.0,16.0,0.0
50%,7.0,4.0,1000.0,28.0,0.0
75%,14.0,7.0,1750.0,50.0,0.0
max,74.0,50.0,12500.0,98.0,1.0


- Looks like there is no missing data
- The last column is considered as target variable( two states )
    - Binary classifier

In [8]:
Transfusion_train_data.columns = ['Recency', 'Frequency', 'Monetary', 'Time', 'DonationStatus']

In [9]:
Transfusion_train_data.columns

Index(['Recency', 'Frequency', 'Monetary', 'Time', 'DonationStatus'], dtype='object')

In [10]:
# picking just the features
X = Transfusion_train_data.iloc[:,:4]
# target
y = Transfusion_train_data.iloc[:,4:]

In [11]:
X.head()

Unnamed: 0,Recency,Frequency,Monetary,Time
0,2,50,12500,98
1,0,13,3250,28
2,1,16,4000,35
3,2,20,5000,45
4,1,24,6000,77


In [12]:
y.head()

Unnamed: 0,DonationStatus
0,1
1,1
2,1
3,1
4,0


In [13]:
Transfusion_train_data['DonationStatus'].unique()

array([1, 0], dtype=int64)

## Modelling

In [14]:
np.random.seed(0)
clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(n_estimators=100, random_state=1)
clf3 = GaussianNB()
lr = LogisticRegression(multi_class='auto', solver='lbfgs')
sclf = StackingClassifier(
    classifiers=[clf1, clf2, clf3],
    meta_classifier=lr)

## Presenting results

In [15]:
X.values.shape

(748, 4)

In [24]:
type(y.values[:,:1])


numpy.ndarray

In [29]:
label = ['K-NN', 'Random Forest', 'Naïve Bayes', 'Stacking Classifier']
clf_list = [clf1, clf2, clf3, sclf]

fig = plt.figure(figsize=(10, 8))
gs = gridspec.GridSpec(2, 2)
#grid = itertools.product([0, 1], repeat=2)

clf_cv_mean = []
clf_cv_std = []
for clf, label in zip(clf_list, label):

    scores = cross_val_score(clf, X, y, cv=3, scoring='accuracy')
    print('Accuracy: %.2f (+/- %.2f) [%s]' % (scores.mean(),
                                              scores.std(),
                                              label))
    clf_cv_mean.append(scores.mean())
    clf_cv_std.append(scores.std())

    #clf.fit(X, y)

Accuracy: 0.66 (+/- 0.11) [K-NN]
Accuracy: 0.70 (+/- 0.07) [Random Forest]
Accuracy: 0.75 (+/- 0.03) [Naïve Bayes]
Accuracy: 0.70 (+/- 0.07) [Stacking Classifier]


<Figure size 720x576 with 0 Axes>