<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# Lab 8.1: Bagging

INSTRUCTIONS:

- Read the guides and hints then create the necessary analysis and code to find an answer and conclusion for the scenario below.
- The baseline results (minimum) are:
    - **Accuracy** = 0.9667
    - **ROC AUC**  = 0.9614
- Try to achieve better results!

# Scenario: Predicting Breast Cancer
The dataset you are going to be using for this laboratory is popularly known as the **Wisconsin Breast Cancer** dataset. The task related to it is Classification.

The dataset contains a total number of _10_ features labelled in either **benign** or **malignant** classes. The features have _699_ instances out of which _16_ feature values are missing. The dataset only contains numeric values.

# Step 1: Define the problem or question
Identify the subject matter and the given or obvious questions that would be relevant in the field.

## Potential Questions
List the given or obvious questions.

## Actual Question
Choose the **one** question that should be answered.

# Step 2: Find the Data
## Wisconsin Breast Cancer DataSet
- **Citation Request**

    This breast cancer databases was obtained from the **University of Wisconsin Hospitals**, **Madison** from **Dr. William H. Wolberg**. If you publish results when using this database, then please include this information in your acknowledgements.

- **Title**

    Wisconsin Breast Cancer Database (January 8, 1991)

- **Sources**
    - **Creator**
            Dr. WIlliam H. Wolberg (physician)
            University of Wisconsin Hospitals
            Madison, Wisconsin
            USA
    - **Donor**
            Olvi Mangasarian (mangasarian@cs.wisc.edu)
            Received by David W. Aha (aha@cs.jhu.edu)
    - **Date**
            15 July 1992
            
    - **Reference**
    
    [https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28original%29](https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28original%29)

# Step 3: Read the Data
- Read the data
- Perform some basic structural cleaning to facilitate the work

In [1]:
# Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression


In [2]:
'''
1. Sample code number: id number         'ID'
2. Clump Thickness: 1 - 10               'Clump_Thickness'
3. Uniformity of Cell Size: 1 - 10       'Uniformity_Cell_Size'
4. Uniformity of Cell Shape: 1 - 10      'Uniformity_Cell_Shape'
5. Marginal Adhesion: 1 - 10             'Marginal_Adhesion'
6. Single Epithelial Cell Size: 1 - 10   'Single_Epithelial_Cell_Size'
7. Bare Nuclei: 1 - 10                   'Bare_Nuclei'
8. Bland Chromatin: 1 - 10               'Bland_Chromatin'
9. Normal Nucleoli: 1 - 10               'Normal_Nucleoli'
10. Mitoses: 1 - 10                      'Mitoses'
11. Class(2 for benign,4 for malignant)  'Class'
'''    

file = '../Data/breast-cancer-wisconsin-data-old.csv'
df = pd.read_csv(file)
df.columns = ['ID','Clump_Thickness','Uniformity_Cell_Size','Uniformity_Cell_Shape','Marginal_Adhesion',
              'Single_Epithelial_Cell_Size','Bare_Nuclei','Bland_Chromatin','Normal_Nucleoli','Mitoses','Class']

In [3]:
df.replace( {'Class': {2:0, 4:1}}, inplace=True )

In [4]:
df

Unnamed: 0,ID,Clump_Thickness,Uniformity_Cell_Size,Uniformity_Cell_Shape,Marginal_Adhesion,Single_Epithelial_Cell_Size,Bare_Nuclei,Bland_Chromatin,Normal_Nucleoli,Mitoses,Class
0,1002945,5,4,4,5,7,10,3,2,1,0
1,1015425,3,1,1,1,2,2,3,1,1,0
2,1016277,6,8,8,1,3,4,3,7,1,0
3,1017023,4,1,1,3,2,1,3,1,1,0
4,1017122,8,10,10,8,7,10,9,7,1,1
...,...,...,...,...,...,...,...,...,...,...,...
693,776715,3,1,1,1,3,2,1,1,1,0
694,841769,2,1,1,1,2,1,1,1,1,0
695,888820,5,10,10,3,7,3,8,10,2,1
696,897471,4,8,6,4,3,4,10,6,1,1


# Step 4: Explore and Clean the Data
- Perform some initial simple **EDA** (Exploratory Data Analysis)
- Check for
    - **Number of features**
    - **Data types**
    - **Domains, Intervals**
    - **Outliers** (are they valid or expurious data [read or measure errors])
    - **Null** (values not present or coded [as zero of empty strings])
    - **Missing Values** (coded [as zero of empty strings] or values not present)
    - **Coded content** (classes identified by numbers or codes to represent absence of data)

In [5]:
# df shape = 698 rows × 11 columns
# features = 10, target =1

In [6]:
# Data Types are all int64, except 1 with the '?' data
df.dtypes

ID                              int64
Clump_Thickness                 int64
Uniformity_Cell_Size            int64
Uniformity_Cell_Shape           int64
Marginal_Adhesion               int64
Single_Epithelial_Cell_Size     int64
Bare_Nuclei                    object
Bland_Chromatin                 int64
Normal_Nucleoli                 int64
Mitoses                         int64
Class                           int64
dtype: object

In [7]:
# >> No NULL Values
df.isna().sum()

ID                             0
Clump_Thickness                0
Uniformity_Cell_Size           0
Uniformity_Cell_Shape          0
Marginal_Adhesion              0
Single_Epithelial_Cell_Size    0
Bare_Nuclei                    0
Bland_Chromatin                0
Normal_Nucleoli                0
Mitoses                        0
Class                          0
dtype: int64

In [8]:
df.nunique()

ID                             644
Clump_Thickness                 10
Uniformity_Cell_Size            10
Uniformity_Cell_Shape           10
Marginal_Adhesion               10
Single_Epithelial_Cell_Size     10
Bare_Nuclei                     11
Bland_Chromatin                 10
Normal_Nucleoli                 10
Mitoses                          9
Class                            2
dtype: int64

In [9]:
'''
ID                                 : >> too many numbers
df.Clump_Thickness.unique()        : array([ 5,  3,  6,  4,  8,  1,  2,  7, 10,  9], dtype=int64)
df.Uniformity_Cell_Size.unique()   : array([ 4,  1,  8, 10,  2,  3,  7,  5,  6,  9], dtype=int64)
df.Uniformity_Cell_Shape.unique()  : array([ 4,  1,  8, 10,  2,  3,  5,  6,  7,  9], dtype=int64)
df.Marginal_Adhesion.unique()      : array([ 5,  1,  3,  8, 10,  4,  6,  2,  9,  7], dtype=int64)
df.Single_Epithelial_Cell_Size.unique() : array([ 7,  2,  3,  1,  6,  4,  5,  8, 10,  9], dtype=int64)
df.Bare_Nuclei.unique()            : array(['10', '2', '4', '1', '3', '9', '7', '?', '5', '8', '6'], dtype=object)
df.Bland_Chromatin.unique()        : array([ 3,  9,  1,  2,  4,  5,  7,  8,  6, 10], dtype=int64)
df.Normal_Nucleoli.unique()        : array([ 2,  1,  7,  4,  5,  3, 10,  6,  9,  8], dtype=int64)
df.Mitoses.unique()                : array([ 1,  5,  4,  2,  3,  7, 10,  8,  6], dtype=int64)
df.Class.unique()                  : array([2, 4], dtype=int64)
'''
df.Class.unique()

array([0, 1], dtype=int64)

In [10]:
df['Bare_Nuclei'] = df['Bare_Nuclei'].replace(['?'],'0')

In [11]:
df.Bare_Nuclei.unique()

array(['10', '2', '4', '1', '3', '9', '7', '0', '5', '8', '6'],
      dtype=object)

In [12]:
df.Bare_Nuclei = pd.to_numeric(df.Bare_Nuclei)

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 698 entries, 0 to 697
Data columns (total 11 columns):
 #   Column                       Non-Null Count  Dtype
---  ------                       --------------  -----
 0   ID                           698 non-null    int64
 1   Clump_Thickness              698 non-null    int64
 2   Uniformity_Cell_Size         698 non-null    int64
 3   Uniformity_Cell_Shape        698 non-null    int64
 4   Marginal_Adhesion            698 non-null    int64
 5   Single_Epithelial_Cell_Size  698 non-null    int64
 6   Bare_Nuclei                  698 non-null    int64
 7   Bland_Chromatin              698 non-null    int64
 8   Normal_Nucleoli              698 non-null    int64
 9   Mitoses                      698 non-null    int64
 10  Class                        698 non-null    int64
dtypes: int64(11)
memory usage: 60.1 KB


# Step 5: Prepare the Data
- Deal with the data as required by the modelling technique
    - **Outliers** (remove or adjust if possible or necessary)
    - **Null** (remove or interpolate if possible or necessary)
    - **Missing Values** (remove or interpolate if possible or necessary)
    - **Coded content** (transform if possible or necessary [str to number or vice-versa])
    - **Normalisation** (if possible or necessary)
    - **Feature Engeneer** (if useful or necessary)

# Step 6: Modelling
Refer to the Problem and Main Question.
- What are the input variables (features)?
- Is there an output variable (label)?
- If there is an output variable:
    - What is it?
    - What is its type?
- What type of Modelling is it?
    - [ ] Supervised
    - [ ] Unsupervised 
- What type of Modelling is it?
    - [ ] Regression
    - [ ] Classification (binary) 
    - [ ] Classification (multi-class)
    - [ ] Clustering

In [14]:
X = df.drop(columns=['Class'], axis=1)
y = df['Class']

In [16]:
y

0      0
1      0
2      0
3      0
4      1
      ..
693    0
694    0
695    1
696    1
697    1
Name: Class, Length: 698, dtype: int64

# Step 7: Split the Data

Need to check for **Supervised** modelling:
- Number of known cases or observations
- Define the split in Training/Test or Training/Validation/Test and their proportions
- Check for unbalanced classes and how to keep or avoid it when spliting

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 8: Define and Fit Models

Define the model and its hyper-parameters.

Consider the parameters and hyper-parameters of each model at each (re)run and after checking the efficiency of a model against the training and test datasets.

In [29]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

tree = DecisionTreeClassifier()
bag = BaggingClassifier(
    base_estimator = tree, 
    n_estimators = 100, 
    max_samples = 0.8, 
    max_features = 0.8)

In [30]:
bag.fit(X_train, y_train)

BaggingClassifier(base_estimator=DecisionTreeClassifier(), max_features=0.8,
                  max_samples=0.8, n_estimators=100)

# Step 9: Verify and Evaluate the Training Model
- Use the **training** data to make predictions
- Check for overfitting
- What metrics are appropriate for the modelling approach used
- For **Supervised** models:
    - Check the **Training Results** with the **Training Predictions** during development
- Analyse, modify the parameters and hyper-parameters and repeat (within reason) until the model does not improve

In [31]:
bag.score(X_test, y_test)

0.9714285714285714

# Step 10: Make Predictions and Evaluate the Test Model
**NOTE**: **Do this only after not making any more improvements in the model**.

- Use the **test** data to make predictions
- For **Supervised** models:
    - Check the **Test Results** with the **Test Predictions**

In [32]:
bag.fit(X_train, y_train).score(X_test, y_test)

0.9714285714285714

# Step 11: Solve the Problem or Answer the Question
The results of an analysis or modelling can be used:
- As part of a product or process, so the model can make predictions when new input data is available
- As part of a report including text and charts to help understand the problem
- As input for further questions

© 2020 Institute of Data