<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# Lab 8.2: Boosting

INSTRUCTIONS:

- Read the guides and hints then create the necessary analysis and code to find an answer and conclusion for the scenario below.
- The baseline results (minimum) are:
    - **Accuracy** = 0.9429
    - **ROC AUC**  = 0.9333
- Try to achieve better results!

## Scenario: Predicting Breast Cancer
The dataset you are going to be using for this laboratory is popularly known as the **Wisconsin Breast Cancer** dataset. The task related to it is Classification.

The dataset contains a total number of _10_ features labelled in either **benign** or **malignant** classes. The features have _699_ instances out of which _16_ feature values are missing. The dataset only contains numeric values.

# Step 1: Define the problem or question
Identify the subject matter and the given or obvious questions that would be relevant in the field.

## Potential Questions
List the given or obvious questions.

## Actual Question
Choose the **one** question that should be answered.

In [4]:
'''
pip install xgboost
'''
# Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LinearRegression

from xgboost import XGBClassifier


# Step 2: Find the Data
## Wisconsin Breast Cancer DataSet
- **Citation Request**

    This breast cancer databases was obtained from the **University of Wisconsin Hospitals**, **Madison** from **Dr. William H. Wolberg**. If you publish results when using this database, then please include this information in your acknowledgements.

- **Title**

    Wisconsin Breast Cancer Database (January 8, 1991)

- **Sources**
    - **Creator**
            Dr. WIlliam H. Wolberg (physician)
            University of Wisconsin Hospitals
            Madison, Wisconsin
            USA
    - **Donor**
            Olvi Mangasarian (mangasarian@cs.wisc.edu)
            Received by David W. Aha (aha@cs.jhu.edu)
    - **Date**
            15 July 1992

# Step 3: Read the Data
- Read the data
- Perform some basic structural cleaning to facilitate the work

In [5]:
# READ FILE
file = '../Data/breast-cancer-wisconsin-data-old.csv'
df = pd.read_csv(file)
# SET COLUMN HEADER
df.columns = ['ID','Clump_Thickness','Uniformity_Cell_Size','Uniformity_Cell_Shape','Marginal_Adhesion',
              'Single_Epithelial_Cell_Size','Bare_Nuclei','Bland_Chromatin','Normal_Nucleoli','Mitoses','Class']

In [6]:
# REPLACE NULL '?' VALUES
df['Bare_Nuclei'] = df['Bare_Nuclei'].replace(['?'],'0')
df.Bare_Nuclei = pd.to_numeric(df.Bare_Nuclei)

In [9]:
# SET y to Appropriate Values 
df.replace( {'Class': {2:0, 4:1}}, inplace=True )

# Step 4: Explore and Clean the Data
- Perform some initial simple **EDA** (Exploratory Data Analysis)
- Check for
    - **Number of features**
    - **Data types**
    - **Domains, Intervals**
    - **Outliers** (are they valid or expurious data [read or measure errors])
    - **Null** (values not present or coded [as zero of empty strings])
    - **Missing Values** (coded [as zero of empty strings] or values not present)
    - **Coded content** (classes identified by numbers or codes to represent absence of data)

# Step 5: Prepare the Data
- Deal with the data as required by the modelling technique
    - **Outliers** (remove or adjust if possible or necessary)
    - **Null** (remove or interpolate if possible or necessary)
    - **Missing Values** (remove or interpolate if possible or necessary)
    - **Coded content** (transform if possible or necessary [str to number or vice-versa])
    - **Normalisation** (if possible or necessary)
    - **Feature Engeneer** (if useful or necessary)

# Step 6: Modelling
Refer to the Problem and Main Question.
- What are the input variables (features)?
- Is there an output variable (label)?
- If there is an output variable:
    - What is it?
    - What is its type?
- What type of Modelling is it?
    - [ ] Supervised
    - [ ] Unsupervised 
- What type of Modelling is it?
    - [ ] Regression
    - [ ] Classification (binary) 
    - [ ] Classification (multi-class)
    - [ ] Clustering

In [10]:
# Set X and y
X = df.drop(columns=['Class','ID'], axis=1)
y = df['Class']


# Step 7: Split the Data

Need to check for **Supervised** modelling:
- Number of known cases or observations
- Define the split in Training/Test or Training/Validation/Test and their proportions
- Check for unbalanced classes and how to keep or avoid it when spliting

In [13]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)

# Step 8: Define and Fit Models

Define the model and its hyper-parameters.

Consider the parameters and hyper-parameters of each model at each (re)run and after checking the efficiency of a model against the training and test datasets.

In [14]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn import metrics

#from xgboost import XFBClassifier

ADBclf = AdaBoostClassifier(n_estimators=50, learning_rate=1)   # Default = Decision Tree Classifier
ADBclf.fit(X_train, y_train)

'''
abc = AdaBoostClassifier(n_estimators=50, base_estimator=svc,learning_rate=1)

base_estimator : The weak learner used to train the model. (Default = DecisionTreeClassifier).
n_estimators   : Number of weak learners to train iteratively.
learning_rate  : It contributes to the weights of weak learners. It uses 1 as a default value.

'''

'\nabc = AdaBoostClassifier(n_estimators=50, base_estimator=svc,learning_rate=1)\n\nbase_estimator : The weak learner used to train the model. (Default = DecisionTreeClassifier).\nn_estimators   : Number of weak learners to train iteratively.\nlearning_rate  : It contributes to the weights of weak learners. It uses 1 as a default value.\n\n'

In [15]:
ADBclf.score(X_test, y_test)

0.9571428571428572

In [16]:
ADBclf.predict([[0, 1, 2, 3, 4, 5, 6, 7, 8]])

array([0], dtype=int64)

In [17]:
from sklearn.svm import SVC
from sklearn import metrics
svc=SVC(probability=True, kernel='linear')

ADBClf_svc = AdaBoostClassifier(n_estimators=50, base_estimator=svc, learning_rate=1)
ADBClf_svc.fit(X_train, y_train)
ADBClf_svc.score(X_test, y_test)

0.9285714285714286

# Step 9: Verify and Evaluate the Training Model
- Use the **training** data to make predictions
- Check for overfitting
- What metrics are appropriate for the modelling approach used
- For **Supervised** models:
    - Check the **Training Results** with the **Training Predictions** during development
- Analyse, modify the parameters and hyper-parameters and repeat (within reason) until the model does not improve

# Step 10: Make Predictions and Evaluate the Test Model
**NOTE**: **Do this only after not making any more improvements in the model**.

- Use the **test** data to make predictions
- For **Supervised** models:
    - Check the **Test Results** with the **Test Predictions**

# Step 11: Solve the Problem or Answer the Question
The results of an analysis or modelling can be used:
- As part of a product or process, so the model can make predictions when new input data is available
- As part of a report including text and charts to help understand the problem
- As input for further questions

© 2020 Institute of Data