# CSC 426 Homework Five - Autoencoder for Feature Extraction

Instructor: Dr. Junxiu Zhou


In this homework, we will explore the feature extraction ability of Autoencoders. Specifically,
- we are asked to train an autoencoder on the "Breast Cancer Dataset" dataset (Homework 2 Question 13)
- use the encoded representation (latent space result) of the trained autoencoder as features for a classification task
- then, train a classifier on the extracted features to predict the class labels
- finally, we compare the classification performance of this approach with using the original data as features for classification.


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf

## Data

In this project, we will use the well-known Breast Cancer Wisconsin (Diagnostic) Data Set


There are two ways to get the dataset:

- (recommended) load it from sklearn, i.e., from sklearn.datasets import load_breast_cancer (https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.htmLinks to an external site.l) -- you need to know the Bunch object in sklearn, i.e., sklearn how to organize the dataset, to identify the features and labels
- download it from its source (https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)) or Kaggle page

In [2]:
from sklearn.datasets import load_breast_cancer
# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target
print(X.shape, y.shape)

(569, 30) (569,)


In [3]:
# Split the dataset into training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=24)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(455, 30) (114, 30) (455,) (114,)


In [4]:
# Standardize the features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Baseline Model

Here I will use Logistic Regression, you may use other model instead.

In [5]:
from sklearn.linear_model import LogisticRegression
clf_lr = LogisticRegression(random_state=24, max_iter=10000)
clf_lr.fit(X_train_scaled, y_train)

## Autoencoder based Model

### Task 1: train an autoencoder on the training dataset  (15 points)

In [15]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
encoder = Sequential([
    Dense(units=20,activation='relu',input_shape=[30])
])

decoder = Sequential([
    Dense(units=30,activation='relu',input_shape=[20])
])
autoencoder = Sequential([encoder,decoder])
autoencoder.compile(loss="mse",
                    optimizer=tf.keras.optimizers.SGD(learning_rate=0.001))

In [17]:
ae_history = autoencoder.fit(X_train_scaled,X_train_scaled,epochs=1000)

Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000
Epoch 9/1000
Epoch 10/1000
Epoch 11/1000
Epoch 12/1000
Epoch 13/1000
Epoch 14/1000
Epoch 15/1000
Epoch 16/1000
Epoch 17/1000
Epoch 18/1000
Epoch 19/1000
Epoch 20/1000
Epoch 21/1000
Epoch 22/1000
Epoch 23/1000
Epoch 24/1000
Epoch 25/1000
Epoch 26/1000
Epoch 27/1000
Epoch 28/1000
Epoch 29/1000
Epoch 30/1000
Epoch 31/1000
Epoch 32/1000
Epoch 33/1000
Epoch 34/1000
Epoch 35/1000
Epoch 36/1000
Epoch 37/1000
Epoch 38/1000
Epoch 39/1000
Epoch 40/1000
Epoch 41/1000
Epoch 42/1000
Epoch 43/1000
Epoch 44/1000
Epoch 45/1000
Epoch 46/1000
Epoch 47/1000
Epoch 48/1000
Epoch 49/1000
Epoch 50/1000
Epoch 51/1000
Epoch 52/1000
Epoch 53/1000
Epoch 54/1000
Epoch 55/1000
Epoch 56/1000
Epoch 57/1000
Epoch 58/1000
Epoch 59/1000
Epoch 60/1000
Epoch 61/1000
Epoch 62/1000
Epoch 63/1000
Epoch 64/1000
Epoch 65/1000
Epoch 66/1000
Epoch 67/1000
Epoch 68/1000
Epoch 69/1000
Epoch 70/1000
Epoch 71/1000
Epoch 72/1000
E

### Task 2: use the encoded representation (latent space result) of the trained autoencoder as features for a classification task  (10 points)

In [21]:
reconstruction = autoencoder.predict(X_train_scaled)



In [22]:
reconstruction.shape

(455, 30)

### Task 3: train a classifier on the extracted features to predict the class labels  (10 points)

In [23]:
clf_lr_encoded = LogisticRegression(random_state=24, max_iter=10000)
clf_lr_encoded.fit(reconstruction, y_train)

### Task 4: compare the classification performance of this appoach with using the original data as features for classification.  (5 points)
**Present your results comparison and final analysis/conclusion here**:

In [40]:
from sklearn.metrics import accuracy_score
pred = clf_lr.predict(X_test_scaled)
accuracy_score(y_test, pred)

0.956140350877193

In [41]:
pred_encoded = clf_lr_encoded.predict(X_test_scaled)
accuracy_score(y_test, pred_encoded)

0.9385964912280702

# Done!