# Classification on Energy data
## This code was programmed on Google Colab.
# ** READ.ME **
## ***Please run this file on Google Colab. And run step by step.*** If you open this file through Jupyter Notebook, it might not run through, since the indentation is slightly different. 
## ***You have to choose file on the 5th code cell.*** I attached the file in zip file.
## I used Google colab because I had a problem in updating Tensorflow to the latest version. Thank you.





This file contains code to:

* Load a CSV file using Pandas.
* Create train, validation, and test sets.
* Define and train a model using Keras (including setting class weights).
* Evaluate the model using various metrics (including precision and recall).
* Try common tecniques for dealing with imbalanced data like:
    * Class weighting 
    * Oversampling


## Setup

In [0]:
from __future__ import absolute_import, division, print_function, unicode_literals

In [0]:
try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass

In [0]:
import tensorflow as tf
from tensorflow import keras

import os
import tempfile

import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

import sklearn
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [0]:
mpl.rcParams['figure.figsize'] = (12, 10)
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']

## Data processing and exploration

In [0]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

In [0]:
#--- Store Data set in Pandas Data Frame
import io
raw_df = pd.read_csv(io.BytesIO(uploaded['energydata_complete.csv']))
# Dataset is now stored in a Pandas Dataframe
raw_df.head()



In [0]:
raw_df.describe()

In [0]:
#--- According to our Problem setting, Make y as Binary label (0,1)

class_div = raw_df['Appliances'].median()

App = []
for yi in raw_df['Appliances']:
    if yi >= class_div :
        App.append(1)
    else :
        App.append(0)

raw_df['Class'] = App
raw_df = raw_df.drop(['Appliances'], axis=1)

### Examine the class label imbalance

In [0]:
neg, pos = np.bincount(raw_df['Class'])
total = neg + pos
print('Examples:\n    Total: {}\n    Positive: {} ({:.2f}% of total)\n'.format(
    total, pos, 100 * pos / total))

This shows the smaller fraction of negative samples.

### Clean, split and normalize the data

The raw data has a few issues. First the `Time` and `Amount` columns are too variable to use directly. Drop the `Time` column (since it's not clear what it means) and take the log of the `Ammount` column to reduce its range.

In [0]:
cleaned_df = raw_df.copy()

#--- Delete columns we'll not use
labels_to_drop = ['date','lights','rv1', 'rv2']
cleaned_df = cleaned_df.drop(labels=labels_to_drop, axis=1) # df.drop(): non-fruitful function
cleaned_df.head()


Split the dataset into train, validation, and test sets. The validation set is used during the model fitting to evaluate the loss and any metrics, however the model is not fit with this data. The test set is completely unused during the training phase and is only used at the end to evaluate how well the model generalizes to new data. This is especially important with imbalanced datasets where [overfitting](https://developers.google.com/machine-learning/crash-course/generalization/peril-of-overfitting) is a significant concern from the lack of training data.

In [0]:
# Use a utility from sklearn to split and shuffle our dataset.
train_df, test_df = train_test_split(cleaned_df, test_size=0.2)
train_df, val_df = train_test_split(train_df, test_size=0.2)

# Form np arrays of labels and features.
train_labels = np.array(train_df.pop('Class'))
bool_train_labels = train_labels != 0
val_labels = np.array(val_df.pop('Class'))
test_labels = np.array(test_df.pop('Class'))

train_features = np.array(train_df)
val_features = np.array(val_df)
test_features = np.array(test_df)

In [0]:
val_labels

Normalize the input features using the sklearn StandardScaler.
This will set the mean to 0 and standard deviation to 1.

Note: The `StandardScalar` is only fit using the `train_features` to be sure the model is not peeking at the validation or test sets. 

In [0]:
scaler = StandardScaler()
train_features = scaler.fit_transform(train_features)

val_features = scaler.transform(val_features)
test_features = scaler.transform(test_features)

train_features = np.clip(train_features, -5, 5)
val_features = np.clip(val_features, -5, 5) 
test_features = np.clip(test_features, -5, 5)


print('Training labels shape:', train_labels.shape)
print('Validation labels shape:', val_labels.shape)
print('Test labels shape:', test_labels.shape)

print('Training features shape:', train_features.shape)
print('Validation features shape:', val_features.shape)
print('Test features shape:', test_features.shape)


Caution: If you want to deploy a model, it's critical that you preserve the preprocessing calculations. The easiest way to implement them as layers, and attach them to your model before export.


### Look at the data distribution

In [0]:
pos_df = pd.DataFrame(train_features[ bool_train_labels], columns = train_df.columns)
neg_df = pd.DataFrame(train_features[~bool_train_labels], columns = train_df.columns)

sns.jointplot(pos_df['T6'], pos_df['T9'], #T9 can be deleted
              kind='hex', xlim = (-5,5), ylim = (-5,5))
#plt.suptitle("Positive distribution")

sns.jointplot(neg_df['T6'], neg_df['T9'],
              kind='hex', xlim = (-5,5), ylim = (-5,5))
#_ = plt.suptitle("Negative distribution")

## Define the model and metrics

Define a function that creates a simple neural network with a densly connected hidden layer, a [dropout](https://developers.google.com/machine-learning/glossary/#dropout_regularization) layer to reduce overfitting, and an output sigmoid layer that returns the probability of a transaction being fraudulent: 

In [0]:
METRICS = [
      keras.metrics.TruePositives(name='tp'),
      keras.metrics.FalsePositives(name='fp'),
      keras.metrics.TrueNegatives(name='tn'),
      keras.metrics.FalseNegatives(name='fn'), 
      keras.metrics.BinaryAccuracy(name='accuracy'),
      keras.metrics.Precision(name='precision'),
      keras.metrics.Recall(name='recall'),
      keras.metrics.AUC(name='auc'),
]

Activation_f = ['relu']

def make_model(metrics = METRICS, output_bias=None, activation_f=Activation_f):
  if output_bias is not None:
    output_bias = tf.keras.initializers.Constant(output_bias)
  for func in Activation_f:
    model = keras.Sequential([
      keras.layers.Dense(
          24, activation='relu',
          input_shape=(train_features.shape[-1],)),
      keras.layers.Dense(12, activation=func),
      keras.layers.Dropout(0.5),
      keras.layers.Dense(1, activation='sigmoid',
                         bias_initializer=output_bias),
  ])

  model.compile(
      optimizer=keras.optimizers.Adam(lr=1e-3),
      loss=keras.losses.BinaryCrossentropy(),
      metrics=metrics)

  return model

### Understanding useful metrics

Notice that there are a few metrics defined above that can be computed by the model that will be helpful when evaluating the performance.



*   **False** negatives and **false** positives are samples that were **incorrectly** classified
*   **True** negatives and **true** positives are samples that were **correctly** classified
*   **Accuracy** is the percentage of examples correctly classified
>   $\frac{\text{true samples}}{\text{total samples}}$
*   **Precision** is the percentage of **predicted** positives that were correctly classified
>   $\frac{\text{true positives}}{\text{true positives + false positives}}$
*   **Recall** is the percentage of **actual** positives that were correctly classified
>   $\frac{\text{true positives}}{\text{true positives + false negatives}}$
*   **AUC** refers to the Area Under the Curve of a Receiver Operating Characteristic curve (ROC-AUC). This metric is equal to the probability that a classifier will rank a random positive sample higher than than a random negative sample.

Note: Accuracy is not a helpful metric for this task. You can 99.8%+ accuracy on this task by predicting False all the time.  

Read more:
*  [True vs. False and Positive vs. Negative](https://developers.google.com/machine-learning/crash-course/classification/true-false-positive-negative)
*  [Accuracy](https://developers.google.com/machine-learning/crash-course/classification/accuracy)
*   [Precision and Recall](https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall)
*   [ROC-AUC](https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc)

## Baseline model

### Build the model

Now create and train your model using the function that was defined earlier. Notice that the model is fit using a larger than default batch size of 2048, this is important to ensure that each batch has a decent chance of containing a few positive samples. If the batch size was too small, they would likely have no fraudelent transactions to learn from.


Note: this model will not handle the class imbalance well. You will improve it later in this tutorial.

In [0]:
EPOCHS = 100
BATCH_SIZE = 200 

early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor='val_auc', 
    verbose=1,
    patience=10,
    mode='max',
    restore_best_weights=True)

In [0]:
model = make_model()
model.summary()

In [0]:
Activation_f = ['tanh']

model2 = make_model()
model2.summary()

In [0]:
Activation_f = ['softmax']

model3 = make_model()
model3.summary()

Test run the model:

In [0]:
model.predict(train_features[:10])

### Optional: Set the correct initial bias.

With the default bias initialization the loss should be about `math.log(2) = 0.69314` 

In [0]:
results = model.evaluate(train_features, train_labels, batch_size=BATCH_SIZE, verbose=0)
print("Loss: {:0.4f}".format(results[0]))

It is not very different from ln(2) value.
But to experiment, we will try to find the correct bias.



The correct bias to set can be derived from:

$$ p_0 = pos/(pos + neg) = 1/(1+e^{-b_0}) $$
$$ b_0 = -log_e(1/p_0 - 1) $$
$$ b_0 = log_e(pos/neg)$$

In [0]:
initial_bias = np.log([pos/neg])
initial_bias

Set that as the initial bias, and the model will give much more reasonable initial guesses. 

It should be near: `pos/total = 0.0018`

In [0]:
model = make_model(output_bias = initial_bias)
model.predict(train_features[:10])

In [0]:
results = model.evaluate(train_features, train_labels, batch_size=BATCH_SIZE, verbose=0)
print("Loss: {:0.4f}".format(results[0]))

This initial loss is less than if would have been with naive initilization.

This way the model doesn't need to spend the first few epochs just learning that positive examples are unlikely. This also makes it easier to read plots of the loss during training.

### Checkpoint the initial weights

To make the various training runs more comparable, keep this initial model's weights in a checkpoint file, and load them into each model before training.

In [0]:
initial_weights = os.path.join(tempfile.mkdtemp(),'initial_weights')
model.save_weights(initial_weights)

### Confirm that the bias fix helps

Before moving on, confirm quick that the careful bias initialization actually helped.

Train the model for 20 epochs, with and without this careful initialization, and compare the losses: 

In [0]:
model = make_model()
model.load_weights(initial_weights)
model.layers[-1].bias.assign([0.0])
zero_bias_history = model.fit(
    train_features,
    train_labels,
    batch_size=BATCH_SIZE,
    epochs=20,
    validation_data=(val_features, val_labels), 
    verbose=0)

In [0]:
model = make_model()
model.load_weights(initial_weights)
careful_bias_history = model.fit(
    train_features,
    train_labels,
    batch_size=BATCH_SIZE,
    epochs=20,
    validation_data=(val_features, val_labels), 
    verbose=0)


In [0]:
def plot_loss(history, label, n):
  # Use a log scale to show the wide range of values.
  plt.semilogy(history.epoch,  history.history['loss'],
               color=colors[n], label='Train '+label)
  plt.semilogy(history.epoch,  history.history['val_loss'],
          color=colors[n], label='Val '+label,
          linestyle="--")
  plt.xlabel('Epoch')
  plt.ylabel('Loss')
  
  plt.legend()

In [0]:
plot_loss(zero_bias_history, "Zero Bias", 0)

In [0]:
plot_loss(zero_bias_history, "Zero Bias", 0)
plot_loss(careful_bias_history, "Careful Bias", 1)

The above figure makes it clear: In terms of validation loss, on this problem, this careful initialization gives an advantage, but not a lot.


In [0]:

model2.load_weights(initial_weights)
model2.layers[-1].bias.assign([0.0])
zero_bias_history = model2.fit(
    train_features,
    train_labels,
    batch_size=BATCH_SIZE,
    epochs=20,
    validation_data=(val_features, val_labels), 
    verbose=0)

In [0]:
def plot_loss(history, label, n):
  # Use a log scale to show the wide range of values.
  plt.semilogy(history.epoch,  history.history['loss'],
               color=colors[n], label='Train '+label)
  plt.semilogy(history.epoch,  history.history['val_loss'],
          color=colors[n], label='Val '+label,
          linestyle="--")
  plt.xlabel('Epoch')
  plt.ylabel('Loss')
  
  plt.legend()
  

In [0]:
plot_loss(zero_bias_history, "Zero Bias", 0)

In [0]:

model3.load_weights(initial_weights)
model3.layers[-1].bias.assign([0.0])
zero_bias_history = model3.fit(
    train_features,
    train_labels,
    batch_size=BATCH_SIZE,
    epochs=20,
    validation_data=(val_features, val_labels), 
    verbose=0)

plot_loss(zero_bias_history, "Zero Bias", 0)

### Train the model

In [0]:
model = make_model()
model.load_weights(initial_weights)
baseline_history = model.fit(
    train_features,
    train_labels,
    batch_size=BATCH_SIZE,
    epochs=EPOCHS,
    callbacks = [early_stopping],
    validation_data=(val_features, val_labels))

### Check training history
In this section, you will produce plots of your model's accuracy and loss on the training and validation set. These are useful to check for overfitting.
Additionally, you can produce these plots for any of the metrics you created above. False negatives are included as an example.

In [0]:
def plot_metrics(history):
  metrics =  ['loss', 'auc', 'precision', 'recall']
  for n, metric in enumerate(metrics):
    name = metric.replace("_"," ").capitalize()
    plt.subplot(2,2,n+1)
    plt.plot(history.epoch,  history.history[metric], color=colors[n], label='Train')
    plt.plot(history.epoch, history.history['val_'+metric],
             color=colors[n], linestyle="--", label='Val')
    plt.xlabel('Epoch')
    plt.ylabel(name)
    if metric == 'loss':
      plt.ylim([0.4,0.7]) #plt.ylim()[2]
    elif metric == 'auc':
      plt.ylim([0.5,1])
    else:
      plt.ylim([0.4,1.01])

    plt.legend()


In [0]:
plot_metrics(baseline_history)

Note: That the validation curve sometimes performs better than the training curve. This is mainly caused by the fact that the dropout layer is not active when evaluating the model.

### Evaluate metrics

You can use a [confusion matrix](https://developers.google.com/machine-learning/glossary/#confusion_matrix) to summarize the actual vs. predicted labels where the X axis is the predicted label and the Y axis is the actual label.

In [0]:
train_predictions_baseline = model.predict(train_features, batch_size=BATCH_SIZE)
test_predictions_baseline = model.predict(test_features, batch_size=BATCH_SIZE)

In [0]:
def plot_cm(labels, predictions, p=0.5):
  cm = confusion_matrix(labels, predictions > p)
  plt.figure(figsize=(5,5))
  sns.heatmap(cm, annot=True, fmt="d")
  plt.title('Confusion matrix @{:.2f}'.format(p))
  plt.ylabel('Actual label')
  plt.xlabel('Predicted label')

  print('Low Energy Usages Detected (True Negatives): ', cm[0][0])
  print('Low Energy Usages Incorrectly Detected (False Positives): ', cm[0][1])
  print('High Energy Usages Missed (False Negatives): ', cm[1][0])
  print('High Energy Usages Detected (True Positives): ', cm[1][1])
  print('Total High Energy Usages Households: ', np.sum(cm[1]))
  

Evaluate your model on the test dataset and display the results for the metrics you created above.

In [0]:
baseline_results = model.evaluate(test_features, test_labels,
                                  batch_size=BATCH_SIZE, verbose=0)
for name, value in zip(model.metrics_names, baseline_results):
  print(name, ': ', value)
print()

plot_cm(test_labels, test_predictions_baseline)

### Plot the ROC

Now plot the [ROC](https://developers.google.com/machine-learning/glossary#ROC). This plot is useful because it shows, at a glance, the range of performance the model can reach just by tuning the output threshold.

In [0]:
def plot_roc(name, labels, predictions, **kwargs):
  fp, tp, _ = sklearn.metrics.roc_curve(labels, predictions)

  plt.plot(100*fp, 100*tp, label=name, linewidth=2, **kwargs)
  plt.xlabel('False positives [%]')
  plt.ylabel('True positives [%]')
  plt.xlim([-0.5,100])
  plt.ylim([0,100.5])
  plt.grid(True)
  ax = plt.gca()
  ax.set_aspect('equal')

In [0]:
plot_roc("Train Baseline", train_labels, train_predictions_baseline, color=colors[0])
plot_roc("Test Baseline", test_labels, test_predictions_baseline, color=colors[0], linestyle='--')
plt.plot(np.arange(100),np.arange(100),ls=':',alpha=0.5) # x=y
plt.legend(loc='lower right')

## Class weights

### Calculate class weights

The goal is to identify Low Energy Usage, but you don't have very many of those positive samples to work with, so you would want to have the classifier heavily weight the few examples that are available. You can do this by passing Keras weights for each class through a parameter. These will cause the model to "pay more attention" to examples from an under-represented class.

In [0]:
# Scaling by total/2 helps keep the loss to a similar magnitude.
# The sum of the weights of all examples stays the same.
weight_for_0 = (1 / neg)*(total)/2.0 
weight_for_1 = (1 / pos)*(total)/2.0

class_weight = {0: weight_for_0, 1: weight_for_1}

print('Weight for class 0: {:.2f}'.format(weight_for_0))
print('Weight for class 1: {:.2f}'.format(weight_for_1))

### Train a model with class weights

Now try re-training and evaluating the model with class weights to see how that affects the predictions.

Note: Using `class_weights` changes the range of the loss. This may affect the stability of the training depending on the optimizer. Optimizers who's step size is dependent on the magnitude of the gradient, like `optimizers.SGD`, may fail. The optimizer used here, `optimizers.Adam`, is unaffected by the scaling change. Also note that because of the weighting, the total losses are not comparable between the two models.

In [0]:
weighted_model = make_model()
weighted_model.load_weights(initial_weights)

weighted_history = weighted_model.fit(
    train_features,
    train_labels,
    batch_size=BATCH_SIZE,
    epochs=EPOCHS,
    callbacks = [early_stopping],
    validation_data=(val_features, val_labels),
    # The class weights go here
    class_weight=class_weight) 

### Check training history

In [0]:
plot_metrics(weighted_history)

### Evaluate metrics

In [0]:
train_predictions_weighted = weighted_model.predict(train_features, batch_size=BATCH_SIZE)
test_predictions_weighted = weighted_model.predict(test_features, batch_size=BATCH_SIZE)

In [0]:
weighted_results = weighted_model.evaluate(test_features, test_labels,
                                           batch_size=BATCH_SIZE, verbose=0)
for name, value in zip(weighted_model.metrics_names, weighted_results):
  print(name, ': ', value)
print()

plot_cm(test_labels, test_predictions_weighted)

Here you can see that with class weights the accuracy and recall are lower because there are more false negatives, but conversely the precision and AUC are higher because the model also found more true positives. Despite having lower accuracy, this model has higher precision (and identifies more lower energy usage). Of course, there is a cost to both types of error (you wouldn't want to bug users by flagging too many low energy usages as high energy usages, either). We need to carefully consider the trade offs between these different types of errors for the application.

### Plot the ROC

In [0]:
plot_roc("Train Baseline", train_labels, train_predictions_baseline, color=colors[0])
plot_roc("Test Baseline", test_labels, test_predictions_baseline, color=colors[0], linestyle='--')

plot_roc("Train Weighted", train_labels, train_predictions_weighted, color=colors[1])
plot_roc("Test Weighted", test_labels, test_predictions_weighted, color=colors[1], linestyle='--')

plt.plot(np.arange(100),np.arange(100),ls=':',alpha=0.5) # x=y
plt.legend(loc='lower right')

## Oversampling

### Oversample the minority class

A related approach would be to resample the dataset by oversampling the minority class.

In [0]:
pos_features = train_features[bool_train_labels]
neg_features = train_features[~bool_train_labels]

pos_labels = train_labels[bool_train_labels]
neg_labels = train_labels[~bool_train_labels]

#### Using NumPy

You can balance the dataset manually by choosing the right number randon 
indices from the positive examples:

In [0]:
ids = np.arange(len(pos_features))
choices = np.random.choice(ids, len(neg_features))

res_pos_features = pos_features[choices]
res_pos_labels = pos_labels[choices]

res_pos_features.shape

In [0]:
resampled_features = np.concatenate([res_pos_features, neg_features], axis=0)
resampled_labels = np.concatenate([res_pos_labels, neg_labels], axis=0)

order = np.arange(len(resampled_labels))
np.random.shuffle(order)
resampled_features = resampled_features[order]
resampled_labels = resampled_labels[order]

resampled_features.shape

#### Using `tf.data`

If you're using `tf.data` the easiest way to produce balanced examples is to start with a `positive` and a `negative` dataset, and merge them. See [the tf.data guide](../../guide/data.ipynb) for more examples.

In [0]:
BUFFER_SIZE = 100000

def make_ds(features, labels):
  ds = tf.data.Dataset.from_tensor_slices((features, labels))#.cache()
  ds = ds.shuffle(BUFFER_SIZE).repeat()
  return ds

pos_ds = make_ds(pos_features, pos_labels)
neg_ds = make_ds(neg_features, neg_labels)

Each dataset provides `(feature, label)` pairs:

In [0]:
for features, label in pos_ds.take(1):
  print("Features:\n", features.numpy())
  print()
  print("Label: ", label.numpy())

Merge the two together using `experimental.sample_from_datasets`:

In [0]:
resampled_ds = tf.data.experimental.sample_from_datasets([pos_ds, neg_ds], weights=[0.5, 0.5])
resampled_ds = resampled_ds.batch(BATCH_SIZE).prefetch(2)

In [0]:
for features, label in resampled_ds.take(1):
  print(label.numpy().mean())

To use this dataset, you'll need the number of steps per epoch.

The definition of "epoch" in this case is less clear. Say it's the number of batches required to see each negative example once:

In [0]:
resampled_steps_per_epoch = np.ceil(2.0*neg/BATCH_SIZE)
resampled_steps_per_epoch

### Train on the oversampled data

Now try training the model with the resampled data set instead of using class weights to see how these methods compare.

Note: Because the data was balanced by replicating the positiver examples, the total dataset size is larger, and each epoch runs for more training steps. 

In [0]:
resampled_model = make_model()
resampled_model.load_weights(initial_weights)

# Reset the bias to zero, since this dataset is balanced.
output_layer = resampled_model.layers[-1] 
output_layer.bias.assign([0])

val_ds = tf.data.Dataset.from_tensor_slices((val_features, val_labels)).cache()
val_ds = val_ds.batch(BATCH_SIZE).prefetch(2) 

resampled_history = resampled_model.fit(
    resampled_ds,
    epochs=EPOCHS,
    steps_per_epoch=resampled_steps_per_epoch,
    callbacks = [early_stopping],
    validation_data=val_ds)

If the training process were considering the whole dataset on each gradient update, this oversampling would be basically identical to the class weighting.

But when training the model batch-wise, as you did here, the oversampled data provides a smoother gradient signal: Instead of each positive example being shown in one batch with a large weight, they're shown in many different batches each time with a small weight. 

This smoother gradient signal makes it easier to train the model.

### Check training history

Note that the distributions of metrics will be different here, because the training data has a totally different distribution from the validation and test data. 

In [0]:
plot_metrics(resampled_history )

### Re-train



Because training is easier on the balanced data, the above training procedure may overfit quickly. 

So break up the epochs to give the `callbacks.EarlyStopping` finer control over when to stop training.

In [0]:
resampled_model = make_model()
resampled_model.load_weights(initial_weights)

# Reset the bias to zero, since this dataset is balanced.
output_layer = resampled_model.layers[-1] 
output_layer.bias.assign([0])

resampled_history = resampled_model.fit(
    resampled_ds,
    # These are not real epochs
    steps_per_epoch = 20,
    epochs=10*EPOCHS,
    callbacks = [early_stopping],
    validation_data=(val_ds))

### Re-check training history

In [0]:
plot_metrics(resampled_history)

### Evaluate metrics

In [0]:
train_predictions_resampled = resampled_model.predict(train_features, batch_size=BATCH_SIZE)
test_predictions_resampled = resampled_model.predict(test_features, batch_size=BATCH_SIZE)

In [0]:
resampled_results = resampled_model.evaluate(test_features, test_labels,
                                             batch_size=BATCH_SIZE, verbose=0)
for name, value in zip(resampled_model.metrics_names, resampled_results):
  print(name, ': ', value)
print()

plot_cm(test_labels, test_predictions_weighted)

### Plot the ROC

In [0]:
plot_roc("Train Baseline", train_labels, train_predictions_baseline, color=colors[0])
plot_roc("Test Baseline", test_labels, test_predictions_baseline, color=colors[0], linestyle='--')

plot_roc("Train Weighted", train_labels, train_predictions_weighted, color=colors[1])
plot_roc("Test Weighted", test_labels, test_predictions_weighted, color=colors[1], linestyle='--')

plot_roc("Train Resampled", train_labels, train_predictions_resampled,  color=colors[2])
plot_roc("Test Resampled", test_labels, test_predictions_resampled,  color=colors[2], linestyle='--')

plt.plot(np.arange(100),np.arange(100),ls=':',alpha=0.5) # x=y
plt.legend(loc='lower right')

## Applying this tutorial to your problem

Imbalanced data classification is an inherantly difficult task since there are so few samples to learn from. You should always start with the data first and do your best to collect as many samples as possible and give substantial thought to what features may be relevant so the model can get the most out of your minority class. At some point your model may struggle to improve and yield the results you want, so it is important to keep in mind the context of your problem and the trade offs between different types of errors.

# KNN

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import random
#import sklearn 
#print(sklearn.__version__)

%matplotlib inline
sns.set(style ='whitegrid', palette='bright')

In [0]:
cleaned_df.head()

In [0]:
cleaned_df = raw_df.copy()

#--- Delete columns we'll not use
labels_to_drop = ['date','lights','rv1', 'rv2']
cleaned_df = cleaned_df.drop(labels=labels_to_drop, axis=1) # df.drop(): non-fruitful function
cleaned_df.head()

y = cleaned_df.pop('Class')
y.head()

df_feat = cleaned_df

In [0]:
#--- Scale our data: For KNN, it is better to scale the data

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(df_feat)
scaled_features = scaler.transform(df_feat)
scaled_features

In [0]:
#--- Split Data: Train and Test data set

from sklearn.model_selection import train_test_split

X = df_feat #X = scaled_features
#y = cancer['target']  -> We already assigned our TARGET to variable 'y'

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

In [0]:
#--- KNN
#--- set k and find a class 

random.seed(1)
# number of class: According to our problem, we have 2 classes (binary case)
k = 2 
from sklearn.neighbors import KNeighborsClassifier
#KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', 
#                     leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None, **kwargs)
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
#knn.get_params()
knn.get_params()

In [0]:
pred_tr = knn.predict(X_train)

In [0]:
# predict our Target for the test set
pred = knn.predict(X_test)
pred

In [0]:
# Evaluate our result
print('Scores without Cross Validation\n')
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, recall_score, precision_score
print('<< Train Errror >>')
print('< Confusion Matrix >')
print(confusion_matrix(y_train, pred_tr))
print('\n< Classification Report >')
print(classification_report(y_train, pred_tr))

accuracy = accuracy_score(y_train, pred_tr)
#recall = recall_score(y_train, pred_tr) # not using it because our focused target is 0 not 1
cm = confusion_matrix(y_train, pred_tr)
recall = cm[0,0]/(cm[0,0]+cm[0,1])
#precision = precision_score(y_train, pred_tr) # not using it because of the same reason above
precision = cm[0,0]/(cm[0,0]+cm[1,0])

print('- Accuracy: {:.2f} \n- Recall: {:.2f}   (=Sensitivity)\n- Precision: {:.2f}'.format(accuracy, recall, precision))
print('\n-----------------------------------------------------------------\n')
print('<<Test error>>')

print('< Confusion Matrix >')
print(confusion_matrix(y_test, pred))
print('\n< Classification Report >')
print(classification_report(y_test, pred))

accuracy = accuracy_score(y_test, pred)
cm1 = confusion_matrix(y_test, pred)
recall = cm1[0,0]/(cm1[0,0]+cm1[0,1])
#precision = precision_score(y_test, pred) # not using it because of the same reason above
precision = cm1[0,0]/(cm1[0,0]+cm1[1,0])
#precision = precision_score(y_test, pred)
print('- Accuracy: {:.2f} \n- Recall: {:.2f}   (=Sensitivity)\n- Precision: {:.2f}'.format(accuracy, recall, precision))


In [0]:
#--- Crossvalidation & Grid Search

from sklearn.neighbors import DistanceMetric
from sklearn.model_selection import GridSearchCV

param_grid = {'algorithm': ['ball_tree', 'kd_tree', 'brute'], 
              'metric':['euclidean','manhattan', 'chebyshev', 'minkowski']}
# metrics: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html
estimator = KNeighborsClassifier(n_neighbors=2)

# k = 5
grid_knn = GridSearchCV(estimator, param_grid, cv=5, verbose=3, return_train_score=True)
grid_knn.fit(X_train, y_train)
grid_knn.cv_results_
sorted(grid_knn.cv_results_.keys())

In [0]:
grid_knn.best_estimator_

In [0]:
#--- Predict with the best estimator after Grid Search and Cross validation
pred_knn_tr=grid_knn.best_estimator_.predict(X_train)
pred_knn_ts=grid_knn.best_estimator_.predict(X_test)

In [0]:
# Results for Train Set

print(confusion_matrix(y_train, pred_knn_tr))
print(classification_report(y_train, pred_knn_tr))

accuracy = accuracy_score(y_train,  pred_knn_tr)
#recall = recall_score(y_train, pred_tr) # not using it because our focused target is 0 not 1
cm = confusion_matrix(y_train,  pred_knn_tr)
recall = cm[0,0]/(cm[0,0]+cm[0,1])
#precision = precision_score(y_train, pred_tr) # not using it because of the same reason above
precision = cm[0,0]/(cm[0,0]+cm[1,0])

print('- Accuracy: {:.2f} \n- Recall: {:.2f}   (=Sensitivity)\n- Precision: {:.2f}'.format(accuracy, recall, precision))

In [0]:
# Results for Test Set

print(confusion_matrix(y_test, pred_knn_ts))
print(classification_report(y_test, pred_knn_ts))

accuracy = accuracy_score(y_test, pred_knn_ts)
#recall = recall_score(y_train, pred_tr) # not using it because our focused target is 0 not 1
cm = confusion_matrix(y_test, pred_knn_ts)
recall = cm[1,1]/(cm[1,0]+cm[1,1])
#precision = precision_score(y_train, pred_tr) # not using it because of the same reason above
precision = cm[1,1]/(cm[0,1]+cm[1,1])

print('- Accuracy: {:.2f} \n- Recall: {:.2f}   (=Sensitivity)\n- Precision: {:.2f}'.format(accuracy, recall, precision))

In [0]:
knn_ts_score =[]

for i in range(len(grid_knn.cv_results_['params'])):
    paramtrs = grid_knn.cv_results_['params'][i]
    knn = KNeighborsClassifier(n_neighbors=2, algorithm = paramtrs['algorithm'], metric = paramtrs['metric'])
    knn.fit(X_train, y_train)
    pred = knn.predict(X_test)
    cm1 = confusion_matrix(y_test, pred)
    recall = cm1[1,1]/(cm1[1,0]+cm1[1,1])
    knn_ts_score.append(recall)
    
knn_ts_score

In [0]:
from sklearn.model_selection import learning_curve
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=None, train_sizes=np.linspace(.1, 1.0, 5)):

    plt.figure(figsize=(10,5))
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training Size")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="#0abab5")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="#EE82EE")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="#0abab5",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="#EE82EE",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt


#--- Plotting
cv = 5      # number of folds
n_jobs = -1

X, y = X_train, y_train

title = r"Learning Curves of KNN)"
knn = KNeighborsClassifier(n_neighbors=2)
estimator = knn

plot_learning_curve(estimator, title, X, y, (0.5, 1.01), cv=cv, n_jobs=n_jobs)

plt.show()

In [0]:
def plot_search_results(grid, score):
 
    #--- Results from grid search
    results = grid.cv_results_
    means_test = results['mean_test_score']
    stds_test = results['std_test_score']
    means_train = results['mean_train_score']
    stds_train = results['std_train_score']
    testset = np.array(score)
    
    
    #knn_testset = np.array(knn_ts_score)
    #poly_testset = np.array(poly_testset_score)
    #sigmoid_testset = np.array(sigmoid_testset_score)
    #dtree_testset = np.array(dtree_testset_score)
    #gb_testset = np.array(gb_testset_score)

    #--- Getting indexes of values per hyper-parameter
    masks=[]
    masks_names= list(grid.best_params_.keys())
    for p_k, p_v in grid.best_params_.items():
        masks.append(list(results['param_'+p_k].data==p_v))

    params=grid.param_grid
    
    #--- Ploting results
    fig, ax = plt.subplots(1,len(params),sharex='none', sharey='all',figsize=(20,5))
    #fig = plot_figure(style_label='classic') #################
    fig.suptitle('Scores vs. Parameters')
    ax[0].set_ylabel('Sensitivity')
    #fig.text(0.04, 0.5, 'Recall', va='center', rotation='vertical')
    pram_preformace_in_best = {}
    for i, p in enumerate(masks_names):
        m = np.stack(masks[:i] + masks[i+1:])
        pram_preformace_in_best
        best_parms_mask = m.all(axis=0)
        best_index = np.where(best_parms_mask)[0]
        x = np.array(params[p])
        y_1 = np.array(means_train[best_index])
        e_1 = np.array(stds_train[best_index])
        y_2 = np.array(means_test[best_index])
        e_2 = np.array(stds_test[best_index])
        y_3 = np.array(testset[best_index])
        
        ax[i].set_xlabel(p.upper())
        ax[i].plot(x,y_1,color="#0abab5", label='Train Score',linestyle='--',marker ='o')
        ax[i].plot(x,y_2,color="#EE82EE", label='5fold-Cross-validation Score',linestyle='--',marker ='o')
        ax[i].plot(x,y_3,color='b', label='Test Score',linestyle='-',marker ='^')
        ax[i].fill_between(x, y_1-e_1,y_1+e_1,alpha=0.1,
                     color="#0abab5")
        ax[i].fill_between(x, y_2-e_2,y_2+e_2,alpha=0.1,
                     color="#EE82EE")
        ax[i].legend()

    plt.show()


    
plot_search_results(grid_knn, knn_ts_score)

In [0]:
def plot_search_results(grid, score):
 
    #--- Results from grid search
    results = grid.cv_results_
    means_train = results['mean_train_score']
    stds_train = results['std_train_score']
    means_test = results['mean_test_score']
    stds_test = results['std_test_score']

    testset = np.array(score)
    
    
    #knn_testset = np.array(knn_ts_score)
    #poly_testset = np.array(poly_testset_score)
    #sigmoid_testset = np.array(sigmoid_testset_score)
    #dtree_testset = np.array(dtree_testset_score)
    #gb_testset = np.array(gb_testset_score)

    #--- Getting indexes of values per hyper-parameter
    masks=[]
    masks_names= list(grid.best_params_.keys())
    for p_k, p_v in grid.best_params_.items():
        masks.append(list(results['param_'+p_k].data==p_v))

    params=grid.param_grid
    
    #--- Ploting results
    fig, ax = plt.subplots(1,len(params),sharex='none', sharey='all',figsize=(20,6))
    #fig = plot_figure(style_label='classic') #################
    fig.suptitle('Scores vs. Parameters')
    ax[0].set_ylabel('Sensitivity')
    #fig.text(0.04, 0.5, 'Recall', va='center', rotation='vertical')
    pram_preformace_in_best = {}
    for i, p in enumerate(masks_names):
        m = np.stack(masks[:i] + masks[i+1:])
        pram_preformace_in_best
        best_parms_mask = m.all(axis=0)
        best_index = np.where(best_parms_mask)[0]
        #x = np.array(params[p])
        x = np.arange(len(np.array(params[p])))
        
        y_1 = np.array(means_train[best_index])
        e_1 = np.array(stds_train[best_index])
        y_2 = np.array(means_test[best_index])
        e_2 = np.array(stds_test[best_index])
        y_3 = np.array(testset[best_index])
        
        ax[i].set_xticks(x)
        ax[i].set_xticklabels(params[p])
        al, w = 0.8, 0.2
        ax[i].set_xlabel(p.upper())
        ax[i].bar(x-w,y_1,color="#0abab5", label='Train Score', alpha=al, width=w)
        ax[i].bar(x,y_2,color="#EE82EE", label='5fold-Cross-validation Score', alpha=al, width=w)
        ax[i].bar(x+w,y_3,color='b', label='Test Score', alpha=al, width=w)
        #ax[i].fill_between(x, y_1-e_1,y_1+e_1,alpha=0.1,color="#0abab5")
        #ax[i].fill_between(x, y_2-e_2,y_2+e_2,alpha=0.1,color="#EE82EE")
        ax[i].legend(loc='lower right')

    plt.show()


    
plot_search_results(grid_knn, knn_ts_score)

In [0]:
results = grid_knn.cv_results_
means_train = results['mean_train_score']
means_train
grid_knn.best_params_.keys()
grid_knn.best_params_['algorithm']
grid_knn.best_params_['metric']

In [0]:
# elbow method to choose better K value
error_rate = []

for i in range(1,40):
    
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, y_train)
    pred_i = knn.predict(X_test)
    error_rate.append(np.mean(pred_i != y_test))
    
    
plt.figure(figsize=(8,4))
plt.plot(range(1,40), error_rate, color='blue', ls= 'dashed',marker='o', markersize=5)
plt.title('Error Rate vs K Value(the number of Nodes)')
plt.xlabel('K')
plt.ylabel('Error Rate')



##### Copyright 2019 The TensorFlow Authors on the snippets of the Tensorflow Codes
