### CPR Tech Sharing - Jing Li - 04/26/2018
### Demo the use of Naive Bayes classifier with oversimplified RCA (Naive Bayes is a family of ML algorithms for supervised learning)

- Problem statement: we have a set of data containing CPU and Memory utilization statistics and server status (good or bad at that point). We want to use machine learning to predict the server status at a future observation of CPU and Memory statistics
- X contains variables (features): an array containing pairs of CPU and Memory utilization statistics in percentage.
- Y contains corresponding class labels: an array containing server status. 
- For example, the first data point has 'Good' corresponds to when CPU utilization is 40% and memory utilization is 80%. The 4th data point has CPU utilization 99% and memory utilization 20% with 'Bad' label
- So, X and Y are the stats we collected over time. We want to use machine learning to predict/classify the server status (Y) based on server key performance indicators (X).
- We pick up a machine learning algorithm (Model) called Gaussian Naive Bayes and we will train the model to learn based on past experience using the collected data.
- We will then use the trained model to predict/classify the server status with a new observation of key performance indiicator.
- in the first example, we predict that server status is 'Good' when we have CPU utilization = 47% and memory utilization = 82% 
in the second example, we predict that server status is 'bad' when we have CPU utilization = 95% and memory utilization = 20% 


In [2]:
# load numpy library
import numpy as np
# construct training data set
X = np.array([[40, 80], [50, 85], [45, 84], [99, 20], [50, 80], [55, 83], [40, 81], [53, 85], [46, 84], [95, 25], [51, 87], [52, 80]])
# Y = np.array(['Good', 'Good', 'Good', 'Bad', 'Good', 'Good', 'Good', 'Good', 'Good', 'Bad', 'Good', 'Good'])
Y = np.array([0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0])

# load Naive Bayes 
from sklearn.naive_bayes import GaussianNB

# Create a model
clf = GaussianNB()

# Traing the model with data set
clf.fit(X, Y)

# Print model accuracy
print "Model accuracy is " + str(clf.score(X,Y))

# Create a new observation of X
newX1 = [[47,82]]

# Predict server status with CPU and Memory statistics
newY1 = clf.predict(newX1)

# Create a new observation of X
newX2 = [[60,20]]

# Predict server status with CPU and Memory statistics
newY2 = clf.predict(newX2)

# Print out the prediction
print "Naive Bayes predict the server status with CPU at " + str(newX1[0][0]) + "% and Memory at " + str(newX1[0][1]) + "% utilization is " +  str(newY1[0])
print "Naive Bayes predict The server status with CPU at " + str(newX2[0][0]) + "% and Memory at " + str(newX2[0][1]) + "% utilization is " +  str(newY2[0])


In [3]:
### draw the decision boundary with the test points overlaid
import warnings
warnings.filterwarnings("ignore")

import matplotlib
import matplotlib.pyplot as plt

x_min = 0.0; x_max = 100.0
y_min = 0.0; y_max = 100.0

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
h = .1  # step size in the mesh
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

# the training data (features_train, labels_train) have both "Good" and "Bad" points mixed
# in together--separate them so we can give them different colors in the scatterplot,
# and visually identify them
cpu_good = [X[ii][0] for ii in range(0, len(X)) if Y[ii]==0]
mem_good = [X[ii][1] for ii in range(0, len(X)) if Y[ii]==0]
cpu_bad = [X[ii][0] for ii in range(0, len(X)) if Y[ii]==1]
mem_bad = [X[ii][1] for ii in range(0, len(X)) if Y[ii]==1]

# Plot also the test points
cpu_good_t = newX1[0][0]
mem_good_t = newX1[0][1]
cpu_bad_t = newX2[0][0]
mem_bad_t = newX2[0][1]


# Put the result into a color plot
# Plot with training data set
plt.title('Server Status')
Z = Z.reshape(xx.shape)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.pcolormesh(xx, yy, Z, cmap='Pastel2')
plt.scatter(cpu_good, mem_good, color = "g", label="Train Good")
plt.scatter(cpu_bad, mem_bad, color = "r", label="Train Bad")
plt.scatter(cpu_good_t, mem_good_t, color = "g", marker = 'v', label="Test Good")
plt.scatter(cpu_bad_t, mem_bad_t, color = "r", marker = 'v', label="Test Bad")
plt.legend()
plt.xlabel("CPU")
plt.ylabel("Memory")
plt.show()

# Plot with testing data set
plt.title('Server Status with Testing Data')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.pcolormesh(xx, yy, Z, cmap='Pastel2', edgecolor='face')
plt.scatter(cpu_good_t, mem_good_t, color = "g", marker = 'v', label="Good")
plt.scatter(cpu_bad_t, mem_bad_t, color = "r", marker = 'v', label="Bad")
plt.legend()
plt.xlabel("CPU")
plt.ylabel("Memory")
plt.show()

## Training an algorithm (Naive Bayes) to predict server performance
    - Generate data to simulation server performance (CPU, Memory -> Good or Bad)
    - Split the data into a train set and a test set (3 to 1 ratio)
    - Train Naive Bayes algorithm (A ML Classifier) with training data
    - Predict server performance: good or bad using the remaining 25% of data
    - Evaluate Model performance - calculate metrics such as recall, precision and accuracy
## 

In [5]:
import numpy as np
import pylab as pl
import random

# Generate data set
# Seed the random number so that the result can be reproduced
random.seed(42)
n_points=1000
cpu = [random.random() for ii in range(0,n_points)]
mem = [random.random() for ii in range(0,n_points)]
error = [random.random() for ii in range(0,n_points)]
y = [round(cpu[ii]*mem[ii]+0.3+0.1*error[ii]) for ii in range(0,n_points)]
for ii in range(0, len(y)):
    if cpu[ii]>0.9 or mem[ii]>0.99:
        y[ii] = 1.0



In [6]:

# split the data into train/test sets
# Here we want 3/4 as train data and 1/4 as test data
X = [[gg, ss] for gg, ss in zip(cpu, mem)]
split = int(0.75*n_points)
features_train = X[0:split]
features_test  = X[split:]
labels_train = y[0:split]
labels_test  = y[split:]


In [7]:

# Use Naive Bayes classifier to model the problem
# Here we are training 
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)


In [8]:

# One import aspect in ML modeling is to evaluate the model
# Calculate accuracy
# Accuracy = (true positive + true negative) / (total population)
accuracy = clf.score(features_train, labels_train)
print 'Model accuracy with training data is ' + str(accuracy)
accuracy = clf.score(features_test, labels_test)
print 'Model accuracy with testing data is ' + str(accuracy)

# Produce confusion_matrix to calculate how good the model is in various aspects
from sklearn.metrics import confusion_matrix
cf = confusion_matrix(labels_test, pred)
print("\nConfusion_matrics:")
print(cf)
tn, fp, fn, tp = cf.ravel()
print "\nTrue Negative = " + str(tn)
print "\nFalse Positive = " + str(fp)
print "\nFalse Negative = " + str(fn)
print "\nTrue Positive = " + str(tp)

# Calculate precision, recall and f1-score
# precision: tp / (tp + fp) => true alarm relative to all alarms
# recall: tp / (tp + fn) => true alarm relative to all Bad condition
# In this case, we are more concerned about missing alarms, thus we want high recall
# f1-score: harmonic mean of precision and recall (2/(1/precision + 1/recall))
# support: number of class label
# avg: weighted average across all classes ( this is a binary class case) - sum(precision*support)/sum(support)
from sklearn.metrics import classification_report
print("\nClassification Report:")
print(classification_report(labels_test, pred))


In [9]:
# We want to visualize our results
# decision bounday is the boundary seperating classes
# draw the decision boundary with the text points overlaid
import warnings
warnings.filterwarnings("ignore")

import matplotlib
import matplotlib.pyplot as plt
#matplotlib.pyplot.switch_backend('agg')

x_min = 0.0; x_max = 1.0
y_min = 0.0; y_max = 1.0

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
h = .01  # step size in the mesh
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

# the training data (features_train, labels_train) have both "Good" and "Bad" points mixed
# in together--separate them so we can give them different colors in the scatterplot,
# and visually identify them
cpu_good = [features_train[ii][0] for ii in range(0, len(features_train)) if labels_train[ii]==0]
mem_good = [features_train[ii][1] for ii in range(0, len(features_train)) if labels_train[ii]==0]
cpu_bad = [features_train[ii][0] for ii in range(0, len(features_train)) if labels_train[ii]==1]
mem_bad = [features_train[ii][1] for ii in range(0, len(features_train)) if labels_train[ii]==1]

# Plot also the test points
cpu_good_t = [features_test[ii][0] for ii in range(0, len(features_test)) if labels_test[ii]==0]
mem_good_t = [features_test[ii][1] for ii in range(0, len(features_test)) if labels_test[ii]==0]
cpu_bad_t = [features_test[ii][0] for ii in range(0, len(features_test)) if labels_test[ii]==1]
mem_bad_t = [features_test[ii][1] for ii in range(0, len(features_test)) if labels_test[ii]==1]


# Put the result into a color plot
# Plot with training data set
plt.title('Server Performance with Training Data')
Z = Z.reshape(xx.shape)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.pcolormesh(xx, yy, Z, cmap='Pastel2')
plt.scatter(cpu_good, mem_good, color = "g", label="Good")
plt.scatter(cpu_bad, mem_bad, color = "r", label="Bad")
plt.legend()
plt.xlabel("CPU")
plt.ylabel("Memory")
plt.show()

# Plot with testing data set
plt.title('Server Performance with Testing Data')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.pcolormesh(xx, yy, Z, cmap='Pastel2', edgecolor='face')
plt.scatter(cpu_good_t, mem_good_t, color = "g", marker = 'v', label="Good")
plt.scatter(cpu_bad_t, mem_bad_t, color = "r", marker = 'v',  label="Bad")
plt.legend()
plt.xlabel("CPU")
plt.ylabel("Memory")
plt.show()

#plt.savefig("/opt/data/share01/jl2408/test.png")



## Training a SVM algorithm to predict server performance
- Generate data to simulation server performance (CPU, Memory -> Good or Bad)
- Split the data into a train set and a test set (3 to 1 ratio)
- Train Support Vector Machine algorithm (A ML Classifier) with training data
- Predict server performance: good or bad using the remaining 25% of data
- Evaluate Model performance - calculate metrics such as recall, precision and accuracy
## 

In [11]:
import numpy as np
import pylab as pl
import random

# Generate data set
# Seed the random number so that the result can be reproduced
random.seed(42)
n_points=1000
cpu = [random.random() for ii in range(0,n_points)]
mem = [random.random() for ii in range(0,n_points)]
error = [random.random() for ii in range(0,n_points)]
y = [round(cpu[ii]*mem[ii]+0.3+0.1*error[ii]) for ii in range(0,n_points)]
for ii in range(0, len(y)):
    if cpu[ii]>0.9 or mem[ii]>0.99:
        y[ii] = 1.0

# split into train/test sets
X = [[gg, ss] for gg, ss in zip(cpu, mem)]
split = int(0.75*n_points)
features_train = X[0:split]
features_test  = X[split:]
labels_train = y[0:split]
labels_test  = y[split:]

# Use Support Vector Machine classifier to model the problem
# Here we can tune the model with kernel, gamma and C
# rbf: Radial Basis Function - a ML kernel
# Data points contains pattern + stochastic noise. The goal of machine learning is to model the pattern and ignore the noise. Anytime an algorithm is trying to fit the noise in addition to the pattern, it is overfitting. Adjust gamma and C to avoid underfitting or overfitting
from sklearn.svm import SVC
clf = SVC(kernel='rbf', gamma = 1., C=10000.)
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)

# Calculate accuracy
accuracy = clf.score(features_train, labels_train)
print 'Model accuracy with training data is ' + str(accuracy)
accuracy = clf.score(features_test, labels_test)
print 'Model accuracy with testing data is ' + str(accuracy)


### draw the decision boundary with the text points overlaid
import warnings
warnings.filterwarnings("ignore")

import matplotlib
import matplotlib.pyplot as plt
#matplotlib.pyplot.switch_backend('agg')

x_min = 0.0; x_max = 1.0
y_min = 0.0; y_max = 1.0

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
h = .01  # step size in the mesh
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

# the training data (features_train, labels_train) have both "Good" and "Bad" points mixed
# in together--separate them so we can give them different colors in the scatterplot,
# and visually identify them
cpu_good = [features_train[ii][0] for ii in range(0, len(features_train)) if labels_train[ii]==0]
mem_good = [features_train[ii][1] for ii in range(0, len(features_train)) if labels_train[ii]==0]
cpu_bad = [features_train[ii][0] for ii in range(0, len(features_train)) if labels_train[ii]==1]
mem_bad = [features_train[ii][1] for ii in range(0, len(features_train)) if labels_train[ii]==1]

# Plot also the test points
cpu_good_t = [features_test[ii][0] for ii in range(0, len(features_test)) if labels_test[ii]==0]
mem_good_t = [features_test[ii][1] for ii in range(0, len(features_test)) if labels_test[ii]==0]
cpu_bad_t = [features_test[ii][0] for ii in range(0, len(features_test)) if labels_test[ii]==1]
mem_bad_t = [features_test[ii][1] for ii in range(0, len(features_test)) if labels_test[ii]==1]


# Put the result into a color plot
# Plot with training data set
plt.title('Server Performance with Training Data')
Z = Z.reshape(xx.shape)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.pcolormesh(xx, yy, Z, cmap='Pastel2')
plt.scatter(cpu_good, mem_good, color = "g", label="Good")
plt.scatter(cpu_bad, mem_bad, color = "r", label="Bad")
plt.legend()
plt.xlabel("CPU")
plt.ylabel("Memory")
plt.show()

# Plot with testing data set
plt.title('Server Performance with Testing Data')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.pcolormesh(xx, yy, Z, cmap='Pastel2', edgecolor='face')
plt.scatter(cpu_good_t, mem_good_t, color = "g", marker = 'v', label="Good")
plt.scatter(cpu_bad_t, mem_bad_t, color = "r", marker = 'v',  label="Bad")
plt.legend()
plt.xlabel("CPU")
plt.ylabel("Memory")
plt.show()

#plt.savefig("/opt/data/share01/jl2408/test.png")