<a href="https://colab.research.google.com/github/cengizmehmet/BenchmarkNets/blob/main/models/SPEC_CNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **CONVOLUTIONAL NEURAL NETWORKS**

**Prepared by Mehmet CENGIZ**

ORCID: 0000-0003-4972-167X

This script is built to create CNNs trained in the SPEC CPU2017 dataset. The source of the dataset can be found on the website of Standard Performance Evaluation Corporation ([SPEC](https://www.spec.org/cpu2017/results/)). You can access the modified dataset in line with our requirements from the [data](https://github.com/cengizmehmet/BenchmarkNets/tree/main/data) folder of this repository. Those who will use this script is free to modify this adhering to their needs.



---



## **NECESSARY DEPENDENCIES AND LIBRARIES**

In [None]:
import pandas as pd
import numpy as np
from numpy.random import seed
from typing import Tuple, List
from tensorflow import keras
from keras.models import Sequential
from tensorflow.keras.layers import Dense, Conv1D, Flatten
from sklearn.metrics import *
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import gaussian_kde

**Versions:** This information is the library versions in Google Colab when the models were first designed (around the end of 2022). Version differences may occur due to time and programming environment changes.

* tensorflow: 2.12.0
* pandas: 1.5.3
* numpy: 1.22.4
* seaborn: 0.12.2
* sklearn-pandas: 2.2.0



## **NECESSARY FUNCTIONS**

**Method Name:** CNN

**Parameters:** Tuple, list, list, list, list, list

**Return:** keras.Sequential

This function builds Convolutional Neural Networks. *input_shape* defines the shape of the input neuron. *filters_in_convs* holds the information of the number of convolutional layers and kernels in each layer. *kernel_size* is the shape of kernels. *neurons_in_denses* holds the information of the number of layers and neurons in each dense layer. *activations* holds the activation functions of each layer.

As this case is a regression problem, the output layer is defined with one neuron with the activation function of linear. Finally, this function returns an CNN.

In [None]:
def CNN(input_shape: Tuple, filters_in_convs: List, kernels_sizes: List, strides_sizes: List,
                      neurons_in_denses: List, activations: List) -> keras.Sequential:
  count = len(filters_in_convs)
  activation_count = 1
  model = Sequential()
  model.add(Conv1D(filters = filters_in_convs[0], kernel_size = kernels_sizes[0], strides = strides_sizes[0],
                   activation=activations[0], input_shape = input_shape))  
  for i in range(1, count):
    model.add(Conv1D(filters = filters_in_convs[i], kernel_size = kernels_sizes[i],
                     strides = strides_sizes[i], activation = activations[activation_count]))
  model.add(Flatten())
  count = len(neurons_in_denses)
  model.add(Dense(units=neurons_in_denses[0], activation = activations[activation_count]))  
  for i in range(1, count):
    activation_count += 1
    model.add(Dense(units=neurons_in_denses[i], activation = activations[activation_count]))
  model.add(Dense(units=1, activation='linear'))
  return model

**Method Name:** correlation

**Parameters:** str, float, bool, Tuple, isCbar, save

**Return:** set

This function calculates the correlation of each column and return the columns that have more than *abs(threshold)* correlation value.

* *corr_method* holds the correlation method. The correlation method may be kendall, spearman, and pearson. The default value is 'kendall'.
* *threshold* defines the acceptable correlation range. The default value is 0.7.
* *show* allows the matrix to be drawn or not. The default value is True
* *fig_dims* holds the size of the figure of the correlation matrix. The default value is (12, 8).
* *iscbar* allow the colour bar to be added or not. In some cases the figure of the correlation matrix does not fit the screen and the colour bar overlaps the matrix. The *isCbar* parameter exists to handle overlaps. The default values is True.
* *save* allows the save the figure of the matrix. The default values is False.

In [None]:
def correlation(corr_method: str = 'kendall', threshold: float = 0.7, show: bool = True, fig_dims: Tuple = (12, 8),
                isCbar: bool = True, save: bool = False) -> set:
  corr_features = set()
  corr_matrix = dataset.corr(method = corr_method)
  for i in range(len(corr_matrix.columns)):
    for j in range(i):
      if abs(corr_matrix.iloc[i, j]) > threshold:
        colname = corr_matrix.columns[i]
        corr_features.add(colname)
  if show:
    fig, ax = plt.subplots(figsize = fig_dims)
    sns.heatmap(corr_matrix, ax = ax, annot = True, cmap = plt.cm.CMRmap_r, cbar = isCbar)
    if save:
        plt.savefig("png_format.png", dpi = 300, format = "png")
        plt.savefig("tiff_format.png", dpi = 300, format = "tiff")
    plt.show()
  return corr_features

**Method Name:** split_dataset

**Parameters:** pd.DataFrame, float, bool, str

**Return:** Tuple[list, list, list, list, list, list]

This function splits the dataset and returns each part.

* *dataset* is the dataset to be split.
* *target_column* holds the name of the target column.
* *split_ratio* defines the ratio of the training and test sets. The default value is 0.8 which means 80% of the dataset is split as the training dataset.
* *shuffle* allows the dataset to be shuffled. The default value is True.

This function returns:
* The training dataset
* The test dataset
* The independent columns of the training dataset
* The target column of the training dataset
* The independent columns of the test dataset
* The target column of the test dataset
* All independent columns as one list
* The target column as one list

In [None]:
def split_dataset(dataset: pd.DataFrame, target_column: str, split_ratio: float = 0.8, shuffle: bool = True) -> Tuple[list, list, list, list, list, list]:  
  target_index = dataset.columns.get_loc(target_column)
  data = np.array(dataset)
  rows, columns = data.shape
  if shuffle:
    np.random.shuffle(data)
  train_size = int(split_ratio * rows)
  test_size = rows - train_size
  train = data[:train_size].T
  y_train = train[target_index]
  X_train = np.delete(train.T, obj = target_index, axis = 1)
  test = data[train_size:].T
  y_test = test[target_index]
  X_test = np.delete(test.T, obj = target_index, axis = 1)
  data = data.T
  y = data[target_index]
  X = np.delete(data.T, obj = target_index, axis = 1)
  return train.T, X_train, y_train, test.T, X_test, y_test, X, y

**Method Name:** factorize_columns

**Parameters:** pd.DataFrame, str

**Return:** pd.DataFrame, dictionary

This function converts values of the columns to type of the target column. The details of the factorization process is [here](https://pandas.pydata.org/docs/reference/api/pandas.factorize.html).

* *dataset* is the dataset to be factorized.
* *target* holds the name of the target column.

This function return both the converted dataset and labels of each converted value.

In [None]:
def factorize_columns(dataset: pd.DataFrame, target: str) -> Tuple[pd.DataFrame, dict]:
  all_labels = {}
  for column in dataset.columns:
    if dataset[column].dtypes != dataset[target].dtypes:
      dataset[column], labels = pd.factorize(dataset[column])
      all_labels[column] = labels
  return dataset, all_labels

## **PREPROCESS**

The variables of the dataset:

In [None]:
path = 'https://raw.githubusercontent.com/cengizmehmet/BenchmarkNets/main/data/SPEC2017_modified.csv'
dataset = pd.read_csv(path)
target = "Base_Result" #Target column

In [None]:
attributes = dataset.columns
rows, columns = dataset.shape
dtypes = dataset.dtypes

print(attributes)
print("-----")
print("Columns: " + str(rows) + "\n" + "Rows: " + str(columns))
print("-----")
print(dtypes)

The variables related to the correlation analysis:

In [None]:
corr_method = 'kendall'
threshold = 0.7 #It is accepted that higher and lower values than 0.7 and -0.7 respectively point out the high correlation
show = True #This defines whether the correlation matrix is drawn
fig_dims = (12, 8) #This defines the size of the figure in case draw is True
isCbar = False #This defines whether the colour bar is added as legend
save = False #This defines whether the correlation matrix is saved

In the following line, those two columns are dropped intentionally. The dataset contains two target column candidates: Peak Result and Base Result. Since we pick Base Result as the target column, Peak Result is dropped. The Disclosure column contains the HTML output of the benchmark of systems. Basically, it involves the same information as other columns.

In [None]:
dataset = dataset.drop(['Peak_Result'], axis = 1)
dataset = dataset.drop(['Disclosures'], axis = 1)

In [None]:
dataset, labels = factorize_columns(dataset, target)

In [None]:
correlated_features = correlation(corr_method = corr_method, threshold = threshold, show = show,
                                  fig_dims = fig_dims, isCbar = isCbar, save = save)

In [None]:
print(correlated_features)

In [None]:
dataset = dataset.drop(correlated_features, axis = 1) #Correlated column are dropped

After dropping the correlated columns, their label also must be dropped.

In [None]:
for label in correlated_features:
    labels.pop(label, None)

In order to provide controlled randomness, seeds are used.

In [None]:
seed(1)

The variables related to the split dataset:

In [None]:
m, n = dataset.shape
split = 0.8
shuffle = True

In [None]:
train, X_train, y_train, test, X_test, y_test, X, y = split_dataset(dataset = dataset, target_column = target, split_ratio = split, shuffle = shuffle)

## **MODEL**

### **Initialisation**

The parameters of the MLP:

In [None]:
input_shape = ((n - 1), 1)
filters_in_convs = [512, 128]
kernels_sizes = [3] * len(filters_in_convs)
strides_sizes = [1] * len(filters_in_convs)
neurons_in_denses = [512, 256, 128, 64, 32, 16]
activations = ['relu'] * (len(filters_in_convs) + len(neurons_in_denses))

In [None]:
model = CNN(input_shape, filters_in_convs, kernels_sizes, strides_sizes, neurons_in_denses, activations)

In [None]:
model.summary()

### **Training**

The training parameters:

In [None]:
loss = "mean_absolute_error"
opt = "adam" 
learning_rate = 0.001
metrics = ["mean_absolute_error", "mean_squared_error", "mean_absolute_percentage_error", "mean_squared_logarithmic_error", "logcosh"]
epochs = 5
batch_size = 10
validation_split = 0.2
verbose = 1 #It may be 0 or 1

In [None]:
model.compile(
        loss = loss,
        optimizer = opt,
        metrics = metrics
        )

In [None]:
history = model.fit(
      X_train,
      y_train,
      epochs = epochs,
      batch_size = batch_size,
      verbose = verbose,
      validation_split = validation_split
      )

Performance of the training phase:

In [None]:
#This line allow to store training results in a dictionary. 
results_dict = {}
for key in history.history.keys():
  results_dict[key] = history.history[key]

In [None]:
#To present average performance of the model
for key in results_dict:
  print(str(key) + '= ' + str(sum(results_dict[key]) / len(results_dict[key])))

Plotting the results of the training phase:

In [None]:
size = int(len(results_dict) / 2)
keys = list(results_dict.keys())
eps = range(1, epochs + 1)
for i in range(size):
  plt.plot(eps, results_dict[list(results_dict.keys())[i]], 'b', label = list(results_dict.keys())[i])
  plt.plot(eps, results_dict[list(results_dict.keys())[i + 6]], 'r', label = list(results_dict.keys())[i + 6])
  plt.xlabel('Epochs')
  plt.ylabel(list(results_dict.keys())[i])
  plt.legend()
  plt.show()

### **Test**

In [None]:
preds = model.predict(X_test)

Evalution of the test:

In [None]:
#Metrics
  #R2
r2_value = r2_score(y_test, preds)
print("R2 = " + str(r2_value))

  #MSE
mse = mean_squared_error(y_test, preds, squared = True)
print("MSE = " + str(mse))

  #RMSE
rmse = mean_squared_error(y_test, preds, squared = False)
print("RMSE = " + str(rmse))

  #MAE
mae = mean_absolute_error(y_test, preds)
print("MAE = " + str(mae))

  #Explained Variance Score
evs = explained_variance_score(y_test, preds)
print("EVS = " + str(evs))

  #Mean Pinball Loss
mpl = mean_pinball_loss(y_test, preds)
print("MPL = " + str(mpl))

Plotting the results of the test phase:

In [None]:
#Scatter plot
fig, ax = plt.subplots(figsize=(12, 8))
ax.scatter(y_test, preds, c='crimson')
ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'b-')
ax.set_xlabel('Actuals')
ax.set_ylabel('Predictions')
plt.show()

In [None]:
#Heat map
xy = np.vstack([y_test, preds.flatten()])
z = gaussian_kde(xy)(xy)
idx = z.argsort()
x, y, z = y_test[idx], preds[idx], z[idx]
fig, ax = plt.subplots(figsize=(12, 8))
ax.set_xlabel('Actuals')
ax.set_ylabel('Predictions')
cax = ax.scatter(x, y, c=z, s=50)
fig.colorbar(cax)
plt.show()