<a href="https://colab.research.google.com/github/cengizmehmet/DesignPatterns/blob/main/SPEC2006/SPEC2006_Cleaner.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **PREPROCESSING OF THE SPEC2006 DATASET**

**Prepared by Mehmet CENGIZ**

ORCID: 0000-0003-4972-167X

As we use this format of the SPEC2006 dataset in our studies, we modify it based on our requirements. Those who will use this script is free to modify this based on their needs. Please do not forget to cite our studies.



---



This script is prepared for making operable the SPEC2006 dataset to use in our further studies. Even though the dataset was prepared carefully by practitioners, there are many unnecessary and untidy data in it. We will explain every process applied on columns onwards.

In [None]:
import pandas as pd
import re
from typing import Tuple
import enum

**Versions:**

We developed this system using the versions:


*   Python: 3.7.12
*   Pandas: 1.1.5
*   Regular expressions: 2.2.1



In [None]:
! python --version

Python 3.7.12


In [None]:
print(pd.__version__)
print(re.__version__)

1.1.5
2.2.1




---



**In our case, we access the dataset from our own Google Drive folder. Thus, in your study, you have to change the path where you access the dataset.**

In [None]:
path = '/content/drive/MyDrive/Colab Notebooks/Datasets/SPEC/SPEC2006_Original.csv' # DO NOT FORGET TO CHANGE THE PATH

In [None]:
dataset = pd.read_csv(path)



---



## **THE DATASET BEFORE PREPROCESSING**

After preprocessing, the dataset will be changed in many ways. In the section below, we present the original formats of some columns of the dataset.

**Column names:**

In [None]:
print(len(dataset.columns))
print(dataset.columns)

33
Index(['Benchmark', 'Hardware Vendor\t', 'System', 'Result', 'Baseline',
       '# Cores', '# Chips ', '# Cores Per Chip ', '# Threads Per Core',
       'Processor ', 'Processor MHz', 'Processor Characteristics',
       'CPU(s) Orderable', 'Auto Parallelization', 'Base Pointer Size',
       'Peak Pointer Size', '1st Level Cache', '2nd Level Cache',
       '3rd Level Cache', 'Other Cache', 'Memory', 'Operating System',
       'File System', 'Compiler', 'HW Avail', 'SW Avail', 'License',
       'Tested By', 'Test Sponsor', 'Test Date', 'Published', 'Updated ',
       'Disclosures'],
      dtype='object')


**Data types:**

In [None]:
dataset.dtypes

Benchmark                     object
Hardware Vendor\t             object
System                        object
Result                       float64
Baseline                     float64
# Cores                        int64
# Chips                        int64
# Cores Per Chip               int64
# Threads Per Core             int64
Processor                     object
Processor MHz                  int64
Processor Characteristics     object
CPU(s) Orderable              object
Auto Parallelization          object
Base Pointer Size             object
Peak Pointer Size             object
1st Level Cache               object
2nd Level Cache               object
3rd Level Cache               object
Other Cache                   object
Memory                        object
Operating System              object
File System                   object
Compiler                      object
HW Avail                      object
SW Avail                      object
License                        int64
T

**Shape:**

In [None]:
dataset.shape

(48381, 33)

**The dataset itself:**

Colab's dataset formatter is not working well, because of so many columns and rows. The limit is 20 columns and 20000 rows. Those who want to see the dataset tidier and interactive may click the baton symbol left down corner.

In [None]:
dataset

Unnamed: 0,Benchmark,Hardware Vendor\t,System,Result,Baseline,# Cores,# Chips,# Cores Per Chip,# Threads Per Core,Processor,Processor MHz,Processor Characteristics,CPU(s) Orderable,Auto Parallelization,Base Pointer Size,Peak Pointer Size,1st Level Cache,2nd Level Cache,3rd Level Cache,Other Cache,Memory,Operating System,File System,Compiler,HW Avail,SW Avail,License,Tested By,Test Sponsor,Test Date,Published,Updated,Disclosures
0,CINT2006,ACTION S.A.,"ACTINA SOLAR 110 S6 (Intel Xeon E3-1220 v3, 3....",58.7,56.9,4,1,4,1,Intel Xeon E3-1220 v3,3100,Intel Turbo Boost Technology up to 3.50 GHz,1 chip,Yes,32/64-bit,32/64-bit,32 KB I + 32 KB D on chip per core,256 KB I+D on chip per core,8 MB I+D on chip per chip,,"32 GB (4 x 8 GB 2Rx8 PC3-12800E-11, ECC)","Red Hat Enterprise Linux Server release 7.1, (...",ext4,C/C++: Version 16.0.0.047 of Intel C++ Studio ...,Sep-2014,Aug-2015,9008,ACTION S.A.,ACTION S.A.,Dec-2015,Dec-2015,Dec-2015,"<A HREF=""/cpu2006/results/res2015q4/cpu2006-20..."
1,CINT2006,ACTION S.A.,"ACTINA SOLAR 202 S6 (Intel Xeon E5-2697 v3, 2....",67.5,64.4,28,2,14,1,Intel Xeon E5-2697 v3,2600,Intel Turbo Boost Technology up to 3.60 GHz,"1,2 chips",Yes,32/64-bit,32/64-bit,32 KB I + 32 KB D on chip per core,256 KB I+D on chip per core,35 MB I+D on chip per chip,,"256 GB (16 x 16 GB 2Rx4 PC4-2400P-R, running a...","Red Hat Enterprise Linux Server release 7.1, (...",ext4,C/C++: Version 16.0.0.047 of Intel C++ Studio ...,Sep-2014,Aug-2015,9008,ACTION S.A.,ACTION S.A.,Nov-2015,Dec-2015,Dec-2015,"<A HREF=""/cpu2006/results/res2015q4/cpu2006-20..."
2,CINT2006,ACTION S.A.,ACTINA SOLAR 205 S5 (Intel Xeon E5-2420),35.8,32.8,12,2,6,2,Intel Xeon E5-2420,1900,Intel Turbo Boost Technology up to 2.40 GHz,"1,2 chips",Yes,32-bit,32/64-bit,32 KB I + 32 KB D on chip per core,256 KB I+D on chip per core,15 MB I+D on chip per chip,,"96 GB (6 x 16 GB 2Rx4 PC3-12800R-11, ECC, runn...","SUSE Linux Enterprise Server 11 SP2 (x86_64), ...",ext3,C/C++: Version 12.1.0.225 of Intel C++ Studio ...,May-2012,Feb-2012,9008,ACTION S.A.,ACTION S.A.,Oct-2012,Dec-2012,Jul-2014,"<A HREF=""/cpu2006/results/res2012q4/cpu2006-20..."
3,CINT2006,ACTION S.A.,ACTINA SOLAR 210 X5 (Intel Xeon E5-2630),41.5,38.9,12,2,6,2,Intel Xeon E5-2630,2300,Intel Turbo Boost Technology up to 2.80 GHz,"1,2 chips",Yes,32-bit,32/64-bit,32 KB I + 32 KB D on chip per core,256 KB I+D on chip per core,15 MB I+D on chip per chip,,"128 GB (16 x 8 GB 2Rx4 PC3-12800R-11, ECC)","SUSE Linux Enterprise Server 11 SP2 (x86_64), ...",ext3,C/C++: Version 12.1.0.225 of Intel C++ Studio ...,Mar-2012,Feb-2012,9008,ACTION S.A.,ACTION S.A.,Oct-2012,Dec-2012,Jul-2014,"<A HREF=""/cpu2006/results/res2012q4/cpu2006-20..."
4,CINT2006,ACTION S.A.,"ACTINA SOLAR 210 X6 (Intel Xeon E5-2603 v4, 1....",35.4,34.4,12,2,6,1,Intel Xeon E5-2603 v4,1700,,"1,2 chips",Yes,32/64-bit,32/64-bit,32 KB I + 32 KB D on chip per core,256 KB I+D on chip per core,15 MB I+D on chip per chip,,"256 GB (16 x 16 GB 2Rx4 PC4-2133P-R, running a...","Red Hat Enterprise Linux Server release 7.2, (...",ext4,C/C++: Version 16.0.3.210 of Intel C++ Studio ...,Mar-2016,Mar-2016,9008,ACTION S.A.,ACTION S.A.,Sep-2016,Nov-2016,Nov-2016,"<A HREF=""/cpu2006/results/res2016q4/cpu2006-20..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48376,CFP2006rate,Wipro Limited,Wipro NetPowerZ2243/NetPowerZ2243R,258.0,252.0,12,2,6,2,Intel Xeon X5670,2933,Intel Turbo Boost Technology up to 3.33 GHz,"1,2 chips",No,64-bit,32/64-bit,32 KB I + 32 KB D on chip per core,256 KB I+D on chip per core,12 MB I+D on chip per chip,,"96 GB (12 x 8 GB 2Rx4 PC3-10600R-9, ECC)","SuSe Linux SLES10 (x86_64) SP1, Kernel, 2.6.27...",ReiserFS,"Intel C++ and Fortran Intel 64 Compiler XE, fo...",Apr-2011,May-2011,937,Wipro Limited,Wipro Limited,Jun-2011,Aug-2011,Jul-2014,"<A HREF=""/cpu2006/results/res2011q3/cpu2006-20..."
48377,CFP2006rate,YOYOtech,Fi7EPOWER MLK1610 (Intel Core i7-965),88.3,84.7,4,1,4,2,Intel Core i7-965 Extreme Edition,3733,"Intel Turbo Boost Technology disabled, clocked...",1 chip,No,32-bit,32-bit,32 KB I + 32 KB D on chip per core,256 KB I+D on chip per core,8 MB I+D on chip per chip,,"9 GB (3x 2GB and 3x 1GB Corsair DDR3-1066, 9-9...",Windows Vista Ultimate w/ SP1 (64-bit),NTFS,"Intel C++ Compiler Professional 11.0 for IA32,...",Nov-2008,Nov-2008,3772,Future Publishing Ltd.,Future Publishing Ltd.,Oct-2008,Jan-2009,Jul-2014,"<A HREF=""/cpu2006/results/res2009q1/cpu2006-20..."
48378,CFP2006rate,Yadro,"Yadro Vesnin (2.92 GHz, 40 cores, RHEL 7.4)",0.0,1500.0,40,4,10,4,IBM POWER8,2926,"Intelligent Energy Optimization enabled, up to...",1-4 chips,No,64-bit,Not Applicable,32 KB I + 64 KB D on chip per core,512 KB I+D on chip per core,8 MB I+D on chip per core,16 MB I+D off chip per 8 DIMMs,"4 TB (128 x 32 GB 2Rx4 PC4 - 2400T, running at...","Red Hat Enterprise Linux Server release 7.4, (...",xfs,C/C++: Version 13.1.5 of IBM XL C/C++ for Linu...,Dec-2017,Dec-2016,4813,Yadro,Yadro,Dec-2017,Mar-2018,Mar-2018,"<A HREF=""/cpu2006/results/res2018q1/cpu2006-20..."
48379,CFP2006rate,Yadro,"Yadro Vesnin (3.32 GHz, 32 cores, RHEL 7.2)",0.0,1380.0,32,4,8,4,IBM POWER8,3325,"Intelligent Energy Optimization enabled, up to...",1-4 chips,No,64-bit,Not Applicable,32 KB I + 64 KB D on chip per core,512 KB I+D on chip per core,8 MB I+D on chip per core,16 MB I+D off chip per 8 DIMMs,"8 TB (128 x 64 GB 4Rx4 PC4 - 2400T, running at...","Red Hat Enterprise Linux Server release 7.2, (...",xfs,C/C++: Version 13.1.5 of IBM XL C/C++ for Linu...,Dec-2017,Dec-2016,4813,Yadro,Yadro,Dec-2017,Mar-2018,Mar-2018,"<A HREF=""/cpu2006/results/res2018q1/cpu2006-20..."


## The Common Functions

Some functions below require a column name as the parameter. In order to prevent errors from passing in invalid constants, we defined an enumeration class.

In [None]:
class Column_Names(enum.Enum):
   system = "System"
   file_system = "File_System"
   first_cache = "1st_Level_Cache"
   second_cache = "2nd_Level_Cache"
   third_cache = "3rd_Level_Cache"
   other_cache = "Other_Cache"
   result = "Result"

**Method Name:** drop_empties

**Parameters:** pd.DataFrame, Column_Names, str

**Return:** pd.DataFrame

This function deletes the rows containing the character entered according to the specified column. e.g. delete rows with value of 0 in the Result column.

In [None]:
def drop_empties(dataset: pd.DataFrame, column_name: Column_Names, empty_character: str) -> pd.DataFrame:
  dataset.drop(dataset.index[dataset[column_name.value] == empty_character], inplace = True)
  return dataset

**Method Name:** remove_between_with_delimiter

**Parameters:** pd.DataFrame, str, str

**Return:** pd.DataFrame

This function removes all characters between two delimiters including delimiters from a string. e. g.
> input: Test system (test info)

> delimiters: "(" and ")"

> output: Test system

In [None]:
def remove_between_with_delimiter(column: pd.DataFrame.columns, delimiter1: str, delimiter2: str) -> pd.DataFrame.columns:
  column = list(column)
  delimiter1 = "\\" + delimiter1
  delimiter2 = "\\" + delimiter2
  pattern = pattern = delimiter1 + ".*?" + delimiter2
  count = 0
  while count < len(column):
    column[count] = re.sub(pattern, "", column[count])
    count += 1
  column = [s.strip() for s in column]
  return column

**Method Name:** remove_after_with_delimiter

**Parameters:** pd.DataFrame, str, 

**Return:** pd.DataFrame

This function removes all characters after a delimiter including the delimiter from a string. e. g.
> input: Test system - test info

> delimiter: "-"

> output: Test system

In [None]:
def remove_after_with_delimiter(column: pd.DataFrame.columns, delimiter: str) -> pd.DataFrame.columns:
  column = list(column)
  count = 0
  while count < len(column):
    column[count] = column[count].split(delimiter, 1)[0]
    count += 1
  column = [s.strip() for s in column]
  return column

**Method Name:** eliminate_non_digits

**Parameters:** str

**Return:** int

This function removes all non-digit characters from a string and returns an integer. e. g.
> input: 45 kb info

> output: 45

In [None]:
def eliminate_non_digits(text: str) -> int:
  numeric_filter = filter(str.isdigit, text)
  digits = "".join(numeric_filter)
  if digits == "":
    return 0
  else:
    return int(digits)

**Method Name:** find_anagrams

**Parameters:** pd.DataFrame.columns

**Return:** Tuple[list, list]

This function was designed to find the whole combinations of a string. The main goal is to find duplicates in columns such as "Pentium dual core e2140" and "Pentium dualcore e2140". The main approach is to find anagrams.

In [None]:
def find_anagrams(column: pd.DataFrame.columns) -> Tuple[list, list]:
  values = list(set(column))
  checked = [False for i in range(len(values))]
  originals = []
  anagrams = []
  i = 0
  while i < len(values):
    if checked[i] == False:
      checked[i] = True
      str1 = sorted((values[i].replace("-", "")).replace(" ",""))
      j = i + 1
      while j < len(values):
        if checked[j] == False:
          str2 = sorted((values[j].replace("-", "")).replace(" ",""))
          if str1 == str2:
            checked[j] = True
            originals.append(values[i])
            anagrams.append(values[j])
            values[j] == values[i]          
        j += 1
    i += 1
  return originals, anagrams

## **PRE CLEARING**

**Clearing column names**

Column names have white spaces between words. In order to reach them easily in the script, we add "_" between words and trim each other. By doing this, a naming convention was created.

In [None]:
def clean_column_names(dataset):
  column_names = dataset.columns
  column_names = [s.strip() for s in column_names]
  column_names = [s.replace(' ', '_') for s in column_names]
  dataset.columns = column_names
  return dataset

In [None]:
dataset = clean_column_names(dataset)

In [None]:
dataset.columns

Index(['Benchmark', 'Hardware_Vendor', 'System', 'Result', 'Baseline',
       '#_Cores', '#_Chips', '#_Cores_Per_Chip', '#_Threads_Per_Core',
       'Processor', 'Processor_MHz', 'Processor_Characteristics',
       'CPU(s)_Orderable', 'Auto_Parallelization', 'Base_Pointer_Size',
       'Peak_Pointer_Size', '1st_Level_Cache', '2nd_Level_Cache',
       '3rd_Level_Cache', 'Other_Cache', 'Memory', 'Operating_System',
       'File_System', 'Compiler', 'HW_Avail', 'SW_Avail', 'License',
       'Tested_By', 'Test_Sponsor', 'Test_Date', 'Published', 'Updated',
       'Disclosures'],
      dtype='object')



---



**Clearing rows**

The dataset contains some inconclusive tests. Result values were entered as 0 in these tests by the practitioners. First, we started by eliminating these lines.

In [None]:
empty_char = 0
dataset = drop_empties(dataset, Column_Names.result, empty_char)



---



## **COLUMNS TO USE AS IS**

Some columns can be used as they are because their contents are quite clear. The columns are:

*   Processor MHz
*   Operating System
*   Auto Parallelization
*   Base Pointer Size
*   Peak Pointer Size
*   Compiler
*   HW Avail
*   SW Avail

The columns "Result" and "Baseline" are target attributes. Therefore we do not change them.

## **COLUMNS NOT CONSIDERED TO BE USED**

The columns

*   Benchmark
*   Licence
*   Tested By
*   Test Sponsor
*   Test Date
*   Published
*   Updated
*   Disclosure

are irrelevant for our further study. Therefore, we dropped them.

In [None]:
dataset = dataset.drop(['Benchmark'], axis = 1)
dataset = dataset.drop(['License'], axis = 1)
dataset = dataset.drop(['Tested_By'], axis = 1)
dataset = dataset.drop(['Test_Sponsor'], axis = 1)
dataset = dataset.drop(['Test_Date'], axis = 1)
dataset = dataset.drop(['Published'], axis = 1)
dataset = dataset.drop(['Updated'], axis = 1)
dataset = dataset.drop(['Disclosures'], axis = 1)

## **COLUMNS TO BE CORRECTED**

### **HARDWARE VENDOR**

Originally, there are 59 different hardware vendors. However, after we cleaned the rows containing 0 values, the number decreased to 53.

In [None]:
dataset['Hardware_Vendor'] = dataset['Hardware_Vendor'].str.lower()
dataset['Hardware_Vendor'] = dataset['Hardware_Vendor'].str.strip()
vendors = set(list(dataset['Hardware_Vendor']))
print(len(vendors))
print(sorted(vendors))

53
['acer incorporated', 'action s.a.', 'apple computer, inc.', 'asus computer international', 'asustek computer inc.', 'boxx technologies, inc.', 'bull sas', 'cisco systems', 'clearcube technology', 'clevo', 'cryo performance computing ltd', 'dell inc.', 'e4 computer engineering s.p.a.', 'format', 'fujitsu', 'fujitsu limited', 'fujitsu siemens computers', 'giga-byte technology co. ltd.', 'gigabyte technology co., ltd.', 'h3c', 'hewlett packard enterprise', 'hewlett-packard company', 'hitachi', 'huawei', 'hypertechnologies ciara, inc', 'ibm corporation', 'incom s.a.', 'inspur corporation', 'intel corporation', 'itautec', 'iwill corporation', 'lenovo global technology', 'lenovo group limited', 'm computers s.r.o.', 'maxdata ag', 'microsoft corporation', 'msi', 'nec corporation', 'nokia', 'ntt system s. a.', 'oracle corporation', 'pc factory s.a.', 'quanta cloud technology', 'quanta computer inc.', 'sgi', 'sugon', 'sun microsystems', 'supermicro', 'tyan', 'unisys corporation', 'wipro lim

In the column, there are duplicated vendors such as "Hewlett-Packard" and "Hewlett Packard". We need to make them one. It is a quite straightforward process, there is no smart approach. However, in order to not lose any vendor the coding order is important.

In [None]:
dataset['Hardware_Vendor'] = remove_after_with_delimiter(dataset['Hardware_Vendor'], "computer")
dataset['Hardware_Vendor'] = remove_after_with_delimiter(dataset['Hardware_Vendor'], "corporation")
dataset['Hardware_Vendor'] = remove_after_with_delimiter(dataset['Hardware_Vendor'], "technology")
dataset['Hardware_Vendor'] = remove_after_with_delimiter(dataset['Hardware_Vendor'], ",")
dataset['Hardware_Vendor'] = remove_after_with_delimiter(dataset['Hardware_Vendor'], "limited")
dataset['Hardware_Vendor'] = remove_after_with_delimiter(dataset['Hardware_Vendor'], "computing")
dataset['Hardware_Vendor'] = remove_after_with_delimiter(dataset['Hardware_Vendor'], "s.")
dataset['Hardware_Vendor'] = remove_after_with_delimiter(dataset['Hardware_Vendor'], "global")
dataset['Hardware_Vendor'] = remove_after_with_delimiter(dataset['Hardware_Vendor'], "group")
dataset['Hardware_Vendor'] = remove_after_with_delimiter(dataset['Hardware_Vendor'], "company")
dataset['Hardware_Vendor'] = remove_after_with_delimiter(dataset['Hardware_Vendor'], "enterprise")



> **Note:**
The Dataframe.replace() method is applied to dataframes and series, and it works only in case when the whole item of that series or dataframe coincides with the indicated value to be replaced.

>The method pandas.Series.str.replace() instead, is applied only to the series of strings, and in each item of that series, it looks for a pattern to be replaced. This pattern can coincide with the whole item, or with a part of it. This means the coinciding part will be replaced.

>In our case, pandas.Series.str.replace() is needed.



In [None]:
dataset['Hardware_Vendor'] = dataset['Hardware_Vendor'].str.replace("-", " ")

In [None]:
dataset['Hardware_Vendor'] = remove_after_with_delimiter(dataset['Hardware_Vendor'], "tek")
dataset['Hardware_Vendor'] = dataset['Hardware_Vendor'].replace("giga byte", "gigabyte")
dataset['Hardware_Vendor'] = remove_after_with_delimiter(dataset['Hardware_Vendor'], "cloud")
dataset['Hardware_Vendor'] = remove_after_with_delimiter(dataset['Hardware_Vendor'], "incorporated")
dataset['Hardware_Vendor'] = remove_after_with_delimiter(dataset['Hardware_Vendor'], "inc.")
dataset['Hardware_Vendor'] = remove_after_with_delimiter(dataset['Hardware_Vendor'], "technologies")
dataset['Hardware_Vendor'] = remove_after_with_delimiter(dataset['Hardware_Vendor'], "sas")
dataset['Hardware_Vendor'] = remove_after_with_delimiter(dataset['Hardware_Vendor'], "performance")
dataset['Hardware_Vendor'] = remove_after_with_delimiter(dataset['Hardware_Vendor'], "systems")
dataset['Hardware_Vendor'] = remove_after_with_delimiter(dataset['Hardware_Vendor'], "microsystems")

As a result, the number of vendors downs to 47. Duplicates and unnecessary parts were eliminated.

In [None]:
vendors = set(list(dataset['Hardware_Vendor']))
print(len(vendors))
print(sorted(vendors))

47
['acer', 'action', 'apple', 'asus', 'boxx', 'bull', 'cisco', 'clearcube', 'clevo', 'cryo', 'dell', 'e4', 'format', 'fujitsu', 'fujitsu siemens', 'gigabyte', 'h3c', 'hewlett packard', 'hitachi', 'huawei', 'hyper', 'ibm', 'incom', 'inspur', 'intel', 'itautec', 'iwill', 'lenovo', 'm', 'maxdata ag', 'microsoft', 'msi', 'nec', 'nokia', 'ntt system', 'oracle', 'pc factory', 'quanta', 'sgi', 'sugon', 'sun micro', 'supermicro', 'tyan', 'unisys', 'wipro', 'yoyotech', 'zte']


Two exceptions:


1.   M Computers' name must be still as it is.
2.   Sun Microsystems is also known as Sun. For further processes, we decided use it.

In [None]:
i = 0
hw_vendors = list(dataset['Hardware_Vendor'])
while i < len(dataset['Hardware_Vendor']):
  if hw_vendors[i] == "m":
    hw_vendors[i] = "m computers"
  elif hw_vendors[i] == "sun micro":
    hw_vendors[i] = "sun"
  i += 1
dataset['Hardware_Vendor'] = hw_vendors

### **CORES, CHIPS, THREADS, and ORDERABLE CPU(s)**

The SPEC2006 dataset contains 5 attributes that related each other: "# Cores", "# Chips", "# Cores Per Chip", "# Threads Per Core", and "CPU(s) Orderable". These involve similar info, so we eliminate some of them. Besides, in these columns, there are some inconsistencies in their values. For example, the system named "Bull Escala PL1650R+ (2200 MHz, 1 CPU)" has 1 chip and 2 cores per chip, but the total number of cores was entered 1. Therefore, we have to check and correct each value of the number of cores.

The function below multiplies the number of chips per core and the number of cores to find the total number of cores.

In [None]:
def correct_core_nums(chips, cores_per_chips):
  chips = list(chips)
  cores_per_chips = list(cores_per_chips)
  cores = []
  i = 0
  while i < len(chips):
    cores.append(cores_per_chips[i] * chips[i])
    i += 1
  return cores

We do not need the information of the number of cores per chip anymore after calculating the total number of cores. Likewise, the number of orderable CPUs is irrelevant. For this reason, we eliminate that column after copying it to a list.

In [None]:
dataset.drop(['CPU(s)_Orderable'], axis = 1)
cores_per_chips = dataset['#_Cores_Per_Chip']
dataset = dataset.drop(['#_Cores_Per_Chip'], axis = 1)

Below, we update the values of the number of cores.

In [None]:
dataset['#_Cores'] = correct_core_nums(dataset['#_Chips'], cores_per_chips)

Besides the total number of cores, the total number of threads are required. The function below calculates it.

In [None]:
def calculate_thread_nums(cores, threads_per_chips):
  cores = list(cores)
  threads_per_chips = list(threads_per_chips)
  threads = []
  i = 0
  while i < len(cores):
    threads.append(threads_per_chips[i] * cores[i])
    i += 1
  return threads

We eliminate the number of threads per core and add a new column named #_Threads. Below, we update the values of the total number of threads.

In [None]:
threads_per_chips = dataset['#_Threads_Per_Core']
dataset = dataset.drop(['#_Threads_Per_Core'], axis = 1)
threads = calculate_thread_nums(dataset['#_Cores'], threads_per_chips)
dataset.insert(8, "#_Threads", threads, True)

Finally, as the information of the number of chips is contained in the "#_Chips" column, we drop the "CPU(s) Orderable" column. 

In [None]:
dataset = dataset.drop(['CPU(s)_Orderable'], axis = 1)

### **CACHES**

In the dataset, there are 4 types of caches: 1st Level, 2nd Level, 3rd Level, and Other. The codes below turn them into their numerical values.

**Method Name:** calculate_cache_size

**Parameters:** pd.DataFrame, Column_Names

**Return:** pd.DataFrame.columns

The main goal is to find the total size of each cache. Caches can be for either instructions or data, however, this is irrelevant for our further studies.
This function split the data of the cells of the columns related to caches. It is designed for 1st, 2nd, and 3rd caches particularly. As seen in the further parts of the main script, the data in the cells are updated. As a result, relatively more clear data can be sent to this function as the parameter. Before dropping unnecessary parts, the data was like:

> 16 MB I+D on chip per chip, 2 MB shared / 2 cores

After cleaning:

> 16 MB I+D on chip per chip

Then, this function calculates the total size of the cache as

> 16 x 1000 x num_of_chips

Here, we accept that 1MB equals 1000 KB to ease convertions.

In [None]:
def calculate_cache_size(dataset: pd.DataFrame, column_name: Column_Names) -> pd.DataFrame.columns:
  caches = list(dataset[column_name.value])
  chips = list(dataset['#_Chips'])
  cores = list(dataset['#_Cores'])
  i = 0
  while i < len(caches):
    if caches[i] == "none":
      caches[i] = 0
    else:
      cache = 0
      parts = caches[i].split("+")
      if len(parts) == 1:
        num = eliminate_non_digits(parts[0])
        if "mb" in parts[0]:
          num = num * 1000
        if "per chip" in parts[0]:
          cache = num * chips[i]
        elif "per core" in parts[0]:
          cache = num * cores[i]
        caches[i] = cache
      elif len(parts) == 2:
        num1 = eliminate_non_digits(parts[0])
        if "mb" in parts[0]:
          num1 = num1 * 1000
        num2 = eliminate_non_digits(parts[1])
        if "mb" in parts[1]:
          num2 = num2 * 1000
        if num2 == 0:
          if "per core" in parts[1]:
            cache = num1 * cores[i]
          elif "per chip" in parts[1]:
            cache = num1 * chips[i]
          caches[i] = cache
        else:
          if parts[0].find("per core") == -1 and parts[0].find("per chip") == -1:
            cache = num1 + num2
            if "per core" in parts[1]:
              cache = cache * cores[i]
            elif "per chip" in parts[1]:
              cache = cache * chips[i]
            caches[i] = cache
          else:
            if "per core" in parts[0]:
              cache = num1 * cores[i]
            elif "per chip" in parts[0]:
              cache = num1 * chips[i]            
            if "per core" in parts[1]:
              cache += num2 * cores[i]
            elif "per chip" in parts[0]:
              cache += num2 * chips[i]
            caches[i] = cache
    i += 1
  return caches

#### **1st Level Cache**

There are 22 different types of 1st level caches. Like Hardware_Vendor, there were 24 different types of 1st level caches before the "row cleaning" process. 

In [None]:
dataset['1st_Level_Cache'] = dataset['1st_Level_Cache'].str.lower()
dataset['1st_Level_Cache'] = dataset['1st_Level_Cache'].str.strip()

In [None]:
first_level_caches = set(list(dataset['1st_Level_Cache']))
print(len(first_level_caches))

22


In [None]:
first_level_caches

{'12 k micro-ops i + 16 kb d on chip per chip',
 '12 k micro-ops i + 16 kb d on chip per core',
 '128 kb i + 128 kb d on chip per core',
 '128 kb i on chip per chip, 64 kb i shared / 2 cores; 16 kb d on chip per core',
 '128 kb i on chip per chip, 64 kb shared / 2 cores; 16 kb d on chip per core',
 '16 kb i + 16 kb d on chip per core',
 '16 kb i + 8 kb d on chip per core',
 '192 kb i on chip per chip, 64 kb i shared / 2 cores; 16 kb d on chip per core',
 '192 kb i on chip per chip, 96 kb i shared / 2 cores; 16 kb d on chip per core',
 '256 kb i on chip per chip, 64 kb i shared / 2 cores; 16 kb d on chip per core',
 '256 kb i on chip per chip, 64 kb shared / 2 cores; 16 kb d on chip per core',
 '32 kb i + 24 kb d on chip per core',
 '32 kb i + 32 kb d on chip per chip',
 '32 kb i + 32 kb d on chip per core',
 '32 kb i + 64 kb d on chip per core',
 '384 kb i on chip per chip, 64 kb i shared / 2 cores; 16 kb d on chip per core',
 '512 kb i on chip per chip, 64 kb i shared / 2 cores; 16 kb

In [None]:
dataset['1st_Level_Cache'] = remove_after_with_delimiter(dataset['1st_Level_Cache'], "/")
dataset['1st_Level_Cache'] = remove_after_with_delimiter(dataset['1st_Level_Cache'], ",")
dataset['1st_Level_Cache'] = dataset['1st_Level_Cache'].str.replace(";", " +")
dataset['1st_Level_Cache'] = dataset['1st_Level_Cache'].str.replace("12 k micro-ops i \+ ", "")

In [None]:
dataset['1st_Level_Cache'] = calculate_cache_size(dataset, Column_Names.first_cache)

#### **2nd Level Cache**

There are 33 different types of 2nd level caches. Like Hardware_Vendor, there were 41 different types of 2nd level caches before the row cleaning process.

In [None]:
dataset['2nd_Level_Cache'] = dataset['2nd_Level_Cache'].str.lower()
dataset['2nd_Level_Cache'] = dataset['2nd_Level_Cache'].str.strip()

In [None]:
second_level_caches = set(list(dataset['2nd_Level_Cache']))
print(len(second_level_caches))

33


As known, sizes of 2nd level caches are bigger than 1st level caches. 2nd level caches generally are measured with MBs while 1st level caches are using KBs. 

In [None]:
second_level_caches

{'1 mb i + 256 kb d on chip per core',
 '1 mb i+d on chip per chip',
 '1 mb i+d on chip per core',
 '1 mb i+d on chip per two cores',
 '11 mb i+d on chip per chip',
 '12 mb i+d on chip per chip',
 '12 mb i+d on chip per chip, 2 mb shared / 2 cores',
 '12 mb i+d on chip per chip, 6 mb shared / 2 cores',
 '128 kb i+d on chip per core',
 '16 mb i+d on chip per chip, 2 mb shared / 2 cores',
 '1920 kb i+d on chip per chip',
 '2 mb i on chip per chip (256 kb / 4 cores); 4 mb d on chip per chip (256 kb / 2 cores)',
 '2 mb i+d on chip per chip',
 '2 mb i+d on chip per core',
 '22 mb i+d on chip per chip',
 '24 mb i+d on chip per chip',
 '256 kb i+d on chip per core',
 '3 mb i+d on chip per chip',
 '4 mb i+d on chip per chip',
 '4 mb i+d on chip per chip, 1 mb i+d shared / 2 cores',
 '4 mb i+d on chip per chip, 2 mb shared / 2 cores',
 '4 mb i+d on chip per core',
 '5 mb i+d on chip per chip',
 '512 kb i + 256 kb d on chip per core',
 '512 kb i+d on chip per chip',
 '512 kb i+d on chip per core

In [None]:
dataset['2nd_Level_Cache'] = remove_after_with_delimiter(dataset['2nd_Level_Cache'], ",")
dataset['2nd_Level_Cache'] = remove_after_with_delimiter(dataset['2nd_Level_Cache'], "(")

In [None]:
dataset['2nd_Level_Cache'] = calculate_cache_size(dataset, Column_Names.second_cache)

#### **3rd Level Cache**

There are 56 different types of 3rd level caches. Like Hardware_Vendor, there were 57 different types of 3rd level caches before the row cleaning process.

In [None]:
dataset['3rd_Level_Cache'] = dataset['3rd_Level_Cache'].str.lower()
dataset['3rd_Level_Cache'] = dataset['3rd_Level_Cache'].str.strip()

In [None]:
third_level_caches = set(list(dataset['3rd_Level_Cache']))
print(len(third_level_caches))

56


In [None]:
third_level_caches

{'10 mb i+d on chip per chip',
 '10 mb i+d on chip per core',
 '11 mb i+d on chip per chip',
 '12 mb i+d on chip per chip',
 '12 mb i+d on chip per chip, 6 mb shared / 4 cores',
 '12 mb i+d on chip per chip, 6 mb shared / 6 cores',
 '12 mb i+d on chip per core',
 '13.75 mb i+d on chip per chip',
 '15 mb i+d on chip per chip',
 '16 mb i+d on chip per chip',
 '16 mb i+d on chip per chip, 8 mb shared / 2 cores',
 '16 mb i+d on chip per chip, 8 mb shared / 4 cores',
 '16 mb i+d on chip per chip, 8 mb shared / 6 cores',
 '16 mb i+d on chip per chip, 8 mb shared / 8 cores',
 '16.5 mb i+d on chip per chip',
 '18 mb i+d on chip per chip',
 '19.25 mb i+d on chip per chip',
 '2 mb i+d on chip per chip',
 '20 mb i+d on chip per chip',
 '22 mb i+d on chip per chip',
 '24 mb i+d on chip per chip',
 '24.75 mb i+d on chip per chip',
 '25 mb i+d on chip per chip',
 '27.5 mb i+d on chip per chip',
 '3 mb i+d on chip per chip',
 '30 mb i+d on chip per chip',
 '30.25 mb i+d on chip per chip',
 '32 mb i+d

In [None]:
dataset['3rd_Level_Cache'] = remove_after_with_delimiter(dataset['3rd_Level_Cache'], ",")
dataset['3rd_Level_Cache'] = remove_after_with_delimiter(dataset['3rd_Level_Cache'], "(")

In [None]:
dataset['3rd_Level_Cache'] = calculate_cache_size(dataset, Column_Names.third_cache)

#### **Other Caches**

As there are 4 different types of other caches, we do not need to use a complex method.

In [None]:
other_caches = set(list(dataset['Other_Cache']))
print(len(other_caches))

4


In [None]:
other_caches

{'16 MB I+D off chip per 4 DIMMs',
 '16 MB I+D off chip per CDIMM',
 '16 MB I+D on chip per chip',
 'None'}

In [None]:
def handle_other_caches(dataset):
  cache_column = list(dataset['Other_Cache'])
  chip_column = list(dataset['#_Chips'])
  i = 0
  while i < len(cache_column):
    stripped = str(cache_column[i]).strip()
    if stripped == "None":
      cache_column[i] = 0
    else:
      cache_column[i] = 16000 * chip_column[i]
    i += 1
  return cache_column

In [None]:
dataset['Other_Cache'] = handle_other_caches(dataset)

### **SYSTEM**

In many cases, system names involve both the vendor's name and the processor specs. We have to get rid of them because that information has its own columns.

**Method Name:** remove_vendor_names

**Parameters:** pd.DataFrame.columns, pd.DataFrame.columns

**Return:** pd.DataFrame.columns

This function deletes vendor names in system names.

In [None]:
def remove_vendor_names(systems: pd.DataFrame.columns, vendors: pd.DataFrame.columns) -> pd.DataFrame.columns:
  vendors = list(set(list(vendors)))
  systems = list(systems)
  i = 0
  while i < len(systems):
    j = 0
    temp = systems[i].strip()
    while j < len(vendors):
      if vendors[j] in temp:
        start_index = systems[i].find(vendors[j])
        end_index = start_index + len(vendors[j])
        temp = temp[0:start_index:] + temp[end_index::]
        systems[i] = temp.strip()
        break
      j +=1
    i += 1
  return systems

In [None]:
dataset['System'] = dataset['System'].str.lower()
dataset['System'] = dataset['System'].str.strip()

In [None]:
dataset['System'] = remove_after_with_delimiter(dataset['System'], "(")
dataset['System'] = remove_after_with_delimiter(dataset['System'], ",")
dataset['System'] = remove_after_with_delimiter(dataset['System'], "amd")
dataset['System'] = dataset['System'].str.replace("motherboard", "")
dataset['System'] = dataset['System'].str.replace("-", " ")
dataset['System'] = dataset['System'].str.replace("processor", "")

In [None]:
dataset['System'] = remove_vendor_names(dataset['System'], dataset['Hardware_Vendor'])

In [None]:
dataset['System'] = dataset['System'].str.replace("2.60 ghz\)", "")

In [None]:
dataset['System'] = dataset['System'].str.strip()

In [None]:
dataset = drop_empties(dataset, Column_Names.system, "")

Before eliminating duplicates, in the dataset, there are 1712 different system names.

In [None]:
systems = set(list(dataset['System']))
print(len(systems))

1705


After eliminating redundancy from the dataset, we have to manage typos. In some cases, some system names were entered with different ways such as "yr190b8228" and "yr190 b8228".

To solve this problem, we designed a method that finds anagrams as mentioned earlier. However, this cannot help us totally. Because there are real anagrams in the dataset such as "poweredge r230" and "poweredge r320". 

On the other hand, with the help of anagrams, we narrowed the search space down. The anagram finder finds 47 anagrams.

In [None]:
original_systems, system_anagrams = find_anagrams(dataset['System'])
print(len(system_anagrams))

47


And we eliminate duplicated and typoed entries manually selecting from the list below.

In [None]:
i = 0
while i < len(system_anagrams):
  print(original_systems[i] + " - " + system_anagrams[i])
  i += 1

thinkserver rd120 - thinkserver rd210
i610 g20 - i620 g10
fire x4200m2 - fire x4200 m2
primergy tx1310 m3 - primergy tx1330 m1
yr190b8228 - yr190 b8228
altos r380 f4 - altos r480 f3
ucs c240m5 - ucs c240 m5
proliant dl785 g5 - proliant dl585 g7
blade sba 7142g t4 - blade sba7142g t4
fire x6440 - fire x4640
actina solar 402 a2 - actina solar 420 a2
primergy rx600 s3 - primergy rx300 s6
ucs b200m4 - ucs b200 m4
ucs b420 m4 - ucs b440 m2
actina solar 202 s5 - actina solar 220 s5
tecal rh5885 v2 - tecal  rh5885 v2
express5800/r110i 1 - express5800/110ri 1
primergy rx2530m4 - primergy rx2530 m4
aw2000 aw370h f3 - aw2000h f3 aw370
express5800/r110h 1 - express5800/110rh 1
poweredge r230 - poweredge r320
system x3400 m3 - system x3300 m4
express5800/gt110e - express5800/t110g e
system x3690 x5 - system x3950 x6
ucs c420 m3 - ucs c240 m3
actina solar 202 x2 - actina solar 220 x2
thinkserver rd540 - thinkserver rd450
primergy tx300 s6 - primergy tx600 s3
thinksystem sr590 - thinksystem sr950
sy

As a result, we corrected 15 different system entries manually as below.

In [None]:
dataset['System'] = dataset['System'].replace("yr190b8228", "yr190 b8228")
dataset['System'] = dataset['System'].replace("novascale r480e1", "novascale r480 e1")
dataset['System'] = dataset['System'].replace("tecal  rh5885 v2", "tecal rh5885 v2")
dataset['System'] = dataset['System'].replace("ucs c240m5", "ucs c240 m5")
dataset['System'] = dataset['System'].replace("fire x4200m2", "fire x4200 m2")
dataset['System'] = dataset['System'].replace("ucs b200m4", "ucs b200 m4")
dataset['System'] = dataset['System'].replace("blade sba7142g t4", "blade sba 7142g t4")
dataset['System'] = dataset['System'].replace("poweredgefc640", "poweredge fc640")
dataset['System'] = dataset['System'].replace("primergy rx2530m4", "primergy rx2530 m4")
dataset['System'] = dataset['System'].replace("superserver1028gr tr", "superserver 1028gr tr")
dataset['System'] = dataset['System'].replace("ar 780 f3", "ar780 f3")
dataset['System'] = dataset['System'].replace("proliant dl580 gen 9", "proliant dl580 gen9")

### **PROCESSORS**

There are 1022 different processors in the dataset. However, like the System column, it is required to check whether there are duplicated processor names because of typos.

In [None]:
dataset['Processor'] = dataset['Processor'].str.lower()
dataset['Processor'] = dataset['Processor'].str.strip()

In [None]:
dataset['Processor'] = dataset['Processor'].str.replace("intel", "")
dataset['Processor'] = dataset['Processor'].str.replace("amd", "")

In [None]:
processors = set(list(dataset['Processor']))
print(len(processors))

1022


First of all, clear "-"s from the data is required.

In [None]:
dataset['Processor'] = dataset['Processor'].str.replace("-", "")

In [None]:
original_processors, processor_anagrams = find_anagrams(dataset['Processor'])
print(len(processor_anagrams))

76


In [None]:
i = 0
while i < len(processor_anagrams):
  print(original_processors[i] + " - " + processor_anagrams[i])
  i += 1

 core 2 duo e7600 -  core 2 duo e6700
 xeon e51630 v4 -  xeon e54610 v3
 pentium dualcore e2160 -  pentium dual core e2160
 xeon x6550 -  xeon x5650
 xeon x6550 -  xeon x5560
 xeon e5640 -  xeon e6540
 opteron 8374 he -  opteron 8347 he
 opteron 2218 -  opteron 8212
 xeon e52658 v3 -  xeon e52685 v3
 xeon x7542 -  xeon x5472
 opteron 4122 -  opteron 2214
 xeon e52623 v4 -  xeon e52643 v2
 xeon e54620 -  xeon e52640
 xeon e52620 v3 -  xeon e52630 v2
 xeon e52620 v3 -  xeon e52603 v2
 xeon e7430 -  xeon e7340
 xeon e52603 v3 -  xeon e52630 v3
 xeon e52403 -  xeon e52430
 xeon e52403 v2 -  xeon e52430 v2
 xeon e74807 -  xeon e74870
 xeon e31220 v3 -  xeon e31230 v2
 xeon e54603 v2 -  xeon e54620 v3
 xeon e54603 v2 -  xeon e52630 v4
 xeon e54603 v2 -  xeon e52640 v3
 xeon e54603 v2 -  xeon e52603 v4
 xeon e74830 v2 -  xeon e74820 v3
 xeon e52603 -  xeon e52630
 opteron 8216 -  opteron 6128
 xeon e52690 v3 -  xeon e52609 v3
 opteron 8378 -  opteron 8387
 xeon e52650 v4 -  xeon e54650 v2
 xe

In [None]:
dataset['Processor'] = dataset['Processor'].str.replace("pentium dualcore e2160", "pentium dual core e2160")
dataset['Processor'] = dataset['Processor'].str.replace("pentium dualcore e2140", "pentium dual core e2140")

The Processor_Characteristics column is not relevant for our further studies, thus we eliminate it either.

In [None]:
dataset = dataset.drop(['Processor_Characteristics'], axis = 1)

### **MEMORY**

In the dataset, the Memory column contains not only the size of the system memory but also the combination of memory components such as "32 GB (4 x 8 GB 2Rx8 PC3-12800E-11, ECC)". As we only need the total size of the memory, we eliminate the parts between brackelets and commas.

In [None]:
dataset['Memory'] = dataset['Memory'].str.lower()
dataset['Memory'] = dataset['Memory'].str.strip()

In [None]:
dataset['Memory'] = remove_after_with_delimiter(dataset['Memory'], "(")
dataset['Memory'] = remove_after_with_delimiter(dataset['Memory'], ",")
dataset['Memory'] = dataset['Memory'].str.strip()

In [None]:
memories = set(list(dataset['Memory']))
print(len(memories))

43


In [None]:
memories

{'1 gb',
 '1 tb',
 '112 gb',
 '1152 gb',
 '12 gb',
 '120 gb',
 '128 gb',
 '1536 gb',
 '16 gb',
 '16 tb',
 '160 gb',
 '192 gb',
 '2 gb',
 '2 tb',
 '208 gb',
 '24 gb',
 '240 gb',
 '256 gb',
 '284 gb',
 '3 tb',
 '304 gb',
 '32 gb',
 '36 gb',
 '384 gb',
 '4 gb',
 '4 tb',
 '400 gb',
 '48 gb',
 '500 gb',
 '512 gb',
 '6 gb',
 '64 gb',
 '640 gb',
 '704 gb',
 '72 gb',
 '768 gb',
 '8 gb',
 '8 tb',
 '8320 gb',
 '8576 gb',
 '9 gb',
 '96 gb',
 '976 gb'}

After eliminating non-essential parts, TB to GB conversions is applied. As cache sizes, we use thousand system: 1 TB = 1000 GB.

In [None]:
def convert_TB_to_GB(column: pd.DataFrame.columns) -> pd.DataFrame.columns:
  column = list(column)
  i = 0
  while i < len(column):
    if "gb" in column[i]:
      column[i] = eliminate_non_digits(column[i])
    else:
      x = eliminate_non_digits(column[i])
      column[i] = 1000 * eliminate_non_digits(column[i])
    i += 1
  return column

In [None]:
dataset['Memory'] = convert_TB_to_GB(dataset['Memory'])

### **FILE SYSTEM**

Before applying following steps, there were 27 different types of file systems. After eliminating duplicates, we achieved to decrease it to 20.

In [None]:
dataset['File_System'] = dataset['File_System'].str.lower()
dataset['File_System'] = dataset['File_System'].str.strip()

In [None]:
file_sys = set(list(dataset['File_System']))
print(len(file_sys))

27


In [None]:
file_sys

{'(see additional details below)',
 'aix/jfs2',
 'benchmark on ufs, os on zfs',
 'btfs',
 'btrfs',
 'ext2',
 'ext3',
 'ext4',
 'hfs+',
 'lustre v1.6.7 over ddr infiniband',
 'nfsv3',
 'nfsv3 (see additional details below)',
 'nfsv3 ipoib',
 'nfsv3, ipoib',
 'nfsv4',
 'ntfs',
 'os on zfs, benchmark on ufs',
 'reiserfs',
 'tmpfs',
 'tmpfs (output_root was used to put run directories in /tmp/cpu2006) zfs',
 'ufs',
 'vxfs',
 'xfs',
 'zfs',
 'zfs (with raidz)',
 'zfs and tmpfs',
 'zfs with gzip compression'}

In [None]:
dataset['File_System'] = remove_between_with_delimiter(dataset['File_System'], "(", ")")

In [None]:
dataset = drop_empties(dataset, Column_Names.file_system, "")

In [None]:
dataset['File_System'] = dataset['File_System'].str.replace(",", "")
dataset['File_System'] = dataset['File_System'].str.replace("/", " ")
dataset['File_System'] = remove_after_with_delimiter(dataset['File_System'], "with")
dataset['File_System'] = remove_after_with_delimiter(dataset['File_System'], "v1.")
dataset['File_System'] = dataset['File_System'].replace("tmpfs  zfs", "tmpfs zfs")
dataset['File_System'] = dataset['File_System'].replace("zfs and tmpfs", "tmpfs zfs")
dataset['File_System'] = dataset['File_System'].replace("os on zfs benchmark on ufs", "ufs zfs")
dataset['File_System'] = dataset['File_System'].replace("benchmark on ufs os on zfs", "ufs zfs")
dataset['File_System'] = dataset['File_System'].str.replace(" ", " + ")

## **EXPORTING THE DATASET**

**In our case, we export the modified dataset to our own Google Drive folder. Thus, in your study, you have to change the path where you export the dataset.**

In [None]:
dataset.to_excel(r'/content/drive/MyDrive/Colab Notebooks/Datasets/SPEC/SPEC2006_modified.xlsx', index = False, header = True)