# Predicting CPU performance using Linear Regression

CPU ERP (Estimated relative performance) is predicted using Spark MLlib Linear Regression in terms of CPU cycle time and memory in this notebook. Dataset is downloaded from UCI machine learning repository. There are 209 instances and 8 features.
https://archive.ics.uci.edu/ml/datasets/Computer+Hardware

In [94]:
#Initializing pyspark session
import findspark
findspark.init('/home/ubuntu/spark-2.2.1-bin-hadoop2.7')
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('cpu_performance').getOrCreate()

In [95]:
data = spark.read.csv('Computerhardware.txt',inferSchema=True,header=None)
data.columns

['_c0', '_c1', '_c2', '_c3', '_c4', '_c5', '_c6', '_c7', '_c8', '_c9']

Attribute Information:

1. Vendor name: 30 
(adviser, amdahl,apollo, basf, bti, burroughs, c.r.d, cambex, cdc, dec, 
dg, formation, four-phase, gould, honeywell, hp, ibm, ipl, magnuson, 
microdata, nas, ncr, nixdorf, perkin-elmer, prime, siemens, sperry, 
sratus, wang) 
2. Model Name: Many unique symbols 
3. MYCT: Machine cycle time in nanoseconds (integer) 
4. MMIN: Minimum main memory in kilobytes (integer) 
5. MMAX: Maximum main memory in kilobytes (integer) 
6. CACH: Cache memory in kilobytes (integer) 
7. CHMIN: Minimum channels in units (integer) 
8. CHMAX: Maximum channels in units (integer) 
9. PRP: Published relative performance (integer) 
10. ERP: Estimated relative performance from the original article (integer)

Renaming columns as in dataset description file

In [96]:
data = data.selectExpr("_c0 as vendor", "_c1 as model",'_c2 as MYCT','_c3 as MMIN','_c4 as MMAX','_c5 as CACH','_c6 as CHMIN','_c7 as CHMAX','_c8 as PRP','_c9 as ERP')
data.show()
data.printSchema()
data.columns
data=data.na.drop()

+---------+--------+----+-----+-----+----+-----+-----+----+----+
|   vendor|   model|MYCT| MMIN| MMAX|CACH|CHMIN|CHMAX| PRP| ERP|
+---------+--------+----+-----+-----+----+-----+-----+----+----+
|  adviser|   32/60| 125|  256| 6000| 256|   16|  128| 198| 199|
|   amdahl|  470v/7|  29| 8000|32000|  32|    8|   32| 269| 253|
|   amdahl| 470v/7a|  29| 8000|32000|  32|    8|   32| 220| 253|
|   amdahl| 470v/7b|  29| 8000|32000|  32|    8|   32| 172| 253|
|   amdahl| 470v/7c|  29| 8000|16000|  32|    8|   16| 132| 132|
|   amdahl|  470v/b|  26| 8000|32000|  64|    8|   32| 318| 290|
|   amdahl|580-5840|  23|16000|32000|  64|   16|   32| 367| 381|
|   amdahl|580-5850|  23|16000|32000|  64|   16|   32| 489| 381|
|   amdahl|580-5860|  23|16000|64000|  64|   16|   32| 636| 749|
|   amdahl|580-5880|  23|32000|64000| 128|   32|   64|1144|1238|
|   apollo|   dn320| 400| 1000| 3000|   0|    1|    2|  38|  23|
|   apollo|   dn420| 400|  512| 3500|   4|    1|    6|  40|  24|
|     basf|    7/65|  60|

Looking at dataset schema, there are two string features and need to be converted to integers for modeling. In pyspark, this can be achieved using StringIndexer and OneHotEncoder

In [97]:
from pyspark.ml.feature import (VectorAssembler,VectorIndexer,OneHotEncoder,StringIndexer)
vendor_indexer = StringIndexer(inputCol = 'vendor', outputCol = 'vendorIndex')
model_indexer = StringIndexer(inputCol = 'model', outputCol = 'modelIndex')


In [98]:
vendor_Encoder = OneHotEncoder(inputCol = 'vendorIndex', outputCol = 'vendorVec')
model_Encoder = OneHotEncoder(inputCol = 'modelIndex', outputCol = 'modelVec')

In [99]:
modell=  model_indexer.fit(data)
indexed=modell.transform(data)
encoded = model_Encoder.transform(indexed)
encoded.show()

+---------+--------+----+-----+-----+----+-----+-----+----+----+----------+-----------------+
|   vendor|   model|MYCT| MMIN| MMAX|CACH|CHMIN|CHMAX| PRP| ERP|modelIndex|         modelVec|
+---------+--------+----+-----+-----+----+-----+-----+----+----+----------+-----------------+
|  adviser|   32/60| 125|  256| 6000| 256|   16|  128| 198| 199|     181.0|(208,[181],[1.0])|
|   amdahl|  470v/7|  29| 8000|32000|  32|    8|   32| 269| 253|     131.0|(208,[131],[1.0])|
|   amdahl| 470v/7a|  29| 8000|32000|  32|    8|   32| 220| 253|      51.0| (208,[51],[1.0])|
|   amdahl| 470v/7b|  29| 8000|32000|  32|    8|   32| 172| 253|     196.0|(208,[196],[1.0])|
|   amdahl| 470v/7c|  29| 8000|16000|  32|    8|   16| 132| 132|      43.0| (208,[43],[1.0])|
|   amdahl|  470v/b|  26| 8000|32000|  64|    8|   32| 318| 290|      29.0| (208,[29],[1.0])|
|   amdahl|580-5840|  23|16000|32000|  64|   16|   32| 367| 381|      89.0| (208,[89],[1.0])|
|   amdahl|580-5850|  23|16000|32000|  64|   16|   32| 489| 

In [100]:
model2=  vendor_indexer.fit(encoded)
indexed=model2.transform(encoded)
encoded2 = vendor_Encoder.transform(indexed)
encoded2.show()

+---------+--------+----+-----+-----+----+-----+-----+----+----+----------+-----------------+-----------+---------------+
|   vendor|   model|MYCT| MMIN| MMAX|CACH|CHMIN|CHMAX| PRP| ERP|modelIndex|         modelVec|vendorIndex|      vendorVec|
+---------+--------+----+-----+-----+----+-----+-----+----+----+----------+-----------------+-----------+---------------+
|  adviser|   32/60| 125|  256| 6000| 256|   16|  128| 198| 199|     181.0|(208,[181],[1.0])|       28.0|(29,[28],[1.0])|
|   amdahl|  470v/7|  29| 8000|32000|  32|    8|   32| 269| 253|     131.0|(208,[131],[1.0])|        6.0| (29,[6],[1.0])|
|   amdahl| 470v/7a|  29| 8000|32000|  32|    8|   32| 220| 253|      51.0| (208,[51],[1.0])|        6.0| (29,[6],[1.0])|
|   amdahl| 470v/7b|  29| 8000|32000|  32|    8|   32| 172| 253|     196.0|(208,[196],[1.0])|        6.0| (29,[6],[1.0])|
|   amdahl| 470v/7c|  29| 8000|16000|  32|    8|   16| 132| 132|      43.0| (208,[43],[1.0])|        6.0| (29,[6],[1.0])|
|   amdahl|  470v/b|  26

To input data into Spark ML it should be in from of labels and feature vectors. Below steps converts data into feature vectors and columns. 

In [101]:
assembler = VectorAssembler(inputCols=['vendorVec','modelVec','MYCT', 'MMIN', 'MMAX', 'CACH', 'CHMIN', 'CHMAX', 'PRP'],outputCol='features')


In [102]:
output = assembler.transform(encoded2)
output.show()

+---------+--------+----+-----+-----+----+-----+-----+----+----+----------+-----------------+-----------+---------------+--------------------+
|   vendor|   model|MYCT| MMIN| MMAX|CACH|CHMIN|CHMAX| PRP| ERP|modelIndex|         modelVec|vendorIndex|      vendorVec|            features|
+---------+--------+----+-----+-----+----+-----+-----+----+----+----------+-----------------+-----------+---------------+--------------------+
|  adviser|   32/60| 125|  256| 6000| 256|   16|  128| 198| 199|     181.0|(208,[181],[1.0])|       28.0|(29,[28],[1.0])|(244,[28,210,237,...|
|   amdahl|  470v/7|  29| 8000|32000|  32|    8|   32| 269| 253|     131.0|(208,[131],[1.0])|        6.0| (29,[6],[1.0])|(244,[6,160,237,2...|
|   amdahl| 470v/7a|  29| 8000|32000|  32|    8|   32| 220| 253|      51.0| (208,[51],[1.0])|        6.0| (29,[6],[1.0])|(244,[6,80,237,23...|
|   amdahl| 470v/7b|  29| 8000|32000|  32|    8|   32| 172| 253|     196.0|(208,[196],[1.0])|        6.0| (29,[6],[1.0])|(244,[6,225,237,2...|

In [103]:
#from pyspark.ml.feature import StandardScaler
#scaler = StandardScaler(inputCol='features', outputCol="scaled_features", withStd=True, withMean=False)
#scalerModel = scaler.fit(output)
#output = scalerModel.transform(output)

Implementing Linear Regression model

In [113]:
from pyspark.ml.regression import LinearRegression
lr=LinearRegression(featuresCol='features',labelCol='ERP')

In [114]:
train_data, test_data = output.randomSplit([0.7,.3])
fit_model = lr.fit(train_data)

In [115]:
results = fit_model.transform(test_data)

In [116]:
results.columns

['vendor',
 'model',
 'MYCT',
 'MMIN',
 'MMAX',
 'CACH',
 'CHMIN',
 'CHMAX',
 'PRP',
 'ERP',
 'modelIndex',
 'modelVec',
 'vendorIndex',
 'vendorVec',
 'features',
 'prediction']

Evaluating model on test data. Metrics calculated are mean absolute error and R-squared error

In [119]:
test_results = fit_model.evaluate(test_data)
fit_model.summary.meanAbsoluteError

7.897154542423447e-07

In [120]:
print ("R2: {}".format(test_results.r2*100))

R2: 90.11736819084928


In [121]:
results.select(['ERP', 'prediction']).show()

+---+------------------+
|ERP|        prediction|
+---+------------------+
|199| 283.6454928791197|
|253|304.88131331198707|
|132|221.95012748486383|
| 24| 26.53637965616452|
| 23|11.384411789344092|
| 28|27.590608311724022|
| 21| 24.56291117560935|
| 28|26.287349297217293|
| 22|16.674202141115472|
| 74| 75.64039475936923|
| 74|     79.1758015445|
| 74| 85.74510971765501|
|138|139.64319439942466|
|136|161.25128579151536|
| 44|38.633114673065045|
| 54| 68.67368534574764|
| 41|  58.0302471667936|
| 19|10.089008882076318|
| 56| 57.40504067890669|
| 34| 32.15322801325483|
+---+------------------+
only showing top 20 rows



If you look at ERP and prediction column above, there is huge difference for some rows. We only have 209 instances to built the model and more data might help with increasing model accuracy