### Problem Description

#### Dataset: 
Training and Test Dataset is attached in the mail.

#### Problem Description:
Concrete is the most important material in civil engineering. The concrete compressive strength is a highly nonlinear function of age and ingredients. “Concrete compressive strength” is the dependent variable, which needs to be predicted based on the below independent variables.

#### Independent Variables:
* Cement (component 1) -- kg in a m3 mixture 
* Blast Furnace Slag (component 2) -- kg in a m3 mixture 
* Fly Ash (component 3) - kg in a m3 mixture
* Water (component 4) -- kg in a m3 mixture 
* Superplasticizer (component 5) -- kg in a m3 mixture 
* Coarse Aggregate (component 6) -- kg in a m3 mixture 
* Fine Aggregate (component 7) -- kg in a m3 mixture 
* Age -- Day (1~365) 

#### Dependent Variable:
* Concrete compressive strength -- MPa

In [0]:
# ref:
# https://docs.databricks.com/data/data-sources/azure/azure-storage.html#mount-azure-blob
# https://www.youtube.com/watch?v=zwMksSEjNvU

try:
  dbutils.fs.unmount('/mnt/hackathon')
except:
  pass

dbutils.fs.mount(
  source = "wasbs://hackathon@synapseadlsak.blob.core.windows.net",
  mount_point = "/mnt/hackathon",
  # use SAS of the container or ACCESS KEY of the storage account
  # use READ, WRITE, DELETE, and LIST as permissions for SAS to enable spark to read and write into and from BLOB storage.
  extra_configs =  {"fs.azure.sas.hackathon.synapseadlsak.blob.core.windows.net":
                    "sp=rwdl&st=2021-06-09T18:47:57Z&se=2021-06-24T02:47:57Z&spr=https&sv=2020-02-10&sr=c&sig=HBsJz5SwjmwS4gp3BHYPS737IcBLb9irS78Cr7Gs54U%3D"})

In [0]:
%fs ls mnt/hackathon

path,name,size
dbfs:/mnt/hackathon/Building_Strength_Test.csv,Building_Strength_Test.csv,9603
dbfs:/mnt/hackathon/Building_Strength_Train.csv,Building_Strength_Train.csv,30500


In [0]:
train_df = spark.read.csv('/mnt/hackathon/Building_Strength_Train.csv', header=True, inferSchema=True)
test_df = spark.read.csv('/mnt/hackathon/Building_Strength_Test.csv', header=True, inferSchema=True)

In [0]:
%fs ls /mnt/hackathon

path,name,size
dbfs:/mnt/hackathon/Building_Strength_Test.csv,Building_Strength_Test.csv,9603
dbfs:/mnt/hackathon/Building_Strength_Train.csv,Building_Strength_Train.csv,30500


In [0]:
train_df.printSchema()
test_df.printSchema()

In [0]:
display(train_df.summary())
# display(train_df.describe())
# train_df.describe().show()

summary,cement,slag,flyash,water,superplasticizer,coarseaggregate,fineaggregate,age,csMPa
count,740.0,740.0,740.0,740.0,740.0,740.0,740.0,740.0,740.0
mean,284.65527027027,73.31797297297295,51.68783783783782,178.9208108108108,6.15513513513513,980.6385135135127,781.8905405405394,48.58108108108108,36.96363513513513
stddev,100.85775141703849,87.15508637558649,62.40102409880701,22.721371063280493,6.291589687322423,70.90189807541634,82.0426577530396,67.26497333330794,17.686493394911164
min,102.0,0.0,0.0,121.8,0.0,801.0,594.0,3.0,2.33
25%,203.5,0.0,0.0,160.6,0.0,936.2,746.6,7.0,23.52
50%,254.0,24.0,0.0,181.1,5.8,968.0,781.2,28.0,35.96
75%,362.6,133.0,118.3,192.0,10.4,1040.6,845.0,56.0,49.2
max,540.0,359.4,174.7,228.0,32.2,1145.0,992.6,365.0,82.6


In [0]:
display(test_df.summary())
# display(test_df.describe())
# test_df.describe().show()

summary,cement,slag,flyash,water,superplasticizer,coarseaggregate,fineaggregate,age,csMPa
count,290.0,290.0,290.0,290.0,290.0,290.0,290.0,290.0,0.0
mean,272.26896551724155,75.3703448275862,60.56896551724137,188.32034482758624,6.331034482758619,953.2206896551722,752.3755172413795,38.21379310344828,
stddev,112.9813283512792,84.13360821349904,67.59263490236042,15.485469242685207,5.08154973556983,90.1626069666602,71.04988880750282,50.60549448496449,
min,132.0,0.0,0.0,127.0,0.0,801.0,612.0,1.0,
25%,154.8,0.0,0.0,178.5,0.0,878.0,697.7,28.0,
50%,287.3,0.0,0.0,189.0,7.0,949.4,763.0,28.0,
75%,331.0,145.0,113.2,196.0,10.0,1002.0,806.0,28.0,
max,540.0,260.0,200.1,247.0,22.1,1125.0,896.0,360.0,


In [0]:
display(train_df)

cement,slag,flyash,water,superplasticizer,coarseaggregate,fineaggregate,age,csMPa
540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3
266.0,114.0,0.0,228.0,0.0,932.0,670.0,90,47.03
380.0,95.0,0.0,228.0,0.0,932.0,594.0,365,43.7
380.0,95.0,0.0,228.0,0.0,932.0,594.0,28,36.45
266.0,114.0,0.0,228.0,0.0,932.0,670.0,28,45.85
475.0,0.0,0.0,228.0,0.0,932.0,594.0,28,39.29


In [0]:
display(test_df)

cement,slag,flyash,water,superplasticizer,coarseaggregate,fineaggregate,age,csMPa
297.0,0.0,0.0,186.0,0.0,1040.0,734.0,7,
480.0,0.0,0.0,192.0,0.0,936.0,721.0,28,
480.0,0.0,0.0,192.0,0.0,936.0,721.0,90,
397.0,0.0,0.0,186.0,0.0,1040.0,734.0,28,
281.0,0.0,0.0,186.0,0.0,1104.0,774.0,7,
281.0,0.0,0.0,185.0,0.0,1104.0,774.0,28,
500.0,0.0,0.0,200.0,0.0,1125.0,613.0,1,
500.0,0.0,0.0,200.0,0.0,1125.0,613.0,3,
500.0,0.0,0.0,200.0,0.0,1125.0,613.0,7,
500.0,0.0,0.0,200.0,0.0,1125.0,613.0,14,


### ML with SparkML

#### Select Model

**Steps:**

1. train validation split
1. build vector of features
1. transform
1. train

In [0]:
# ref:
# https://towardsdatascience.com/from-scikit-learn-to-spark-ml-f2886fb46852

from pyspark.sql.dataframe import DataFrame
from pyspark.ml.feature import VectorAssembler, StandardScaler, PolynomialExpansion
from pyspark.ml.regression import LinearRegression
from pyspark.ml.pipeline import Pipeline

def train_valid_split(df: DataFrame, target_col_name: str, test_size: float, seed=None):
  assert (0.0 < test_size < 1.0), "Invalid value for test_size"
  assembler = VectorAssembler().setInputCols(df.drop(target_col_name).columns).setOutputCol('features')
  train, test = df.randomSplit(weights=[1.0 - test_size, test_size], seed = seed)
  return (assembler.transform(train), assembler.transform(test))

test_size=0.15
features_list = train_df.drop('csMPa').columns
# traindf, validdf = train_valid_split(train_df, target_col_name='csMPa', test_size=test_size, seed=7)
traindf, validdf = train_df.randomSplit(weights=[1.0 -test_size, test_size], seed = 7)

# Transform & Train
model = Pipeline(stages=[
  VectorAssembler(inputCols=features_list, outputCol='features'),
  StandardScaler(inputCol='features', outputCol='scaledFeatures', withMean=False, withStd=True), 
  PolynomialExpansion(inputCol='scaledFeatures', outputCol='scaledPolyExpandedFeatures', degree=2),
  LinearRegression(labelCol='csMPa', featuresCol='scaledPolyExpandedFeatures', predictionCol='prediction')
])

predY = model.fit(traindf).transform(validdf)
display(predY)

# Evaluate
from pyspark.ml.evaluation import RegressionEvaluator

rmse = RegressionEvaluator().setLabelCol('csMPa').setPredictionCol('prediction').setMetricName('rmse')
r2 = RegressionEvaluator().setLabelCol('csMPa').setPredictionCol('prediction').setMetricName('r2')
print('Root Mean Squared Error =', rmse.evaluate(predY))
print('R^2 =', r2.evaluate(predY))

cement,slag,flyash,water,superplasticizer,coarseaggregate,fineaggregate,age,csMPa,features,scaledFeatures,scaledPolyExpandedFeatures,prediction
108.3,162.4,0.0,203.5,0.0,938.2,849.0,7,7.72,"Map(vectorType -> dense, length -> 8, values -> List(108.3, 162.4, 0.0, 203.5, 0.0, 938.2, 849.0, 7.0))","Map(vectorType -> dense, length -> 8, values -> List(1.0776026669021281, 1.864053147711785, 0.0, 9.210019274527367, 0.0, 13.268806491002964, 10.578441151475918, 0.10319857898508855))","Map(vectorType -> dense, length -> 44, values -> List(1.0776026669021281, 1.1612275077145788, 1.864053147711785, 2.008708643221526, 3.4746941374942137, 0.0, 0.0, 0.0, 0.0, 9.210019274527367, 9.924741332450694, 17.167965419168947, 0.0, 84.8244550371656, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 13.268806491002964, 14.298501261313064, 24.73376050593264, 0.0, 122.20596353211114, 0.0, 176.0612256956824, 10.578441151475918, 11.399356396497668, 19.718776526292565, 0.0, 97.42764689954667, 0.0, 140.36328861539653, 111.90341719523914, 0.10319857898508855, 0.11120706393484134, 0.19236763599653758, 0.0, 0.9504609015565004, 0.0, 1.369321974699625, 1.0916800947096985, 0.010649946704541561))",13.338625966644033
108.3,162.4,0.0,203.5,0.0,938.2,849.0,90,29.23,"Map(vectorType -> dense, length -> 8, values -> List(108.3, 162.4, 0.0, 203.5, 0.0, 938.2, 849.0, 90.0))","Map(vectorType -> dense, length -> 8, values -> List(1.0776026669021281, 1.864053147711785, 0.0, 9.210019274527367, 0.0, 13.268806491002964, 10.578441151475918, 1.3268388726654243))","Map(vectorType -> dense, length -> 44, values -> List(1.0776026669021281, 1.1612275077145788, 1.864053147711785, 2.008708643221526, 3.4746941374942137, 0.0, 0.0, 0.0, 0.0, 9.210019274527367, 9.924741332450694, 17.167965419168947, 0.0, 84.8244550371656, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 13.268806491002964, 14.298501261313064, 24.73376050593264, 0.0, 122.20596353211114, 0.0, 176.0612256956824, 10.578441151475918, 11.399356396497668, 19.718776526292565, 0.0, 97.42764689954667, 0.0, 140.36328861539653, 111.90341719523914, 1.3268388726654243, 1.4298051077336744, 2.4732981770983407, 0.0, 12.220211591440721, 0.0, 17.605568246138038, 14.03588693198184, 1.760501394016054))",32.9795210519369
122.6,183.9,0.0,203.5,0.0,958.2,800.1,28,24.29,"Map(vectorType -> dense, length -> 8, values -> List(122.6, 183.9, 0.0, 203.5, 0.0, 958.2, 800.1, 28.0))","Map(vectorType -> dense, length -> 8, values -> List(1.219889999650978, 2.1108335829076186, 0.0, 9.210019274527367, 0.0, 13.55166316316248, 9.969152844871475, 0.4127943159403542))","Map(vectorType -> dense, length -> 44, values -> List(1.219889999650978, 1.488131611248463, 2.1108335829076186, 2.5749847787164475, 4.455618414730615, 0.0, 0.0, 0.0, 0.0, 9.210019274527367, 11.23521040958869, 19.440817983898828, 0.0, 84.8244550371656, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 13.55166316316248, 16.53153837138045, 28.60530570905545, 0.0, 124.81107893462895, 0.0, 183.64757448781492, 9.969152844871475, 12.16126986045081, 21.043222618093736, 0.0, 91.81608985197562, 0.0, 135.09860137578121, 99.38400844440903, 0.4127943159403542, 0.5035636579284044, 0.8713401049202774, 0.0, 3.8018436062260017, 0.0, 5.594049525291752, 4.115209629103557, 0.17039914727266497))",20.758544810342755
133.0,200.0,0.0,192.0,0.0,927.4,839.2,3,6.88,"Map(vectorType -> dense, length -> 8, values -> List(133.0, 200.0, 0.0, 192.0, 0.0, 927.4, 839.2, 3.0))","Map(vectorType -> dense, length -> 8, values -> List(1.3233716961955961, 2.29563195531008, 0.0, 8.689551354836631, 0.0, 13.116063888036823, 10.456334292483618, 0.04422796242218081))","Map(vectorType -> dense, length -> 44, values -> List(1.3233716961955961, 1.751312646291609, 2.29563195531008, 3.0379743545395135, 5.269926074240781, 0.0, 0.0, 0.0, 0.0, 8.689551354836631, 11.499506315628892, 19.94801176747097, 0.0, 75.50830274834313, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 13.116063888036823, 17.357427714921098, 30.109655389265903, 0.0, 113.97271072841419, 0.0, 172.03113191506364, 10.456334292483618, 13.837616848632225, 24.00389513723001, 0.0, 90.86085381787575, 0.0, 137.14594861488544, 109.33492683616888, 0.04422796242218081, 0.0585300336499165, 0.10153112385461167, 0.0, 0.3843211507873248, 0.0, 0.5800967807670153, 0.462462360161726, 0.0019561126600178376))",11.675219838183692
133.0,200.0,0.0,192.0,0.0,927.4,839.2,28,27.87,"Map(vectorType -> dense, length -> 8, values -> List(133.0, 200.0, 0.0, 192.0, 0.0, 927.4, 839.2, 28.0))","Map(vectorType -> dense, length -> 8, values -> List(1.3233716961955961, 2.29563195531008, 0.0, 8.689551354836631, 0.0, 13.116063888036823, 10.456334292483618, 0.4127943159403542))","Map(vectorType -> dense, length -> 44, values -> List(1.3233716961955961, 1.751312646291609, 2.29563195531008, 3.0379743545395135, 5.269926074240781, 0.0, 0.0, 0.0, 0.0, 8.689551354836631, 11.499506315628892, 19.94801176747097, 0.0, 75.50830274834313, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 13.116063888036823, 17.357427714921098, 30.109655389265903, 0.0, 113.97271072841419, 0.0, 172.03113191506364, 10.456334292483618, 13.837616848632225, 24.00389513723001, 0.0, 90.86085381787575, 0.0, 137.14594861488544, 109.33492683616888, 0.4127943159403542, 0.5462803140658874, 0.9476238226430423, 0.0, 3.586997407348365, 0.0, 5.414236620492143, 4.316315361509442, 0.17039914727266497))",19.63109893930323
135.7,203.5,0.0,185.7,0.0,1076.2,759.3,28,18.2,"Map(vectorType -> dense, length -> 8, values -> List(135.7, 203.5, 0.0, 185.7, 0.0, 1076.2, 759.3, 28.0))","Map(vectorType -> dense, length -> 8, values -> List(1.3502371366446795, 2.3358055145280066, 0.0, 8.404425451006054, 0.0, 15.220517528903633, 9.46078959518924, 0.4127943159403542))","Map(vectorType -> dense, length -> 44, values -> List(1.3502371366446795, 1.823140325174423, 2.3358055145280066, 3.1538913496951477, 5.455987401699446, 0.0, 0.0, 0.0, 0.0, 8.404425451006054, 11.347967356110082, 19.63110331489947, 0.0, 70.63436716151831, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 15.220517528903633, 20.551308006476994, 35.55216877798329, 0.0, 127.91970489740146, 0.0, 231.66415384766276, 9.46078959518924, 12.774309453406095, 22.098564508232215, 0.0, 79.51250086042171, 0.0, 143.99811387084694, 89.50653976444099, 0.4127943159403542, 0.557370215178503, 0.9642072395392955, 0.0, 3.469299054919747, 0.0, 6.282943121601946, 3.905360169201763, 0.17039914727266497))",19.98020883408208
141.3,212.0,0.0,203.5,0.0,971.8,748.5,7,10.39,"Map(vectorType -> dense, length -> 8, values -> List(141.3, 212.0, 0.0, 203.5, 0.0, 971.8, 748.5, 7.0))","Map(vectorType -> dense, length -> 8, values -> List(1.4059580501687048, 2.4333698726286848, 0.0, 9.210019274527367, 0.0, 13.744005700230952, 9.326222852626296, 0.10319857898508855))","Map(vectorType -> dense, length -> 44, values -> List(1.4059580501687048, 1.9767180388341863, 2.4333698726286848, 3.421215961460295, 5.921288937016942, 0.0, 0.0, 0.0, 0.0, 9.210019274527367, 12.948900741230686, 22.411383428964392, 0.0, 84.8244550371656, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 13.744005700230952, 19.323495455804274, 33.44424940017891, 0.0, 126.58255740834107, 0.0, 188.8976926879809, 9.326222852626296, 13.112278097317283, 22.69414971500198, 0.0, 85.89469223122579, 0.0, 128.17966004811998, 86.97843269684897, 0.10319857898508855, 0.14509287289005618, 0.2511203130004062, 0.0, 0.9504609015565004, 0.0, 1.4183618578267911, 0.9624529456892927, 0.010649946704541561))",15.953468251539562
160.0,128.0,122.0,182.0,6.4,824.0,879.0,28,39.4,"Map(vectorType -> dense, length -> 8, values -> List(160.0, 128.0, 122.0, 182.0, 6.4, 824.0, 879.0, 28.0))","Map(vectorType -> dense, length -> 8, values -> List(1.5920261006864314, 1.4692044513984512, 1.9585363560418372, 8.236970555105557, 1.0120026546762806, 11.653694892972117, 10.952237658595209, 0.4127943159403542))","Map(vectorType -> dense, length -> 44, values -> List(1.5920261006864314, 2.534547105266843, 1.4692044513984512, 2.339011833871024, 2.158561720009024, 1.9585363560418372, 3.118040997961898, 2.877490332522369, 3.835864657937638, 8.236970555105557, 13.11347211431365, 12.101793805599057, 16.132406295820346, 67.84768392567595, 1.0120026546762806, 1.6111346402085962, 1.486838805077441, 1.9820439915943482, 8.33583606825718, 1.0241493730718392, 11.653694892972117, 18.55298643904778, 17.121660411994032, 22.82418513010498, 95.99114169159535, 11.793570168475197, 135.8086046584844, 10.952237658595209, 17.436248213404422, 16.09107632078183, 21.45035563436924, 90.21325910636696, 11.083693585143884, 127.63403606858789, 119.95150973035106, 0.4127943159403542, 0.6571793251920449, 0.606479246491547, 0.8084726753366042, 3.400174625715638, 0.4177489435669177, 4.810579011522025, 4.5210214522959955, 0.17039914727266497))",43.061611052502485
165.0,0.0,143.6,163.8,0.0,1005.6,900.9,14,16.88,"Map(vectorType -> dense, length -> 8, values -> List(165.0, 0.0, 143.6, 163.8, 0.0, 1005.6, 900.9, 14.0))","Map(vectorType -> dense, length -> 8, values -> List(1.6417769163328824, 0.0, 2.3052936125213757, 7.413273499595002, 0.0, 14.222033476180536, 11.22510910879229, 0.2063971579701771))","Map(vectorType -> dense, length -> 44, values -> List(1.6417769163328824, 2.6954314430035082, 0.0, 0.0, 0.0, 2.3052936125213757, 3.784777838407235, 0.0, 5.314378639931855, 7.413273499595002, 12.170941306097358, 0.0, 17.089772046490342, 54.95662397979753, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 14.222033476180536, 23.349406264506705, 0.0, 32.78596292970417, 105.43182387932215, 0.0, 202.26623619759982, 11.22510910879229, 18.429125018133156, 0.0, 25.87717232835438, 83.21480388627236, 0.0, 159.64387751902302, 126.00307450429165, 0.2063971579701771, 0.33885808955214813, 0.0, 0.4758060499112146, 1.5300785815720372, 0.0, 2.935387290040381, 2.316830617959876, 0.042599786818166244))",11.595858941335791
165.0,0.0,143.6,163.8,0.0,1005.6,900.9,28,26.2,"Map(vectorType -> dense, length -> 8, values -> List(165.0, 0.0, 143.6, 163.8, 0.0, 1005.6, 900.9, 28.0))","Map(vectorType -> dense, length -> 8, values -> List(1.6417769163328824, 0.0, 2.3052936125213757, 7.413273499595002, 0.0, 14.222033476180536, 11.22510910879229, 0.4127943159403542))","Map(vectorType -> dense, length -> 44, values -> List(1.6417769163328824, 2.6954314430035082, 0.0, 0.0, 0.0, 2.3052936125213757, 3.784777838407235, 0.0, 5.314378639931855, 7.413273499595002, 12.170941306097358, 0.0, 17.089772046490342, 54.95662397979753, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 14.222033476180536, 23.349406264506705, 0.0, 32.78596292970417, 105.43182387932215, 0.0, 202.26623619759982, 11.22510910879229, 18.429125018133156, 0.0, 25.87717232835438, 83.21480388627236, 0.0, 159.64387751902302, 126.00307450429165, 0.4127943159403542, 0.6777161791042963, 0.0, 0.9516120998224292, 3.0601571631440745, 0.0, 5.870774580080762, 4.633661235919752, 0.17039914727266497))",16.68002880482254


#### Predict

In [0]:
pred_df = model.fit(train_df)\
               .transform(test_df)
display(pred_df)

cement,slag,flyash,water,superplasticizer,coarseaggregate,fineaggregate,age,csMPa,features,scaledFeatures,scaledPolyExpandedFeatures,prediction
297.0,0.0,0.0,186.0,0.0,1040.0,734.0,7,,"Map(vectorType -> dense, length -> 8, values -> List(297.0, 0.0, 0.0, 186.0, 0.0, 1040.0, 734.0, 7.0))","Map(vectorType -> dense, length -> 8, values -> List(2.94474143858244, 0.0, 0.0, 8.186125717588869, 0.0, 14.668154566098943, 8.946565361271539, 0.1040660488381369))","Map(vectorType -> dense, length -> 44, values -> List(2.94474143858244, 8.671502140104579, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.186125717588869, 24.106023622029355, 0.0, 0.0, 67.01265426416987, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 14.668154566098943, 43.19392257832379, 0.0, 0.0, 120.07535732311115, 0.0, 215.15475837496928, 8.946565361271539, 26.34532175232258, 0.0, 0.0, 73.23770878799469, 0.0, 131.22960355483775, 80.04103176350374, 0.1040660488381369, 0.3064476063632057, 0.0, 0.0, 0.8518977587217317, 0.0, 1.5264568894409933, 0.9310337078196679, 0.010829742520781494))",12.537645467304174
480.0,0.0,0.0,192.0,0.0,936.0,721.0,28,,"Map(vectorType -> dense, length -> 8, values -> List(480.0, 0.0, 0.0, 192.0, 0.0, 936.0, 721.0, 28.0))","Map(vectorType -> dense, length -> 8, values -> List(4.759178082557479, 0.0, 0.0, 8.450194289123994, 0.0, 13.201339109489048, 8.788111206371635, 0.4162641953525476))","Map(vectorType -> dense, length -> 44, values -> List(4.759178082557479, 22.649776021495487, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.450194289123994, 40.215979454151295, 0.0, 0.0, 71.40578352394375, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 13.201339109489048, 62.82752375028915, 0.0, 0.0, 111.55388035179358, 0.0, 174.27535428372508, 8.788111206371635, 41.824186240441655, 0.0, 0.0, 74.26124712826817, 0.0, 116.01483616721285, 77.23089857555472, 0.4162641953525476, 1.9810754350752695, 0.0, 0.0, 3.517513326334892, 0.0, 5.4952448019875755, 3.6581760399889953, 0.1732758803325039))",38.147177814829774
480.0,0.0,0.0,192.0,0.0,936.0,721.0,90,,"Map(vectorType -> dense, length -> 8, values -> List(480.0, 0.0, 0.0, 192.0, 0.0, 936.0, 721.0, 90.0))","Map(vectorType -> dense, length -> 8, values -> List(4.759178082557479, 0.0, 0.0, 8.450194289123994, 0.0, 13.201339109489048, 8.788111206371635, 1.3379920564903316))","Map(vectorType -> dense, length -> 44, values -> List(4.759178082557479, 22.649776021495487, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.450194289123994, 40.215979454151295, 0.0, 0.0, 71.40578352394375, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 13.201339109489048, 62.82752375028915, 0.0, 0.0, 111.55388035179358, 0.0, 174.27535428372508, 8.788111206371635, 41.824186240441655, 0.0, 0.0, 74.26124712826817, 0.0, 116.01483616721285, 77.23089857555472, 1.3379920564903316, 6.367742469884795, 0.0, 0.0, 11.306292834647868, 0.0, 17.663286863531493, 11.758422985678912, 1.7902227432312268))",52.96760092270506
397.0,0.0,0.0,186.0,0.0,1040.0,734.0,28,,"Map(vectorType -> dense, length -> 8, values -> List(397.0, 0.0, 0.0, 186.0, 0.0, 1040.0, 734.0, 28.0))","Map(vectorType -> dense, length -> 8, values -> List(3.936236872448582, 0.0, 0.0, 8.186125717588869, 0.0, 14.668154566098943, 8.946565361271539, 0.4162641953525476))","Map(vectorType -> dense, length -> 44, values -> List(3.936236872448582, 15.493960716023794, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.186125717588869, 32.22252989207291, 0.0, 0.0, 67.01265426416987, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 14.668154566098943, 57.73733085385369, 0.0, 0.0, 120.07535732311115, 0.0, 215.15475837496928, 8.946565361271539, 35.215800456808296, 0.0, 0.0, 73.23770878799469, 0.0, 131.22960355483775, 80.04103176350374, 0.4162641953525476, 1.6385144744268374, 0.0, 0.0, 3.4075910348869267, 0.0, 6.105827557763973, 3.7241348312786715, 0.1732758803325039))",36.15516925824613
281.0,0.0,0.0,186.0,0.0,1104.0,774.0,7,,"Map(vectorType -> dense, length -> 8, values -> List(281.0, 0.0, 0.0, 186.0, 0.0, 1104.0, 774.0, 7.0))","Map(vectorType -> dense, length -> 8, values -> List(2.7861021691638577, 0.0, 0.0, 8.186125717588869, 0.0, 15.570810231705032, 9.4341166071174, 0.1040660488381369))","Map(vectorType -> dense, length -> 44, values -> List(2.7861021691638577, 7.762365297019553, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.186125717588869, 22.807382618822388, 0.0, 0.0, 67.01265426416987, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 15.570810231705032, 43.38186816219218, 0.0, 0.0, 127.46461008145646, 0.0, 242.45013127177012, 9.4341166071174, 26.284412743234558, 0.0, 0.0, 77.22886458025599, 0.0, 146.89683939320196, 89.00255615668831, 0.1040660488381369, 0.28993864440424516, 0.0, 0.0, 0.8518977587217317, 0.0, 1.6203926980219776, 0.9817712395809577, 0.010829742520781494))",18.321940088916563
281.0,0.0,0.0,185.0,0.0,1104.0,774.0,28,,"Map(vectorType -> dense, length -> 8, values -> List(281.0, 0.0, 0.0, 185.0, 0.0, 1104.0, 774.0, 28.0))","Map(vectorType -> dense, length -> 8, values -> List(2.7861021691638577, 0.0, 0.0, 8.142114288999682, 0.0, 15.570810231705032, 9.4341166071174, 0.4162641953525476))","Map(vectorType -> dense, length -> 44, values -> List(2.7861021691638577, 7.762365297019553, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.142114288999682, 22.684762282162055, 0.0, 0.0, 66.2940250951328, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 15.570810231705032, 43.38186816219218, 0.0, 0.0, 126.779316478868, 0.0, 242.45013127177012, 9.4341166071174, 26.284412743234558, 0.0, 0.0, 76.81365563089977, 0.0, 146.89683939320196, 89.00255615668831, 0.4162641953525476, 1.1597545776169806, 0.0, 0.0, 3.389270652978933, 0.0, 6.48157079208791, 3.9270849583238308, 0.1732758803325039))",23.64080616249339
500.0,0.0,0.0,200.0,0.0,1125.0,613.0,1,,"Map(vectorType -> dense, length -> 8, values -> List(500.0, 0.0, 0.0, 200.0, 0.0, 1125.0, 613.0, 1.0))","Map(vectorType -> dense, length -> 8, values -> List(4.957477169330708, 0.0, 0.0, 8.802285717837494, 0.0, 15.86699412198203, 7.471722842587811, 0.014866578405448128))","Map(vectorType -> dense, length -> 44, values -> List(4.957477169330708, 24.576579884435205, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.802285717837494, 43.637130484105136, 0.0, 0.0, 77.48023385844593, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 15.86699412198203, 78.66026110563045, 0.0, 0.0, 139.6658157449339, 0.0, 251.7615024670123, 7.471722842587811, 37.04089540769581, 0.0, 0.0, 65.76823926495085, 0.0, 118.55378242441965, 55.82664223644848, 0.014866578405448128, 0.07370072303107401, 0.0, 0.0, 0.13085987077138736, 0.0, 0.23588791217323044, 0.11107895346310945, 2.210151534853366E-4))",38.31155642522299
500.0,0.0,0.0,200.0,0.0,1125.0,613.0,3,,"Map(vectorType -> dense, length -> 8, values -> List(500.0, 0.0, 0.0, 200.0, 0.0, 1125.0, 613.0, 3.0))","Map(vectorType -> dense, length -> 8, values -> List(4.957477169330708, 0.0, 0.0, 8.802285717837494, 0.0, 15.86699412198203, 7.471722842587811, 0.04459973521634439))","Map(vectorType -> dense, length -> 44, values -> List(4.957477169330708, 24.576579884435205, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.802285717837494, 43.637130484105136, 0.0, 0.0, 77.48023385844593, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 15.86699412198203, 78.66026110563045, 0.0, 0.0, 139.6658157449339, 0.0, 251.7615024670123, 7.471722842587811, 37.04089540769581, 0.0, 0.0, 65.76823926495085, 0.0, 118.55378242441965, 55.82664223644848, 0.04459973521634439, 0.22110216909322206, 0.0, 0.0, 0.3925796123141621, 0.0, 0.7076637365196914, 0.33323686038932837, 0.00198913638136803))",38.91648846954922
500.0,0.0,0.0,200.0,0.0,1125.0,613.0,7,,"Map(vectorType -> dense, length -> 8, values -> List(500.0, 0.0, 0.0, 200.0, 0.0, 1125.0, 613.0, 7.0))","Map(vectorType -> dense, length -> 8, values -> List(4.957477169330708, 0.0, 0.0, 8.802285717837494, 0.0, 15.86699412198203, 7.471722842587811, 0.1040660488381369))","Map(vectorType -> dense, length -> 44, values -> List(4.957477169330708, 24.576579884435205, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.802285717837494, 43.637130484105136, 0.0, 0.0, 77.48023385844593, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 15.86699412198203, 78.66026110563045, 0.0, 0.0, 139.6658157449339, 0.0, 251.7615024670123, 7.471722842587811, 37.04089540769581, 0.0, 0.0, 65.76823926495085, 0.0, 118.55378242441965, 55.82664223644848, 0.1040660488381369, 0.5159050612175181, 0.0, 0.0, 0.9160190953997116, 0.0, 1.651215385212613, 0.7775526742417662, 0.010829742520781494))",40.11177454710469
500.0,0.0,0.0,200.0,0.0,1125.0,613.0,14,,"Map(vectorType -> dense, length -> 8, values -> List(500.0, 0.0, 0.0, 200.0, 0.0, 1125.0, 613.0, 14.0))","Map(vectorType -> dense, length -> 8, values -> List(4.957477169330708, 0.0, 0.0, 8.802285717837494, 0.0, 15.86699412198203, 7.471722842587811, 0.2081320976762738))","Map(vectorType -> dense, length -> 44, values -> List(4.957477169330708, 24.576579884435205, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.802285717837494, 43.637130484105136, 0.0, 0.0, 77.48023385844593, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 15.86699412198203, 78.66026110563045, 0.0, 0.0, 139.6658157449339, 0.0, 251.7615024670123, 7.471722842587811, 37.04089540769581, 0.0, 0.0, 65.76823926495085, 0.0, 118.55378242441965, 55.82664223644848, 0.2081320976762738, 1.0318101224350362, 0.0, 0.0, 1.8320381907994232, 0.0, 3.302430770425226, 1.5551053484835324, 0.04331897008312598))",42.15675406389437


#### Save

In [0]:
output = pred_df.select(features_list + ['prediction']).withColumnRenamed('prediction', 'csMPa')
display(output)
output.coalesce(1).write.mode('overwrite').csv('/mnt/hackathon/Submission_Building_Strength_Test_900802.csv', header=True)

cement,slag,flyash,water,superplasticizer,coarseaggregate,fineaggregate,age,csMPa
297.0,0.0,0.0,186.0,0.0,1040.0,734.0,7,12.537645467304174
480.0,0.0,0.0,192.0,0.0,936.0,721.0,28,38.147177814829774
480.0,0.0,0.0,192.0,0.0,936.0,721.0,90,52.96760092270506
397.0,0.0,0.0,186.0,0.0,1040.0,734.0,28,36.15516925824613
281.0,0.0,0.0,186.0,0.0,1104.0,774.0,7,18.321940088916563
281.0,0.0,0.0,185.0,0.0,1104.0,774.0,28,23.64080616249339
500.0,0.0,0.0,200.0,0.0,1125.0,613.0,1,38.31155642522299
500.0,0.0,0.0,200.0,0.0,1125.0,613.0,3,38.91648846954922
500.0,0.0,0.0,200.0,0.0,1125.0,613.0,7,40.11177454710469
500.0,0.0,0.0,200.0,0.0,1125.0,613.0,14,42.15675406389437


### ML with Pandas

In [0]:
import pandas as pd

tr_df = train_df.toPandas()
ts_df = test_df.toPandas()

print('dataframe type is:', type(ts_df))
ts_df.head()

Unnamed: 0,cement,slag,flyash,water,superplasticizer,coarseaggregate,fineaggregate,age,csMPa
0,297.0,0.0,0.0,186.0,0.0,1040.0,734.0,7,
1,480.0,0.0,0.0,192.0,0.0,936.0,721.0,28,
2,480.0,0.0,0.0,192.0,0.0,936.0,721.0,90,
3,397.0,0.0,0.0,186.0,0.0,1040.0,734.0,28,
4,281.0,0.0,0.0,186.0,0.0,1104.0,774.0,7,


#### Select Model

In [0]:
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline


train_y = tr_df['csMPa']
train_X = tr_df.drop(columns='csMPa')
test_X = ts_df.drop(columns='csMPa')

trainX, validX, trainY, validY = train_test_split(train_X, train_y, test_size=0.15, random_state=7)

# Transform & Train
model = make_pipeline(StandardScaler(), PolynomialFeatures(2), LinearRegression())

model.fit(trainX, trainY)
predY = model.predict(validX)

# Evaluate
from sklearn.metrics import mean_squared_error, r2_score

rmse = mean_squared_error(y_pred=predY, y_true=validY) ** 0.5
r2 = r2_score(y_pred=predY, y_true=validY)
print('Root Mean Squared Error =', rmse)
print('R^2 =', r2)

#### Predict

In [0]:
model.fit(train_X, train_y)
pred_y = model.predict(test_X)
pred_y

#### Save

In [0]:
test_X['csMPa'] = pred_y
test_X.to_csv('Submission_Building_Strength_Test_900802.csv', header=True)
test_X.head(10)

Unnamed: 0,cement,slag,flyash,water,superplasticizer,coarseaggregate,fineaggregate,age,csMPa
0,297.0,0.0,0.0,186.0,0.0,1040.0,734.0,7,12.537645
1,480.0,0.0,0.0,192.0,0.0,936.0,721.0,28,38.147178
2,480.0,0.0,0.0,192.0,0.0,936.0,721.0,90,52.967601
3,397.0,0.0,0.0,186.0,0.0,1040.0,734.0,28,36.155169
4,281.0,0.0,0.0,186.0,0.0,1104.0,774.0,7,18.32194
5,281.0,0.0,0.0,185.0,0.0,1104.0,774.0,28,23.640806
6,500.0,0.0,0.0,200.0,0.0,1125.0,613.0,1,38.311556
7,500.0,0.0,0.0,200.0,0.0,1125.0,613.0,3,38.916488
8,500.0,0.0,0.0,200.0,0.0,1125.0,613.0,7,40.111774
9,500.0,0.0,0.0,200.0,0.0,1125.0,613.0,14,42.156754
