# Lab: Spark Machine Learning

In this lab, you will build a simple ML pipeline and model with Spark Machine Learning.

Start your SparkSession:

In [1]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

In [2]:
spark

<a id="load"></a>
## 1. Load and explore data

Load the file `s3://bigdatateaching/misc/gosales_tx_naivebayes.csv` into a DataFrame:

In [7]:
df = spark.read.format('csv').option('header','true').option('inferSchema','true').load("s3://bigdatateaching/misc/gosales_tx_naivebayes.csv")

Print the schema of the DataFrame

In [9]:
df.printSchema()

root
 |-- PRODUCT_LINE: string (nullable = true)
 |-- GENDER: string (nullable = true)
 |-- AGE: integer (nullable = true)
 |-- MARITAL_STATUS: string (nullable = true)
 |-- PROFESSION: string (nullable = true)



Print the first few records:

In [10]:
df.show()

+--------------------+------+---+--------------+------------+
|        PRODUCT_LINE|GENDER|AGE|MARITAL_STATUS|  PROFESSION|
+--------------------+------+---+--------------+------------+
|Personal Accessories|     M| 27|        Single|Professional|
|Personal Accessories|     F| 39|       Married|       Other|
|Mountaineering Eq...|     F| 39|       Married|       Other|
|Personal Accessories|     F| 56|   Unspecified| Hospitality|
|      Golf Equipment|     M| 45|       Married|     Retired|
|      Golf Equipment|     M| 45|       Married|     Retired|
|   Camping Equipment|     F| 39|       Married|       Other|
|   Camping Equipment|     F| 49|       Married|       Other|
|  Outdoor Protection|     F| 49|       Married|       Other|
|      Golf Equipment|     M| 47|       Married|     Retired|
|      Golf Equipment|     M| 47|       Married|     Retired|
|Mountaineering Eq...|     M| 21|        Single|      Retail|
|Personal Accessories|     F| 66|       Married|       Other|
|   Camp

Count the number of records:

In [11]:
df.count()

60252

### 2.1: Prepare data

In this subsection you will split your data into: train, test and predict datasets. Create three splits of  `df_data` (train, test, predict) by using the `randomSplit` method:

In [12]:
splitted_data = df.randomSplit([0.8,0.18,0.02],24)

In [13]:
train_data = splitted_data[0]
test_data = splitted_data[1]
predict_data = splitted_data[2]

In [15]:
train_data.count()

48176

In [17]:
test_data.count()

10860

### 2.2: Create pipeline and train a model

For this lab, your job is to build a model that classifies the `PRODUCT_LINE`. In this section you will create a machine learning pipeline and then train the model. The next cell imports all the packages you will need:

In [18]:
from pyspark.ml.feature import OneHotEncoder, StringIndexer, IndexToString, VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml import Pipeline, Model

In the following step, convert all the string fields to numeric ones. Look at the DataFrame structure to determine which ones you need to convert. Use the `StringIndexer` transformer. You need to create a transformer for each columnt you want to modify.

In [26]:
stringindexer_label = StringIndexer(inputCol="PRODUCT_LINE",outputCol='label')
stringindexer_prof = StringIndexer(inputCol='PROFESSION',outputCol="PROFESSION_IX")
stringindexer_gend = StringIndexer(inputCol='GENDER',outputCol="GENDER_IX")
stringindexer_mar = StringIndexer(inputCol="MARITAL_STATUS",outputCol="MARITAL_STATUS_IX")


Try looking at the values for one of the re-encoded columns using the `labels` method. Does it work?

In [21]:
stringindexer_label.labels

AttributeError: 'StringIndexer' object has no attribute 'labels'

To see the values, the transformer needs to be executed first, and you can do so by using the `fit` method. Try now:

In [22]:
si_label_fit = StringIndexer(inputCol="PRODUCT_LINE",outputCol='label').fit(df)

In [24]:
si_label_fit.labels

['Camping Equipment',
 'Personal Accessories',
 'Mountaineering Equipment',
 'Golf Equipment',
 'Outdoor Protection']

In [25]:
si_label_fit.printSchema()

AttributeError: 'StringIndexerModel' object has no attribute 'printSchema'

In the following step, create a feature vector by combining all string features together usinf the `vectorAssembler` method:

In [27]:
vectorassembler_features = VectorAssembler(inputCols=['GENDER_IX',"AGE","MARITAL_STATUS_IX","PROFESSION_IX"],outputCol='features')

In [28]:
vectorassembler_features

VectorAssembler_3acf68ee2e4f

Next, define estimators you want to use for classification. You will build a Random Forest using the `RandomForestClassifier` estimator:

In [31]:
rf = RandomForestClassifier(labelCol='label',featuresCol="features")

Finally, indexed labels back to original labels.

In [29]:
labelConverter = IndexToString(inputCol="prediction",
                               outputCol='predictedLabel',
                               labels=stringindexer_label.fit(df).labels)

Let's build the pipeline now. A pipeline consists of transformers and an estimator.

In [32]:
pipeline_rf = Pipeline(stages=[stringindexer_label,
                              stringindexer_prof,
                              stringindexer_gend,
                              stringindexer_mar,
                              vectorassembler_features,
                              rf,labelConverter])

Now, you can train your Random Forest model by using the previously defined **pipeline** and **train data**.

+-----------------+------+---+--------------+-----------+
|     PRODUCT_LINE|GENDER|AGE|MARITAL_STATUS| PROFESSION|
+-----------------+------+---+--------------+-----------+
|Camping Equipment|     F| 18|        Single|      Other|
|Camping Equipment|     F| 18|        Single|      Other|
|Camping Equipment|     F| 18|        Single|     Retail|
|Camping Equipment|     F| 18|        Single|     Retail|
|Camping Equipment|     F| 19|        Single|Hospitality|
|Camping Equipment|     F| 19|        Single|Hospitality|
|Camping Equipment|     F| 19|        Single|Hospitality|
|Camping Equipment|     F| 19|        Single|Hospitality|
|Camping Equipment|     F| 19|        Single|Hospitality|
|Camping Equipment|     F| 19|        Single|Hospitality|
|Camping Equipment|     F| 19|        Single|Hospitality|
|Camping Equipment|     F| 19|        Single|Hospitality|
|Camping Equipment|     F| 19|        Single|Hospitality|
|Camping Equipment|     F| 19|        Single|Hospitality|
|Camping Equip

You can check your **model accuracy** now. To evaluate the model, use **test data**.

In [35]:
model_rf = pipeline_rf.fit(train_data)

In [43]:
predictions = model_rf.transform(test_data)

In [44]:
predictions.show()

+-----------------+------+---+--------------+-----------+-----+-------------+---------+-----------------+------------------+--------------------+--------------------+----------+--------------------+
|     PRODUCT_LINE|GENDER|AGE|MARITAL_STATUS| PROFESSION|label|PROFESSION_IX|GENDER_IX|MARITAL_STATUS_IX|          features|       rawPrediction|         probability|prediction|      predictedLabel|
+-----------------+------+---+--------------+-----------+-----+-------------+---------+-----------------+------------------+--------------------+--------------------+----------+--------------------+
|Camping Equipment|     F| 18|        Single|      Other|  0.0|          0.0|      1.0|              1.0|[1.0,18.0,1.0,0.0]|[4.49916516171064...|[0.22495825808553...|       1.0|Personal Accessories|
|Camping Equipment|     F| 18|        Single|     Retail|  0.0|          7.0|      1.0|              1.0|[1.0,18.0,1.0,7.0]|[3.70782298786380...|[0.18539114939319...|       1.0|Personal Accessories|
|Camp

In [50]:
evaluatorRF = MulticlassClassificationEvaluator(labelCol='label',predictionCol='prediction',metricName="accuracy")
accuracy = evaluatorRF.evaluate(predictions)

print("Accuracy = %g"%(accuracy))
print("Test error{}".format((1-accuracy)))

Accuracy = 0.584346
Test error0.41565377532228365
