## **Exercise 51 - Classification problem**

Input:
- A training data set containing a set of sentences
    - One sentence per line
    - Schema
        - label: 1 (Spark related sentence) or 0 (Non-spark related sentence)
        - text: a sentence about something
- A set of unlabeled sentences

Output:
- For each unlabeled sentence the predicted class label value by using a logistic regression algorithm

You must train the model by using as input two predictive features:
 - The number of words in each sentence
 - A Boolean value associated with the presence/absence of the word “Spark” in the sentences

In [1]:
from pyspark.mllib.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import SQLTransformer
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline
from pyspark.ml import PipelineModel
from pyspark.sql.types import *

# input and output folders
trainingData = "/data/students/bigdata-01QYD/ex_data/Ex51/data//trainingData.csv"
unlabeledData = "/data/students/bigdata-01QYD/ex_data/Ex51/data//unlabeledData.csv"
outputPath = "./res_ex51/"

In [2]:
# *************************
# Training step
# *************************

# Create a DataFrame from trainingData.csv
# Training data in raw format
trainingData = spark.read.load(trainingData,\
                                format="csv",\
                                header=True,\
                                inferSchema=True)

In [3]:
trainingData.printSchema()
trainingData.show()

root
 |-- label: integer (nullable = true)
 |-- text: string (nullable = true)

+-----+--------------------+
|label|                text|
+-----+--------------------+
|    1|The Spark system ...|
|    1|Spark is a new di...|
|    0|Turin is a beauti...|
|    0|Turin is in the n...|
+-----+--------------------+



In [4]:
# Define a Python function that returns the number 
# of words occuring in the input string
def countWords(text):
    return len(text.split(' '))

In [5]:
# We register a UDF function and explicitly report data type
spark.udf.register('countWords', countWords, IntegerType())

<function __main__.countWords(text)>

In [6]:
# Funct that checks if the input string contains the word 'Spark'
def containsSpark(text):
    return text.find('Spark')>=0

In [7]:
# We register a UDF function and explicitly report data type
spark.udf.register('containsSpark', containsSpark, BooleanType())

<function __main__.containsSpark(text)>

In [8]:
# Create an SQLTransformer to add two column to the input DF
# numLines and SparkWord
sqlTrans = SQLTransformer(statement= """SELECT *,
                                        countWords(text) AS numLines,
                                        containsSpark(text) AS SparkWord
                                        FROM __THIS__""")

In [9]:
# Use an assembler to combine "numLines" and "SparkWord" in a Vector
assembler = VectorAssembler(inputCols=['numLines','SparkWord'], outputCol='features')

In [10]:
# Create a classification model based on the logistic regression
lr = LogisticRegression()\
.setMaxIter(10)\
.setRegParam(0.01)

In [11]:
# Define the pipeline 
pipeline = Pipeline().setStages([sqlTrans, assembler, lr])

In [12]:
classificationModel = pipeline.fit(trainingData)

In [13]:
# *************************
# Prediction step
# *************************

# Create a DataFrame from unlabeledData.csv
# Unlabeled data in raw format
unlabeledData = spark.read.load(unlabeledData,\
format="csv", header=True, inferSchema=True)

In [14]:
unlabeledData.printSchema()
unlabeledData.show()

root
 |-- label: string (nullable = true)
 |-- text: string (nullable = true)

+-----+--------------------+
|label|                text|
+-----+--------------------+
| null|Spark performs be...|
| null|Comparison betwee...|
| null|Turin is in Piedmont|
+-----+--------------------+



In [15]:
# Make predictions on the unlabled data using the transform() method of the
# trained classification model transform uses only the content of 'features'
# to perform the predictions
predictionsDF = classificationModel.transform(unlabeledData)

In [16]:
predictionsDF.printSchema()
predictionsDF.show(truncate=False)

root
 |-- label: string (nullable = true)
 |-- text: string (nullable = true)
 |-- numLines: integer (nullable = true)
 |-- SparkWord: boolean (nullable = true)
 |-- features: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = false)

+-----+-----------------------------------+--------+---------+---------+----------------------------------------+----------------------------------------+----------+
|label|text                               |numLines|SparkWord|features |rawPrediction                           |probability                             |prediction|
+-----+-----------------------------------+--------+---------+---------+----------------------------------------+----------------------------------------+----------+
|null |Spark performs better than Hadoop  |5       |true     |[5.0,1.0]|[-3.1272480248757137,3.1272480248757137]|[0.04199718899423514,0.9580028110057648]|1.0       |
|nu

In [17]:
predictions = predictionsDF.select('text', 'prediction')
predictions.printSchema()
predictions.show(truncate=False)

root
 |-- text: string (nullable = true)
 |-- prediction: double (nullable = false)

+-----------------------------------+----------+
|text                               |prediction|
+-----------------------------------+----------+
|Spark performs better than Hadoop  |1.0       |
|Comparison between Spark and Hadoop|1.0       |
|Turin is in Piedmont               |0.0       |
+-----------------------------------+----------+



In [None]:
# ssave the result in an HDFS output folder
predictions.write.csv(ouputPath, header=True)