# ***Exercise 66 prediction***
Predict the variety of iris plants in real-time

Inputs:

- Training data
    - Static input file: training.csv

- New data
    - A stream of records about new iris plants

Output
- Predicted class label/variety for each new iris plant using only column “sepallength” and “sepalwidth”

Training data schema

- sepallength: double
- sepalwidth: double
- petallength: double – not considered
- petalwidth: double – not considered
- variety: integer
     - 1 -> Setosa category
     - 2 -> Versicolor category
     - 3 -> Virginica category
     
New streaming data schema
- sepallength: double
- sepalwidth: double
- petallength: double
- petalwidth: double
- variety: it is always null for the new data

In [None]:
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import VectorAssembler, SQLTransformer
from pyspark.ml import Pipeline,PipelineModel
from pyspark.sql.types import StructType

In [None]:
streaming_path = 'inputStreaming/'
model_path = 'pipelineClassificationModel/'

output_path = 'predictions/'

In [None]:
# Define the schema of the input data
inputSchema = StructType()\
.add("sepallength", "double")\
.add("sepalwidth", "double")\
.add("petallength", "double")\
.add("petalwidth", "double")\
.add("variety","double")

In [None]:
# Define input stream
data_stream = spark.readStream\
                    .format('csv')\
                    .option('path', streaming_path)\
                    .option('header', True)\
                    .schema(inputSchema)\
                    .load()

In [None]:
data_stream.printSchema()

In [None]:
# Real ML model
model = PipelineModel.load(model_path)

In [None]:
predictions = model.transform(data_stream)

In [None]:
predictions.printSchema()

In [None]:
predictionStreamWriter = predictions\
.select('sepallength', 'sepalwidth', 'petallength', 'petalwidth', 'prediction')\
.writeStream\
.format("csv")\
.outputMode('append')\
.option("path", output_path)\
.option("checkpointLocation", "checkpoint/")

In [None]:
prediction_stream = predictionStreamWriter.start()

In [None]:
prediction_stream.stop()