# ***Exercise 66 training***
Predict the variety of iris plants in real-time

Inputs:

- Training data
    - Static input file: training.csv

- New data
    - A stream of records about new iris plants

Output
- Predicted class label/variety for each new iris plant using only column “sepallength” and “sepalwidth”

Training data schema

- sepallength: double
- sepalwidth: double
- petallength: double – not considered
- petalwidth: double – not considered
- variety: integer
     - 1 -> Setosa category
     - 2 -> Versicolor category
     - 3 -> Virginica category
     
New streaming data schema
- sepallength: double
- sepalwidth: double
- petallength: double
- petalwidth: double
- variety: it is always null for the new data

In [None]:
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import VectorAssembler, SQLTransformer
from pyspark.ml import Pipeline

In [None]:
train_path = 'data/training.csv'

model_savepath = 'pipelineClassificationModel/'

In [None]:
# Read training data
train_df = spark.read\
.load(path=train_path, header=True, format='csv', inferSchema=True)

In [None]:
train_df.printSchema()
train_df.show()

In [None]:
vector_assembler = VectorAssembler(inputCols=["sepallength","sepalwidth"],\
                                   outputCol='features')

In [None]:
# Select features and rename variety as label
sql_transformer = SQLTransformer(statement="""
    SELECT sepallength, sepalwidth, petallength, petalwidth, features, variety as label
    FROM __THIS__
    """)

In [None]:
lr = LogisticRegression()

In [None]:
pipeline = Pipeline().setStages([vector_assembler, sql_transformer, lr])

In [None]:
# Train the model
model = pipeline.fit(train_df)

In [None]:
# Save the trained model in the output folder 
model.save(model_savepath)